Skip to content

databricks-industry-solutions/hls-tcga

Repository files navigation

Data Intelligence for R&D: The Cancer Genome Atlas (TCGA)

The Cancer Genome Atlas (TCGA) represents a comprehensive and coordinated initiative aimed at expediting our understanding of the molecular foundations of cancer by leveraging genome analysis technologies, including large-scale genome sequencing. Spearheaded in 2006 by the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), TCGA has set forth the following objectives:

  1. Enhance our capacity for cancer diagnosis, treatment, and prevention by delving into the genomic alterations in cancer, which will pave the way for more refined diagnostic and therapeutic strategies.
  2. Pinpoint molecular therapy targets by discerning common molecular traits of tumors, enabling the development of treatments that specifically target these markers.
  3. Uncover carcinogenesis mechanisms by identifying the genomic shifts that lead to the transition of normal cells into tumors.
  4. Strengthen predictions of cancer recurrence by understanding the genomic modifications in tumors, facilitating the recognition of indicators that signify an increased likelihood of cancer resurgence post-treatment.
  5. Foster new breakthroughs via data sharing. TCGA has adopted a policy of sharing all its data and findings with the global scientific fraternity, promoting independent research, novel discoveries, and the development of improved solutions.

TCGA boasts over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data spanning 33 cancer types. Contributions from over 10,000 patients include tumor samples and matched controls from blood or nearby normal tissues. The Genomic Data Commons offers complete access to this data, and users can visually navigate it using the Integrated Genomics Viewer.

TCGA serves as a comprehensive repository of pivotal genomic variations in major cancers, continually propelling significant advancements in cancer biology comprehension. It illuminates the mechanisms underlying tumorigenesis and sets the stage for the next generation of diagnostic and therapeutic methods.

Workflow Overview

Within this solution accelerator, we present a template illustrating the ease with which one can load RNA expression profiles from TCGA and associated clinical data into the Databricks platform, and subsequently perform diverse analyses on the dataset. Specifically, we demonstrate how to construct a database of gene expression profiles combined with pertinent metadata and manage all data assets, including raw files, using Unity Catalog (UC). Below is an outline of the workflow:

  • Modify config.json file if needed to customize your catalog

  • Run setup notebook to create the catalog, schema and associated volume to store raw data

  • Initially, RNA expression profiles and clinical metadata can be downloaded using the 01-data-download notebook. This notebook uses GDC APIs to download and store the data into a managed volume, a Unity Catalog-governed storage volume housed in the schema's default storage location.

  • Subsequently, in the etl_pipelines, we create tables from raw data using Lakeflow ETL pipelines.

  • As an example of using databricks data intelligence 02-tcga-expression-clsutering, we use dimensionality reduction and clustering to investigate relationships between RNA expression profiles and meta data such as tissue or oragn of origin. Most of this notebook is created by using Databricks Data Science Agent.

Workflow

Data Download

We use the following enedpoints to download open access data:

cases_endpt: https://api.gdc.cancer.gov/cases

files_endpt: https://api.gdc.cancer.gov/files

data_endpt: https://api.gdc.cancer.gov/data

ETL

After landing the files in a managed volume, we transform the data into the following tables:

About

No description or website provided.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •