The Cancer Genome Atlas (TCGA) represents a comprehensive and coordinated initiative aimed at expediting our understanding of the molecular foundations of cancer by leveraging genome analysis technologies, including large-scale genome sequencing. Spearheaded in 2006 by the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), TCGA has set forth the following objectives:
- Enhance our capacity for cancer diagnosis, treatment, and prevention by delving into the genomic alterations in cancer, which will pave the way for more refined diagnostic and therapeutic strategies.
- Pinpoint molecular therapy targets by discerning common molecular traits of tumors, enabling the development of treatments that specifically target these markers.
- Uncover carcinogenesis mechanisms by identifying the genomic shifts that lead to the transition of normal cells into tumors.
- Strengthen predictions of cancer recurrence by understanding the genomic modifications in tumors, facilitating the recognition of indicators that signify an increased likelihood of cancer resurgence post-treatment.
- Foster new breakthroughs via data sharing. TCGA has adopted a policy of sharing all its data and findings with the global scientific fraternity, promoting independent research, novel discoveries, and the development of improved solutions.
TCGA boasts over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data spanning 33 cancer types. Contributions from over 10,000 patients include tumor samples and matched controls from blood or nearby normal tissues. The Genomic Data Commons offers complete access to this data, and users can visually navigate it using the Integrated Genomics Viewer.
TCGA serves as a comprehensive repository of pivotal genomic variations in major cancers, continually propelling significant advancements in cancer biology comprehension. It illuminates the mechanisms underlying tumorigenesis and sets the stage for the next generation of diagnostic and therapeutic methods.
Within this solution accelerator, we present a template illustrating the ease with which one can load RNA expression profiles from TCGA and associated clinical data into the Databricks lakehouse platform, and subsequently perform diverse analyses on the dataset. Specifically, we demonstrate how to construct a database of gene expression profiles combined with pertinent metadata and manage all data assets, including raw files, using Unity Catalog (UC). Below is an outline of the workflow:
-
Initially, RNA expression profiles and clinical metadata are downloaded using the 00-data-download notebook. This action leverages GDC APIs to store the data into a managed volume, a Unity Catalog-governed storage volume housed in the schema's default storage location.
-
Subsequently, in the 01-tcga-etl notebook, we establish tables and publish them to Unity Catalog. Alternatively, 01-tcga-dlt can be employed to accomplish the same tasks using DLT pipelines.
-
In 02-rna-tcga-analysis, we provide examples illustrating data exploration using
sql
andpyspark-ai
to interact with the tables through natural language.
- In the subsequent phase, within 03-rna-tcga-expression-profiles, we curate a dataset of normalized gene expressions for each sample. We then select the most variable features, apply UMAP dimensionality reduction to these features for data visualization, and design an interactive dashboard for exploratory RNA cluster analysis.
We use the following enedpoints to download open access data:
cases_endpt: https://api.gdc.cancer.gov/cases
files_endpt: https://api.gdc.cancer.gov/files
data_endpt: https://api.gdc.cancer.gov/data
After landing the files in a managed volume, we transform the data into the following tables:
To create gene expression profiles, we group profiles by sample and calculate gene-level and sample-level summary statistics and select the most variable sites for visualization of the data using UMAP.