Skip to content

JE4: Exploring The Cancer Genome Atlas Program

Annie Liu edited this page Feb 25, 2023 · 3 revisions

Objective

  • To explore an annotation database source, the source I am picking is The Cancer Genome Atlas Program

Started: Feb 25, 2023

Finished: Feb 25, 2023

Time Estimated: 2 hours

Time Spent: 1.5 Hours

Procedure

According to the Quercus assignment, we are expected to search for the following:

"Find an annotation data set (excluding GO and Reactome which I have outlined below as an example) for human genes - any data set that adds functional, process, location, disease status ... to a set of genes."

I am picking The Cancer Genome Atlas Program

Results

1. What sort of data is it? What sort of information does it offer us?

The TCGA is a major cancer genomics program that consolidates transcriptomic, genomic, proteomic, and epigenomic information about 33 cancer types. It offers information characterizing 20,000 primary cancers against matched normal samples. This collection of data includes clinical, copy number, DNA, Imagining, methylation data, and more. An outline of the data types can be found here: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/types

2. When and where was it published? Was it published?

This project was launched in 2006 as a joint effort between the National Cancer Institute and the National Human Genom Research Institute under the US government. The first cancers (lung, brain, and ovarian) were mapped in that same year by the TCGA. There does not seem to be one particular publication that represents this program as a whole.

3. Is this annotation set updated regularly or is it a static source?

This annotation set is updated multiple times a year.

4. Where can I find this data? (link to the download web address or ftp site or publication where it can be found)

The data is available on the Genomic Data Commons Data portal.

5. How is the data formatted and released? Does it exist in some sort of standard file format?

The data is formatted in many different file formats such as VCF, TSV, etc. Due to the nature of this project, there is no standard file format for the dataset. The datasets are presented on the GDC portal with a case UUID and a case ID, along with details of the patient and the type of carcinoma that is profiled in a particular sample.

6. What identifiers are associated with these annotations?

Each case appears to have a case UUID while each file has its own UUID. In addition, there is an entity ID for each case that has the prefix TCGA.

References

The Cancer Genome Atlas program [Internet]. National Cancer Institute. 2018 [cited 2023 Feb 25]. Available from: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

Cancer Genome Atlas Research Network, Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R., Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C., & Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature genetics, 45(10), 1113–1120. https://doi.org/10.1038/ng.2764