-
Notifications
You must be signed in to change notification settings - Fork 0
JE4: Exploring The Cancer Genome Atlas Program
- To explore an annotation database source, the source I am picking is The Cancer Genome Atlas Program
Started: Feb 25, 2023
Finished: Feb 25, 2023
Time Estimated: 2 hours
Time Spent: 1.5 Hours
According to the Quercus assignment, we are expected to search for the following:
"Find an annotation data set (excluding GO and Reactome which I have outlined below as an example) for human genes - any data set that adds functional, process, location, disease status ... to a set of genes."
I am picking The Cancer Genome Atlas Program
1. What sort of data is it? What sort of information does it offer us?
The TCGA is a major cancer genomics program that consolidates transcriptomic, genomic, proteomic, and epigenomic information about 33 cancer types. It offers information characterizing 20,000 primary cancers against matched normal samples. This collection of data includes clinical, copy number, DNA, Imagining, methylation data, and more. An outline of the data types can be found here: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/types
2. When and where was it published? Was it published?
This project was launched in 2006 as a joint effort between the National Cancer Institute and the National Human Genom Research Institute under the US government. The first cancers (lung, brain, and ovarian) were mapped in that same year by the TCGA. There does not seem to be one particular publication that represents this program as a whole.
3. Is this annotation set updated regularly or is it a static source?
This annotation set is updated multiple times a year.
4. Where can I find this data? (link to the download web address or ftp site or publication where it can be found)
The data is available on the Genomic Data Commons Data portal.
5. How is the data formatted and released? Does it exist in some sort of standard file format?
The data is formatted in many different file formats such as VCF, TSV, etc. Due to the nature of this project, there is no standard file format for the dataset. The datasets are presented on the GDC portal with a case UUID and a case ID, along with details of the patient and the type of carcinoma that is profiled in a particular sample.
6. What identifiers are associated with these annotations?
Each case appears to have a case UUID while each file has its own UUID. In addition, there is an entity ID for each case that has the prefix TCGA.
The Cancer Genome Atlas program [Internet]. National Cancer Institute. 2018 [cited 2023 Feb 25]. Available from: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
Cancer Genome Atlas Research Network, Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R., Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C., & Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature genetics, 45(10), 1113–1120. https://doi.org/10.1038/ng.2764