In this repo we host the code to generate the data and figures for the paper "Reproducible processing of TCGA regulatory networks".
All the data is generated with the tcga-data-nf workflow. This folder holds sample files and analyses that can be run thanks to the pipeline.
.
├── LICENSE
├── README.md
├── config # sample configuration files
├── data
│ ├── conf
│ │ └── coad-subtype/ # configuration files for the COAD subtype application in the paper
│ └── external
│ ├── coad-subtype # subtype assignment for each TCGA-COAD sample
│ └── reactome_slim # reactome SLIM pathways used in the paper
├── envs # conda environments
├── notebooks
│ ├── colon_subtype_dragon.ipynb # DRAGON results in the paper
│ ├── colon_subtype_panda.ipynb # PANDA results in the paper
│ └── src # reusable functions
└── results # folder where all results are generated
First, we ran the full tcga-data-nf
workflow with the configuration in
coad_subtype.config
and the metadata in full_coad_subtypes.json
.
$ nextflow run tcga-data-nf -profile conda --pipeline full -c coad_subtype.config
Results are stored into the results/batch-coad-subtype-20240510/
folder which has the following structure:
├── tcga_coad_cms1
│ ├── analysis
│ │ ├── dragon
│ │ └── panda
│ ├── data_download
│ │ ├── clinical
│ │ ├── cnv
│ │ ├── methylation
│ │ ├── mutations
│ │ └── recount3
│ └── data_prepared
│ ├── methylation
│ └── recount3
├── tcga_coad_cms2
│ ...
├── tcga_coad_cms3
│ ...
└── tcga_coad_cms4
...
For each subtype, you'll find the downloaded data (data_download
), the prepared data (data_prepared
) and the
networks (analysis
).
aw The notebooks reproduce the results in the paper. In order to run the code in them, you need to have the pre-processed DRAGON and PANDA networks.
You can either download the batch-coad-subtype-20240510
folder, or run the workflow again to generate all the data.
The data relative to this repo can be found on the Harvard Dataverse: Replication Data for: tcga-data-nf
@data{DVN/MCSSYJ_2024,
author = {Fanfani, Viola},
publisher = {Harvard Dataverse},
title = {{Replication Data for: tcga-data-nf}},
UNF = {UNF:6:TYixGNR1fJyPs/vReFVaPQ==},
year = {2024},
version = {V1},
doi = {10.7910/DVN/MCSSYJ},
url = {https://doi.org/10.7910/DVN/MCSSYJ}
}
Data on AWS: tcga-data-nf-procumputed.
In order to visualize and download this data, you need to have an active AWS account (a free tier one should suffice). For any additional help, please contact [email protected]
We'll keep an updated list of exemplary configuration files inside the config
folder.
For the most updated structure of the configuration files always refer to the tests inside the tcga-data-nf repository
For examples of configuration files for a full analysis you can refer to those we used for the colon cancer application:
- Pipeline configurations:
data/conf/coad-subtype/coad_subtype.config
- Data configurations:
data/conf/coad-subtype/full_coad_subtypes.json
We paste here the configuration files we used to download data from TCGA. These are also available alongside the data on AWS.
First round downloads:
::warning:: These configuration files follow an older structure of the metadata, but they still include all relevant information to understand what has been downloaded
- Clinical data:
config/download_clinical_tcgabiolinks_firstround.config
- Gene Expression:
config/download_expression_recount3_firstround.config
- Mutations:
config/download_mutation_tcgabiolinks_firstround.config
- Methylation:
config/download_methylation_firstround.config
Files are at:
New Methylation:
GDC data went through some ID changes/downgrading to legacy, so we re-downloaded and prepared all methylation data:
Configuration file: conf/download_methylation.json
We have pre-processed gene expression data for the following tumor types: BRCA, COAD, DLBC, KIRC, LAML, LIHC, PRAD, PAAD, SKCM, STAD, LUAD, LUSC.
Configuration file (tcga-data-nf (0.0.10)): conf/expression_prepare.conf
Output files follow the naming:
recount3_tcga_coad_purity06_normlogtpm_mintpm1_fracsamples000001_tissuetumor_batchtcgagdcplatform_adjtcgagdcplatform.txt where we write in the filename the parameters used to generate it.
For instance, the file above is in logptm, has genes with at least 1 tpm in at least 0.000001 samples (we are basically filtering out only 'all-zero' genes), and it has been corrected for gdc-platform.
We have pre-processed methylation data for the following tumor types: BRCA, COAD, DLBC, KIRC, LAML, LIHC, PRAD, PAAD, SKCM, STAD, LUAD, LUSC.
Configuration file (tcga-data-nf (0.0.13)): conf/ methylation_prepare.conf
We generated PANDA and LIONESS networks for 10 solid cancers: BRCA, COAD, KIRC, LIHC, LUAD, LUSC, PAAD, PRAD, SKCM, STAD.
We have used the prepared data with:
- purity: 03
- normalization: logcpm
- gene filters: mintpm1, fracsamples01
- tissues: tissueall
- Viola Fanfani, [email protected]