diff --git a/README.Rmd b/README.Rmd index 64d390eb..ffd00009 100755 --- a/README.Rmd +++ b/README.Rmd @@ -18,8 +18,7 @@ knitr::opts_chunk$set( `RNAsum` is an R package that can post-process, summarise and visualise outputs primarily from [DRAGEN RNA][dragen-rna] pipelines. -Its main application is to complement genome-based findings from the -[umccrise][umccrise] pipeline and to provide additional evidence for detected +Its main application is to complement whole-genome based findings and to provide additional evidence for detected alterations. [dragen-rna]: <https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html> @@ -59,7 +58,7 @@ docker pull ghcr.io/umccr/rnasum:latest ## Workflow The pipeline consists of five main components illustrated and briefly -described below. For more details, see [workflow.md](/workflow.md). +described below. For more details, see [workflow.md](./inst/articles/workflow.md). <img src="man/figures/RNAsum_workflow_updated.png" width="100%"> @@ -81,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md). potential druggable targets. 5. The final product is an interactive HTML report with searchable tables and plots presenting expression levels of the genes of interest. The report - consists of several sections described [here](./articles/report_structure.md). + consists of several sections described [here](./inst/articles/report_structure.md). ## Reference data @@ -100,10 +99,10 @@ Depending on the tissue from which the patient's sample was taken, one of **33 cancer datasets** from TCGA can be used as a reference cohort for comparing expression changes in genes of interest of the patient. Additionally, 10 samples from each of the 33 TCGA datasets were combined to create the -**[Pan-Cancer dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**, -and for some cohorts **[extended sets](./articles/tcga_projects_summary.md#extended-datasets)** +**[Pan-Cancer dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**, +and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are also available. All available datasets are listed in the -**[TCGA projects summary table](./articles/tcga_projects_summary.md)**. These datasets +**[TCGA projects summary table](./inst/articles/tcga_projects_summary.md)**. These datasets have been processed using methods described in the [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data) repository. The dataset of interest can be specified by using one of the @@ -119,7 +118,7 @@ analytical pipelines. Moreover, TCGA data may include samples from tissue material of lower quality and cellularity compared to samples processed using local protocols. To address these issues, we have built a high-quality internal reference cohort processed using the same pipelines as input data -(see [data pre-processing](./articles/workflow.md#data-processing)). +(see [data pre-processing](./inst/articles/workflow.md#data-processing)). This internal reference set of **40 pancreatic cancer samples** is based on WTS data generated at **[UMCCR](https://research.unimelb.edu.au/centre-for-cancer-research/our-research/precision-oncology-research-group)** @@ -359,7 +358,7 @@ sections, including: \*\* if genome-based results are available; see `--umccrise` argument Detailed description of the report structure, including result prioritisation -and visualisation is available [here](report_structure.md). +and visualisation is available [here](./inst/articles/report_structure.md). #### Results diff --git a/README.md b/README.md index 282f598c..26ee87a9 100755 --- a/README.md +++ b/README.md @@ -19,9 +19,8 @@ `RNAsum` is an R package that can post-process, summarise and visualise outputs primarily from [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) -pipelines. Its main application is to complement genome-based findings -from the [umccrise](https://github.com/umccr/umccrise) pipeline and to -provide additional evidence for detected alterations. +pipelines. Its main application is to complement whole-genome based +findings and to provide additional evidence for detected alterations. **DOCS**: <https://umccr.github.io/RNAsum> @@ -54,7 +53,8 @@ docker pull ghcr.io/umccr/rnasum:latest ## Workflow The pipeline consists of five main components illustrated and briefly -described below. For more details, see [workflow.md](/workflow.md). +described below. For more details, see +[workflow.md](./inst/articles/workflow.md). <img src="man/figures/RNAsum_workflow_updated.png" width="100%"> @@ -80,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md). 5. The final product is an interactive HTML report with searchable tables and plots presenting expression levels of the genes of interest. The report consists of several sections described - [here](./articles/report_structure.md). + [here](./inst/articles/report_structure.md). ## Reference data @@ -101,12 +101,12 @@ of **33 cancer datasets** from TCGA can be used as a reference cohort for comparing expression changes in genes of interest of the patient. Additionally, 10 samples from each of the 33 TCGA datasets were combined to create the **[Pan-Cancer -dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**, and -for some cohorts **[extended -sets](./articles/tcga_projects_summary.md#extended-datasets)** are also -available. All available datasets are listed in the **[TCGA projects -summary table](./articles/tcga_projects_summary.md)**. These datasets -have been processed using methods described in the +dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**, +and for some cohorts **[extended +sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are +also available. All available datasets are listed in the **[TCGA +projects summary table](./inst/articles/tcga_projects_summary.md)**. +These datasets have been processed using methods described in the [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data) repository. The dataset of interest can be specified by using one of the TCGA project IDs for the `RNAsum` `--dataset` argument (see @@ -122,7 +122,7 @@ may include samples from tissue material of lower quality and cellularity compared to samples processed using local protocols. To address these issues, we have built a high-quality internal reference cohort processed using the same pipelines as input data (see [data -pre-processing](./articles/workflow.md#data-processing)). +pre-processing](./inst/articles/workflow.md#data-processing)). This internal reference set of **40 pancreatic cancer samples** is based on WTS data generated at @@ -170,12 +170,12 @@ quantification file. The table below lists all input data accepted in `RNAsum`: -| Input file | Tool | Example | Required | -|----|----|----|----| -| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** | -| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** | -| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | -| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | +| Input file | Tool | Example | Required | +|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|----------| +| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** | +| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** | +| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | +| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | ### WGS @@ -183,11 +183,11 @@ The table below lists all input data accepted in `RNAsum`: The table below lists all input data accepted in `RNAsum`: -| Input file | Tool | Example | Required | -|----|----|----|----| -| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No | -| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No | -| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No | +| Input file | Tool | Example | Required | +|-----------------|-------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|----------| +| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No | +| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No | +| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No | ## Usage @@ -203,7 +203,7 @@ export PATH="${rnasum_cli}:${PATH}" Usage ===== - /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RNAsum/cli/rnasum.R [options] + /Library/Frameworks/R.framework/Versions/4.2/Resources/library/RNAsum/cli/rnasum.R [options] Options @@ -455,7 +455,7 @@ argument Detailed description of the report structure, including result prioritisation and visualisation is available -[here](report_structure.md). +[here](./inst/articles/report_structure.md). #### Results diff --git a/inst/articles/TCGA_projects_summary.md b/inst/articles/TCGA_projects_summary.md new file mode 100644 index 00000000..31e3127f --- /dev/null +++ b/inst/articles/TCGA_projects_summary.md @@ -0,0 +1,84 @@ +# TCGA projects summary + + +The table below summarises [TCGA](https://portal.gdc.cancer.gov/) expression data available for **[33 cancer types](#primary-datasets)**. + +Additionally, for *Bladder Urothelial Carcinoma*, *Pancreatic Adenocarcinoma* and *Lung Adenocarcinoma* cohorts extended sets are available (see [Extended datasets](#extended-datasets) table), including neuroendocrine tumours (NETs), intraductal papillary mucinous neoplasm (IPMNs), acinar cell carcinoma (ACC) samples and large-cell neuroendocrine carcinoma (LCNEC). + +Finally, 10 samples from each of the [33 datasets](#primary-datasets) were combined to create [Pan-Cancer dataset](#pan-cancer-dataset). + +The dataset of interest can be specified by using one of the [TCGA](https://portal.gdc.cancer.gov/) project IDs (`Project` column) for the `--dataset` argument in *[RNAseq_report.R](./rmd_files/RNAseq_report.R)* script (see [Arguments](./README.md#arguments) section). + +###### Note + +To readuce the data processing time and the size of the final html-based ***Patient Transcriptome Summary*** **report** the following datasets were restricted to inlcude expression data from 300 patients: `BRCA`, `THCA`, `HNSC`, +`LGG`, `KIRC`, `LUSC`, `LUAD`, `PRAD`, `STAD` and `LIHC`. + +## Primary datasets + +No | Project | Name | Tissue code\* | Samples no.\** +------------ | ------------ | ------------ | ------------ | ------------ +1 | `BRCA` | Breast Invasive Carcinoma | 1 | **300** +2 | `THCA` | Thyroid Carcinoma | 1 | **300** +3 | `HNSC` | Head and Neck Squamous Cell Carcinoma | 1 | **300** +4 | `LGG` | Brain Lower Grade Glioma | 1 | **300** +5 | `KIRC` | Kidney Renal Clear Cell Carcinoma | 1 | **300** +6 | `LUSC` | Lung Squamous Cell Carcinoma | 1 | **300** +7 | `LUAD` | Lung Adenocarcinoma | 1 | **300** +8 | `PRAD` | Prostate Adenocarcinoma | 1 | **300** +9 | `STAD` | Stomach Adenocarcinoma | 1 | **300** +10 | `LIHC` | Liver Hepatocellular Carcinoma | 1 | **300** +11 | `COAD` | Colon Adenocarcinoma | 1 | **257** +12 | `KIRP` | Kidney Renal Papillary Cell Carcinoma | 1 | **252** +13 | `BLCA` | Bladder Urothelial Carcinoma | 1 | **246** +14 | `OV` | Ovarian Serous Cystadenocarcinoma | 1 | **220** +15 | `SARC` | Sarcoma | 1 | **214** +16 | `PCPG` | Pheochromocytoma and Paraganglioma | 1 | **177** +17 | `CESC` | Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma | 1 | **171** +18 | `UCEC` | Uterine Corpus Endometrial Carcinoma | 1 | **168** +19 | `PAAD` | Pancreatic Adenocarcinoma | 1 | **150** +20 | `TGCT` | Testicular Germ Cell Tumours | 1 | **149** +21 | `LAML` | Acute Myeloid Leukaemia | 3 | **145** +22 | `ESCA` | Esophageal Carcinoma | 1 | **142** +23 | `GBM` | Glioblastoma Multiforme | 1 | **141** +24 | `THYM` | Thymoma | 1 | **118** +25 | `SKCM` | Skin Cutaneous Melanoma | 1 | **100** +26 | `READ` | Rectum Adenocarcinoma | 1 | **87** +27 | `UVM` | Uveal Melanoma | 1 | **80** +28 | `ACC` | Adrenocortical Carcinoma | 1 | **78** +29 | `MESO` | Mesothelioma | 1 | **77** +30 | `KICH` | Kidney Chromophobe | 1 | **59** +31 | `UCS` | Uterine Carcinosarcoma | 1 | **56** +32 | `DLBC` | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 1 | **47** +33 | `CHOL` | Cholangiocarcinoma | 1 | **34** +<br /> + +## Extended datasets + +No | Project | Name | Tissue code\* | Samples no.\** +------------ | ------------ | ------------ | ------------ | ------------ +1 | `LUAD-LCNEC` | Lung Adenocarcinoma dataset including large-cell neuroendocrine carcinoma (LCNEC, n=14) | 1 | **314** +2 | `BLCA-NET` | Bladder Urothelial Carcinoma dataset including neuroendocrine tumours (NETs, n=2) | 1 | **248** +3 | `PAAD-IPMN` | Pancreatic Adenocarcinoma dataset including intraductal papillary mucinous neoplasm (IPMNs, n=2) | 1 | **152** +4 | `PAAD-NET` | Pancreatic Adenocarcinoma dataset including neuroendocrine tumours (NETs, n=8) | 1 | **158** +5 | `PAAD-ACC` | Pancreatic Adenocarcinoma dataset including acinar cell carcinoma (ACCs, n=1) | 1 | **151** +<br /> + +## Pan-Cancer dataset + +No | Project | Name | Tissue code\* | Samples no.\** +------------ | ------------ | ------------ | ------------ | ------------ +1 | `PANCAN` | Samples from all [33 cancer types](#primary-datasets), 10 samples from each | 1 and 3 (`LAML` samples only) | **330** +<br /> + +\* Tissue codes: + +Tissue code | Letter code | Definition +------------ | ------------ | ------------ +1 | TP | Primary solid Tumour +3 | TB | Primary Blood Derived Cancer - Peripheral Blood +<br /> + +\** Each dataset was cleaned based on the quality metrics provided in the *Merged Sample Quality Annotations* file **[merged_sample_quality_annotations.tsv](http://api.gdc.cancer.gov/data/1a7d7be8-675d-4e60-a105-19d4121bdebf)** from [TCGA PanCanAtlas initiative webpage](https://gdc.cancer.gov/about-data/publications/pancanatlas) (see [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/tree/master/expression/README.md#data-clean-up) repository for more details). + + \ No newline at end of file diff --git a/inst/articles/report_structure.md b/inst/articles/report_structure.md new file mode 100644 index 00000000..48870211 --- /dev/null +++ b/inst/articles/report_structure.md @@ -0,0 +1,126 @@ +## RNAsum sections + +<!-- vim-markdown-toc GFM --> +* [Input data](#input-data) +* [Clinical information](#clinical-information) +* [Findings summary](#findings-summary) +* [Mutated genes](#mutated-genes) +* [Fusion genes](#fusion-genes) + * [Prioritisation](#prioritisation) + * [Filtering](#filtering) + * [Abundant transcripts](#abundant-transcripts) +* [Structural variants](#structural-variants) +* [CN altered genes](#cn-altered-genes) +* [Immune markers](#immune-markers) +* [HRD genes](#hrd-genes) +* [Cancer genes](#cancer-genes) +* [Drug matching](#drug-matching) +* [Addendum](#addendum) + +<!-- vim-markdown-toc --> + +<br/> + +The **`Mutated genes`**, **`Structural variants`** and **`CN altered genes`** sections will contain information about expression levels of the mutated genes, genes located within detected structural variants (SVs) and copy-number (CN) altered regions, respectively. Genes will be ordered by increasing *variants* `TIER`, *SV* `score` and `CN` *value*, resepctively, and then by decreasing absolute values in the `Patient` vs selected `dataset` column. Moreover, gene fusions detected in WTS data and reported in **`Fusion genes`** section will be first ordered based on the evidence from genome-based data (`DNA support (gene A/B)` columns). + +*** + +### Input data + +Summary of the input data + +*** + +### Clinical information + +Treatment regimen information for patient for which clinical information is available. + +NOTE: for confidentiality reasons, the timeline (x-axis) projecting patient’s treatment regimens (y-axis) is set to start from 1st January 2000, but the treatments lengths are preserved. + +*** + +### Findings summary + +Plot and table summarising altered genes listed across various report sections + +*** + +### Mutated genes + +mRNA expression levels of mutated genes (containing single nucleotide variants (SNVs) or insertions/deletions (indels)) measured in patient's sample and their average mRNA expression in samples from cancer patients (from [TCGA](https://portal.gdc.cancer.gov/)). This section is available only for samples with available *[umccrise](https://github.com/umccr/umccrise) results* + +*** + +### Fusion genes + +Prioritised fusion genes based on [Arriba](https://arriba.readthedocs.io/en/latest/) results and annotated with [FusionGDB](https://ccsm.uth.edu/FusionGDB) database. If WGS results from **[umccrise](https://github.com/umccr/umccrise)** are available then fusion genes in the **`Fusion genes`** report section are ordered based on the evidence from genome-based data. For more information about gene fusions and methods for their detectecion and visualisation can be found [here](./fusions/README.md). + +#### Prioritisation + +Fusion genes detected in transcriptome data are prioritised based on criteria ranked in the following order: + +1. Involvement of fusion gene(s) **detected in genomic data** (if [Structural variants](#structural-variants) results are available) +2. **Detected in transcriptome data** by [Arriba](https://arriba.readthedocs.io/en/latest/) tool +3. **Reported** fusion event according to [FusionGDB](https://ccsm.uth.edu/FusionGDB/) database +4. Decreasing number of **split reads** +5. Decreasing number of **pair reads** +6. Involvement of **cancer gene(s)** (see [Cancer genes](#cancer-genes) section) + +#### Filtering + +Fusion genes detected in transcriptome data are reported if **at least one** of the following criteria is met: + +1. Involvement of fusion gene(s) **detected in genomic data** (if [Structural variants](#structural-variants) results are available) +2. **Reported** fusion event according to [FusionGDB](https://ccsm.uth.edu/FusionGDB) database +3. Involvement of **cancer gene(s)** (see [Cancer genes](#cancer-genes) section) +4. **Split reads** > 1 +5. **Pair reads** > 1 and **split reads** > 1 + +*** + +### Structural variants + +Similar to *Mutated genes* analysis but limited to genes located within structural variants (SVs) detected by [MANTA](https://github.com/Illumina/manta) using genomic data. This section is available only for samples with available *[MANTA](https://github.com/Illumina/manta) results*. + +*** + +### CN altered genes + +Section overlaying the mRNA expression data for [cancer genes](#cancer-genes) with per-gene somatic copy-number (CN) data (from [PURPLE](https://anaconda.org/bioconda/hmftools-purple)) and mutation status, if available. + +*** + +### Immune markers + +Similar to *Mutated genes* analysis but limited to genes considered to be immune markers. The immune markers used in the report are listed in PanelApp panel [Immune markers for WTS report](https://panelapp.agha.umccr.org/panels/243/). + +*** + +### HRD genes + +Similar to *Mutated genes* analysis but limited to genes considered to be homologous recombination deficiency (HRD) genes. The HRD genes used in the report are listed in PanelApp panel [Homologous recombination deficiency (HDR) for WTS report](https://panelapp.agha.umccr.org/panels/242/). + +*** + +### Cancer genes + +Similar to analysis above, but limited to *UMCCR cancer genes*. + +*** + +### Drug matching + +List of drugs targeting variants in detected *mutated genes*, *fusion genes*, *structural variants-affected genes*, *CN altered genes*, *HRD genes* and dysregulated *cancer genes*, which can be considered in the treatment decision making process. + +###### Note + +This section is not displayed as default. Set the `--drugs` argument to `TRUE` to present it in the report. + +*** + +### Addendum + +Additional information, including `Parameters`, `Reporter details` and R `Session information`, added at the end of the report. + +<br/> + diff --git a/inst/articles/workflow.md b/inst/articles/workflow.md new file mode 100644 index 00000000..24711bb8 --- /dev/null +++ b/inst/articles/workflow.md @@ -0,0 +1,228 @@ +## RNAsum data processing workflow + +The description of the main workflow components involved in (**1**) *[read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* and *[gene fusions](./data/test_data/final/test_sample_WTS/arriba/fusions.tsv)* data **[collection](#1-data-collection)**, (**2**) *[read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* data **[processing](#1-data-processing)**, (**3**) **[integration](#2-integration-with-wgs-based-results)** with **[WGS](./README.md#wgs)**-based data (processed using *[umccrise](https://github.com/umccr/umccrise)* pipeline), (**4**) results **[annotation](#3-results-annotation)** and (**5**) presentation in the *Patient Transcriptome Summary* **[report](#4-report-generation)**. + +<img src="img/RNAsum_workflow.png" width="100%"> + +<br/> + +## Table of contents + +<!-- vim-markdown-toc GFM --> +* [1. Data collection](#1-data-collection) +* [2. Data processing](#2-data-processing) + * [Counts processing](#counts-processing) + * [Data collection](#data-collection) + * [Transformation](#transformation) + * [Filtering (optional)](#filtering-optional) + * [Normalisation (optional)](#normalisation-optional) + * [Combination](#combination) + * [Batch-effects correction (optional)](#batch-effects-correction-optional) + * [Data scaling](#data-scaling) +* [3. Integration with WGS-based results](#3-integration-with-wgs-based-results) + * [Somatic SNVs and small indels](#somatic-snvs-and-small-indels) + * [Structural variants](#structural-variants) + * [Somatic CNVs](#somatic-cnvs) +* [4. Results annotation](#4-results-annotation) + * [Key cancer genes](#key-cancer-genes) + * [OncoKB](#oncokb) + * [VICC](#vicc) + * [CIViC](#civic) + * [CGI](#cgi) + * [FusionGDB](#fusiongdb) +* [5. Report generation](#5-report-generation) + +<!-- vim-markdown-toc --> + +## 1. Data collection + +**[Read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)** data from patient sample are collected from *[bcbio-nextgen RNA-seq](https://bcbio-nextgen.readthedocs.io/en/latest/contents/bulk_rnaseq.html)* or *[DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html)* pipeline. + +## 2. Data processing + +### Counts processing + +The **read count** data (see [Input data](./README.md#input-data) section in the main page) in *[abundance.tsv](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* or *[quant.sf](./data/test_data/stratus/test_sample_WTS/TEST.quant.sf)* quantification files from [kallisto](https://pachterlab.github.io/kallisto/about) or [salmon](https://salmon.readthedocs.io/en/latest/salmon.html), respectively, are processed following steps illustrated in [Figure 1](./img/counts_post-processing_scheme.png) and described below. + +<img src="img/counts_post-processing_scheme.png" width="40%"> + +###### Figure 1 +>Counts processing scheme. + +#### Data collection + +([Figure 1](./img/counts_post-processing_scheme.png)A) + +* Load read count files from the following three sets of data: + + 1. patient **sample** (see [Input data](./README.md#input-data) section in the main page) + 2. **external reference** cohort ([TCGA](https://tcga-data.nci.nih.gov/), available cancer types are listed in [TCGA projects summary table](./TCGA_projects_summary.md)) corresponding to the patient cancer sample + 3. UMCCR **internal reference** set of in-house pancreatic cancer samples (regardless of the patient sample origin; see [Input data](./README.md#input-data) section in the main page) + +#### Transformation + +([Figure 1](./img/counts_post-processing_scheme.png)B) + +* Subset datasets to include common genes +* Combine patient **sample** and **internal reference** dataset +* Convert counts to **[CPM](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)** (*Counts Per Million*; default) or **[TPM](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)** (*Transcripts Per Kilobase Million*) values in: + 1. **sample** + **internal reference** set + 2. **external reference** set + +#### Filtering (optional) + +([Figure 1](./img/counts_post-processing_scheme.png)C) + +* Filter out genes with low counts (CPM or TPM **< 1** in more than 90% of samples) in: + 1. **sample** + **internal reference** set + 2. **external reference** set + +#### Normalisation (optional) + +([Figure 1](./img/counts_post-processing_scheme.png)D) + +* Normalise data (see [Arguments](./README.md#arguments) section in the main page for available options) for sample-specific effects in: + 1. **sample** + **internal reference** set + 2. **external reference** set + +#### Combination + +([Figure 1](./img/counts_post-processing_scheme.png)E) + +* Subset datasets to include common genes +* Combine **sample** + **internal reference** set with **external reference** set + +#### Batch-effects correction (optional) + +([Figure 1](./img/counts_post-processing_scheme.png)F) + +* Consider the patient **sample** + **internal reference** (regardless of the patient sample origin) as one batch (both sets processed with the same pipeline) and corresponding **[TCGA](https://tcga-data.nci.nih.gov/) dataset** as another batch. The objective is to remove data variation due to technical factors. + +#### Data scaling + +The processed count data is scaled to facilitate expression values interpretation. The data is either scaled **[gene-wise](#gene-wise-z-scoreztransformation)** (Z-score transformation, default) or **[group-wise](#group-wise-centering)** (centering). + +##### Gene-wise + +Z-scores are comparable by measuring the observations in multiples of the standard deviation of given sample. The gene-wise Z-score transformation procedure is illustrated in [Figure 2](./img/Z-score_transformation_gene_wise.png) and is described below. + +<img src="img/Z-score_transformation_gene_wise.png" width="30%"> + +###### Figure 2 +>Gene-wise Z-score transformation scheme. + +* Extract expression values across all samples for a given **gene** ([Figure 2](./img/Z-score_transformation_gene_wise.png)A) +* Compute **Z-scores** for individual samples (see equation in ([Figure 2](./img/Z-score_transformation_gene_wise.png)B) +* Compute **median Z-scores** for ([Figure 2](./img/Z-score_transformation_gene_wise.png)C): + 1. **internal reference** set\* + 2. **external reference** set + +* Present patient sample **Z-score** in the context the reference cohorts' **median Z-scores** ([Figure 2](./img/Z-score_transformation_gene_wise.png)D) + +\* used only for pancreatic cancer patients + +##### Group-wise + +The group-wise centering apporach is presented in [Figure 3](./img/centering_group_wise.png) and is described below. + + +<img src="img/centering_group_wise.png" width="30%"> + +###### Figure 3 +>Group-wise centering scheme. + +* Extract expression values for ([Figure 3](./img/centering_group_wise.png)A): + 1. patient **sample** + 2. **internal reference** set\* + 3. **external reference** set + +* For each gene compute **median expression** value in ([Figure 3](./img/centering_group_wise.png)B): + 1. **internal reference** set\* + 2. **external reference** set + +* **Center** the median expression values for each gene in individual groups ([Figure 3](./img/centering_group_wise.png)C) +* Present patient sample **centered** expression values in the context the reference cohorts' **centered** values ([Figure 3](./img/centering_group_wise.png)D) + +\* used only for pancreatic cancer patients + + +## 3. Integration with WGS-based results + +For patients with available [WGS](./README.md#wgs) data processed using *[umccrise](https://github.com/umccr/umccrise)* pipeline (see ```--umccrise``` [argument](README.md/#arguments)) the expression level information for [mutated](#somatic-snvs-and-small-indels) genes or genes located within detected [structural variants](#structural-variants) (SVs) or [copy-number](#somatic-cnvs) (CN) [altered regions](#somatic-cnvs), as well as the genome-based findings are incorporated and used as primary source for expression profiles prioritisation. + +### Somatic SNVs and small indels + +* Check if **[PCGR](https://github.com/sigven/pcgr)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/pcgr/test_sample_WGS-somatic.pcgr.snvs_indels.tiers.tsv)) is available +* **Extract** expression level **information** and genome-based findings for genes with detected genomic variants (use ```--pcgr_tier``` [argument](README.md/#arguments) to define [tier]([https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg](https://sigven.github.io/pcgr/articles/variant_classification.html) threshold value) +* **Ordered genes** by increasing variants **[tier]([https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg](https://sigven.github.io/pcgr/articles/variant_classification.html)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort + +### Structural variants + +* Check if **[Manta](https://github.com/Illumina/manta)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/structural/test_sample_WGS-sv-prioritize-manta-pass.tsv)) is available +* **Extract** expression level **information** and genome-based findings for genes located within detected SVs +* **Ordered genes** by increasing **[SV score](https://github.com/vladsaveliev/simple_sv_annotation)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort +* **Compare** [gene fusions](./fusions) detected in [WTS](./README.md#wts) data ([arriba](https://arriba.readthedocs.io/en/latest/) and [pizzly](https://github.com/pmelsted/pizzly)) and [WGS](./README.md#wgs) data ([Manta](https://github.com/Illumina/manta)) +* **Priritise** [WGS](./README.md#wgs)-supported [gene fusions](./fusions) + +### Somatic CNVs + +* Check if **[PURPLE](https://github.com/hartwigmedical/hmftools/blob/master/purple/README.md)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/purple/test_sample_WGS.purple.gene.cnv)) is available +* **Extract** expression level **information** and genome-based findings for genes located within detected CNVs (use ```--cn_loss ``` and ```--cn_gain ``` [arguments](README.md/#arguments) to define CN threshold values to classify genes within lost and gained regions) +* **Ordered genes** by increasing (for genes within lost regions) or decreasing (for genes within gained regions) **[CN](https://github.com/umccr/umccrise/blob/master/workflow.md#somatic-cnv)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort + +## 4. Results annotation + +[WTS](./README.md#wts)- and/or [WGS](./README.md#wgs)-based results for the altered genes are collated with **knowledge** derived from in-house resources and public **databases** (listed below) to provide additional source of evidence for their significance, e.g. to flag variants with clinical significance or potential druggable targets. + +### Key cancer genes + +* [UMCCR key cancer genes set](https://github.com/umccr/NGS_Utils/blob/master/ngs_utils/reference_data/key_genes/make_umccr_cancer_genes.Rmd) build of off several sources: + * [Cancermine](http://bionlp.bcgsc.ca/cancermine/) with at least 2 publication with at least 3 citations + * [NCG known cancer genes](http://ncg.kcl.ac.uk/) + * Tier 1 [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/census) (CGC) + * [CACAO](https://github.com/sigven/cacao) hotspot genes (curated from [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/), [CiViC](https://civicdb.org/), [Cancer Hotspots](https://www.cancerhotspots.org/)) + * At least 2 matches in the following 5 sources and 8 clinical panels: + * Cancer predisposition genes ([CPSR](https://github.com/sigven/cpsr) list) + * [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/census) (tier 2) + * AstraZeneca 300 (AZ300) + * Familial Cancer + * [OncoKB](https://oncokb.org/) annotated + * MSKC-IMPACT + * MSKC-Heme + * PMCC-CCP + * Illumina-TS500 + * TEMPUS + * Foundation One + * Foundation Heme + * Vogelstein + +* Used for extracting expression levels of cancer genes (presented in the `Cancer genes` report section) +* Used to prioritise candidate [fusion genes](./fusions) + +### OncoKB + +* [OncoKB](https://oncokb.org/cancerGenes) gene list is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) + + +### VICC + +* [Variant Interpretation for Cancer Consortium](https://cancervariants.org/) (VICC) knowledgebase is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) + + +### CIViC + +* The [Clinical Interpretation of Variants in Cancer](https://civicdb.org/) (CIViC) database is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) +* Used to flag clinically actionable aberrations in the `Drug matching` report section + +### CGI + +* The [Cancer Genome Interpreter](https://www.cancergenomeinterpreter.org/biomarkers) (CGI) database is used to flag genes known to be involved in gene fusions and to prioritise candidate [fusion genes](./fusions) + +### FusionGDB + +* [FusionGDB](https://ccsm.uth.edu/FusionGDB/) database is used to flag genes known to be involved in gene fusions and to prioritise candidate [gene fusions](./fusions) + +### 5. Report generation + +The final html-based ***Patient Transcriptome Summary*** **report** contains searchable tables and interactive plots presenting expression levels of altered genes, as well as links to public resources providing additional source of evidence for their significance. The individual **[report sections](report_structure.md)**, **[results prioritisation](report_structure.md)** and **[visualisation](report_structure.md)** are described more in detail in [report_structure.md](report_structure.md). +