diff --git a/README.Rmd b/README.Rmd
index 64d390eb..ffd00009 100755
--- a/README.Rmd
+++ b/README.Rmd
@@ -18,8 +18,7 @@ knitr::opts_chunk$set(
 
 `RNAsum` is an R package that can post-process, summarise and visualise
 outputs primarily from [DRAGEN RNA][dragen-rna] pipelines.
-Its main application is to complement genome-based findings from the
-[umccrise][umccrise] pipeline and to provide additional evidence for detected
+Its main application is to complement whole-genome based findings and to provide additional evidence for detected
 alterations.
 
 [dragen-rna]: <https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html>
@@ -59,7 +58,7 @@ docker pull ghcr.io/umccr/rnasum:latest
 ## Workflow
 
 The pipeline consists of five main components illustrated and briefly
-described below. For more details, see [workflow.md](/workflow.md).
+described below. For more details, see [workflow.md](./inst/articles/workflow.md).
 
 <img src="man/figures/RNAsum_workflow_updated.png" width="100%">
 
@@ -81,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md).
    potential druggable targets.
 5. The final product is an interactive HTML report with searchable tables and
    plots presenting expression levels of the genes of interest. The report
-   consists of several sections described [here](./articles/report_structure.md).
+   consists of several sections described [here](./inst/articles/report_structure.md).
 
 ## Reference data
 
@@ -100,10 +99,10 @@ Depending on the tissue from which the patient's sample was taken, one of
 **33 cancer datasets** from TCGA can be used as a reference cohort for comparing
 expression changes in genes of interest of the patient. Additionally, 10 samples
 from each of the 33 TCGA datasets were combined to create the
-**[Pan-Cancer dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**,
-and for some cohorts **[extended sets](./articles/tcga_projects_summary.md#extended-datasets)**
+**[Pan-Cancer dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**,
+and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)**
 are also available. All available datasets are listed in the
-**[TCGA projects summary table](./articles/tcga_projects_summary.md)**. These datasets
+**[TCGA projects summary table](./inst/articles/tcga_projects_summary.md)**. These datasets
 have been processed using methods described in the
 [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data)
 repository. The dataset of interest can be specified by using one of the
@@ -119,7 +118,7 @@ analytical pipelines. Moreover, TCGA data may include samples from tissue
 material of lower quality and cellularity compared to samples processed using
 local protocols. To address these issues, we have built a high-quality internal
 reference cohort processed using the same pipelines as input data
-(see [data pre-processing](./articles/workflow.md#data-processing)).
+(see [data pre-processing](./inst/articles/workflow.md#data-processing)).
 
 This internal reference set of **40 pancreatic cancer samples** is based on WTS
 data generated at **[UMCCR](https://research.unimelb.edu.au/centre-for-cancer-research/our-research/precision-oncology-research-group)**
@@ -359,7 +358,7 @@ sections, including:
 \*\* if genome-based results are available; see `--umccrise` argument
 
 Detailed description of the report structure, including result prioritisation
-and visualisation is available [here](report_structure.md).
+and visualisation is available [here](./inst/articles/report_structure.md).
 
 #### Results
 
diff --git a/README.md b/README.md
index 282f598c..26ee87a9 100755
--- a/README.md
+++ b/README.md
@@ -19,9 +19,8 @@
 `RNAsum` is an R package that can post-process, summarise and visualise
 outputs primarily from [DRAGEN
 RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html)
-pipelines. Its main application is to complement genome-based findings
-from the [umccrise](https://github.com/umccr/umccrise) pipeline and to
-provide additional evidence for detected alterations.
+pipelines. Its main application is to complement whole-genome based
+findings and to provide additional evidence for detected alterations.
 
 **DOCS**: <https://umccr.github.io/RNAsum>
 
@@ -54,7 +53,8 @@ docker pull ghcr.io/umccr/rnasum:latest
 ## Workflow
 
 The pipeline consists of five main components illustrated and briefly
-described below. For more details, see [workflow.md](/workflow.md).
+described below. For more details, see
+[workflow.md](./inst/articles/workflow.md).
 
 <img src="man/figures/RNAsum_workflow_updated.png" width="100%">
 
@@ -80,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md).
 5.  The final product is an interactive HTML report with searchable
     tables and plots presenting expression levels of the genes of
     interest. The report consists of several sections described
-    [here](./articles/report_structure.md).
+    [here](./inst/articles/report_structure.md).
 
 ## Reference data
 
@@ -101,12 +101,12 @@ of **33 cancer datasets** from TCGA can be used as a reference cohort
 for comparing expression changes in genes of interest of the patient.
 Additionally, 10 samples from each of the 33 TCGA datasets were combined
 to create the **[Pan-Cancer
-dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**, and
-for some cohorts **[extended
-sets](./articles/tcga_projects_summary.md#extended-datasets)** are also
-available. All available datasets are listed in the **[TCGA projects
-summary table](./articles/tcga_projects_summary.md)**. These datasets
-have been processed using methods described in the
+dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**,
+and for some cohorts **[extended
+sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are
+also available. All available datasets are listed in the **[TCGA
+projects summary table](./inst/articles/tcga_projects_summary.md)**.
+These datasets have been processed using methods described in the
 [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data)
 repository. The dataset of interest can be specified by using one of the
 TCGA project IDs for the `RNAsum` `--dataset` argument (see
@@ -122,7 +122,7 @@ may include samples from tissue material of lower quality and
 cellularity compared to samples processed using local protocols. To
 address these issues, we have built a high-quality internal reference
 cohort processed using the same pipelines as input data (see [data
-pre-processing](./articles/workflow.md#data-processing)).
+pre-processing](./inst/articles/workflow.md#data-processing)).
 
 This internal reference set of **40 pancreatic cancer samples** is based
 on WTS data generated at
@@ -170,12 +170,12 @@ quantification file.
 
 The table below lists all input data accepted in `RNAsum`:
 
-| Input file | Tool | Example | Required |
-|----|----|----|----|
-| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** |
-| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** |
-| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |
-| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |
+| Input file                           | Tool                                                                                                                                                 | Example                                                                                              | Required |
+|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|----------|
+| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf)                                          | **Yes**  |
+| Quantified gene **abundances**       | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf)                                | **Yes**  |
+| **Fusion gene** list                 | [Arriba](https://arriba.readthedocs.io/en/latest/)                                                                                                   | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final)                | No       |
+| **Fusion gene** list                 | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No       |
 
 ### WGS
 
@@ -183,11 +183,11 @@ The table below lists all input data accepted in `RNAsum`:
 
 The table below lists all input data accepted in `RNAsum`:
 
-| Input file | Tool | Example | Required |
-|----|----|----|----|
-| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No |
-| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No |
-| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No |
+| Input file      | Tool                                                                    | Example                                                                                                                   | Required |
+|-----------------|-------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|----------|
+| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr)                                  | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No       |
+| **CNVs**        | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv)                           | No       |
+| **SVs**         | [Manta](https://github.com/Illumina/manta)                              | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv)           | No       |
 
 ## Usage
 
@@ -203,7 +203,7 @@ export PATH="${rnasum_cli}:${PATH}"
     Usage
     =====
      
-    /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RNAsum/cli/rnasum.R [options]
+    /Library/Frameworks/R.framework/Versions/4.2/Resources/library/RNAsum/cli/rnasum.R [options]
 
 
     Options
@@ -455,7 +455,7 @@ argument
 
 Detailed description of the report structure, including result
 prioritisation and visualisation is available
-[here](report_structure.md).
+[here](./inst/articles/report_structure.md).
 
 #### Results
 
diff --git a/inst/articles/TCGA_projects_summary.md b/inst/articles/TCGA_projects_summary.md
new file mode 100644
index 00000000..31e3127f
--- /dev/null
+++ b/inst/articles/TCGA_projects_summary.md
@@ -0,0 +1,84 @@
+# TCGA projects summary
+
+
+The table below summarises [TCGA](https://portal.gdc.cancer.gov/) expression data available for **[33 cancer types](#primary-datasets)**. 
+
+Additionally, for *Bladder Urothelial Carcinoma*, *Pancreatic Adenocarcinoma* and *Lung Adenocarcinoma* cohorts extended sets are available (see [Extended datasets](#extended-datasets) table), including neuroendocrine tumours (NETs), intraductal papillary mucinous neoplasm (IPMNs), acinar cell carcinoma (ACC) samples and large-cell neuroendocrine carcinoma (LCNEC).
+
+Finally, 10 samples from each of the [33 datasets](#primary-datasets) were combined to create [Pan-Cancer dataset](#pan-cancer-dataset).
+
+The dataset of interest can be specified by using one of the [TCGA](https://portal.gdc.cancer.gov/) project IDs (`Project` column) for the `--dataset` argument in *[RNAseq_report.R](./rmd_files/RNAseq_report.R)* script (see [Arguments](./README.md#arguments) section).
+
+###### Note
+
+To readuce the data processing time and the size of the final html-based ***Patient Transcriptome Summary*** **report** the following datasets were restricted to inlcude expression data from 300 patients: `BRCA`, `THCA`, `HNSC`, 
+`LGG`, `KIRC`, `LUSC`, `LUAD`, `PRAD`, `STAD` and `LIHC`.
+
+## Primary datasets
+
+No | Project | Name | Tissue code\* | Samples no.\**
+------------ | ------------ | ------------ | ------------ | ------------
+1 | `BRCA`  | Breast Invasive Carcinoma | 1 | **300**
+2 | `THCA`  | Thyroid Carcinoma | 1 | **300**
+3 | `HNSC`  | Head and Neck Squamous Cell Carcinoma | 1 | **300**
+4 | `LGG`   | Brain Lower Grade Glioma | 1 | **300**
+5 | `KIRC`  | Kidney Renal Clear Cell Carcinoma | 1 | **300**
+6 | `LUSC`  | Lung Squamous Cell Carcinoma | 1 | **300**
+7 | `LUAD`  | Lung Adenocarcinoma | 1 | **300**
+8 | `PRAD`  | Prostate Adenocarcinoma | 1 | **300**
+9 | `STAD`  | Stomach Adenocarcinoma | 1 | **300**
+10 | `LIHC`  | Liver Hepatocellular Carcinoma | 1 | **300**
+11 | `COAD`  | Colon Adenocarcinoma | 1 | **257**
+12 | `KIRP`  | Kidney Renal Papillary Cell Carcinoma | 1 | **252**
+13 | `BLCA`  | Bladder Urothelial Carcinoma | 1 | **246**
+14 | `OV`    | Ovarian Serous Cystadenocarcinoma | 1 | **220**
+15 | `SARC`  | Sarcoma | 1 | **214**
+16 | `PCPG`  | Pheochromocytoma and Paraganglioma | 1 | **177**
+17 | `CESC`  | Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma | 1 | **171**
+18 | `UCEC`  | Uterine Corpus Endometrial Carcinoma | 1 | **168**
+19 | `PAAD`  | Pancreatic Adenocarcinoma | 1 | **150**
+20 | `TGCT`  | Testicular Germ Cell Tumours | 1 | **149**
+21 | `LAML`  | Acute Myeloid Leukaemia | 3 | **145**
+22 | `ESCA`  | Esophageal Carcinoma | 1 | **142**
+23 | `GBM`   | Glioblastoma Multiforme | 1 | **141**
+24 | `THYM`  | Thymoma | 1 | **118**
+25 | `SKCM`  | Skin Cutaneous Melanoma | 1 | **100**
+26 | `READ`  | Rectum Adenocarcinoma | 1 | **87**
+27 | `UVM`   | Uveal Melanoma | 1 | **80**
+28 | `ACC`   | Adrenocortical Carcinoma | 1 | **78**
+29 | `MESO`  | Mesothelioma | 1 | **77**
+30 | `KICH`  | Kidney Chromophobe | 1 | **59**
+31 | `UCS`   | Uterine Carcinosarcoma | 1 | **56**
+32 | `DLBC`  | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 1 | **47**
+33 | `CHOL`  | Cholangiocarcinoma | 1 | **34**
+<br />
+
+## Extended datasets
+
+No | Project | Name | Tissue code\* | Samples no.\**
+------------ | ------------ | ------------ | ------------ | ------------
+1 | `LUAD-LCNEC`  | Lung Adenocarcinoma dataset including large-cell neuroendocrine carcinoma (LCNEC, n=14) | 1 | **314**
+2 | `BLCA-NET`  | Bladder Urothelial Carcinoma dataset including neuroendocrine tumours (NETs, n=2) | 1 | **248**
+3 | `PAAD-IPMN`  | Pancreatic Adenocarcinoma dataset including intraductal papillary mucinous neoplasm (IPMNs, n=2) | 1 | **152**
+4 | `PAAD-NET`  | Pancreatic Adenocarcinoma dataset including neuroendocrine tumours (NETs, n=8) | 1 | **158**
+5 | `PAAD-ACC`  | Pancreatic Adenocarcinoma dataset including acinar cell carcinoma (ACCs, n=1) | 1 | **151**
+<br />
+
+## Pan-Cancer dataset
+
+No | Project | Name | Tissue code\* | Samples no.\**
+------------ | ------------ | ------------ | ------------ | ------------
+1 | `PANCAN`  | Samples from all [33 cancer types](#primary-datasets), 10 samples from each  | 1 and 3 (`LAML` samples only) | **330**
+<br />
+
+\* Tissue codes:
+
+Tissue code | Letter code | Definition
+------------ | ------------ | ------------
+1 | TP  | Primary solid Tumour
+3 | TB  | Primary Blood Derived Cancer - Peripheral Blood
+<br />
+
+\** Each dataset was cleaned based on the quality metrics provided in the *Merged Sample Quality Annotations* file **[merged_sample_quality_annotations.tsv](http://api.gdc.cancer.gov/data/1a7d7be8-675d-4e60-a105-19d4121bdebf)** from [TCGA PanCanAtlas initiative webpage](https://gdc.cancer.gov/about-data/publications/pancanatlas) (see [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/tree/master/expression/README.md#data-clean-up) repository for more details).
+ 
+ 
\ No newline at end of file
diff --git a/inst/articles/report_structure.md b/inst/articles/report_structure.md
new file mode 100644
index 00000000..48870211
--- /dev/null
+++ b/inst/articles/report_structure.md
@@ -0,0 +1,126 @@
+## RNAsum sections
+
+<!-- vim-markdown-toc GFM -->
+* [Input data](#input-data)
+* [Clinical information](#clinical-information)
+* [Findings summary](#findings-summary)
+* [Mutated genes](#mutated-genes)
+* [Fusion genes](#fusion-genes)
+  * [Prioritisation](#prioritisation)
+  * [Filtering](#filtering)
+  * [Abundant transcripts](#abundant-transcripts)
+* [Structural variants](#structural-variants)
+* [CN altered genes](#cn-altered-genes)
+* [Immune markers](#immune-markers)
+* [HRD genes](#hrd-genes)
+* [Cancer genes](#cancer-genes)
+* [Drug matching](#drug-matching)
+* [Addendum](#addendum)
+
+<!-- vim-markdown-toc -->
+
+<br/> 
+
+The **`Mutated genes`**, **`Structural variants`** and **`CN altered genes`** sections will contain information about expression levels of the mutated genes, genes located within detected structural variants (SVs) and copy-number (CN) altered regions, respectively. Genes will be ordered by increasing *variants* `TIER`, *SV* `score` and `CN` *value*, resepctively, and then by decreasing absolute values in the `Patient` vs selected `dataset` column. Moreover, gene fusions detected in WTS data and reported in **`Fusion genes`** section will be first ordered based on the evidence from genome-based data (`DNA support (gene A/B)` columns).
+
+***
+
+### Input data
+
+Summary of the input data
+
+***
+
+### Clinical information
+
+Treatment regimen information for patient for which clinical information is available.
+
+NOTE: for confidentiality reasons, the timeline (x-axis) projecting patient’s treatment regimens (y-axis) is set to start from 1st January 2000, but the treatments lengths are preserved.
+
+***
+
+### Findings summary
+
+Plot and table summarising altered genes listed across various report sections
+
+***
+
+### Mutated genes
+
+mRNA expression levels of mutated genes (containing single nucleotide variants (SNVs) or insertions/deletions (indels)) measured in patient's sample and their average mRNA expression in samples from cancer patients (from [TCGA](https://portal.gdc.cancer.gov/)). This section is available only for samples with available *[umccrise](https://github.com/umccr/umccrise) results*
+
+***
+
+### Fusion genes
+
+Prioritised fusion genes based on [Arriba](https://arriba.readthedocs.io/en/latest/) results and annotated with [FusionGDB](https://ccsm.uth.edu/FusionGDB) database. If WGS results from **[umccrise](https://github.com/umccr/umccrise)** are available then fusion genes in the **`Fusion genes`** report section are ordered based on the evidence from genome-based data. For more information about gene fusions and methods for their detectecion and visualisation can be found [here](./fusions/README.md).
+
+#### Prioritisation
+
+Fusion genes detected in transcriptome data are prioritised based on criteria ranked in the following order:
+
+1. Involvement of fusion gene(s) **detected in genomic data** (if [Structural variants](#structural-variants) results are available)
+2. **Detected in transcriptome data** by [Arriba](https://arriba.readthedocs.io/en/latest/) tool
+3. **Reported** fusion event according to [FusionGDB](https://ccsm.uth.edu/FusionGDB/) database
+4. Decreasing number of **split reads**
+5. Decreasing number of **pair reads**
+6. Involvement of **cancer gene(s)** (see [Cancer genes](#cancer-genes) section)
+
+#### Filtering
+
+Fusion genes detected in transcriptome data are reported if **at least one** of the following criteria is met:
+
+1. Involvement of fusion gene(s) **detected in genomic data** (if [Structural variants](#structural-variants) results are available)
+2. **Reported** fusion event according to [FusionGDB](https://ccsm.uth.edu/FusionGDB) database
+3. Involvement of **cancer gene(s)** (see [Cancer genes](#cancer-genes) section)
+4. **Split reads** > 1
+5. **Pair reads** > 1 and **split reads** > 1
+
+***
+
+### Structural variants
+
+Similar to *Mutated genes* analysis but limited to genes located within structural variants (SVs) detected by [MANTA](https://github.com/Illumina/manta) using genomic data. This section is available only for samples with available *[MANTA](https://github.com/Illumina/manta) results*.
+
+***
+
+### CN altered genes
+
+Section overlaying the mRNA expression data for [cancer genes](#cancer-genes) with per-gene somatic copy-number (CN) data (from [PURPLE](https://anaconda.org/bioconda/hmftools-purple)) and mutation status, if available.
+
+***
+
+### Immune markers
+
+Similar to *Mutated genes* analysis but limited to genes considered to be immune markers. The immune markers used in the report are listed in PanelApp panel [Immune markers for WTS report](https://panelapp.agha.umccr.org/panels/243/).
+
+***
+
+### HRD genes
+
+Similar to *Mutated genes* analysis but limited to genes considered to be homologous recombination deficiency (HRD) genes. The HRD genes used in the report are listed in PanelApp panel [Homologous recombination deficiency (HDR) for WTS report](https://panelapp.agha.umccr.org/panels/242/).
+
+***
+
+### Cancer genes
+
+Similar to analysis above, but limited to *UMCCR cancer genes*.
+
+***
+
+### Drug matching
+
+List of drugs targeting variants in detected *mutated genes*, *fusion genes*, *structural variants-affected genes*, *CN altered genes*, *HRD genes* and dysregulated *cancer genes*, which can be considered in the treatment decision making process.
+
+###### Note
+
+This section is not displayed as default. Set the `--drugs` argument to `TRUE` to present it in the report.
+
+***
+
+### Addendum
+
+Additional information, including `Parameters`, `Reporter details` and R `Session information`,  added at the end of the report.
+
+<br/>
+
diff --git a/inst/articles/workflow.md b/inst/articles/workflow.md
new file mode 100644
index 00000000..24711bb8
--- /dev/null
+++ b/inst/articles/workflow.md
@@ -0,0 +1,228 @@
+## RNAsum data processing workflow
+
+The description of the main workflow components involved in (**1**) *[read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* and *[gene fusions](./data/test_data/final/test_sample_WTS/arriba/fusions.tsv)* data **[collection](#1-data-collection)**, (**2**) *[read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* data **[processing](#1-data-processing)**, (**3**) **[integration](#2-integration-with-wgs-based-results)** with **[WGS](./README.md#wgs)**-based data (processed using *[umccrise](https://github.com/umccr/umccrise)* pipeline), (**4**) results **[annotation](#3-results-annotation)** and (**5**) presentation in the *Patient Transcriptome Summary* **[report](#4-report-generation)**. 
+
+<img src="img/RNAsum_workflow.png" width="100%"> 
+
+<br/>
+
+## Table of contents
+
+<!-- vim-markdown-toc GFM -->
+* [1. Data collection](#1-data-collection)
+* [2. Data processing](#2-data-processing)
+    * [Counts processing](#counts-processing)
+    	* [Data collection](#data-collection)
+    	* [Transformation](#transformation)
+    	* [Filtering (optional)](#filtering-optional)
+    	* [Normalisation (optional)](#normalisation-optional)
+    	* [Combination](#combination)
+    	* [Batch-effects correction (optional)](#batch-effects-correction-optional)
+    	* [Data scaling](#data-scaling)
+* [3. Integration with WGS-based results](#3-integration-with-wgs-based-results)
+	* [Somatic SNVs and small indels](#somatic-snvs-and-small-indels)
+	* [Structural variants](#structural-variants)
+	* [Somatic CNVs](#somatic-cnvs)
+* [4. Results annotation](#4-results-annotation)
+	* [Key cancer genes](#key-cancer-genes)
+	* [OncoKB](#oncokb)
+	* [VICC](#vicc)
+	* [CIViC](#civic)
+	* [CGI](#cgi)
+	* [FusionGDB](#fusiongdb)
+* [5. Report generation](#5-report-generation)
+
+<!-- vim-markdown-toc -->
+
+## 1. Data collection
+
+**[Read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)** data from patient sample are collected from *[bcbio-nextgen RNA-seq](https://bcbio-nextgen.readthedocs.io/en/latest/contents/bulk_rnaseq.html)* or *[DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html)* pipeline.
+
+## 2. Data processing 
+
+### Counts processing
+
+The **read count** data (see [Input data](./README.md#input-data) section in the main page) in *[abundance.tsv](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* or *[quant.sf](./data/test_data/stratus/test_sample_WTS/TEST.quant.sf)* quantification files from [kallisto](https://pachterlab.github.io/kallisto/about) or [salmon](https://salmon.readthedocs.io/en/latest/salmon.html), respectively, are processed following steps illustrated in [Figure 1](./img/counts_post-processing_scheme.png) and described below.
+
+<img src="img/counts_post-processing_scheme.png" width="40%"> 
+
+###### Figure 1
+>Counts processing scheme.
+
+#### Data collection
+
+([Figure 1](./img/counts_post-processing_scheme.png)A)
+
+* Load read count files from the following three sets of data:
+
+	1. patient **sample** (see [Input data](./README.md#input-data) section in the main page)
+	2. **external reference** cohort ([TCGA](https://tcga-data.nci.nih.gov/), available cancer types are listed in [TCGA projects summary table](./TCGA_projects_summary.md)) corresponding to the patient cancer sample
+	3. UMCCR **internal reference** set of in-house pancreatic cancer samples (regardless of the patient sample origin; see [Input data](./README.md#input-data) section in the main page)
+
+#### Transformation
+
+([Figure 1](./img/counts_post-processing_scheme.png)B)
+
+* Subset datasets to include common genes
+* Combine patient **sample** and **internal reference** dataset
+* Convert counts to **[CPM](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)** (*Counts Per Million*; default) or **[TPM](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)** (*Transcripts Per Kilobase Million*) values in:
+	1. **sample** + **internal reference** set
+	2. **external reference** set
+
+#### Filtering (optional)
+
+([Figure 1](./img/counts_post-processing_scheme.png)C)
+
+* Filter out genes with low counts (CPM or TPM **< 1** in more than 90% of samples) in:
+	1. **sample** + **internal reference** set
+	2. **external reference** set
+
+#### Normalisation (optional)
+
+([Figure 1](./img/counts_post-processing_scheme.png)D)
+
+* Normalise data (see [Arguments](./README.md#arguments) section in the main page for available options) for sample-specific effects in:
+	1. **sample** + **internal reference** set
+	2. **external reference** set
+
+#### Combination
+
+([Figure 1](./img/counts_post-processing_scheme.png)E)
+
+* Subset datasets to include common genes
+* Combine **sample** + **internal reference** set with **external reference** set
+
+#### Batch-effects correction (optional)
+
+([Figure 1](./img/counts_post-processing_scheme.png)F)
+
+* Consider the patient **sample** + **internal reference** (regardless of the patient sample origin) as one batch (both sets processed with the same pipeline) and corresponding **[TCGA](https://tcga-data.nci.nih.gov/) dataset** as another batch. The objective is to remove data variation due to technical factors.
+
+#### Data scaling
+
+The processed count data is scaled to facilitate expression values interpretation. The data is either scaled **[gene-wise](#gene-wise-z-scoreztransformation)** (Z-score transformation, default) or **[group-wise](#group-wise-centering)** (centering).
+
+##### Gene-wise
+
+Z-scores are comparable by measuring the observations in multiples of the standard deviation of given sample. The gene-wise Z-score transformation procedure is illustrated in [Figure 2](./img/Z-score_transformation_gene_wise.png) and is described below.
+
+<img src="img/Z-score_transformation_gene_wise.png" width="30%"> 
+
+###### Figure 2
+>Gene-wise Z-score transformation scheme.
+
+* Extract expression values across all samples for a given **gene** ([Figure 2](./img/Z-score_transformation_gene_wise.png)A)
+* Compute **Z-scores** for individual samples (see equation in ([Figure 2](./img/Z-score_transformation_gene_wise.png)B)
+* Compute **median Z-scores** for ([Figure 2](./img/Z-score_transformation_gene_wise.png)C):
+	1. **internal reference** set\*
+	2.  **external reference** set
+
+* Present patient sample **Z-score** in the context the reference cohorts' **median Z-scores** ([Figure 2](./img/Z-score_transformation_gene_wise.png)D)
+
+\* used only for pancreatic cancer patients
+
+##### Group-wise
+
+The group-wise centering apporach is presented in [Figure 3](./img/centering_group_wise.png) and is described below.
+
+
+<img src="img/centering_group_wise.png" width="30%"> 
+
+###### Figure 3
+>Group-wise centering scheme.
+
+* Extract expression values for ([Figure 3](./img/centering_group_wise.png)A):
+	1. patient **sample**
+	2. **internal reference** set\*
+	3.  **external reference** set
+	
+* For each gene compute **median expression** value in ([Figure 3](./img/centering_group_wise.png)B):
+	1. **internal reference** set\*
+	2.  **external reference** set
+	
+* **Center** the median expression values for each gene in individual groups ([Figure 3](./img/centering_group_wise.png)C)
+* Present patient sample **centered** expression values in the context the reference cohorts' **centered** values ([Figure 3](./img/centering_group_wise.png)D)
+
+\* used only for pancreatic cancer patients
+
+
+## 3. Integration with WGS-based results
+
+For patients with available [WGS](./README.md#wgs) data processed using *[umccrise](https://github.com/umccr/umccrise)* pipeline (see ```--umccrise``` [argument](README.md/#arguments)) the expression level information for [mutated](#somatic-snvs-and-small-indels) genes or genes located within detected [structural variants](#structural-variants) (SVs) or [copy-number](#somatic-cnvs) (CN) [altered regions](#somatic-cnvs), as well as the genome-based findings are incorporated and used as primary source for expression profiles prioritisation.
+
+### Somatic SNVs and small indels
+
+* Check if **[PCGR](https://github.com/sigven/pcgr)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/pcgr/test_sample_WGS-somatic.pcgr.snvs_indels.tiers.tsv)) is available
+* **Extract** expression level **information** and genome-based findings for genes with detected genomic variants (use ```--pcgr_tier``` [argument](README.md/#arguments) to define [tier]([https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg](https://sigven.github.io/pcgr/articles/variant_classification.html) threshold value)
+* **Ordered genes** by increasing variants **[tier]([https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg](https://sigven.github.io/pcgr/articles/variant_classification.html)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort
+
+### Structural variants
+
+* Check if **[Manta](https://github.com/Illumina/manta)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/structural/test_sample_WGS-sv-prioritize-manta-pass.tsv)) is available
+* **Extract** expression level **information** and genome-based findings for genes located within detected SVs
+* **Ordered genes** by increasing **[SV score](https://github.com/vladsaveliev/simple_sv_annotation)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort
+* **Compare** [gene fusions](./fusions) detected in [WTS](./README.md#wts) data ([arriba](https://arriba.readthedocs.io/en/latest/) and [pizzly](https://github.com/pmelsted/pizzly)) and [WGS](./README.md#wgs) data ([Manta](https://github.com/Illumina/manta))
+* **Priritise** [WGS](./README.md#wgs)-supported [gene fusions](./fusions)
+
+### Somatic CNVs
+
+* Check if **[PURPLE](https://github.com/hartwigmedical/hmftools/blob/master/purple/README.md)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/purple/test_sample_WGS.purple.gene.cnv)) is available
+* **Extract** expression level **information** and genome-based findings for genes located within detected CNVs (use ```--cn_loss ``` and ```--cn_gain ``` [arguments](README.md/#arguments) to define CN threshold values to classify genes within lost and gained regions)
+* **Ordered genes** by increasing (for genes within lost regions) or decreasing (for genes within gained regions) **[CN](https://github.com/umccr/umccrise/blob/master/workflow.md#somatic-cnv)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort
+
+## 4. Results annotation
+
+[WTS](./README.md#wts)- and/or [WGS](./README.md#wgs)-based results for the altered genes are collated with **knowledge** derived from in-house resources and public **databases** (listed below) to provide additional source of evidence for their significance, e.g. to flag variants with clinical significance or potential druggable targets.
+
+### Key cancer genes
+
+* [UMCCR key cancer genes set](https://github.com/umccr/NGS_Utils/blob/master/ngs_utils/reference_data/key_genes/make_umccr_cancer_genes.Rmd) build of off several sources:
+	* [Cancermine](http://bionlp.bcgsc.ca/cancermine/) with at least 2 publication with at least 3 citations
+	* [NCG known cancer genes](http://ncg.kcl.ac.uk/)
+	* Tier 1 [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/census) (CGC)
+	* [CACAO](https://github.com/sigven/cacao) hotspot genes (curated from [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/), [CiViC](https://civicdb.org/), [Cancer Hotspots](https://www.cancerhotspots.org/))
+	* At least 2 matches in the following 5 sources and 8 clinical panels:
+		* Cancer predisposition genes ([CPSR](https://github.com/sigven/cpsr) list)
+		* [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/census) (tier 2)
+		* AstraZeneca 300 (AZ300)
+		* Familial Cancer
+		* [OncoKB](https://oncokb.org/) annotated
+		* MSKC-IMPACT
+		* MSKC-Heme
+		* PMCC-CCP
+		* Illumina-TS500
+		* TEMPUS
+		* Foundation One
+		* Foundation Heme
+		* Vogelstein
+
+* Used for extracting expression levels of cancer genes (presented in the `Cancer genes` report section)
+* Used to prioritise candidate [fusion genes](./fusions)
+
+### OncoKB
+
+* [OncoKB](https://oncokb.org/cancerGenes) gene list is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) 
+
+
+### VICC
+
+* [Variant Interpretation for Cancer Consortium](https://cancervariants.org/) (VICC) knowledgebase is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) 
+
+
+### CIViC
+
+* The [Clinical Interpretation of Variants in Cancer](https://civicdb.org/) (CIViC) database is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) 
+* Used to flag clinically actionable aberrations in the `Drug matching` report section
+
+### CGI
+
+* The [Cancer Genome Interpreter](https://www.cancergenomeinterpreter.org/biomarkers) (CGI) database is used to flag genes known to be involved in gene fusions and to prioritise candidate [fusion genes](./fusions)
+ 
+### FusionGDB
+
+* [FusionGDB](https://ccsm.uth.edu/FusionGDB/) database is used to flag genes known to be involved in gene fusions and to prioritise candidate [gene fusions](./fusions)
+
+### 5. Report generation
+
+The final html-based ***Patient Transcriptome Summary*** **report** contains searchable tables and interactive plots presenting expression levels of altered genes, as well as links to public resources providing additional source of evidence for their significance. The individual **[report sections](report_structure.md)**, **[results prioritisation](report_structure.md)** and **[visualisation](report_structure.md)** are described more in detail in [report_structure.md](report_structure.md).
+