-
Notifications
You must be signed in to change notification settings - Fork 52
Pre configured workflows
NGS-pipe provides pre-configured workflows for WGS, WES and RNA analysis. In the following we will describe them in detail, including the tools used and the reasons why they are used.
The goal of this workflow is to obtain the somatic single nucleotide variants as well as copy number changes. In order to do so, several tools are used. An overview is provided in the following figure.
In order to remove adapter contamination we use Trimmomatic (Bolger 2014). Trimmomatic is a well known tool and provides the user with a palindrome mode that efficiently removes adapters from paired-end sequencing in case of read through. Afterwards the reads are mapped using BWA (Li, 2009_a, 2013) mem. BWA is widely used and recommended by the GATK GATK Best Practices. The mapped reads are then sorted und possible mate pair artifacts resolved using picard tools. Then all reads from each sample are merged and secondary alignments removed using picard tools and SAMtools (Li 2009b). This step is followed by removing PCR duplicates with picard tools. Realignment around indels and base-recalibration are then performed using GATK (McKenna, 2010). This is in accordance with the GATK Best Practices. The resulting BAM files are then processed with Mutect2 (part of GATK (McKenna, 2010)), VarScan (Koboldt) and Strelka (Saunders, 2012) and their results combined with GATK's variant combine tool. We chose this approach in order to make use of the strength of different variant callers. At the same time we require that a mutation is identified by at least two callers to be very specific. The last step is then constituted by annotating the resulting mutations with SnpSift (Cingolani, 2012b) and SnpEff (Cingolani, 2012a) using dbSNP (Sherry, 1999, 2001) and cosmic (Forbes, 2014).
In contrast to the WES workflow, which has the goal to identify somatic mutations also with low frequency, the goal of this workflow is to obtain the copy number changes between a test sample and a control sample. For WGS data sets the coverage is usually much lower than for WES data sets, such that somatic mutation calling is often not possible. However, one can use the WES workflow on WGS data by simply providing a regions file covering the whole genome. Here we now describe the workflow for copy number change identification on WGS data.
In order to remove adapter contamination we use Trimmomatic (Bolger 2014). Trimmomatic is a well known tool and provides the user with a palindrome mode that efficiently removes adapters from paired-end sequencing in case of read through. Afterwards the reads are mapped using BWA (Li, 2009_a, 2013) mem. BWA is widely used and recommended by the GATK GATK Best Practices. The mapped reads are then sorted und possible mate pair artifacts resolved using picard tools. Then all reads from each sample are merged and secondary alignments removed using picard tools and SAMtools (Li 2009b). This step is followed by removing PCR duplicates with picard tools. We then use BicSeq2 (Xi, 2016) to call copy number changes and annotate them using ANNOVAR (Wang, 2010).
The goal in the analysis of RNASeq experiments is to provide reliable information onto which and how often genes are expressed in a sample (referred as gene counting). This information can then be used by statisticians to e.g. estimate the differential gene expression.
In order to remove adapter contamination we use Trimmomatic (Bolger 2014). Trimmomatic is a well known tool and provides the user with a palindrome mode that efficiently removes adapters from paired-end sequencing in case of read through. We furthermore use Trimmomatic to remove low quality bases from the end of sequences. The resulting sequences are used as input for the STAR (Dobin 2013) aligner, which is a highly performant and RNA specific read mapper. The last step of the analysis comprises gene counting for which we use featureCounts (Liao 2013) a multithreaded tools that is capable of mapping reads on genes in only a few minutes.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(4), 2114-2120.
Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., ... & Ruden, D. M. (2012a). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80-92.
Cingolani, P., Patel, V. M., Coon, M., Nguyen, T., Land, S. J., Ruden, D. M., & Lu, X. (2012b). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Toxicogenomics in non-mammalian species, 3, 35.
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.
Forbes, S. A., Beare, D., Gunasekaran, P., Leung, K., Bindal, N., Boutselakis, H., ... & Kok, C. Y. (2014). COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic acids research, 43(D1), D805-D811.
Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., ... & Wilson, R. K. (2012). VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research, 22(3), 568-576.
Li H. and Durbin R. (2009_a). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25(14), 1754-1760.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009_b). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.
Liao, Y., Smyth, G. K., & Shi, W. (2013). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), 923-930.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... & DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9), 1297-1303.
Saunders, C. T., Wong, W. S., Swamy, S., Becq, J., Murray, L. J., & Cheetham, R. K. (2012). Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics, 28(14), 1811-1817.
Sherry, S. T., Ward, M., & Sirotkin, K. (1999). dbSNP—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9(8), 677-679.
Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M., & Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic acids research, 29(1), 308-311.
Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841-842.
Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research, 38(16), e164-e164.
Xi, R., Lee, S., Xia, Y., Kim, T. M., & Park, P. J. (2016). Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic acids research, 44(13), 6274-6286.