Skip to content
Stephen Fisher edited this page Mar 19, 2014 · 75 revisions

ngs: Next Generation Sequencing Pipeline

ngs is designed to automate the processing of next generation sequencing data. The master script ngs.sh is used to run the various 'modules' (aka 'commands') that perform individual tasks. Each module should perform a single task. The module name (uppercase) is the command name (lowercase). So for example the 'ngs_INIT.sh' file contains the code for the 'INIT' module that will perform the 'init' command. While the module files end in '.sh', then are not meant to be run outside of the ngs.sh script.

Current modules:

  • HELP: expanded help for other modules
  • INIT: prepare read file(s) for processing
  • FASTQC: run FastQC
  • BLAST: run blast on randomly sampled subset of reads
  • TRIM: trim adapter and poly-A/T contamination (requires Python 2.7 or later)
  • RUM: run RUM on trimmed reads
  • RUMSTATUS: get status of RUM run
  • STAR: run STAR on trimmed reads
  • BOWTIE: run bowtie on trimmed reads
  • SNP: perform SNP calling on bowtie BAM file
  • SPAdes: run SPAdes on trimmed reads
  • HTSEQ: run HTSeq on unique mappers from RUM
  • POST: clean up trimmed data
  • BLASTDB: create blast database from reads
  • RSYNC: copy data to analyzed directory
  • STATS: print stats from blast, trimming and STAR
  • PIPELINE: run full pipeline
  • VERSION: print version information
  • TEMPLATE: this is an empty module that can be used as an example to build new modules

Using the ngs.sh script, modules can be run on their own, although modules may depend on the output from other modules. For example, the INIT module places the raw read files (uncompressed) in a directory called 'orig'. The TRIM module, by default, looks for the uncompressed read files in the 'orig' directory and places the trimmed read files in a directory called 'trim'. The STAR module uses the reads from the 'trim' directory and hence is expected to run after TRIM.

The PIPELINE module is effectively a meta-module and runs the following modules: init, fastqc, blast, trim, star, post, blastdb, htseq, rsync


Installation and Running

To install the pipeline, just download the files from GitHub and place them in a directory that is in your executable PATH. Make sure ngs.sh and the *.py files are executable. Also be sure you have the required additional programs installed (see Requirements below).

The ngs.sh script is the master script. You use this script to run each of the modules. For example, you can run the HELP module to get expanded help on other modules; documenting the input files, output files, and required programs needed for that module to function. The following command would display documentation on the BLAST module:

ngs.sh help blast

You can either run the modules manually or all at once with the PIPELINE module. To start off manually, you would probably want to run the INIT module, copying the raw fastq files out of your data repository and into a subdirectory for processing. The following command would run the INIT module uncompressing (ungzip) the fastq read files from the directory '/lab/repo/E.43/raw/mySample'. The fastq file(s) containing the first reads is expected to have 'R1' in the file name(s). The second read fastq file(s) must have a 'R2' in the file name(s). The uncompressed files would be put in the subdirectcory './sampleID/orig'. This subdirectory will be created, if necessary.

ngs.sh init -i /lab/repo/E.43/raw mySample  (NOTE: there is a space between 'raw' and 'mySample')

As with the other modules, the PIPELINE module is also run using the ngs.sh command. The following command would run the PIPELINE module using fastq files from '/lab/repo/E.51/raw/mySample'. When the pipeline is complete the generated files (ie alignment files, trimmed read files, etc) would be copied to '/lab/repo/E.51/analyzed/mySample'.

ngs.sh pipeline -i /lab/repo/E.51/raw -o /lab/repo/E.51/analyzed -p 8 -s mm10 mySample

Requirements

System

  • Only tested on Linux OS (RHEL 6.x). Will likely work on a Mac. May work on Windows with Cygwin. System specs are dictated by the modules used.

Note that the "-P" flag in grep is used in some cases. This works on RHEL 6.x (ie GNU version of grep). On Mac 10.9 the "-P" flag isn't required (ie BSD version of grep).

External Programs Required per Module

Note that the Python scripts included here are hardcoded to use /usr/bin/Python. If Python is not installed in /usr/bin, then these files will need to be updated to point to the appropriate Python. Similarly if /usr/bin/Python is older than 2.7, trimReads.py will need to be manually updated to point to a newer version of Python.

Resource Locations

The pipeline requires various genome and transcriptome library files. The location of these library files is currently hardcoded in the ngs.sh script. These library locations may need to be adjusted for your environment. If a module is not used, then that library is not required. For example, if the user only plans to run STAR, skipping Bowtie, RUM, and HTSeq, then only the STAR library files are required.

  • BOWTIE_REPO = /lab/repo/resources/bowtie
    • Location of the Bowtie databases.
  • RUM_REPO = /lab/repo/resources/rum2
    • Location of the RUM (version 2) databases.
  • STAR_REPO = /lab/repo/resources/star
    • Location of the STAR databases.
  • HTSEQ_REPO = /lab/repo/resources/htseq
    • Location of the HTSeq databases.

Additional Documentation

When lots of sequencing runs are being managed and processed, the process of storing the data gets complicated. Data Repository outlines a simple scheme for storing NGS data that works well with this NGS pipeline.

Ancillary Files

The Ancillary directory contains files that are not required by any modules provided herein although thay may be helpful to prepare files for use with the pipeline and/or process files after pipeline processing.

Version Tracking

When a module is run, a file ("sampleID.version") is created in the module subdirectory. The sampleID.version file is a tab-delimited list containing the external programs used by that module and their respective version numbers. This file also includes the location of the species library used by that module, if relevant. See Data Repository for an understanding of the expected directory structure.


No Warranty

Unless otherwise noted in the individual applications, the following disclaimer applies to all applications provided herein.

There is no warranty to the extent permitted by applicable law. Except when otherwise stated in writing the copyright holders and/or other parties provide these applications "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of these applications, and data is with you. Should these applications or data prove defective, you assume the cost of all necessary servicing, repair or correction.

In no event unless required by applicable law or agreed to in writing will any copyright holder, or any other party who may modify and/or redistribute these applications as permitted above, be liable to you for damages, including any general, special, incidental or consequential damages arising out of the use or inability to use these applications and data (including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of these applications and data to operate with any other programs), even if such holder or other party has been advised of the possibility of such damages.


Credits

The pipeline was developed by Stephen Fisher and Junhyong Kim at the University of Pennsylvania. Licensing information can be found in each file or here: http://kim.bio.upenn.edu/software/LICENSE

Clone this wiki locally