Skip to content

Latest commit

 

History

History
160 lines (126 loc) · 10.6 KB

README.md

File metadata and controls

160 lines (126 loc) · 10.6 KB

TOPMed RNA-seq pipeline

The TOPMed RNA-Seq pipeline was converted to CWL for a deliverable to have a CWL pipeline available through a public Tool Registry Service. Specifically, this workflow is available through Dockstore.org.

Workflow description

This document describes team Helium's implimentation of the TOPMed RNA-seq pipeline as described in commit b65c22b. The CWL Workflow is registered publicly on Dockstore here. This CWL workflow has 4 components described below.

A checker workflow registered on Dockstore is also available to verify operation of this pipeline. See information here.

The scripts and settings used for the TOPMed MESA RNA-seq pilot match commit 725a2bc, packaged here.

Intended Audience

The intended audiance is any scientist familiar with RNA-seq analysis wishing to run RNA-seq analysis on the TOPMed public access data.

Quick Start

Run the pipeline locally with small test input files. Creating these sample input files is described here.

  1. Dockstore CLI, CWLTool, Git, Git LFS and Docker should be installed.
  2. Clone this GitHub repository:
    git clone https://github.com/heliumdatacommons/cwl_workflows.git
    
  3. Decompress sample files.
    ./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh
    
  4. Use this input file or edit the file paths based on your local machine paths.
  5. Run the workflow with CWLTool.
    cwltool topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl \
    topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json
    

Checker Workflow

A checker workflow for the TOPMed RNA-seq pipeline is published on Dockstore here. It is described in more detail in this README.md

Sample data sets

The sample data sets intended to be used as input are available through this BioProject.

Creating downsampled datasets for testing is described here.

Pipeline components

OUTPUTS describes the files generated by the TOPMed RNA-Seq pipeline, for each sample.

  • Alignment: STAR 2.5.3a
    • STAR CWL File
    • Python script ran by CWL file in Docker container: run_STAR.py
    • INPUT: STAR Index and sample FASTQ's. See example input file.
      • See here to create STAR Index
    • OUTPUT: Aligned RNA-seq reads in BAM format.
  • Post-processing: Picard 2.9.0 MarkDuplicates
    • Picard MarkDuplicates CWL File
    • Python script ran by CWL file in Docker container: run_MarkDuplicates.py
    • INPUT: Aligned BAM file from STAR. See example input
    • OUTPUT: Marked duplicates BAM file.
      • Will need to create BAM index file with Samtools index, CWL File, example input
  • Transcript quantification: RNA-SeQC 1.1.9
    • RNA-SeQC CWL File
    • Python script ran by CWL file in Docker container: run_rnaseqc.py
    • INPUT: Genome FASTA, GTF file, Aligned BAM file from STAR. See example input
    • OUTPUT:
      • Transcript-level expression quantifications, provided as TPM, expected read counts, and isoform percentages.
      • Standard quality control metrics derived from the aligned reads.
  • Gene quantification and quality control: RSEM 1.3.0
    • RSEM CWL File
    • Python script ran by CWL file in Docker container: run_RSEM.py
    • INPUT: RSEM refernce files, BAM with reads aligned to transcriptome from STAR. See example input
      • See here to create RSEM refernce directory.
    • OUTPUT: Gene-level expression quantifications based on a collapsed version of a reference transcript annotation, provided as read counts and TPM.
  • Utilities: SAMtools 1.6 and HTSlib 1.6
    • Samtools index is used to create .bai files for input .bam files. CWL File, example input

Alternative Approaches

Many other software packages are available to perform similar funcionality as this pipeline. For deatiled information on RNA-seq analysis steps and other software options, please see A survey of best practices for RNA-seq data analysis.

Docker Image

Currently, republishing the GTEx pipeline Docker container on Docker Hub.

Obtaining docker image.

  1. Docker should be installed. See here if not.
  2. Pull the image from Docker Hub
    docker pull heliumdatacommons/topmed-rnaseq:latest
    

Create required inputs

The following steps assume:

  1. You have downloaded the following files:

    $ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz
    $ gunzip gencode.v26.annotation.gtf.gz
    
    $ wget https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz
    $ tar -xzf Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz
    
  2. You have obtained the Docker container described here

Create .fai file

Create the index file using samtools faidx.

~/input_files contains the Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta file.

docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
    samtools faidx /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta

Create .dict file

Create the dictionary file using Picard CreateSequenceDictionary.

~/input_files contains the Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta file.

docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
    java -jar /opt/picard-tools/picard.jar CreateSequenceDictionary \
    R=/input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
    O=/input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.dict

Create STAR Index

  1. Create .fai and .dict file for Genome FASTA (both described above).
  2. GTF file, Genome FASTA file, .fai and .dict should all be in the same directory. Use this directoy as a volume mount when running docker. We used input_files below.
  3. Run the following command:
    docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
        STAR --runMode genomeGenerate \
        --genomeDir /input_files/star_index \
        --genomeFastaFiles /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
        --sjdbGTFfile /input_files/gencode.v26.annotation.gtf \
        --sjdbOverhang 100 --runThreadN 10
    
  4. Upon completion, your STAR Index will be in the ~/input_files/star_index directory.

Create RSEM Reference

  1. Create .fai and .dict file for Genome FASTA (both described above).
  2. GTF file, Genome FASTA file, .fai and .dict should all be in the same directory. Use this directoy as a volume mount when running docker.
  3. Create RSEM reference using rsem-prepare-reference:
docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq:latest \
    rsem-prepare-reference --num-threads 4 \
    --gtf /input_files/gencode.v26.annotation.gtf \
    /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
    /input_files/rsem_reference
  1. Upon completion, the RSEM reference directory will be in the ~/input_files/rsem_reference directory.