REFERENCE

This document contains more detailed information on the inputs, outputs and the software.

{
    "rna.endedness" : "paired",
    "rna.fastqs_R1" : ["test_data/ENCSR653DFZ_rep1_chr19_10000reads_R1.fastq.gz", "test_data/ENCSR653DFZ_rep2_chr19_10000reads_R1.fastq.gz"],
    "rna.fastqs_R2" : ["test_data/ENCSR653DFZ_rep1_chr19_10000reads_R2.fastq.gz", "test_data/ENCSR653DFZ_rep2_chr19_10000reads_R2.fastq.gz"],
    "rna.aligner" : "star",
    "rna.align_index" : "test_data/GRCh38_v24_ERCC_phiX_starIndex_chr19only.tgz",
    "rna.rsem_index" : "test_data/GRCh38_v24_ERCC_phiX_rsemIndex_chr19only.tgz",
    "rna.bamroot" : "PE_stranded",
    "rna.strandedness" : "stranded",
    "rna.strandedness_direction" : "reverse",
    "rna.chrom_sizes" : "test_data/GRCh38_EBV.chrom.sizes",
    "rna.align_ncpus" : 2,
    "rna.align_ramGB" : 4,
    "rna.kallisto_index" : "test_data/Homo_sapiens.GRCh38.cdna.all.chr19_ERCC_phix_k31_kallisto.idx",
    "rna.kallisto_number_of_threads" : 2,
    "rna.kallisto_ramGB" : 4,
    "rna.rna_qc_tr_id_to_gene_type_tsv" : "transcript_id_to_gene_type_mappings/gencodeV24pri-tRNAs-ERCC-phiX.transcript_id_to_genes.tsv",
    "rna.bam_to_signals_ncpus" : 1,
    "rna.bam_to_signals_ramGB" : 2,
    "rna.rsem_ncpus" : 2,
    "rna.rsem_ramGB" : 4,
    "rna.align_disk" : "local-disk 20 HDD",
    "rna.kallisto_disk" : "local-disk 20 HDD",
    "rna.rna_qc_disk" : "local-disk 20 HDD",
    "rna.mad_qc_disk" : "local-disk 20 HDD",
    "rna.bam_to_signals_disk" : "local-disk 20 HDD",
    "rna.rsem_disk" : "local-disk 20 HDD"
}

Following elaborates the meaning of each line in the input file.

rna.endedness Indicates whether the endedness of the experiment is paired or single.
rna.fastqs_R1 Is list of gzipped fastq files containing the first pairs of reads.
rna.fastqs_R2 Is list of gzipped fastq files containing the second pairs of reads.

Example:

Assume you are running a paired end experiment with 3 replicates. The fastq files from the first replicate are replicate1_read1.fastq.gz and replicate1_read2.fastq.gz. The fastq files from the second replicate are replicate2_read1.fastq.gz and replicate2_read2.fastq.gz. Finally assume that the fastq files from the third replicate are replicate3_read1.fastq.gz and replicate3_read2.fastq.gz. In this case the input on the relevant part should be as follows:
"rna.fastqs_R1" : ["replicate1_read1.fastq.gz", "replicate2_read1.fastq.gz", "replicate3_read1.fastq.gz"]
"rna.fastqs_R2" : ["replicate1_read2.fastq.gz", "replicate2_read2.fastq.gz", "replicate3_read2.fastq.gz"]
Note that it is very important that the replicates are in same order in both lists, this correspondence is used for pairing correct files with each other.

rna.aligner Use star aligner, possibly extended to use others in future.
rna.align_index Is the index for STAR aligner.
rna.rsem_index Is the index for RSEM quantifier.
rna.kallisto_index Is the index for Kallisto quantifier.
rna.bamroot This is a prefix that gets added into the output filenames. Additionally the files are prefixed with information of the replicate they originate from.

Example:

Assume the rna.bamroot is FOO. Outputs from first replicate would be prefixed by rep1FOO and outputs from second replicate would be prefixed by rep2FOO etc.

rna.strandedness Indicates whether the experiment is stranded or unstranded. If this is stranded, then the rna.strandedness_direction should be set to forward or reverse.
rna.strandedness_direction Indicates the direction of strandedness. Options are forward, reverse and unstranded.
rna.chrom_sizes Is the file containing the chromosome sizes. You can find and download the files from ENCODE portal.
rna.align_ncpus How many cpus are available for STAR alignment.
rna.align_ramGB How many GBs of memory are available for STAR alignment.
rna.align_ncpus How many cpus are available for RSEM quantification.
rna.align_ramGB How many GBs of memory are available for RSEM quantification.
rna.align_disk How much disk space is available for Align task. You can also specify the type of disk, HDD for a spinning disk and SSD for a solid state drive.
rna.kallisto_disk As above, but for Kallisto.
rna.rna_qc_disk As above, but for RNA QC.
rna.bam_to_signals_disk As above, but for bam_to_signals.
rna.mad_qc_disk As above, but for MAD QC.
rna.rsem_disk As above, but for RSEM.
rna.kallisto_number_of_threads How many threads are available for Kallisto quantification.
rna.kallisto_ramGB How many GBs of memory are available for Kallisto quantification.

Example:

Assume you want to allocate 100 gigabytes of spinning hard drive. In this case you would enter "local-disk 100 HDD". If you want to allocate 111 gigabytes of solid state drive space, enter "local-disk 111 SSD".

rna.rna_qc_tr_id_to_gene_type_tsv rna_qc task calculates the number of reads by gene type. For this a tsv file that contains a mapping from transcript IDs to gene types is provided. For GRCh38, hg19, and mm10 with ERCC (ambion 1) and PhiX spikes the tsv is provided in this repo. If you are using some other annotation, you can use code here to build your own.
rna.bam_to_signals_ncpus Is the number of cpus given to bam_to_signals task.
rna.bam_to_signals_ramGB Is the amount of memory in GB given to bam_to_signals task.

Additional inputs when running single-ended experiments:

Kallisto quantifier makes use of average fragment lenghts and standard deviations of those lengths. In the case of paired end experiments, those values can be calculated from the data, but in case of single-ended experiment those values must be provided.

rna.kallisto_fragment_length Is the average fragment length.
rna.kallisto_sd_of_fragment_length Is the standard deviation of the fragment lengths.

Outputs

DNAnexus: If you choose to use dxWDL and run pipelines on DNAnexus platform, then output will be stored on the specified output directory without any subdirectories.
Cromwell: Cromwell will store outputs for each task under directory cromwell-executions/[WORKFLOW_ID]/call-[TASK_NAME]/shard-[IDX]. For all tasks [IDX] means a zero-based index for each replicate.

Output files

Task Align

Genome bam, file name matches *_genome.bam. Bam aligned to genome.
Anno bam, file name matches *_anno.bam. Bam aligned to annotation.
Genome flagstat file name matches *_genome_flagstat.txt. Samtools flagstats on the genome bam.
Anno flagstat file name matches *_anno_flagstat.txt. Samtools flagstats on anno bam.
STAR run log file name matches *_Log.final.out. STAR run log.
Python log file name is align.log. This file contains possible additional information on the pipeline step.

Task Kallisto

Kallisto quants, file name matches *_abundance.tsv. Kallisto quantifications.
Python log file name is kallisto_quant.log. This file contains possible additional information on the pipeline step.

Task Bam to Signals

In case of an stranded run, the plus and minus strand signal tracks are separated (there will be four tracks per replicate).

Unique BigWig, file name matches *niq.bw. Contains the signal track of the uniquely mapped reads.
All BigWig, the file name matches *ll.bw. Contains the signal track of all reads.
Python log file name is bam_to_signals.log. This file contains possible additional information on the pipeline step.

Task RSEM Quant

Genes results, file name matches *.genes.results. Contains gene quantifications.
Isoforms results, file name matches *.isoforms.results. Contains isoform quantifications.
Number of genes, file name matches *_number_of_genes_detected.json. Contains the number of genes detected, which is determined as TPM value being greater than 1.
Python log file name is rsem_quant.log. This file contains possible additional information on the pipeline step.

Task Mad QC

This step is run if and only if the number of replicates is 2.

Mad QC plot, file name matches *_mad_plot.png. Contains the MAD QC plot.
Mad QC metrics file name matches *_mad_qc_metrics.json. Contains MAD QC metrics.
Python log file name is mad_qc.log. This file contains possible additional information on the pipeline step.

Task RNA QC

This step calculates additional metrics. At this time the only metric is to calculate reads by gene type. It is very IMPORTANT to look at the Python log of this step to see that the transcriptome bam did not contain any transcripts that are not present in the transcript ID to gene type mapping tsv. In case that happens, make sure you are using the STAR aligner and RSEM quantifier indexes you think you are using, and that all the other references are correct!

RNA QC, file name matches *_qc.json. Contains additional QC metrics. For now the reads by gene type.
Python log file name is rna_qc.log. This file contains IMPORTANT information on the pipeline step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reference.md

reference.md

REFERENCE

CONTENTS

Software

Ubuntu 16.04

Python 3.5.2

R 3.2.3

STAR 2.5.1b

RSEM 1.2.23

Kallisto 0.44.0

Samtools 1.9

bedGraphToBigWig and bedSort

Inputs

Example:

Example:

Example:

Additional inputs when running single-ended experiments:

Outputs

Output files

Task Align

Task Kallisto

Task Bam to Signals

Task RSEM Quant

Task Mad QC

Task RNA QC

Files

reference.md

Latest commit

History

reference.md

File metadata and controls

REFERENCE

CONTENTS

Software

Ubuntu 16.04

Python 3.5.2

R 3.2.3

STAR 2.5.1b

RSEM 1.2.23

Kallisto 0.44.0

Samtools 1.9

bedGraphToBigWig and bedSort

Inputs

Example:

Example:

Example:

Additional inputs when running single-ended experiments:

Outputs

Output files

Task Align

Task Kallisto

Task Bam to Signals

Task RSEM Quant

Task Mad QC

Task RNA QC