This workflow performs a differential expression analysis with STAR and edgeR
- Dean Pettinga @deanpettinga
NOTE this workflow is optimized for HPC3 @ Van Andel Institute.
The following recipe provides established best practices for running and extending this workflow in a reproducible way.
- Fork the repo to a project directory on /secondary
- Clone the fork to the desired working directory for your project.
- Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
- Modify the config and any necessary sheets, e.g.:
- bin/units.tsv
- sample - ID of biological sample
- group - comparison group for DE contrast
- unit - description of sequencing unit ("lane1" or "flowcell1", etc.)
- fq1 - name of read1 fastq
- fq2* - name of read2 fastq
- strandedness - strandedness of library prep
- typically reverse for Illumina
- used to identify column to count reads from STAR output.
- bin/contrasts.tsv
- contrast - name for edgeR contrast
- groupRelative - "group" defined in units.tsv
- groupBaseline - other "group" defined in units.tsv
- bin/config.yaml
- PE_or_SE - designates the sequencing protocol as paired-end or single-end. (e.g. "PE")
- ref
- index - absolute path to your STAR index directory
- annotation - absolute path to your genome annotation.gtf
- species
- short - e.g. "hsapiens"
- long - e.g. "Homo sapiens"
- annotation - Bioconductor annotation package (e.g. "org.Hs.eg.db")
- bin/units.tsv
- Move your sequencing reads to
raw_data/
Test your configuration by performing a dry-run via
$ snakemake --use-conda -np
Execute from within your project directory as a PBS job using BBC nodes via
$ qsub -q bbc bin/run_snake.sh
This job script will produce DAG (.txt & .png) and .html with run stats for the workflow to be executed in logs/runs/bulk_rnaseq-workflow_(TIME)
Review your results including the run stats:
logs/runs/rnaseq-workflow_(TIME).html
and the differential expression results:
deliverables/edgeR_longReport.html
Now that you've successfully run your analysis, it's time to do some housekeeping.
- Commit any changes you've made to the repo and push the project-branch to your fork on github.
- Optional: Merge back any valuable and generalizable changes to the [upstream repo] via a pull request. This would be greatly appreciated.
- Optional: Push results (plots/tables) to the remote branch on your fork.
- Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
- Optional: Delete the local clone/workdir to free space.
Besides running the differential expression portion of the pipeline, it is possible to also do variant calling based on the GATK "Best Practices". Importantly, the read groups are added post-alignment to the STAR BAM files created in the DE part of the pipeline, so align files separately if you want to assign distinct reads groups.
-
Create a tab-separated file called
bin/variant_calling_units.tsv
. These may consist of all or a subset of the files used for DE analysis. There are four columns:i. unit: This corresponds the 'sample' prefix in
{sample}.Aligned.sortedByCoord.out.bam
from the STAR alignments and is used to code theRG:ID
tag.ii. sample: This is used to merge same-sample 'units' together at the MarkDuplicates step. Specifically, this indicates the value for the
RG:SM
tag.iii. library: Indicates the value for the
RG:LB
tag. Important for the MarkDuplicates step.iv. platform_unit: Indicates the value for the
RG:PU
tag. According to GATK docs, this should be {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. This is important for the BQSR step; if fastq files were merged in the DE part of the pipeline, then you cannot correct for any potential batch effects due to lane, flow cell etc. -
In the
config.yaml
file:i. Set
call_variants
totrue
.ii. Ensure that valid paths are provided for
known_snps
,known_indels
andfai
. These should be two vcf/vcf.gz files and a .fai file. For hg38, use snpDB and "known_indels" from the GATK resource bundle. -
For this to work, you must specify the .fai file first in step 2. You must also ensure that your default python has access to the snakemake library. If running on the VAI HPC, you may use the python installed using venv for the environment module of snakemake. Run
python ./group_chroms.py
. This will creategrouped_contigs.tsv
which is used to run GATK in parallel in a per-contig fashion. -
Do a dry run to check if the rules are being correctly triggered. Run the pipeline as usual.