This is a snakemake workflow for profiling microbial communities from amplicon sequencing data using dada2. DADA2 tutorial can be found from https://benjjneb.github.io/dada2/index.html. The initial code was cloned from https://github.com/SilasK/amplicon-seq-dada2 and modified to make a workflow suitable for our needs.
Input:
- Raw paired-end fastq files
- samples.tsv example
Output:
- Taxonomic assignment tables for specified databases (GTDB, RDP, SILVA, URE).
- ASV abundance table (seqtab_nochimera.rds).
- ASV sequences in a fasta file from seqtab_nochimera.rds (ASV_seq.fasta).
- Summary of reads filtered at each step (Nreads.tsv).
- A phylogenetic tree (ASV_tree.nwk).
Please install the following tools before running this workflow. Please request an interactive session before starting the installation step by running the following command:
salloc --mem=20G --time=05:00:00
conda (miniconda): https://conda.io/projects/conda/en/stable/user-guide/install/linux.html
snakemake: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
Next we need to set up a few environments to use in different steps of the pipeline.
To install r and dada2:
conda create -n dada2 -c conda-forge -c bioconda -c defaults --override-channels bioconductor-dada2
To activate the environment and install the required packages (dplyr, gridExtra, ggplot2, DECIPHER, Biostrings, limma) locally in R:
conda activate dada2
(dada2) [username@hostname ~]$ R
install.packages("gridExtra")
install.packages("ggplot2")
install.packages("dplyr")
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DECIPHER")
BiocManager::install("Biostrings")
BiocManager::install("limma")
q() #to quit R
conda deactivate
To install fastqc, multiQC, cutadapt, and seqkit tools for quality control in a new environment:
conda create --name QC
conda activate QC
conda install -c bioconda fastqc==0.11.9
conda install pip
pip install multiqc
pip install pandas==1.5.3
pip install cutadapt
conda install -c bioconda seqkit
conda deactivate
To create an environment for generating a phylogenetic tree and a fasta file of ASVs:
conda create -n fastree_mafft
conda activate fastree_mafft
conda install -c bioconda fasttree
conda deactivate
conda create -n rmd
conda activate rmd
conda install -c conda-forge r-base
conda install -c conda-forge pandoc
(rmd) [username@hostname ~]$ R
install.packages('DT')
install.packages("ggplot2")
install.packages("dplyr")
q() #to quit R
conda deactivate
Then please follow these steps to set up and run the pipeline.
3.1 Make sure that all the environments are set up and required packages are installed. To do so, after installing each package, you can run the tool name and the flag -h (i.g. fastqc -h) to see if it is installed.
3.2 Navigate to your project directory and clone this repository into that directory using the following command:
git clone https://github.com/IMCBioinformatics/dada2_snakemake_workflow.git
3.3 Use prepare.py script to generate the samples.tsv file as an input for this pipeline using the following command:
<DIR>
is the location of the raw fastq files.
python utils/scripts/common/prepare.py <DIR>
- path of the input directory
- name and path of the output directory
- Forward and reverse reads format
- primer sequences (if they are sequenced)
- primer removal and quality trimming values
- DADA2 filter and trim thresholds
- chimera removal method
- number of reads for error rate learning
3.5 Download the taxonomy databases from http://www2.decipher.codes/Downloads.html that you plan to use in utils/databases/ and consequently set the path for them in the config file.
3.6 Once confident with all the parameters first run the snakemake dry run command to make sure that pipeline is working.
snakemake -np
Then snakemake can be executed by the following bash script:
sbatch dada2_sbatch.sh
To make sure that the pipeline is run completely, we need to check the log and output files.
All logs are placed in the logs directory. A copy of all snakemake files and logs will be copied to the output directory (output/snakemake_files/) as well to avoid rewritting them by upcoming re-runs.
- seqtab_nochimeras.rds
- Nreads.tsv
<DATABASE>
.tsv
- ASV_seq.fasta
- ASV_tree.nwk
- qc_report.html