-
Notifications
You must be signed in to change notification settings - Fork 21
Setup and Usage
This is the thorough walk-through. See Quick Start and Full Example for the short version. We here expect basic familiarity with Snakemake - please read the Snakemake Tutorial to get up to speed.
For the pipeline to run, we need
to be available.
We currently require at least snakemake >= 5.7.0, but highly recommend using snakemake >= 5.18.0 and mamba, to be able to use mamba for dependency management via the snakemake --conda-frontend mamba
option when running the pipeline, as this is way faster for installing the tools that grenepipe uses.
In cluster environments, we often find that versions are outdated, or tools not even available. We hence recommend to simply install your own miniconda locally for your user. Use wget
to obtain the binary (probably you want the generic linux one), and execute it to install locally, log out, log in again to activate it, and also get mamba:
cd ~/path/to/my/software
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
conda install mamba -n base -c conda-forge
Then, download grenepipe into the directory from where you want it to run - this can (and should) be a different directory as the one where your data is stored; see below for details. We recommend to download a release version for reproducibility - simply click the "Source code" link below the release description.
Snakemake, conda, python, pandas, and numpy are notorious for causing trouble when mixing their versions. We hence want to make sure to always use the same versions of these tools across all steps of the pipeline. We achieve that by running the main pipeline in an environment of its own with fixed versions of these tools, instead of relying on the local versions of the tools installed on your system.
Currently, we recommend using grenepipe with:
- snakemake == 6.0.5
- python == 3.7.10
- pandas == 1.3.1
- numpy == 1.21.2
as listed in the grenepipe.yaml
conda environment file.
[Expand] A note on these versions.
Note that somewhere between python 3.7.10 and 3.10.4, current versions of bcftools <= 1.13 are not compatible any more, and hence we currently do not use the latest python version. Furthermore, we have tested grenepipe with different versions of Snakemake between 5.32.1 (and probably even earlier, if needed) and 7.6.1, and it generally works. Currently, there is a bug in Snakemake >= 6.1.0 that affects the way we use cluster profiles, and hence use Snakemake 6.0.5 at the moment.
We install and activate that environment, from within the directory that you dowloaded grenepipe into:
cd /path/to/grenepipe
mamba env create -f envs/grenepipe.yaml -n grenepipe
conda activate grenepipe
We recommend to use mamba
, which is way faster than conda
for this step. From here on, we assume that this is the environment from which we run the pipeline. Individual rules create their own environments, but they match the versions of python, pandas, and numpy specified above, and hence hopefully avoid any trouble due to mismatches.
The file config.yaml
in the main grenepipe directory contains all configurations for the data and the tools to use:
- Path to a table of all sample
fastq
files to process, see below for the expected format. - Path to the reference genome
fasta
file. - Settings for which tools to use, and their parameters. There are a lot of tools and options.
This is the main file that needs to be adapted in order to configure a run of the pipeline.
Change the settings in this file to your needs, and see the comments in the config file itself for details on each of the tools and settings. Note that absolute file paths (starting with /
) are needed in the config file!
The recommended way to use this is to copy the config.yaml
file to an empty directory where you want your analysis to run. Then, you specify
snakemake [other options] --directory /path/to/my-analysis
to point snakemake to that directory, and the pipeline output files will then be written to that directory. See Working Directory for details.
The configuration (config.yaml
) expects data: samples-table
to point to a tab-separated
table that lists all sample names and the absolute (!) paths to their fastq
files:
sample unit platform fq1 fq2
A 1 ILLUMINA /path/to/A_R1_001.fastq.gz /path/to/A_R2_001.fastq.gz
B 1 ILLUMINA /path/to/B_R1_001.fastq.gz /path/to/B_R2_001.fastq.gz
B 2 ILLUMINA /path/to/B_002.fastq.gz
The above table contains two samples, A
and B
.
- Sample
A
consists of one unit (coming from one sequencing run), with paired-end reads. - Sample
B
consists of two units (coming e.g., from different sequencing runs or multiple lanes per sample), with the first unit being paired-end, and the second unit being single-end (hence, the last column is simply left empty).
Samples can either be in fastq
format, or in compressed fastq.gz
format (the file extension
itself does not matter though). See also our exemplary table.
Columns have to be labelled as above (that is, simply copy the first row over to your table, making sure that the tabs stay tabs and don't turn into spaces accidentally).
[Expand] The "platform" column and GATK BaseRecalibrator.
The "platforms" column is optional. When the settings: recalibrate-base-qualities
value is set in the config.yaml
, we run GATK BaseRecalibrator, which needs to know the sequencing instrument/platform for each sample. This can be provided via the "platform" column in the table, as shown above, for each sample individually. Alternatively, the params: gatk: platform
setting in the config.yaml
can be used to set the platform for all samples.
We also provide a script located at tools/generate-table.py
that automatically generates the samples table by (recursively) scanning a directory of fastq files. It scans for all files ending in .fastq
, .fa
, .fq
, .fastq.gz
, .fa.gz
, and .fq.gz
, and matches pair-end read files by looking for file names that only differ in a 1
and 2
(with a preference for R1
and R2
, in case of multiple matches) for forward and backward files, e.g.:
S1_R1_001.fq.gz
S1_R2_001.fq.gz
The script prints the matched pairs that it determined. Everything before the matching 1
and 2
is considered to be the sample name (except for R
if present, and any dashes or underscores), and everything after is used to determine the unit of the sample (identical text means same unit). For example, if two further files were present
S1_R1_002.fq.gz
S1_R2_002.fq.gz
the script would consider them as a second unit (second run of a sequencing machine on the same biological sample). These four example files hence would yield a table with two lines:
S1 1 S1_R1_001.fq.gz S1_R2_001.fq.gz
S1 2 S1_R1_002.fq.gz S1_R2_002.fq.gz
(leaving out platform here for simplicity)
Of course, the unit information does not have to be present. For example, file names such as S1_R1.fq.gz
work fine, and would just use the S1
part to determine the sample name, and always be single-unit, as there is nothing after the R1
to distinguish between units.
Lastly, files for which no match is found are considered to be unpaired single end read files. It is also possible to provide the option --single
to the script, in which case no attempt is made at matching paired end files, and all files are treated as single-end instead.
Usage: It only needs the directory to scan as input, and optionally takes the output file name for the table, which by default is samples.tsv
:
./generate-table.py [--single] <directory-with-fastq-files> [<output-file-name=samples.tsv>]
Arguments in square brackets are optional (so, type them without the brackets), and arguments in angle brackets are paths (so, type them without the angle brackets). If --single
is provided, the script does not try to match paired end reads, and instead uses each fastq
file as an individual sample (i.e., it will be on its own line in the output table). If your file naming fits the above scheme, you can use the script to generate the table for you. If not, please let us know, so that we can adapt the script to other file naming schemes as well!
Also note that we recommend to ensure file names that only consist of alpha-numerical characters, dots, dashes, and underscores. Almost all other characters are special in some contexts, and might hence cause trouble when running the pipeline. See the tools/copy-samples.py
script for a tool to create a copy (or symlink) of the fastq files that cleans the names.
Simply run
snakemake --cores <n> --use-conda --directory /path/to/my-analysis [other options]
from the directory where you downloaded grenepipe into, to execute the whole pipeline on your computer, using the config.yaml
that you stored in /path/to/my-analysis
.
Important: We always want to call snakemake
from within the grenepipe directory (where the code is), and not from the analysis directory (where the config.yaml
is). The analysis directory (where the config.yaml
is) is always specified via the --directory
option of snakemake.
Of course, any additional snakemake options can be set as needed:
- See Advanced Usage for details on the working directory, and for an important conda setting to save time. This also contains a section about how to run only parts of the pipeline, e.g., to just run the mapping, but not the calling.
- See our Cluster and Profiles documentation for cluster environments.
- Before running a large analysis, we recommend a "dry run" that lists all the steps snakemake wants to execute, without actually executing them, to check that everything is as expected. To do this, add the flags
-nq
to any snakemake call. - For debugging or gaining understanding of why snakemake wants to execute certain steps, use
-n --reason
, which is a dry run that lists the reason for each step being executed.
The snakemake command line interface offers a lot of useful options, see there for more details.
Once the pipeline is finished, there are several files of interest:
-
genotyped/all.vcf.gz
: The raw variant call result. -
filtered/all.vcf.gz
: The filtered variant call result. -
annotated/[snpeff|vep].vcf.gz
: If SnpEff and/or VEP are activated in theconfig.yaml
(neither are by default), these are the annotated versions of the above variant call result. See Variant Annotations for details on SnpEff and VEP in grenepipe. -
qc/multiqc.html
: The MultiQC report. See also Report and Quality Control.
The pipeline also keeps most intermediate files by default, in case they are needed for further analyses.
Lastly, we provide statistics on the reference genome (number of sequences and their lenghts), using SeqKit. The output file with the statistics is stored next to the reference genome file, with the additional suffix .seqkit
.