-
Notifications
You must be signed in to change notification settings - Fork 21
Quick Start and Full Example
This page is meant for the impatient.
Simply download grenepipe to somewhere. We recommend to use a release version, so that you have a stable point of reference. No installation is needed - just extract the files to somewhere.
You will furthermore need one of conda / miniconda / anaconda / mamba / micromamba. At the moment, we recommend micromamba, as that seems to be the easiest to install. However, the whole conda ecosystem is rather fast moving, fragile, and unpredictable, so just use whatever is working at the moment. On computer clusters, this might already be available as a module, or you can install it locally for your user. We highly recommended to use mamba (or micromamba) instead of conda, for speed. We will then use that to install and run Snakemake, which is the backend that grenepipe runs in.
Snakemake, conda/mamba, python, pandas, and numpy are notorious for causing trouble when mixing their versions. We make sure to always use the same versions of these tools by running the main pipeline in an environment of its own, instead of using your local versions of the tools.
First, install micromamba locally or on your cluster. Then, use that to install and activate the grenepipe environment, from within the main grenepipe directory:
# Create and activate a conda environment for running snakemake.
cd /path/to/grenepipe
micromamba env create -f workflow/envs/grenepipe.yaml
micromamba activate grenepipe
Instead of micromamba
, you can also use mamba
or conda
, depending on which one you decided to use.
We provide a small test/exemplary data set at grenepipe/example
.
This contains the files minimally needed to run the pipeline:
-
samples.tsv
: table listing all input fastq files. -
samples/*.fastq.gz
: actual sequence data, referenced from the table. -
TAIR10_chr_all.fa.gz
: reference genome (here, Arabidopsis thaliana). -
known-variants.vcf.gz
: to constrain the variant calling process. This file is based on the 1001 Genomes dataset, and imputed and subset to serve for exemplary and test purposes. - Lastly, a
config.yaml
file is needed to set up which input files, tools, and settings we are using. The main grenepipe directory contains the base config file, which we will use.
We now need to prepare the config.yaml
file for the example, by adjusting the file paths to the fastq files in the samples.tsv
, which need to fit with where grenepipe is located.
Simply call
# Prepare the config.yaml and samples.tsv as described above.
./example/prepare.sh
which copies the config/config.yaml
to the example directory, and adjusts the paths in the two files as needed. All other settings are left at their defaults.
NB: Note that we are using Arabidopsis thaliana as a small exemplary genome here; the pipeline is however agnostic to the species under study.
The data can then be fully analyzed by running the following command from the main grenepipe directory:
# Run the pipeline!
snakemake --use-conda --directory example/
to run the actual pipeline. That's it.
[Expand] A note on grenepipe < v0.13.0
In grenepipe v0.13.0, we upgraded from Snakemake v6 to Snakemake v8, which now by default uses --conda-frontend mamba
and also sets the number of compute cores to use by default, instead of having to specify, e.g., --cores 4
. If you are still using an older grenepipe before v0.13.0, you will have to add both options to the above command.
Note: Snakemake always needs to be run from within the directory where you downloaded grenepipe to; you then always specify where your config file is (and hence, where the output files are produced) via the --directory
option.
The most important outputs of this are:
- The
calling/filtered-all.vcf.gz
final variant call file (excluding SnpEff and VEP annotations). - The
qc/multiqc.html
MultiQC quality control statistics report.
See Setup and Usage for more details.