Skip to content

Advanced Usage

Lucas Czech edited this page Aug 13, 2022 · 31 revisions

Working Directory

We call snakemake from within the grenepipe directory, because this is where it is looking for the code run the pipeline. However, that typically is not the directory where you want to store your data. Hence, we need to tell snakemake where to find the config.yaml file of your run.

To this end, we use the the snakemake --directory option, which specifies the directory where your config.yaml file is, and to which output files will be written:

snakemake [other options] --directory /path/to/my-analysis

That is, we are still calling snakemake from the main grenepipe directory, but this time we specify a different --directory for the config and output files.

Our typical (recommended) setup looks as follows:

  • For a new project/analysis, create a new directory. In that directory:
  • Create a samples.tsv table, listing all fastq file paths (which can be located wherever, depending on your preferences). Note that your actual sequence data (fastq) files do not need to be stored there as well - and we even recommend to not have them stored with your analysis. They can be in some safe storage on your cluster, and only need to be referenced from the samples table.
  • Copy the config.yaml into it, and edit as needed. In particular, edit the path to the samples.tsv table, and to the reference genome (which also can be located wherever).
  • Run the pipeline by calling snakemake from the main grenepipe directory, and specify the --directory to point to our newly created directory with the samples.tsv and config.yaml in it.

This puts all your results and outputs in that new directory, which also includes the configuration (good for reproducibility).

Using a different --directory for each run will also easily allow you to repeat analyses with different tools and parameters, which allows data exploration. You will only need to create a new config.yaml file per run, each in a separate directory, specify your settings, and start the pipeline.

Conda Environments

With the typical setup, snakemake unfortunately stores all conda environments inside the specified --directory. That means, all conda environments are downloaded again and again for each --directory (each run) that we use. This is of course not desirable (and it is mysterious why snakemake behaves that way),so let's avoid that by setting

snakemake [other options] --conda-prefix /path/to/some/stable/conda/directory

This stores all conda environments in the specified directory. This can be in your home directory (~/conda-envs) for example. Just make sure to use the same prefix for every run.

Running only parts of the pipeline

Sometimes, you do not want to run all steps of the pipeline. To this end, we offer some shortcuts:

  • Only run the reference genome preparation step.

    snakemake [other options] all_prep
    

    This can be useful if you want to start several runs of grenepipe with the same reference genome, but different config files (e.g., for exploring the effects of different tools and parameters). In that case, if you started all runs at the same time, they would all be trying to process the reference genome simultaneously, which might lead to corrupt files. Running the preparation step once before makes sure that all later runs already have the necessary index files etc, and do not try to create them again, thus avoiding clashes.

  • Only run quality control, see also this page.

    snakemake [other options] all_qc
    

    This will produce all quality control statistics, including the MultiQC report. Note that this might still need to run the whole variant calling if you activated SnpEff or VEP, as those will be included in the MultiQC report, and they depend on the variants. Deactivate them in the config to avoid this.

  • Only run the mapping, to get a set of bam files.

    snakemake [other options] all_bams
    

    This will yield all bam files that are requested in the config, i.e., just the sorted bams, the samtools filtered bams (e.g., for ancient DNA), the duplicate-marked bams, or the base quality recalibrated bams, in their respective output directories.

  • Only run the mapping, and mpileup creation.

    snakemake [other options] all_pileups
    

    This is the same as the above all_bams step, but additionally creates the mpileup files as specified in the config "settings:pileups" list. Note that in order for this target to do anything at all, at least one of the pileup options has to be activated in the config file.

Furthermore, snakemake offers to just run certain rules, and has some other tricks up its sleeve, see their command line interface for details.

Un-assembled reference genomes with many contigs/scaffolds

For some reference genomes, not all chromosomes/contigs have been fully assembled yet, and instead the reference genome consists of many small contigs/scaffolds. As some of the steps in the workflow however parallelize over contigs (for speed), this can lead to a large number of jobs being created, which in particular can cause issues when running in cluster environments. It will slow down the snakemake execution itself, but also might start hundreds of thousands of jobs, which is rarely a good idea.

To solve this issue, use the setting contig-group-size in the config.yaml. See there for more details and an explanation of how this feature works. In short, it runs the computation for several contigs in a single job, without however affecting the produced output.

Clone this wiki locally