Skip to content

Advanced Usage

Lucas Czech edited this page Jul 14, 2024 · 31 revisions

Working directory

We always call snakemake from within the grenepipe directory, because this is where it is looking for the code run the pipeline. However, that typically is not the directory where you want to store your data. Hence, we need to tell snakemake where to find the config.yaml file of your run.

To this end, we use the the snakemake --directory option, which specifies the directory where your config.yaml file is, and to which output files will be written:

snakemake [other options] --directory /path/to/my-analysis

That is, we are still calling snakemake from the directory where you downloaded grenepipe to, but we specify a --directory, telling Snakemake where to find the the config file and where to put the output files.

Our typical (recommended) setup looks as follows:

  • For a new project/analysis, create a new directory. In that directory:
  • Create a samples.tsv table, listing all fastq file paths (which can be located wherever, depending on your preferences), for example by using the generate-table.py script. Note that your actual sequence data (fastq) files do not need to be stored there as well - and we even recommend to not have them stored with your analysis. They can be in some safe storage on your cluster, and only need to be referenced from the samples table.
  • Copy the config.yaml into that directory as well, and edit as needed. In particular, edit the path to the samples.tsv table, and to the reference genome (which also can be located wherever).
  • Run the pipeline by calling snakemake from the directory where you downloaded grenepipe to, and specify the --directory to point to our newly created directory (the one with the samples.tsv and config.yaml in it).

This puts all your results and outputs in that new directory, which also includes the configuration (good for reproducibility).

Using a different --directory for each run will also easily allow you to repeat analyses with different tools and parameters, which is useful for data exploration. You will only need to create a new config.yaml file per run, each in a separate directory, specify your settings, and start the pipeline.

Processing of the mapped reads

We offer many optional steps to process the mapped reads, before the variant calling. Here, we explore their order.

  1. First, reads are mapped, using the selected mapping-tool of the config file.
  2. Reads are then sorted, and the units (different sequencing runs of the same biological sample/library) are merged.
  3. Next, optionally, if filter-mapped-reads is set, samtools view is used for filtering.
  4. Next, optionally, if clip-read-overlaps is set, bam clipOverlap is used to clip paired end reads.
  5. Now, optionally, duplicates can be removed, depending on remove-duplicates and duplicates-tool, using either Picard MarkDuplicates or dedup.
  6. Finally, optionally, if recalibrate-base-qualities is set, the nucleotide base quality scores of the reads are recalibrated, using GATK BaseRecalibrator.

Whichever steps are activated, the respective next in the list uses the output of the corresponding previous one as its input. The resulting bam files are then used for all downstream processing, such as variant calling. This is also the case when only the mapping is being run (see above, all_bams).

Some tools such as the bam QC tools (samtools stats, samtools flagstat, Picard CollectMultipleMetrics, qualimap) offer to instead run on the mapped, sorted, and merged reads only, that is, as they are produced by step 2 in the above list. This is offered in case users want to see the "raw" mapping QC, instead of the one after the other processing steps as listed above.

[Expand] A note on the behavior before grenepipe v0.12.0.

The order of steps was different prior to grenepipe v0.12.0, where we worked on un-merged bam files, one per unit per sample, instead. That is, the "merging" of different units of the same sample did not happen on the bam files, but instead we forwarded all bam files of a sample to the selected variant caller. This usually works, as the bam files contain Read Group information that identifies them as the same sample. But still, this approach was not the most efficient way, and might have confused some downstream tools. So all releases since v0.12.0 merge the bam files per sample (all its units) first.

One down side of this old approach was that samples where the same biological material was sequenced multiple times could contain duplicate reads in its units (i.e., its re-sequencing runs) that were marked and removed as duplicates within each bam file (by step 5 of the above process), but were still duplicates between the samples (i.e., two bam files containing the same read). Those would not be marked/removed, and hence might have influenced downstream steps. With the new approach, duplicates are marked and removed after merging, so that those between-units duplicates are also found.

Running only parts of the pipeline

Sometimes, you do not want to run all steps of the pipeline. To this end, we offer some shortcuts to stop the run early:

  • Only run the reference genome preparation step.

    snakemake [other options] all_prep
    

    This can be useful if you want to start several runs of grenepipe with the same reference genome, but different config files (e.g., for exploring the effects of different tools and parameters). In that case, if you started all runs at the same time, they would all be trying to process the reference genome simultaneously, which might lead to corrupt files. Running the preparation step once before makes sure that all later runs already have the necessary index files etc, and do not try to create them again, thus avoiding clashes.

  • Only run quality control, see also this page.

    snakemake [other options] all_qc
    

    This will produce all quality control statistics, including the MultiQC report. Note that this might still need to run the whole variant calling if you activated SnpEff or VEP, or bcftools stats, as those will be included in the MultiQC report, and they depend on the variants to be called. Deactivate them in the config to avoid this.

  • Only run the mapping, to get a set of bam files.

    snakemake [other options] all_bams
    

    This will yield all bam files that are requested in the config, i.e., just the sorted and merged bams, the samtools filtered bams (e.g., for ancient DNA), the clipped bams, the duplicate-marked bams, or the base quality recalibrated bams, in their respective output directories, depending on whether these options are activated in the config, as explained above. These bam files are all based on the mapped files, sorted by coordinate, and merged by sample, that is, combining all units (independent sequencing runs) per sample into one bam file. That means, the respective last bam files as listed represent the same data that would also be used for variant calling downstream (if that part of the pipeline is being run).

    This step also outputs a list of symlink files in mapping/final that link to the final bam files that the pipeline produces after all the above steps in the mapping. These are meant as a user convenience, to make it easier to decipher which bam files are the ones being used for the variant calling.

    With these final mapped files, it is then also possible to start the pipeline, for instance to combine multiple runs. See below for details.

  • Only run the mapping and mpileup creation.

    snakemake [other options] all_pileups
    

    This is the same as the above all_bams step, but additionally creates the mpileup files as specified in the config "settings:pileups" list. Note that in order for this target to do anything at all, at least one of the pileup options has to be activated in the config file.

Furthermore, snakemake offers to just run certain rules, and has some other tricks up its sleeve, see their command line interface for details.

Running the variant calling from a set of bam files

In certain circumstances it might be useful to skip the initial fastq-based steps (trimming, mapping, read filtering), and instead start the pipeline from a set of bam files. This can help exploring the right parameters, or to combine distinct runs (as long as they use the same reference genome). TODO: This feature is work in progress. More on this later.

Ancient DNA (aDNA)

A more advanced use case is that of ancient DNA. In that case, similar to the above section, we usually do not want to run the whole pipeline - variant calling might not be of interest. Instead, typically, we want to run all steps prior to the usual calling step, that is, trimming, mapping, duplicate removal, and furthermore the damage control tools.

To achieve that, we can use the same above methods of running only parts of the pipeline. Two approaches are to run either all_bams or all_qc, as explained in the section above. The former will only run everything up until (and including) the mapping, with all its associated steps (clipping, duplicate removal, base recalibration) as far as those steps are activated in the config.yaml. The latter will additionally run the QC steps, which also include the two damage profiling tools (mapdamage and damageprofiler) if those are activated in the config.yaml. The bam files resulting from these steps can then be used for downstream analyses with external tools as needed.

Note however that by default (as of now), the optional bcftools-stats step is activated in the config.yaml, which requires the filtered VCF as input. Hence, if you want to run the QC steps, but without any VCF-related steps, you need to deactivate bcftools-stats. Before submitting any large jobs to the cluster, we recommend to run snakemake [other options] -nq, which is a dry run summary that lists all steps that would be executed - and use this to check that only the intended steps are being listed. You can further use snakemake [other options] -n --reason to list the reason for every step being run, if you need to investigate why certain steps are listed.

Frequency calling from pool-seq data with HAF-pipe

Although grenepipe is to a large extend about calling variants from sequenced individuals, we also offer tools that are useful for pool-sequencing data, where each sample is obtained from genetic material of a whole pool of individuals (of the same species). This can be a cost effective way to assess diversity and changes in allele frequencies, for example in Evolve and Resequence experiments. See for example here for some details and experiments with that approach.

For obtaining allele frequencies from pool-seq data, we typically do not need to execute the variant calling step. Instead, we use the mapped reads, and work from there. For each sample - representing a pool of individuals - we could just count the bases at each position, and compute the frequency of any alternative base compared to the reference base (based on the reference genome). See again here for comparisons of different approaches and their upsides and downsides.

However, in many types of experiments where populations are sequenced over multiple generations, we often have a founder generation of individuals that we can use to improve the statistical quality of the frequencies. In grenepipe, we offer to run HAF-pipe, which is a tool to calculate haplotype-inferred allele frequencies from pool-seq data and founder SNP profiles.

After setting the necessary files and parameters in the hafpipe section of the config.yaml, run

snakemake [other options] all_hafpipe

to obtain the frequency files. We improve upon the original HAF-pipe, which outputs individual files per sample and per chromosome, by creating a combined table containing all frequencies in the file "hafpipe/all.csv".

See the hafpipe section in the config.yaml for more details and information.

Un-assembled reference genomes with many contigs/scaffolds

For some reference genomes, not all chromosomes/contigs have been fully assembled yet, and instead the reference genome consists of many small contigs/scaffolds. As some of the steps in the workflow however parallelize over contigs (for speed), this can lead to a large number of jobs being created, which in particular can cause issues when running in cluster environments. It will slow down the snakemake execution itself, but also might start hundreds of thousands of jobs, which is rarely a good idea.

To solve this issue, use the setting contig-group-size in the config.yaml. See there for more details and an explanation of how this feature works. In short, it runs the computation for several contigs in a single job, without however affecting the produced output.

Re-using conda environments

With the typical setup, snakemake unfortunately stores all conda environments inside the specified --directory. That means, all conda environments are downloaded again and again for each --directory (i.e., each run of the pipeline) that we use. This is of course not desirable (and it is mysterious why snakemake behaves that way), so let's avoid that by setting

snakemake [other options] --conda-prefix ~/path/to/my/software/conda-envs

This stores all conda environments in the specified directory. This can be in your home directory (~/conda-envs) for example. Just make sure to use the same prefix for every run.

Cluster environments with different file system partitions

Many clusters have different file system partitions, intended for different use cases. Often, there is long-term storage (typically backed-up), and so-called scratch space, which is meant for faster file access. This temporary storage is often recommended for fast access of large files - such as what we need in grenepipe.

Hence, when your cluster offers such a scratch partition, we recommend to copy your input fastq files onto this partition (typically, you want to copy instead of move, to ensure that a backed-up version of the files remains). To this end, you can use the tools/copy-samples.py script. The script also allows to clean up sample and file names with invalid or difficult characters, which might be useful as well.

Furthermore, it is recommended to use a working directory for running grenepipe that is also located on that scratch space, for the same reasons. See above for details on this. Furthermore, see our Cluster and Profiles page for more details on running grenepipe on clusters.

Clone this wiki locally