Skip to content

Cluster and Profiles

Lucas Czech edited this page May 12, 2022 · 55 revisions

To fully leverage cluster environments, rules can be executed as individual jobs, running in parallel. See the Snakemake Cluster Execution page for details. Snakemake provides so-called profiles to wrap the configurations necessary for each cluster environment.

In grenepipe, we provide a starting point for Snakemake profiles, which might come in handy when running the pipeline locally or in a cluster environment. Our profiles are located in grenepipe/profiles. Currently, we provide profiles for the slurm workload manager. See the Snakemake-Profiles repository for templates for other cluster environments.

Cluster environments

Many computer clusters come with a module system, where you can module load certain tools. In our experience, that rarely works with snakemake, as versions of tools will often mismatch. We hence want to use conda throughout the pipeline, both for installing snakemake and for all its dependencies (including python, pandas, numpy, etc). Hence, you only need to make sure that conda (or better: mamba) works on your cluster. This can either be via a module (if your cluster provides conda/mamba as a module), or simply a user-local installation of Miniconda.

Note that we again assume a consistent conda environment for running snakemake, as described in Setup and Usage; see there how to install the necessary grenepipe conda environment. This also applies here, i.e., activate that environment before the below steps!

When you run multiple analyses, make also sure to specify a --conda-prefix in order to not re-create conda environments every time, as described in Setup and Usage.

Rule of thumb: Clusters often have hiccups - some temporary network issues, some node having a problem, or jobs failing for no apparent reason, etc. No worries, just wait until snakemake/grenepipe stops running and until all currently running jobs in the queue are done (or kill them manually), and start the pipeline again, with the same command as before. It will continue where it left off. Only if the error is apparently caused by some bug or is persistent, it's worth looking into it in more detail (as described below).

Usage

Typically, cluster environments provide a "login node" from which you submit jobs to the "compute nodes" of the cluster. In such a setup, you want to make sure that Snakemake keeps running and submitting jobs for you, even when you close your terminal session on the login node. To this end, we recommend using Unix/Linux tools such as tmux or screen, which enable you to detach your session, and return to it later.

That is, on the login node, start a screen or tmux session. In that session, start Snakemake with grenepipe as described below. Do not submit this main Snakemake command as a job itself - we do not want it to stop once that job runs out of time or the like. This main Snakemake call needs only few resources, as it merely submits and checks the actual compute jobs, so it should be okay to run this on the login node; you can also check with your cluster admins to make sure. With the cluster profiles setup as described below, the actual compute jobs are then automatically submitted to the cluster compute nodes by Snakemake.

For this, you need to tell Snakemake to actually use the cluster profile, so that jobs are not accidentally run on the login node.

Example:

snakemake [other options] --profile profiles/slurm

for our default slurm profile. Add or edit profiles as needed for your cluster environment.

Alternatively, Snakemake looks for profiles in ~/.config/snakemake. Hence, you can also copy the contents of the local or the slurm subdirectory to that location on your system, and then do not need to specify --profile each time when calling Snakemake.

TL;DR (aka Summary)

Install the proper conda enviroment that we want for grenepipe:

cd /path/to/grenepipe
mamba env create -f envs/grenepipe.yaml

Then start the pipeline within a tmux session, so that it keeps submitting jobs to the cluster, and make sure that conda environments are re-used across runs:

tmux new-session -s my-grenepipe-run
conda activate grenepipe
snakemake --conda-frontend mamba --conda-prefix ~/conda-envs --profile profiles/slurm/ --directory /path/to/data

This should work for most slurm-base cluster systems out of the box, at last for smaller datasets. For larger datasets, more memory or wall time might be needed, as described below.

Troubleshooting

Finding errors and debugging in a cluster environment can be a bit tricky. Potential sources of error are snakemake, python, conda, slurm (or your cluster submission / workload manager system), all the tools being run, the grenepipe code itself, and the computer cluster (broken nodes, internet down, etc), and each of them manifests in a different way. That is an unfortunate part of any sufficiently complicated bioinformatics setup though. Often, some digging is needed, and we have to follow traces of log files.

For example, depending on the dataset, the default cluster settings might not work well. The most common issues are

  • an issue with the input data,
  • a cluster issue (e.g., a node failing) or a network issue (e.g., could not load a conda environment),
  • running out of (wall) time, and
  • running out of memory.

In these cases, slurm (and other workload managers) will abort the job, and Snakemake will print an error message such as

[Sat May 29 12:44:21 2021]
Error in rule trim_reads_pe:
    jobid: 1244
    output: trimmed/S1-1.1.fastq.gz, trimmed/S1-1.2.fastq.gz, trimmed/S1-1-pe-fastp.html, trimmed/S1-1-pe-fastp.json
    log: logs/fastp/S1-1.log (check log file(s) for error message)
    conda-env: /...
    cluster_jobid: 8814415

Error executing rule trim_reads_pe on cluster (jobid: 1244, external: 8814415, jobscript: /.../snakejob.trim_reads_pe.1244.sh). 
For error details see the cluster log and the log files of the involved rule(s).
Trying to restart job 1244.

This error message can be used to investigate the error and look at further log files with more detail:

  1. First, the log file of the tool being run should be checked, which in the above case is logs/fastp/S1-1.log. If there was an issue with the input data, the cluster itself, or conda and the tool being run, this usually shows up here. In that case, you will have to fix the issue with the data or the tool, and run the pipeline again.

  2. If the problem is however too little memory or time, this log file might be empty. In that case, check the slurm log files for this job, which are located in your analysis directory under slurm-logs (if you are using our slurm profile, see below). The above Snakemake error message contains the external slurm job ID (cluster_jobid: 8814415, or external: 8814415); use for example find slurm-logs/ -name "*8814415*" to find the corresponding slurm log files. Of course, for other workload managers, and depending on where you obtained the --profile from, this might differ.

  3. For example, you will find

    slurm-logs/trim_reads_pe/snakejob.trim_reads_pe.sample=S1.unit=1.8814415.out
    slurm-logs/trim_reads_pe/snakejob.trim_reads_pe.sample=S1.unit=1.8814415.err
    

    Typically, the .err file will contain a description of why slurm stopped the job.

  4. That file, usually at its end, might contain the hint that we are looking for:

    slurmstepd: error: Detected 1 oom-kill event(s) in step 8814415.batch cgroup. 
    Some of your processes may have been killed by the cgroup out-of-memory handler.
    

    So here, we ran out of allocated memory.

Using this information, you can then edit your slum config file (e.g., profiles/slurm/cluster_config.yaml, or the cluster_config.yaml file in the --profile directory that you used). In the above example, trim_reads_pe ran out of memory, so we want to edit

trim_reads_pe:
  mem: 1G

to

trim_reads_pe:
  mem: 5G

to give it more memory. If time was the issue, change the line time: x (or add that line if not present) to the respective entry. If there is no entry for the particular job step, you can add one in the same manner as the other job step entries in that file, and simply add a line for mem or time as needed. All other properties are inherited from the __default__ entry at the top of the file. See the Snakemake Profiles documentation, and if you are using slurm, see the Snakemake slurm profile template documentation for details.

Profiles

Local

The local profile is meant as a helper to not having to set all necessary settings by hand each time Snakemake is called. Not really needed, but if you find yourself using a lot of individual Snakemake settings, it might come in handy.

Slurm

The slurm profile is based on the Snakemake Profiles cookiecutter template, and is a general starting point for running grenepipe on slurm-based clusters that should work out of the box on most systems.

We also extended the original template as follows:

  • Slurm log files are collected in a subdirectory, instead of cluttering the main directory.
  • We write some more debugging log files that list how jobs are submitted to the cluster.
  • Some more quality of life additions and fixes, nicer output, etc.

There are two files of interest for customization:

  • config.yaml: General Snakemake configuration, similar to what the above "local" profile offers. It also sets the submission scripts for slurm as needed.
  • cluster_config.yaml: This file contains per-rule customization, for example, execution times, memory limits, partitions and user names to use for the submission, etc.

The latter file is also important to set further slurm attributes that would normally go directly into the submission script (via #SBATCH lines for example). Simply add those lines as key-value pairs similar to the ones already present in our cluster_config.yaml - that is, either to the __default__ category for all jobs, or in sub-categories for individual rules only.

Memex

The memex profile is a specialization of the above slurm profile for the Carnegie Science cluster "memex". It here serves as an example of how to customize the profiles even more.

Note that we re-used the submission scripts via symlinks, and only copied the two config files to the memex profile directory.

Clone this wiki locally