-
Notifications
You must be signed in to change notification settings - Fork 21
Cluster and Profiles
To fully leverage cluster environments, independent steps of the pipeline can be executed as individual jobs, running in parallel. See the Snakemake Cluster Execution page for details.
As different cluster systems work differently, Snakemake provides so-called profiles to wrap the configurations necessary for each cluster environment. See the Profiles section below for details.
This is a summary, showing how we recommend to use the pipeline in a cluster environment. See Setup and Usage for details on the general setup, and see below for details on running grenepipe on a cluster.
Preparing the environment: As before, install the proper conda enviroment that we want for grenepipe:
cd /path/to/grenepipe
mamba env create -f envs/grenepipe.yaml
Running the pipeline: For each analysis you want to conduct, start the pipeline within a tmux session, so that it keeps submitting jobs to the cluster, and make sure that conda environments are re-used across runs:
tmux new-session -s my-grenepipe-run
conda activate grenepipe
snakemake \
--conda-frontend mamba \
--conda-prefix ~/conda-envs \
--profile profiles/slurm/ \
--directory /path/to/data
This should work for most slurm-based cluster systems out of the box, at last for smaller datasets. For larger datasets, more memory or wall time might be needed, as described below.
After this, press control + b
, and then d
to detach from the tmux sessions (when using screen instead, the commands differ). Now you can close your terminal.
Next time you want to check in on your analysis, log onto the cluster, and re-attach to the tmux session
tmux attach-session -t my-grenepipe-run
Now you should see the Snakemake output again.
Note: Large clusters tend to have multiple login nodes, to distribute load between them. Your tmux session lives only on the node where it was started. Hence, take note of this; you might need to ssh login-xyz
to that node if your initial login put you on a different login node.
We here assume basic familiarity with unix-like terminals, command line interfaces, and SSH to access remote computers.
Many computer clusters come with a module system, where you can module load
certain tools. In our experience, that rarely works with snakemake, as versions of tools will often mismatch and clash. We hence want to use conda throughout the pipeline, both for installing snakemake and for all its dependencies (including python, pandas, numpy, etc). Hence, you only need to make sure that conda (or better: mamba) works on your cluster. This can either be via a module (if your cluster provides conda/mamba as a module), or simply a user-local installation of Miniconda.
Note that we again assume a consistent conda environment for running snakemake, as described in Setup and Usage; see there how to install the necessary grenepipe conda environment. This also applies here, i.e., activate that environment before the below steps!
When you run multiple analyses, make also sure to specify a --conda-prefix
in order to not re-create conda environments every time, as described in Advanced Usage.
Rule of thumb: Clusters often have hiccups - some temporary network issues, some node having a problem, or jobs failing for no apparent reason, etc. No worries, just wait until snakemake/grenepipe stops running and until all currently running jobs in the queue are done (or kill them manually), and start the pipeline again, with the same command as before. It will continue where it left off. Only if the error is apparently caused by some bug or is persistent, it's worth looking into it in more detail (as described below).
Typically, cluster environments provide a "login node" from which you submit jobs to the "compute nodes" of the cluster. Think of a "node" as one physical computer sitting somewhere on a rack in a data center, where each node offers some compute power (number of cores/CPUs, amount of memory, etc).
In such a setup, you want to make sure that Snakemake keeps running and submitting jobs for you, even when you close your console (terminal) session on the login node. To this end, we recommend using Unix/Linux tools such as tmux or screen, which enable you to detach from your session, and return to it later.
That is, on the login node, start a screen or tmux session. In that session, start Snakemake with grenepipe as described below. Do not submit this main Snakemake command as a job itself - we do not want it to stop once that job runs out of allocated time. This main Snakemake call usually needs only few resources, as it merely submits and checks the actual compute jobs, so it should be okay to run this on the login node; you can also check with your cluster admins to make sure. With the cluster profiles setup as described below, the actual compute jobs are then automatically submitted to the cluster compute nodes by Snakemake.
For this, you need to tell Snakemake to actually use the cluster profile, so that jobs are not accidentally run on the login node.
Example:
snakemake [other options] --profile profiles/slurm
for our default slurm
profile (see below for details). Add or edit the profile files as needed for your cluster environment.
Alternatively, Snakemake looks for profiles in ~/.config/snakemake
. Hence, you can also copy the contents of the slurm
subdirectory to that location on your system, and then do not need to specify --profile
each time when calling Snakemake.
With that, Snakemake will start submitting jobs, and keep submitting until everything is done - hopefully.
Snakemake offers so-called Profiles to adapt to specific cluster environments. In the academic world, the slurm workload manager seems to be the most commonly used cluster system.
In grenepipe, we provide a starting point for Snakemake profiles, which might come in handy when running the pipeline locally or in a cluster environment. Our profiles are located in grenepipe/profiles
. Currently, we provide profiles for the slurm workload manager. See the Snakemake-Profiles repository for templates for other cluster environments.
The slurm
profile is based
on the Snakemake Profiles cookiecutter template,
and is a general starting point for running grenepipe on slurm-based clusters that should work
out of the box on most systems.
We also extended the original template as follows:
- Slurm log files are collected in a subdirectory, instead of cluttering the main directory.
- We write some more debugging log files that list how jobs are submitted to the cluster.
- Some more quality of life additions and fixes, nicer output, etc.
There are two files of interest for customization:
-
config.yaml
: General Snakemake configuration, similar to what the above "local" profile offers. It also sets the submission scripts for slurm as needed. -
cluster_config.yaml
: This file contains per-rule customization, for example, execution times, memory limits, partitions and user names to use for the submission, etc.
The latter file is also important to set further slurm attributes that would normally go directly
into the submission script (via #SBATCH
lines for example). Simply add those lines as key-value pairs
similar to the ones already present in our cluster_config.yaml
- that is, either to the __default__
category for all jobs, or in sub-categories for individual rules only.
The local
profile is meant
as a helper to not having to set all necessary settings by hand each time Snakemake is called.
Not really needed, but if you find yourself using a lot of individual Snakemake settings, it might come in handy.
The memex
profile is a specialization
of the above slurm profile for the Carnegie Science cluster
"memex". It here serves as an example of how to customize
the profiles even more.
Note that we re-used the submission scripts via symlinks, and only copied the two config files to the memex profile directory.
Finding errors and debugging in a cluster environment can be a bit tricky. Potential sources of error are snakemake, python, conda, slurm (or your cluster submission / workload manager system), all the tools being run, the grenepipe code itself, and the computer cluster (broken nodes, internet down, etc), and each of them manifests in a different way. That is an unfortunate part of any sufficiently complicated bioinformatics setup though. Often, some digging is needed, and we have to follow traces of log files.
For example, depending on the dataset, the default cluster settings might not work well. The most common issues are
- an issue with the input data,
- a cluster issue (e.g., a node failing) or a network issue (e.g., could not load a conda environment),
- running out of (wall) time, and
- running out of memory.
In these cases, slurm (and other workload managers) will abort the job, and Snakemake will print an error message such as
[Sat May 29 12:44:21 2021]
Error in rule trim_reads_pe:
jobid: 1244
output: trimmed/S1-1.1.fastq.gz, trimmed/S1-1.2.fastq.gz, trimmed/S1-1-pe-fastp.html, trimmed/S1-1-pe-fastp.json
log: logs/fastp/S1-1.log (check log file(s) for error message)
conda-env: /...
cluster_jobid: 8814415
Error executing rule trim_reads_pe on cluster (jobid: 1244, external: 8814415, jobscript: /.../snakejob.trim_reads_pe.1244.sh).
For error details see the cluster log and the log files of the involved rule(s).
Trying to restart job 1244.
If you cannot access this on-screen output, for example because you already closed your tmux session, you can also find the full Snakemake log output of each time that Snakemake was started in a hidden directory
.snakemake/log/
in the analysis directory (where the config.yaml
is), sorted by time.
This error message can be used to investigate the error and look at further log files with more detail:
-
First, the log file of the tool being run should be checked, which in the above case is
logs/fastp/S1-1.log
. If there was an issue with the input data, the cluster itself, or conda and the tool being run, this usually shows up here. In that case, you will have to fix the issue with the data or the tool, and run the pipeline again. -
If the problem is however related to cluster settings such as too little memory or time, this log file might be empty. In that case, check the slurm log files for this job, which are located in your analysis directory under
slurm-logs
(if you are using our slurm profile, see below).
The above Snakemake error message contains the external slurm job ID (
cluster_jobid: 8814415
, orexternal: 8814415
); use for examplefind slurm-logs/ -name "*8814415*"
to find the corresponding slurm log files. Of course, for other workload managers, and depending on where you obtained the--profile
from, this might differ. -
For example, you will find
slurm-logs/trim_reads_pe/snakejob.trim_reads_pe.sample=S1.unit=1.8814415.out slurm-logs/trim_reads_pe/snakejob.trim_reads_pe.sample=S1.unit=1.8814415.err
Typically, the
.err
file will contain a description of why slurm stopped the job. -
That file, usually at its end, might contain the hint that we are looking for:
slurmstepd: error: Detected 1 oom-kill event(s) in step 8814415.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
So here, we ran out of allocated memory (
oom
= out of memory). When in doubt, just google the error message ;-)
Using this information, we can then edit your slum config file (e.g., profiles/slurm/cluster_config.yaml
, or the cluster_config.yaml
file in the --profile
directory that you used). In the above example, the task trim_reads_pe
ran out of memory, so we want to edit
trim_reads_pe:
mem: 1G
to
trim_reads_pe:
mem: 5G
to give it more memory. If time was the issue, change the line time: x
(or add that line if not present) to the respective entry. If there is no entry for the particular job step, you can add one in the same manner as the other job step entries in that file, and simply add a line for mem
or time
as needed. All other properties are inherited from the __default__
entry at the top of the file. See the Snakemake Profiles documentation, and if you are using slurm, see the Snakemake slurm profile template documentation for details.