-
Notifications
You must be signed in to change notification settings - Fork 21
Cluster and Profiles
To fully leverage cluster environments, independent steps of the pipeline can be executed as individual jobs, running in parallel. To this end, Snakemake provides so-called Profiles to configure the specifics of the (cluster) environment. Furthermore, Snakemake now provides a SLURM Plugin if you want to run grenepipe on a cluster that uses slurm for job scheduling. Please familiarize yourself with these resources before reading on. Also, see Setup and Usage for details on the general setup, and see below for details on running grenepipe on a cluster.
[Expand] A note on the behavior before grenepipe v0.13.0, using Snakemake v6.0.5.
With grenepipe v0.13.0, we switched from using Snakemake v6.0.5 to the more recent version v8.15.2. In between these versions, the whole approach to cluster environments has changed in Snakemake, and so our previous approach is not working any more. Here it is for reference, if you are still working with an older grenepipe/snakemake version (not recommended though!):
To fully leverage cluster environments, independent steps of the pipeline can be executed as individual jobs, running in parallel. See the Snakemake Cluster Execution page for details. Also, see Setup and Usage for details on the general setup, and see below for details on running grenepipe on a cluster.
As different cluster systems work differently, Snakemake provides so-called profiles to wrap the configurations necessary for each cluster environment. See the Profiles section below for details.
We here assume basic familiarity with unix-like terminals, command line interfaces, and SSH to access remote computers.
Many computer clusters come with a module system, where you can module load
certain tools. In our experience, that rarely works with snakemake, as versions of tools will often mismatch and clash. We hence want to use conda throughout the pipeline, both for installing snakemake and for all its dependencies (including python, pandas, numpy, etc). Hence, you only need to make sure that conda (or better: mamba/micromamba) works on your cluster. This can either be via a module (if your cluster provides conda/mamba as a module), or simply a user-local installation of Miniconda or Micromamba.
At the moment, we recommend the latter: Simply install Micromamba locally in your user directory on the cluster, and use this to install and activate the grenepipe environment as explained in our Setup and Usage page. Then, within the pipeline, this will use mamba to install all rule-specific conda environments.
Generally, we recommend to fully rely on conda/mamba for the package management, instead of for example local installations of the used tools, or module load
commands for tools that are available in cluster environments. By using only conda/mamba, we make sure to use compatible versions of all tools, without conflicts. Hence, do not module load
any of the bioinformatics tools in the cluster session that you are running grenepipe in.
Note that we again assume a consistent conda environment for running snakemake, as described in Setup and Usage; see there how to install the necessary grenepipe conda environment. This also applies here, i.e., activate that environment before the below steps!
When you run multiple analyses, make also sure to specify a --conda-prefix
in order to not re-create conda environments every time, as described in Advanced Usage.
Rule of thumb: Clusters often have hiccups - some temporary network issues, some node having a problem, or jobs failing for no apparent reason, etc. No worries, just wait until snakemake/grenepipe stops running and until all currently running jobs in the queue are done (or kill them manually), and start the pipeline again, with the same command as before. It will continue where it left off. Only if the error is apparently caused by some bug or is persistent, it's worth looking into it in more detail (as described below).
Typically, cluster environments provide a "login node" from which you submit jobs to the "compute nodes" of the cluster. Think of a "node" as one physical computer sitting somewhere on a rack in a data center, where each node offers some compute power (number of cores/CPUs, amount of memory, etc).
In such a setup, you want to make sure that Snakemake keeps running and submitting jobs for you, even when you close your console (terminal) session on the login node. To this end, we recommend using Unix/Linux tools such as tmux or screen, which enable you to detach from your session, and return to it later.
That is, on the login node, start a screen or tmux session. In that session, start Snakemake with grenepipe as described below. Do not submit this main Snakemake command as a job itself - we do not want it to stop once that job runs out of allocated time. This main Snakemake call usually needs only few resources, as it merely submits and checks the actual compute jobs, so it should be okay to run this on the login node; you can also check with your cluster admins to make sure. With the cluster profiles setup as described below, the actual compute jobs are then automatically submitted to the cluster compute nodes by Snakemake.
For this, you need to tell Snakemake to actually use the cluster profile, so that jobs are not accidentally run on the login node.
Example:
snakemake [other options] --profile /path/to/grenepipe/workflow/profiles/slurm
for our default slurm
profile (see below for details). Add or edit the profile files as needed for your cluster environment. In particular, you need to edit the slurm_account
and slurm_partition
entries in that config file for your cluster.
Alternatively, Snakemake looks for profiles in ~/.config/snakemake
, see here. Hence, you can also copy the contents of the slurm
subdirectory to that location on your system, and then do not need to specify --profile
each time when calling Snakemake.
With that, Snakemake will start submitting jobs, and keep submitting until everything is done - hopefully.
This is a summary of the above, showing how we recommend to use the pipeline in a slurm cluster environment.
Preparing the environment: As before, install the proper conda enviroment that we want for grenepipe:
cd /path/to/grenepipe
micromamba env create -f envs/grenepipe.yaml
(or mamba
, if you are using that instead)
Edit the config file: The cluster config file grenepipe/workflow/profiles/slurm/config.yaml
needs to be edited to your needs, and in particular, the following entries have to be set as needed:
slurm_account: "your_account"
slurm_partition: "your_partition"
This is also the file that contains the runtime and memory requirements for the jobs. Should any job fail due to lack of resources, this file needs to be edited accordingly.
Running the pipeline: For each analysis you want to conduct, start the pipeline within a tmux session, so that it keeps submitting jobs to the cluster, and make sure that conda environments are re-used across runs:
tmux new-session -s my-grenepipe-run
micromamba activate grenepipe
snakemake \
--conda-prefix ~/conda-envs \
--profile /path/to/grenepipe/workflow/profiles/slurm \
--directory /path/to/data
This should work for most slurm-based cluster systems out of the box, at last for smaller datasets. For larger datasets, more memory or wall time might be needed for some of the jobs, as described below. You will notice this by seeing jobs fail; see also below for hints on Troubleshooting.
After starting Snakemake in tmux
as show above, press control + b
, and then d
to detach from the tmux session (when using screen instead, the commands differ). Now you can close your terminal.
Next time you want to check in on your analysis, log onto the cluster, and re-attach to the tmux session
tmux attach-session -t my-grenepipe-run
Now you should see the Snakemake output again.
Note: Large clusters tend to have multiple login nodes, to distribute load between them. Your tmux session lives only on the node where it was started. Hence, take note of this; you might need to ssh login-xyz
to that node if your initial login put you on a different login node.
Note on a current Snakemake bug: Currently (Dec 2024), there is a bug in the snakemake slurm submission, where the number of cores on the login node (i.e., the tmux session where snakemake is being run, and from which the slurm jobs are hence submitted) is used as a limitation check for the number of cores a job can request, in order to avoid over-allocation. On many clusters however, the login node might have way fewer cores than the compute nodes, and so this bug prevents us of submitting jobs that need more cores than the login node has. See the bug report for details. The workaround for this is to run snakemake with --cores 1024
or some other large number - those cores might then be used for some local rules, which however should usually not lead to exhausting on the login node, as the local rules are rather small.
Finding errors and debugging in a cluster environment can be a bit tricky. Potential sources of error are snakemake, python, conda, slurm (or your cluster submission / workload manager system), all the tools being run, the grenepipe code itself, and the computer cluster (broken nodes, internet down, etc), and each of them manifests in a different way. That is an unfortunate part of any sufficiently complicated bioinformatics setup though. Often, some digging is needed, and we have to follow traces of log files.
For example, depending on the dataset, the default cluster settings might not work well. The most common issues are
- an issue with the input data,
- a cluster issue (e.g., a node failing) or a network issue (e.g., could not load a conda environment),
- running out of (wall) time, and
- running out of memory.
In these cases, slurm (and other workload managers) will abort the job, and Snakemake will print an error message such as
[Sat May 29 12:44:21 2021]
Error in rule trim_reads_pe:
jobid: 1244
output: trimmed/S1-1.1.fastq.gz, trimmed/S1-1.2.fastq.gz, trimmed/S1-1-pe-fastp.html, trimmed/S1-1-pe-fastp.json
log: logs/fastp/S1-1.log (check log file(s) for error message)
conda-env: /...
cluster_jobid: 8814415
Error executing rule trim_reads_pe on cluster (jobid: 1244, external: 8814415, jobscript: /.../snakejob.trim_reads_pe.1244.sh).
For error details see the cluster log and the log files of the involved rule(s).
Trying to restart job 1244.
If you cannot access this on-screen output, for example because you already closed your tmux session, you can also find the full Snakemake log output of each time that Snakemake was started in the directory
logs/snakemake/
in the analysis directory (where the config.yaml
is), named and sorted by the time when you started the run. Note: Snakemake itself also logs these files in a hidden directory .snakemake/log/
, but our alternative also logs some grenepipe internals, which might be helpful for debugging. Hence, we recommend to inspect our improved log in logs/snakemake/
.
This error message can be used to investigate the error and look at further log files with more detail:
-
In addition to the general log file above, when using slurm on a cluster, Snakemake will also start instances of itself for each slurm job that it submits, that is, for each rule that is executed on a compute node on the cluster. We log those files as well, in
logs/snakemake-jobs/
If there was an error, it might show up there. Try
grep -rni "error" logs/snakemake-jobs/
from your analysis directory to find any log files there that contain the word "error". This might help, but might also not show up anything interesting. -
As the most common clue, the log file of the tool being run should be checked, which in the above case is
logs/fastp/S1-1.log
. If there was an issue with the input data or the tool being run, this usually shows up here. In that case, you will have to fix the issue with the data or the tool, and run the pipeline again. -
If the problem is however related to cluster settings such as too little memory or time given to a job, this log file might be empty. In that case, check the slurm log files for this job, which are located in a hidden directory in your analysis directory under
.snakemake/slurm_logs/
(if you are using
--executor slurm
, see above).Note: Since grenepipe v0.13.0, we are using a more recent version of Snakemake (v8.15.2), for which the above applies. Before that, Snakemake interacted with slurm in a different way, for which we previously provided custom scripts to make our lives easier. If you are using grenepipe < v0.13.0 with our custom slurm profile, you will instead find the slurm logs in
slurm-logs
in the analysis directory.The above Snakemake error message contains the external slurm job ID (
cluster_jobid: 8814415
, orexternal: 8814415
); use for examplefind .snakemake/slurm_logs/ -name "*8814415*"
to find the corresponding slurm log files. -
For example, you will find
.snakemake/slurm_logs/rule_trim_reads_pe/S1_1/8814415.log
That file, usually at its end, might contain the hint that we are looking for:
slurmstepd: error: Detected 1 oom-kill event(s) in step 8814415.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
So here, we ran out of allocated memory (
oom
= out of memory). When in doubt, just google the error message ;-)
Using this information, you can then edit your slum config file (e.g., workflow/profiles/slurm/cluster_config.yaml
, or the cluster_config.yaml
file in the --profile
directory that you used). In the above example, the task trim_reads_pe
ran out of memory, so we want to edit
trim_reads_pe:
mem_mb: 5000
to
trim_reads_pe:
mem_mb: 10000
to give it more memory. If time was the issue, change the line runtime: x
(or add that line if not present) to the respective entry. If there is no entry for the particular job step, you can add one in the same manner as the other job step entries in that file, and simply add a line for mem_mb
or runtime
as needed. All other properties are inherited from the default-resources
entry in the file. See the Snakemake Profiles documentation, and if you are using slurm, see the Snakemake slurm plugin documentation for details.