Skip to content

Commit

Permalink
Add extra troubleshooting for failed jobs and retry failures up to 5 …
Browse files Browse the repository at this point in the history
…times

To handle potential failures arising from bad luck in a distributed computing environment,
this PR tries to inform users which types of errors might retriable, and attempts to retry
all errors on the user's behalf up to 5 times. The hope is that this will obscure most of
these transient errors from the user's view.
  • Loading branch information
jhiemstrawisc committed Oct 14, 2024
1 parent b4d9d27 commit 3f3fdc8
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 1 deletion.
9 changes: 8 additions & 1 deletion docker-wrappers/SPRAS/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ The second option is to let HTCondor manage the Snakemake process, which allows
./snakemake_long.py --profile spras_profile --htcondor-jobdir <path/to/logging/directory>
```

When run in this mode, all log files for the workflow will be placed into the path you provided for the logging directory. In particular, Snakemake's outputs with job progress can be found split between `<logdir>/snakemake-long.err` and `<logdir>/snakemake-long.out`.
When run in this mode, all log files for the workflow will be placed into the path you provided for the logging directory. In particular, Snakemake's outputs with job progress can be found split between `<logdir>/snakemake-long.err` and `<logdir>/snakemake-long.out`. These will also log each rule and what HTCondor job ID was submitted for that rule (see the [troubleshooting section](#troubleshooting) for information on how to use these extra log files).

### Adjusting Resources

Expand Down Expand Up @@ -152,6 +152,13 @@ contain useful debugging clues about what may have gone wrong.
the version of SPRAS you want to test, and push the image to your image repository. To use that container in the workflow, change the `container_image` line of
`spras.sub` to point to the new image.

### Troubleshooting
Some errors Snakemake might encounter while executing rules in the workflow boil down to bad luck in a distributed, heterogeneous computational environment, and it's expected that some errors can be solved simply by rerunning. If you encounter a Snakemake error, try restarting the workflow to see if the same error is generated in the same rule a second time -- repeatable, identical failures are more likely to indicate a more fundamental issue that might require user intervention to fix.

To investigate issues, start by referring to your logging directory. Each Snakemake rule submitted to HTCondor will log a corresponding HTCondor job ID in the Snakemake standard out/error. You can use this job ID to check the standard out, standard error, and HTCondor job log for that specific rule. In some cases the error will indicate a user-solvable issue, e.g. "input file not found" might point to a typo in some part of your workflow. In other cases, errors might be solved by retrying the workflow, which causes Snakemake to pick up where it left off.

If your workflow gets stuck on the same error after multiple consecutive retries and prevents your workflow from completing, this indicates some user/developer intervention is likely required. If you choose to open a github issue, please include a description of the error(s) and what troubleshooting steps you've already taken.

## Versions:

The versions of this image match the version of the spras package within it.
Expand Down
4 changes: 4 additions & 0 deletions docker-wrappers/SPRAS/spras_profile/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ configfile: example_config.yaml
# Indicate to the plugin that jobs running on various EPs do not share a filesystem with
# each other, or with the AP.
shared-fs-usage: none
# Distributed, heterogeneous computational environments are a wild place where strange things
# can happen. If something goes wrong, try again up to 5 times. After that, we assume there's
# a real error that requires user/admin intervention
retries: 5

# Default resources will apply to all workflow steps. If a single workflow step fails due
# to insufficient resources, it can be re-run with modified values. Snakemake will handle
Expand Down

0 comments on commit 3f3fdc8

Please sign in to comment.