New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Enable long Snakemake workflows in HTCondor #183

Merged

agitter merged 4 commits into Reed-CompBio:master from jhiemstrawisc:snakemake-long

Oct 15, 2024

Collaborator

jhiemstrawisc commented Sep 4, 2024

To avoid tying Snakemake execution to the current terminal (which is bad if
you ever want to log out of the AP or let your computer fall asleep), the new
script snakemake_long.py wraps Snakemake execution in an HTCondor local
universe job. This job is submitted like a regular HTCondor job, but it runs
on the AP in the context of the submit directory.

Usage:
snakemake_long.py --snakefile (OPTIONAL) </path/to/snakefile> --profile (REQUIRED) </path/to/profile> --htcondor-jobdir (OPTIONAL) </path/to/logs>

The only CLI option that might feel new here is htcondor-jobdir. This is
actually the same CLI used by the HTCondor executor, and it specifies a directory
in which logs are placed. I chose to keep names the same with this script so it
feels more familiar to Snakemake users.

jhiemstrawisc requested a review from agitter

September 4, 2024 17:36

Collaborator Author

jhiemstrawisc commented Sep 4, 2024

@agitter, this branch is based on the other CHTC executor work in #172. That should be merged first (and doing so will dramatically reduce the noise in this PR by hiding most of the commits that came along with it from the other branch).

jhiemstrawisc changed the title ~~Snakemake long~~ Enable long Snakemake workflows in HTCondor


          Add wrapper script for running long Snakemake workflows with HTCondor

e3f2978

To avoid tying Snakemake execution to the current terminal (which is bad if
you ever want to log out of the AP or let your computer fall asleep), the new
script `snakemake_long.py` wraps Snakemake execution in an HTCondor local
universe job. This job is submitted like a regular HTCondor job, but it runs
on the AP in the context of the submit directory.

Usage:
snakemake_long.py --snakefile (OPTIONAL) </path/to/snakefile> --profile (REQUIRED) </path/to/profile> --htcondor-jobdir (OPTIONAL) </path/to/logs>

The only CLI option that might feel new here is `htcondor-jobdir`. This is
actually the same CLI used by the HTCondor executor, and it specifies a directory
in which logs are placed. I chose to keep names the same with this script so it
feels more familiar to Snakemake users.

jhiemstrawisc force-pushed the snakemake-long branch from 04b1bc6 to e3f2978 Compare

September 4, 2024 21:11

agitter requested changes

View reviewed changes

Collaborator

agitter left a comment

My initial testing started off well, but then all 6 of my jobs were put on hold condor_q -l showed

HoldReason = "Transfer input files failure at access point ap2001 while sending files to execution point [email protected]. Details: reading from file /var/lib/condor/execute/slot1/dir_20697/_condor_stdout: (errno 2) No such file or directory"
HoldReasonCode = 13
HoldReasonSubCode = 2

docker-wrappers/SPRAS/snakemake_long.py Show resolved Hide resolved

docker-wrappers/SPRAS/snakemake_long.py Outdated Show resolved Hide resolved

docker-wrappers/SPRAS/snakemake_long.py Outdated Show resolved Hide resolved

docker-wrappers/SPRAS/snakemake_long.py

+                  submit_description = htcondor.Submit({
+                      "executable":              script_location,
+                      "arguments":               f"long --snakefile {snakefile} --profile {profile} --htcondor-jobdir {htcondor_jobdir}",

Collaborator

agitter Sep 9, 2024

Does long here imply there is a max runtime on the local universe job?

Collaborator Author

jhiemstrawisc Sep 17, 2024

Nope, it's my way of using the same executable to serve two purposes -- snakemake_long.py is first run by the user with their args to submit the local universe job. Then, the local universe job runs snakemake_long.py long <user args> to indicate to itself that it's time to submit the long-running Snakemake process instead of submitting another local universe job.

I'll add a comment for now to better explain the behavior, and if you don't like the name I can change it all together.

docker-wrappers/SPRAS/snakemake_long.py Outdated Show resolved Hide resolved

docker-wrappers/SPRAS/snakemake_long.py Outdated

+                      # modules we've installed in the submission environment (notably spras).
+                      "getenv":                  "true",
+                      "JobBatchName":            f"spras-long-{time.strftime('%Y%m%d-%H%M%S')}",

Collaborator

agitter Sep 9, 2024

Same comment here, calling this long may not be intuitive. Should it be local? manage to correspond with the print statement below?

Collaborator Author

jhiemstrawisc Sep 17, 2024

Since this is all spras-specific (at this point), how about just calling it spras-<timestamp>?

docker-wrappers/SPRAS/snakemake_long.py Outdated

Comment on lines 81 to 82

		print(f"Error: The Snakefile {args.snakefile} does not exist.")
		return 1

Collaborator

agitter Sep 9, 2024

This approach of returning error codes is a bit atypical to me. Is this better than raising errors?

Collaborator Author

jhiemstrawisc Sep 17, 2024

This is mostly to play better with HTCondor, which expects job executables to return 0 for success, and something else for failure. I guess raising an error also returns non-zero, but this felt more explicit to me. I'll let you make the call.


          Incorporate review feedback

22c11b5

Collaborator Author

jhiemstrawisc commented Sep 17, 2024

My initial testing started off well, but then all 6 of my jobs were put on hold condor_q -l showed

HoldReason = "Transfer input files failure at access point ap2001 while sending files to execution point [email protected]. Details: reading from file /var/lib/condor/execute/slot1/dir_20697/_condor_stdout: (errno 2) No such file or directory"
HoldReasonCode = 13
HoldReasonSubCode = 2

Hmm, I'm not totally sure what to make of that error at first glance. It seems to imply that there was a failure sending files from the AP to the EP, but the file it's looking to send seems to come from the EP...

Can you share the content of the generated log files snakemake-long.out/err?

Collaborator

agitter commented Sep 22, 2024

I tried running this again and encountered the same error:

HoldReason = "Transfer input files failure at access point ap2001 while sending files to execution point [email protected]. Details: reading from file /var/lib/condor/execute/slot2/dir_530230/_condor_stdout: (errno 2) No such file or directory"

497726.0 is an example of a held job. I left it in the queue for now.


          Make sm-long more pythonic by using exceptions instead of return codes

b4d9d27

Collaborator

agitter commented Sep 24, 2024

We discovered some of my errors were because my .sif file was named incorrectly. I returned to that submission later, and condor_q shows no more running jobs. However, the output dir does not show all expected files.

In snakemake-long.err I see:

Error in rule reconstruct:
    message: Job 18 with HTCondor Cluster ID 498174 has  status Completed, but failed with ExitCode 1.For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 18
    input: output/prepared/data0-meo-inputs/sources.txt, output/prepared/data0-meo-inputs/targets.txt, output/prepared/data0-meo-inputs/edges.txt
    output: output/data0-meo-params-GKEDDFZ/raw-pathway.txt
    external_jobid: 498174

[Mon Sep 23 14:58:24 2024]
Error in rule reconstruct:
    message: Job 3 with HTCondor Cluster ID 498179 has  status Completed, but failed with ExitCode 1.For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 3
    input: output/prepared/data0-omicsintegrator1-inputs/prizes.txt, output/prepared/data0-omicsintegrator1-inputs/edges.txt
    output: output/data0-omicsintegrator1-params-PU62FNV/raw-pathway.txt
    external_jobid: 498179

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-09-23T141203.356730.snakemake.log
WorkflowError:
At least one job did not complete successfully.
``

Collaborator Author

jhiemstrawisc commented Oct 10, 2024

Adding my thoughts from our Slack conversation for git posterity -- I believe the errors in your run were related to the luck of the draw in a distributed, heterogeneous computing environment. When I investigated, everything succeeded with a retry that allowed Snakemake to pick up where it left off. I was even able to generate a few new errors related to a pokey HTCondor SchedD, but again a retry picked things back up and carried the workflow to completion.

It's less clear to me how to communicate this to users in the docs -- some errors are definitely not retryable, but enumerating all the cases for "rerun if you encounter error XXX but don't rerun if you encounter error YYY" isn't feasible due to the unknowably-many error modes. Maybe we add a snippet to the README here saying something like "Some errors Snakemake might encounter while executing rules in the workflow boil down to bad luck in a distributed, heterogeneous computational environment, and it's expected that some errors can be solved simply by rerunning. If you encounter a Snakemake error, try restarting the workflow to see if the same error is generated in the same rule a second time -- repeatable, identical failures are more likely to indicate a more fundamental issue that might require user intervention to fix."

Collaborator

agitter commented Oct 12, 2024

I tested again, and this time it ran to completing confirming what you said above. Adding a message similar to

Some errors Snakemake might encounter while executing rules in the workflow boil down to bad luck in a distributed, heterogeneous computational environment, and it's expected that some errors can be solved simply by rerunning. If you encounter a Snakemake error, try restarting the workflow to see if the same error is generated in the same rule a second time -- repeatable, identical failures are more likely to indicate a more fundamental issue that might require user intervention to fix.

does seem valuable because I initially thought something was more fundamentally broken.

I saw in the snakemake long err log

The job argument contains a single quote. Removing it to avoid issues with HTCondor.

Does that mean something we should pay attention to?


          Add extra troubleshooting for failed jobs and retry failures up to 5 …

3f3fdc8

…times

To handle potential failures arising from bad luck in a distributed computing environment,
this PR tries to inform users which types of errors might retriable, and attempts to retry
all errors on the user's behalf up to 5 times. The hope is that this will obscure most of
these transient errors from the user's view.

agitter approved these changes

View reviewed changes

agitter merged commit a26f4d0 into Reed-CompBio:master

5 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet