-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable long Snakemake workflows in HTCondor #183
Conversation
To avoid tying Snakemake execution to the current terminal (which is bad if you ever want to log out of the AP or let your computer fall asleep), the new script `snakemake_long.py` wraps Snakemake execution in an HTCondor local universe job. This job is submitted like a regular HTCondor job, but it runs on the AP in the context of the submit directory. Usage: snakemake_long.py --snakefile (OPTIONAL) </path/to/snakefile> --profile (REQUIRED) </path/to/profile> --htcondor-jobdir (OPTIONAL) </path/to/logs> The only CLI option that might feel new here is `htcondor-jobdir`. This is actually the same CLI used by the HTCondor executor, and it specifies a directory in which logs are placed. I chose to keep names the same with this script so it feels more familiar to Snakemake users.
04b1bc6
to
e3f2978
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial testing started off well, but then all 6 of my jobs were put on hold condor_q -l
showed
HoldReason = "Transfer input files failure at access point ap2001 while sending files to execution point [email protected]. Details: reading from file /var/lib/condor/execute/slot1/dir_20697/_condor_stdout: (errno 2) No such file or directory"
HoldReasonCode = 13
HoldReasonSubCode = 2
|
||
submit_description = htcondor.Submit({ | ||
"executable": script_location, | ||
"arguments": f"long --snakefile {snakefile} --profile {profile} --htcondor-jobdir {htcondor_jobdir}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does long
here imply there is a max runtime on the local universe job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it's my way of using the same executable to serve two purposes -- snakemake_long.py
is first run by the user with their args to submit the local universe job. Then, the local universe job runs snakemake_long.py long <user args>
to indicate to itself that it's time to submit the long-running Snakemake process instead of submitting another local universe job.
I'll add a comment for now to better explain the behavior, and if you don't like the name I can change it all together.
# modules we've installed in the submission environment (notably spras). | ||
"getenv": "true", | ||
|
||
"JobBatchName": f"spras-long-{time.strftime('%Y%m%d-%H%M%S')}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, calling this long
may not be intuitive. Should it be local
? manage
to correspond with the print statement below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is all spras-specific (at this point), how about just calling it spras-<timestamp>
?
print(f"Error: The Snakefile {args.snakefile} does not exist.") | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of returning error codes is a bit atypical to me. Is this better than raising errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly to play better with HTCondor, which expects job executables to return 0 for success, and something else for failure. I guess raising an error also returns non-zero, but this felt more explicit to me. I'll let you make the call.
Hmm, I'm not totally sure what to make of that error at first glance. It seems to imply that there was a failure sending files from the AP to the EP, but the file it's looking to send seems to come from the EP... Can you share the content of the generated log files |
I tried running this again and encountered the same error:
497726.0 is an example of a held job. I left it in the queue for now. |
We discovered some of my errors were because my .sif file was named incorrectly. I returned to that submission later, and condor_q shows no more running jobs. However, the output dir does not show all expected files. In
|
Adding my thoughts from our Slack conversation for git posterity -- I believe the errors in your run were related to the luck of the draw in a distributed, heterogeneous computing environment. When I investigated, everything succeeded with a retry that allowed Snakemake to pick up where it left off. I was even able to generate a few new errors related to a pokey HTCondor SchedD, but again a retry picked things back up and carried the workflow to completion. It's less clear to me how to communicate this to users in the docs -- some errors are definitely not retryable, but enumerating all the cases for "rerun if you encounter error XXX but don't rerun if you encounter error YYY" isn't feasible due to the unknowably-many error modes. Maybe we add a snippet to the README here saying something like "Some errors Snakemake might encounter while executing rules in the workflow boil down to bad luck in a distributed, heterogeneous computational environment, and it's expected that some errors can be solved simply by rerunning. If you encounter a Snakemake error, try restarting the workflow to see if the same error is generated in the same rule a second time -- repeatable, identical failures are more likely to indicate a more fundamental issue that might require user intervention to fix." |
I tested again, and this time it ran to completing confirming what you said above. Adding a message similar to
does seem valuable because I initially thought something was more fundamentally broken. I saw in the snakemake long err log
Does that mean something we should pay attention to? |
…times To handle potential failures arising from bad luck in a distributed computing environment, this PR tries to inform users which types of errors might retriable, and attempts to retry all errors on the user's behalf up to 5 times. The hope is that this will obscure most of these transient errors from the user's view.
To avoid tying Snakemake execution to the current terminal (which is bad if
you ever want to log out of the AP or let your computer fall asleep), the new
script
snakemake_long.py
wraps Snakemake execution in an HTCondor localuniverse job. This job is submitted like a regular HTCondor job, but it runs
on the AP in the context of the submit directory.
Usage:
snakemake_long.py --snakefile (OPTIONAL) </path/to/snakefile> --profile (REQUIRED) </path/to/profile> --htcondor-jobdir (OPTIONAL) </path/to/logs>
The only CLI option that might feel new here is
htcondor-jobdir
. This isactually the same CLI used by the HTCondor executor, and it specifies a directory
in which logs are placed. I chose to keep names the same with this script so it
feels more familiar to Snakemake users.