Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable long Snakemake workflows in HTCondor #183

Merged
merged 4 commits into from
Oct 15, 2024

Conversation

jhiemstrawisc
Copy link
Collaborator

To avoid tying Snakemake execution to the current terminal (which is bad if
you ever want to log out of the AP or let your computer fall asleep), the new
script snakemake_long.py wraps Snakemake execution in an HTCondor local
universe job. This job is submitted like a regular HTCondor job, but it runs
on the AP in the context of the submit directory.

Usage:
snakemake_long.py --snakefile (OPTIONAL) </path/to/snakefile> --profile (REQUIRED) </path/to/profile> --htcondor-jobdir (OPTIONAL) </path/to/logs>

The only CLI option that might feel new here is htcondor-jobdir. This is
actually the same CLI used by the HTCondor executor, and it specifies a directory
in which logs are placed. I chose to keep names the same with this script so it
feels more familiar to Snakemake users.

@jhiemstrawisc
Copy link
Collaborator Author

@agitter, this branch is based on the other CHTC executor work in #172. That should be merged first (and doing so will dramatically reduce the noise in this PR by hiding most of the commits that came along with it from the other branch).

@jhiemstrawisc jhiemstrawisc changed the title Snakemake long Enable long Snakemake workflows in HTCondor Sep 4, 2024
To avoid tying Snakemake execution to the current terminal (which is bad if
you ever want to log out of the AP or let your computer fall asleep), the new
script `snakemake_long.py` wraps Snakemake execution in an HTCondor local
universe job. This job is submitted like a regular HTCondor job, but it runs
on the AP in the context of the submit directory.

Usage:
snakemake_long.py --snakefile (OPTIONAL) </path/to/snakefile> --profile (REQUIRED) </path/to/profile> --htcondor-jobdir (OPTIONAL) </path/to/logs>

The only CLI option that might feel new here is `htcondor-jobdir`. This is
actually the same CLI used by the HTCondor executor, and it specifies a directory
in which logs are placed. I chose to keep names the same with this script so it
feels more familiar to Snakemake users.
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial testing started off well, but then all 6 of my jobs were put on hold condor_q -l showed

HoldReason = "Transfer input files failure at access point ap2001 while sending files to execution point [email protected]. Details: reading from file /var/lib/condor/execute/slot1/dir_20697/_condor_stdout: (errno 2) No such file or directory"
HoldReasonCode = 13
HoldReasonSubCode = 2

docker-wrappers/SPRAS/snakemake_long.py Show resolved Hide resolved
docker-wrappers/SPRAS/snakemake_long.py Outdated Show resolved Hide resolved
docker-wrappers/SPRAS/snakemake_long.py Outdated Show resolved Hide resolved

submit_description = htcondor.Submit({
"executable": script_location,
"arguments": f"long --snakefile {snakefile} --profile {profile} --htcondor-jobdir {htcondor_jobdir}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does long here imply there is a max runtime on the local universe job?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it's my way of using the same executable to serve two purposes -- snakemake_long.py is first run by the user with their args to submit the local universe job. Then, the local universe job runs snakemake_long.py long <user args> to indicate to itself that it's time to submit the long-running Snakemake process instead of submitting another local universe job.

I'll add a comment for now to better explain the behavior, and if you don't like the name I can change it all together.

docker-wrappers/SPRAS/snakemake_long.py Outdated Show resolved Hide resolved
# modules we've installed in the submission environment (notably spras).
"getenv": "true",

"JobBatchName": f"spras-long-{time.strftime('%Y%m%d-%H%M%S')}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, calling this long may not be intuitive. Should it be local? manage to correspond with the print statement below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is all spras-specific (at this point), how about just calling it spras-<timestamp>?

Comment on lines 81 to 82
print(f"Error: The Snakefile {args.snakefile} does not exist.")
return 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach of returning error codes is a bit atypical to me. Is this better than raising errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly to play better with HTCondor, which expects job executables to return 0 for success, and something else for failure. I guess raising an error also returns non-zero, but this felt more explicit to me. I'll let you make the call.

@jhiemstrawisc
Copy link
Collaborator Author

My initial testing started off well, but then all 6 of my jobs were put on hold condor_q -l showed

HoldReason = "Transfer input files failure at access point ap2001 while sending files to execution point [email protected]. Details: reading from file /var/lib/condor/execute/slot1/dir_20697/_condor_stdout: (errno 2) No such file or directory"
HoldReasonCode = 13
HoldReasonSubCode = 2

Hmm, I'm not totally sure what to make of that error at first glance. It seems to imply that there was a failure sending files from the AP to the EP, but the file it's looking to send seems to come from the EP...

Can you share the content of the generated log files snakemake-long.out/err?

@agitter
Copy link
Collaborator

agitter commented Sep 22, 2024

I tried running this again and encountered the same error:

HoldReason = "Transfer input files failure at access point ap2001 while sending files to execution point [email protected]. Details: reading from file /var/lib/condor/execute/slot2/dir_530230/_condor_stdout: (errno 2) No such file or directory"

497726.0 is an example of a held job. I left it in the queue for now.

@agitter
Copy link
Collaborator

agitter commented Sep 24, 2024

We discovered some of my errors were because my .sif file was named incorrectly. I returned to that submission later, and condor_q shows no more running jobs. However, the output dir does not show all expected files.

In snakemake-long.err I see:

Error in rule reconstruct:
    message: Job 18 with HTCondor Cluster ID 498174 has  status Completed, but failed with ExitCode 1.For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 18
    input: output/prepared/data0-meo-inputs/sources.txt, output/prepared/data0-meo-inputs/targets.txt, output/prepared/data0-meo-inputs/edges.txt
    output: output/data0-meo-params-GKEDDFZ/raw-pathway.txt
    external_jobid: 498174

[Mon Sep 23 14:58:24 2024]
Error in rule reconstruct:
    message: Job 3 with HTCondor Cluster ID 498179 has  status Completed, but failed with ExitCode 1.For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 3
    input: output/prepared/data0-omicsintegrator1-inputs/prizes.txt, output/prepared/data0-omicsintegrator1-inputs/edges.txt
    output: output/data0-omicsintegrator1-params-PU62FNV/raw-pathway.txt
    external_jobid: 498179

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-09-23T141203.356730.snakemake.log
WorkflowError:
At least one job did not complete successfully.
``

@jhiemstrawisc
Copy link
Collaborator Author

Adding my thoughts from our Slack conversation for git posterity -- I believe the errors in your run were related to the luck of the draw in a distributed, heterogeneous computing environment. When I investigated, everything succeeded with a retry that allowed Snakemake to pick up where it left off. I was even able to generate a few new errors related to a pokey HTCondor SchedD, but again a retry picked things back up and carried the workflow to completion.

It's less clear to me how to communicate this to users in the docs -- some errors are definitely not retryable, but enumerating all the cases for "rerun if you encounter error XXX but don't rerun if you encounter error YYY" isn't feasible due to the unknowably-many error modes. Maybe we add a snippet to the README here saying something like "Some errors Snakemake might encounter while executing rules in the workflow boil down to bad luck in a distributed, heterogeneous computational environment, and it's expected that some errors can be solved simply by rerunning. If you encounter a Snakemake error, try restarting the workflow to see if the same error is generated in the same rule a second time -- repeatable, identical failures are more likely to indicate a more fundamental issue that might require user intervention to fix."

@agitter
Copy link
Collaborator

agitter commented Oct 12, 2024

I tested again, and this time it ran to completing confirming what you said above. Adding a message similar to

Some errors Snakemake might encounter while executing rules in the workflow boil down to bad luck in a distributed, heterogeneous computational environment, and it's expected that some errors can be solved simply by rerunning. If you encounter a Snakemake error, try restarting the workflow to see if the same error is generated in the same rule a second time -- repeatable, identical failures are more likely to indicate a more fundamental issue that might require user intervention to fix.

does seem valuable because I initially thought something was more fundamentally broken.

I saw in the snakemake long err log

The job argument contains a single quote. Removing it to avoid issues with HTCondor.

Does that mean something we should pay attention to?

…times

To handle potential failures arising from bad luck in a distributed computing environment,
this PR tries to inform users which types of errors might retriable, and attempts to retry
all errors on the user's behalf up to 5 times. The hope is that this will obscure most of
these transient errors from the user's view.
@agitter agitter merged commit a26f4d0 into Reed-CompBio:master Oct 15, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants