-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow certain exit codes at user's discretion #67
Comments
Interesting. I could see the use case although I personally haven't run into this myself because I am almost always running quantum chemistry codes with Python wrappers that automatically capture the exit code and then decide if an error should be raised (so, one abstraction layer higher if that makes sense). In the end, if the user is launching the executable with a Python subprocess, this never really has to be an issue. But if they are calling the executable directly in the Slurm script I could see it coming up. |
Where I ran into this is using ASE with quantum espresso (although I have seen it in other codes also). Typically, in ASE, one would have an environment variable set: os.environ["ASE_ESPRESSO_COMMAND"] = "srun pw.x -npool %d -ndiag 1 -input PREFIX.pwi > PREFIX.pwo" % num_nodes Finishing this command with an exit code other then 0 will then cause Covalent to crash out. You can get around this by doing the below instead: script = r'''bash -c '(srun pw.x -npool %d -ndiag 1 -input PREFIX.pwi 2>stderr.log 1>PREFIX.pwo || exitcode=$?) && [[ $exitcode -eq 3 ]] && exitcode=0
if [[ $exitcode -eq 3 ]]; then
>&2 cat stderr.log
fi
exit $exitcode'
''' % (num_nodes)
os.environ["ASE_ESPRESSO_COMMAND"] = script but it's a a bit of a hack. The error codes could be handled in the ASE internals but the standard user isn't going to want to do that. From a UX standpoint, we may wish to do something about all of this. |
Ah! You said the magic words. I totally get what you mean now! I'm 100% on board. |
As a side-comment: does that exit code issue not make the ASE calculator not useful as-is? Perhaps put another way, would things work fine without Covalent but not with Covalent? I'm just curious. |
The default behaviour of ASE is to allow non-zero exit codes for the wrapped DFT binaries. So, it works fine without covalent but not with! |
Ah, I realized another related scenario where this comes up (for me at least). A lot of the workflows I run have some sort of error-handling routine associated with them, such that if the executable errors out, the error-handling routine will try to fix it and re-launch the executable rather than cancel the Slurm job entirely. That's the basis behind the Custodian package we use with VASP. For the kinds of calculations I do, I don't want Covalent to care about the exit code in a subprocess. |
What should we add?
For a number of reasons, some code you are running with
srun
may exit with an exit code other than 0 (0 indicates "success", non-zero indicates certain categories of failure) although from the standpoint of the user, nothing actually went wrong, and usable output was produced. This can happen, for example, in quantum espresso (pw.x) if self-consistency isn't met. In some select situations, we may not care about this and wish to carry on anyways.Presently, however, this non-zero exit code is caught by the slurm executor from STDERR and the calculation crashes.
I suggest we implement "allowed exit codes" other than just 0. This is intended for advanced users and we should consider throwing a "are you sure you know what you're doing?" warning if this input variable is set as it can lead to unexpected beahviour.
In bash, to ignore a certain exit code (say exit code 3), we can use:
where the above example is on Perlmutter running the GPU version of quantum expresso (pw.x).
We can implement something along these lines for srun options.
One thing that isn't clear to me is whether or not this is a slurm executor addition or a core covalent addition.
Describe alternatives you've considered.
You can set
use_srun=False
and manually define yoursrun
behaviour as the above script.The text was updated successfully, but these errors were encountered: