-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop daemon on compute nodes after jobtap prolog completes #166
Comments
"reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons" We want to stop the daemons while the compute job is running, to avoid introducing jitter on the compute node. |
We should get more opinions, but I don't think the performance impact of IIRC, each broker has its own kvs-watch module and local KVS cache, so I think the load will be mostly distributed... |
I think that should be OK! A quick test might be a good idea though. We've been surprised on el cap before :-) |
I noticed that if a job is canceled during a prolog, the epilog does not run. This can mean that the nnf daemons are never started again. So the next rabbit job that comes in will end up stuck, because it will wait for the file systems to mount and the daemons that are supposed to handle the mounting are not running. One solution would be to change flux-core to always run the epilog. See flux-framework/flux-core#6055 Another would be to change the rabbit prolog above to start the daemons just in case, something like
|
This now is in place and working with flux-core <= 0.63.0. However, the new housekeeping service in 0.64.0 breaks the epilog. The problem is that the housekeeping service runs after the dws jobtap epilog action, not at the same time, as the old epilog infrastructure did. Since the dws jobtap action cannot complete until the @garlick do you see any ways around this? Is there a way we can use the old epilog infrastructure for this one script? |
We could configure epilog as it was before to run just that script. |
Excellent, any pointers on how I could configure that? Add the |
Yes. The other component is the IMP The epilog/housekeeping scripts in Flux expects that the entry points for prolog, epilog, and housekeeping are the scripts of the same name in |
The epilog I want to run is just
If |
James, the other way (minor nit, to reduce the apiserver load before adding more):
And at the prolog, stop one before starting the other as well. |
Good point, will fix. |
Works for me!
Be careful not to activate the epilog in the |
This is now configured in ansible on all the rabbit systems and seems to be working. |
We had a chat about some aspects of this after the meeting today, and I was wondering why nnf-clientmount needs to be started at the end of a job like this. Can it not be socket-triggered in the systemd unit or otherwise triggered when needed? I'm trying to figure out that and why these need to happen before the housekeeping phase. |
At the end of a job that uses the rabbits, nnf-clientmount needs to run to unmount the rabbit file systems. That unmounting needs to happen before the housekeeping phase because there is currently a jobtap epilog added by a jobtap plugin in this repo that is only released when all the rabbit resources have been cleaned up, which includes having the compute nodes unmounted. Housekeeping runs after the epilog completes, which would be too late. I would be very happy to work to trigger it another way but it needs to be triggered by the |
It's one of several trigger methods, but socket triggering is usually used for things like ssh or other servers where you want the daemon to be started when a client connects to a specific port. The thought was that if this is a service that gets a connection from dws, or from somewhere, when it needs to perform an action then we could set it up so it gets launched as a direct result of that connection being made, then shut it down after it's done. |
There are clientmount daemons running on every compute node to handle the mounting and unmounting of rabbit file systems. The daemons produce noise, and there have been some investigations lately into how to reduce it. In theory the daemons only need to be running when there are file systems to mount or unmount, at the beginning and end of jobs.
The HPE rabbit team would like Flux to start the daemons when a job finishes and stop them right before the job starts to run, so that the daemons are never running on a node at the same time as a job. The daemons are to be stopped by executing a
systemctl stop
and started with asystemctl start
.The logic to start the daemons could easily go in the administrative epilog. However, the daemons must not be stopped until they have finished mounting their file systems which will happen some time after the RUN state is reached, so the command to stop them cannot be issued arbitrarily by the administrative prolog. The daemons are only guaranteed to have finished their work when the job's k8s Workflow resource goes to
PreRun: Ready: True
. That corresponds to thedws-prolog
jobtap prolog action completing and thedws_environment
event being posted to the job's eventlog.One solution would be to add a final bit of logic to the administrative prolog, something like
(since if the job doesn't use rabbits, the
dws_environment
event will not be posted.)Thoughts @grondo ? Would there be performance issues from having every node read the eventlog? (I hope not, because I think we do some coral2 plugins do this elsewhere.)
At any event we cannot check the k8s Workflow resource from the administrative prolog because reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons.
The text was updated successfully, but these errors were encountered: