Stop daemon on compute nodes after jobtap prolog completes #166

jameshcorbett · 2024-06-14T02:21:21Z

There are clientmount daemons running on every compute node to handle the mounting and unmounting of rabbit file systems. The daemons produce noise, and there have been some investigations lately into how to reduce it. In theory the daemons only need to be running when there are file systems to mount or unmount, at the beginning and end of jobs.

The HPE rabbit team would like Flux to start the daemons when a job finishes and stop them right before the job starts to run, so that the daemons are never running on a node at the same time as a job. The daemons are to be stopped by executing a systemctl stop and started with a systemctl start.

The logic to start the daemons could easily go in the administrative epilog. However, the daemons must not be stopped until they have finished mounting their file systems which will happen some time after the RUN state is reached, so the command to stop them cannot be issued arbitrarily by the administrative prolog. The daemons are only guaranteed to have finished their work when the job's k8s Workflow resource goes to PreRun: Ready: True. That corresponds to the dws-prolog jobtap prolog action completing and the dws_environment event being posted to the job's eventlog.

One solution would be to add a final bit of logic to the administrative prolog, something like

if job_uses_rabbits:
    flux job wait-event dws_environment
    systemctl stop nnf-clientmount
else:
    systemctl stop nnf-clientmount

(since if the job doesn't use rabbits, the dws_environment event will not be posted.)

Thoughts @grondo ? Would there be performance issues from having every node read the eventlog? (I hope not, because I think we do some coral2 plugins do this elsewhere.)

At any event we cannot check the k8s Workflow resource from the administrative prolog because reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons.

The text was updated successfully, but these errors were encountered:

roehrich-hpe · 2024-06-14T13:04:40Z

"reducing some of the pressure on k8s is one of the things the HPE team is trying to accomplish by having Flux start and stop the daemons"

We want to stop the daemons while the compute job is running, to avoid introducing jitter on the compute node.

grondo · 2024-06-14T13:43:11Z

Thoughts @grondo ? Would there be performance issues from having every node read the eventlog? (I hope not, because I think we do some coral2 plugins do this elsewhere.)

We should get more opinions, but I don't think the performance impact of wait-event should be too bad. It would be easy enough to test a worst case scenario.

IIRC, each broker has its own kvs-watch module and local KVS cache, so I think the load will be mostly distributed...

@chu11 @garlick - any concerns?

garlick · 2024-06-14T13:59:40Z

I think that should be OK! A quick test might be a good idea though. We've been surprised on el cap before :-)

jameshcorbett · 2024-06-25T01:29:16Z

I noticed that if a job is canceled during a prolog, the epilog does not run. This can mean that the nnf daemons are never started again. So the next rabbit job that comes in will end up stuck, because it will wait for the file systems to mount and the daemons that are supposed to handle the mounting are not running.

One solution would be to change flux-core to always run the epilog. See flux-framework/flux-core#6055

Another would be to change the rabbit prolog above to start the daemons just in case, something like

# just in case they aren't already running
systemctl start nnf-clientmount

if job_uses_rabbits:
    flux job wait-event dws_environment
    systemctl stop nnf-clientmount
else:
    systemctl stop nnf-clientmount

jameshcorbett · 2024-07-30T03:31:06Z

This now is in place and working with flux-core <= 0.63.0. However, the new housekeeping service in 0.64.0 breaks the epilog.

The problem is that the housekeeping service runs after the dws jobtap epilog action, not at the same time, as the old epilog infrastructure did. Since the dws jobtap action cannot complete until the start_nnf_services housekeeping script runs (which starts the nnf-clientmount service the dws-epilog action indirectly depends on) this leads to job deadlock, with housekeeping waiting on the dws-epilog and vice versa.

@garlick do you see any ways around this? Is there a way we can use the old epilog infrastructure for this one script?

garlick · 2024-07-30T05:28:42Z

We could configure epilog as it was before to run just that script.

jameshcorbett · 2024-07-30T05:47:50Z

Excellent, any pointers on how I could configure that? Add the job-manager.epilog section back to job-manager.toml and, add a new directory somewhere with just that script, and point the epilog at it?

garlick · 2024-07-30T14:06:19Z

Yes. The other component is the IMP [run.epilog] table that points to the /etc/flux/system/epilog script (still in place I think).

The epilog/housekeeping scripts in /etc/flux/system are a little confusing on our systems because housekeeping runs epilog.real, as does epilog, which is currently unused. epilog should be modified to run a new directory of scripts. However, that's going to make things really confusing since we'll have two directories with epilog in the name. This might need a little thought by the sys admins about how they want to organize things.

Flux expects that the entry points for prolog, epilog, and housekeeping are the scripts of the same name in /etc/flux/system. I would suggest not changing that. If we enable running prolog and epilog under systemd, the unit scripts expect those paths.

jameshcorbett · 2024-07-30T17:10:56Z

The epilog I want to run is just

systemctl start nnf-clientmount
systemctl stop nnf-dm

If /etc/flux/system/epilog is currently unused, maybe I could just change the contents to those two lines? Rather than adding a new directory.

roehrich-hpe · 2024-07-30T17:56:45Z

James, the other way (minor nit, to reduce the apiserver load before adding more):

systemctl stop nnf-dm
systemctl start nnf-clientmount

And at the prolog, stop one before starting the other as well.

jameshcorbett · 2024-07-30T18:02:56Z

Good point, will fix.

garlick · 2024-07-30T18:10:31Z

Works for me!

epilog is controlled by the sys admins + ansible on our systems so check in with them.

Be careful not to activate the epilog in the job-manager config before the script's contents are updated though, or we'll have all that slow gunk running twice, and nobody wants that.

jameshcorbett · 2024-08-01T04:14:09Z

This is now configured in ansible on all the rabbit systems and seems to be working.

trws · 2024-08-08T00:18:04Z

We had a chat about some aspects of this after the meeting today, and I was wondering why nnf-clientmount needs to be started at the end of a job like this. Can it not be socket-triggered in the systemd unit or otherwise triggered when needed? I'm trying to figure out that and why these need to happen before the housekeeping phase.

jameshcorbett · 2024-08-08T17:44:15Z

At the end of a job that uses the rabbits, nnf-clientmount needs to run to unmount the rabbit file systems. That unmounting needs to happen before the housekeeping phase because there is currently a jobtap epilog added by a jobtap plugin in this repo that is only released when all the rabbit resources have been cleaned up, which includes having the compute nodes unmounted. Housekeeping runs after the epilog completes, which would be too late.

I would be very happy to work to trigger it another way but it needs to be triggered by the finish event. How does socket-triggering of systemd units work?

trws · 2024-08-08T18:37:57Z

It's one of several trigger methods, but socket triggering is usually used for things like ssh or other servers where you want the daemon to be started when a client connects to a specific port. The thought was that if this is a service that gets a connection from dws, or from somewhere, when it needs to perform an action then we could set it up so it gets launched as a direct result of that connection being made, then shut it down after it's done.

bdevcich mentioned this issue Jun 14, 2024

nnf-dm + clientmountd cpu usage NearNodeFlash/NearNodeFlash.github.io#117

Open

jameshcorbett mentioned this issue Jun 25, 2024

Run administrative epilog even if job is canceled before starting flux-framework/flux-core#6055

Closed

jameshcorbett mentioned this issue Jun 26, 2024

Start nnf-dm daemon on compute nodes when job requests it #167

Closed

jameshcorbett closed this as completed Aug 1, 2024

jameshcorbett reopened this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop daemon on compute nodes after jobtap prolog completes #166

Stop daemon on compute nodes after jobtap prolog completes #166

jameshcorbett commented Jun 14, 2024 •

edited

Loading

roehrich-hpe commented Jun 14, 2024

grondo commented Jun 14, 2024

garlick commented Jun 14, 2024

jameshcorbett commented Jun 25, 2024

jameshcorbett commented Jul 30, 2024

garlick commented Jul 30, 2024

jameshcorbett commented Jul 30, 2024

garlick commented Jul 30, 2024 •

edited

Loading

jameshcorbett commented Jul 30, 2024 •

edited

Loading

roehrich-hpe commented Jul 30, 2024

jameshcorbett commented Jul 30, 2024

garlick commented Jul 30, 2024

jameshcorbett commented Aug 1, 2024

trws commented Aug 8, 2024

jameshcorbett commented Aug 8, 2024

trws commented Aug 8, 2024

Stop daemon on compute nodes after jobtap prolog completes #166

Stop daemon on compute nodes after jobtap prolog completes #166

Comments

jameshcorbett commented Jun 14, 2024 • edited Loading

roehrich-hpe commented Jun 14, 2024

grondo commented Jun 14, 2024

garlick commented Jun 14, 2024

jameshcorbett commented Jun 25, 2024

jameshcorbett commented Jul 30, 2024

garlick commented Jul 30, 2024

jameshcorbett commented Jul 30, 2024

garlick commented Jul 30, 2024 • edited Loading

jameshcorbett commented Jul 30, 2024 • edited Loading

roehrich-hpe commented Jul 30, 2024

jameshcorbett commented Jul 30, 2024

garlick commented Jul 30, 2024

jameshcorbett commented Aug 1, 2024

trws commented Aug 8, 2024

jameshcorbett commented Aug 8, 2024

trws commented Aug 8, 2024

jameshcorbett commented Jun 14, 2024 •

edited

Loading

garlick commented Jul 30, 2024 •

edited

Loading

jameshcorbett commented Jul 30, 2024 •

edited

Loading