-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea: log job and node events directly to journald #6624
Comments
A worry is that dumping cluster RAS/job data into the systemd journal on a management node might fill up the limited journal storage allotment with Flux data and push out other useful things. The end goal presumably wouldn't be to do queries on the journal directly anyway, but to just get it out of there and into something scalable and site-specific. So isn't the python API for sites to do their own thing that we've already developed arguably the better solution? |
Your argument does make sense, I don't recall the exact arguments for the journald approach. I had just agreed to open an issue describing the idea. If the worry is that the backend database may spend significant time down and thus the Python based consumers might have to cache large numbers of events to avoid loss of data due to eventlog truncation or job purges, then nothing is stopping the Python consumer from using the journal as a local store. (Though the same caveat about filling the journal applies) |
The main thing I had in mind was to offload all of the work of creating reliable, resumable journal semantics to journald. Less for flux to implement, and perhaps less custom API since journald is well known (if journald really offers sufficient semantics to meet our needs). Journald allows setting individual log limits, does it not? I'm not suggesting buffering things forever. If flux is going to make these journals reliable across flux and node restarts, it needs to use up disk space too. |
AFAIK it only allows an overall limit to be set. https://www.man7.org/linux/man-pages/man5/journald.conf.5.html although I'm by no means an expert. |
It looks like they have "namespaces" (see same man page). So a flux namespace with its own independent limits could be used. It looks like the journalctl command takes a "--namespace" option for reading, but I haven't checked the C API to see what its namespace support looks like. |
Seems like that's a systemd |
Yeah, that figures. |
@morrone stopped by for a chat today and suggested an idea for handling streaming events from Flux, i.e. those currently supported by the JournalConsumer interfaces for the job manager and resource module journals.
Chris proposed sending events directly to the systemd journal, presumably using the native protocol. Consumers can then use journald apis and commands to grab events without connecting to Flux, and the persistent problem is (presumably) foisted off to journald.
The text was updated successfully, but these errors were encountered: