How to understand task progression? #3262
-
Hi Flux, I am trying to understand what happens to a job of mine. I apparently submit it with a resource request which cannot possibly allocated (more cores per rank than exist on a node), and I rightly see the job state progression like this:
No Now, in this case I happen to know why the job cannot run - but assuming I don't: is it in any way possible to learn from Flux why a specific job could not be scheduled or executed, as to inform the application layer? Not having that information makes error tracing a bit... tedious... The only trace I see right now is:
I have a number of job specs which share the fate of the above one, but seem to me to be valid job specs. I am a bit hesitant to open an issue for each of them, because I don't think that they indicate actual bugs but rather reflect on my limited understanding of the job spec... Many thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
Sorry for the delayed reply @andre-merzky. To get more detailed "events" for a job, you can view the job's "eventlog", which has more fine grained events than the bulk job state events. For example, this eventlog will contain any job exceptions raised against the job. These exceptions usually have some detail of the reason for a failed job. For example, in the case where I submit a job with an unsatisfiable request, I will see from the
Since there is no runtime or assigned ranks, it is likely this job never reached the RUN state. To get more information I can dump the eventlog for the job with
This shows that the job received an "alloc" (allocation) exception due to "unsatisfiable request" Attaching to the job will also display any exceptions or output received by a job:
To get more detail from
(That will display the main job eventlog, the execution eventlog, along with any standard output from a job. So it is a good debugging aid) For access to these interfaces from Python, we have some bindings for the job eventlogs under |
Beta Was this translation helpful? Give feedback.
-
To give you an idea of the level of detail in the job eventlogs, you can run with
|
Beta Was this translation helpful? Give feedback.
-
Great @grondo, this is what I was looking for! I have this working in Python now, and get the |
Beta Was this translation helpful? Give feedback.
-
Those events are coming from an eventlog that is stored in the Flux KVS, so they do not expire (or are not removed anyway) unless something removes them (which is not done by anything in Flux now, but that could be coming in the future). |
Beta Was this translation helpful? Give feedback.
Sorry for the delayed reply @andre-merzky.
To get more detailed "events" for a job, you can view the job's "eventlog", which has more fine grained events than the bulk job state events. For example, this eventlog will contain any job exceptions raised against the job. These exceptions usually have some detail of the reason for a failed job.
For example, in the case where I submit a job with an unsatisfiable request, I will see from the
flux jobs
listing that the job failed:Since there is no runtime …