How to understand task progression? #3262

andre-merzky · 2020-10-10T18:03:58Z

andre-merzky
Oct 10, 2020

Hi Flux,

I am trying to understand what happens to a job of mine. I apparently submit it with a resource request which cannot possibly allocated (more cores per rank than exist on a node), and I rightly see the job state progression like this:

 [144250503168, 'DEPEND', 1602352429.4082935]
 [144250503168, 'SCHED', 1602352429.4213998]
 [144250503168, 'CLEANUP', 1602352429.4223936]
 [144250503168, 'INACTIVE', 1602352429.4224164]

No RUN state, so it did not really get executed.

Now, in this case I happen to know why the job cannot run - but assuming I don't: is it in any way possible to learn from Flux why a specific job could not be scheduled or executed, as to inform the application layer? Not having that information makes error tracing a bit... tedious... The only trace I see right now is:

2020-10-10T17:53:49.422203Z sched-simple.debug[0]: req: 144250503168: spec={1,1,500} duration=1.0
2020-10-10T17:53:49.439992Z job-manager.debug[0]: submit_cb: added 1 jobs

I have a number of job specs which share the fate of the above one, but seem to me to be valid job specs. I am a bit hesitant to open an issue for each of them, because I don't think that they indicate actual bugs but rather reflect on my limited understanding of the job spec...

Many thanks,
Andre.

Answered by grondo

Oct 12, 2020

Sorry for the delayed reply @andre-merzky.

To get more detailed "events" for a job, you can view the job's "eventlog", which has more fine grained events than the bulk job state events. For example, this eventlog will contain any job exceptions raised against the job. These exceptions usually have some detail of the reason for a failed job.

For example, in the case where I submit a job with an unsatisfiable request, I will see from the flux jobs listing that the job failed:

ƒ(s=1,d=0,builddir) grondo@asp:~/git/flux-core.git$ flux jobs -a
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME RANKS
   ƒ5cENUc8P grondo   hostname    F     12      -        - -

Since there is no runtime …

View full answer

grondo · 2020-10-12T19:32:42Z

grondo
Oct 12, 2020
Maintainer

Sorry for the delayed reply @andre-merzky.

To get more detailed "events" for a job, you can view the job's "eventlog", which has more fine grained events than the bulk job state events. For example, this eventlog will contain any job exceptions raised against the job. These exceptions usually have some detail of the reason for a failed job.

For example, in the case where I submit a job with an unsatisfiable request, I will see from the flux jobs listing that the job failed:

ƒ(s=1,d=0,builddir) grondo@asp:~/git/flux-core.git$ flux jobs -a
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME RANKS
   ƒ5cENUc8P grondo   hostname    F     12      -        - -

Since there is no runtime or assigned ranks, it is likely this job never reached the RUN state.

To get more information I can dump the eventlog for the job with flux job eventlog JOBID:

ƒ(s=1,d=0,builddir) grondo@asp:~/git/flux-core.git$ flux job eventlog ƒ5cENUc8P
1602530498.767617 submit userid=1000 priority=16 flags=0
1602530498.781305 depend
1602530498.782476 exception type="alloc" severity=0 userid=-1 note="unsatisfiable request"
1602530498.782503 clean

This shows that the job received an "alloc" (allocation) exception due to "unsatisfiable request"

Attaching to the job will also display any exceptions or output received by a job:

$ flux job attach ƒ5cENUc8P
0.015s: job.exception type=alloc severity=0 unsatisfiable request

To get more detail from flux job attach, run with the -v -E -X options:

$ flux job attach -vEX ƒ5cENUc8P
0.000s: job.submit {"userid":1000,"priority":16,"flags":0}
0.014s: job.depend
0.015s: job.exception type=alloc severity=0 unsatisfiable request
0.015s: job.clean

(That will display the main job eventlog, the execution eventlog, along with any standard output from a job. So it is a good debugging aid)

For access to these interfaces from Python, we have some bindings for the job eventlogs under src/bindings/python/flux/job/event.py, accessible as flux.job.event_watch and flux.job.event_watch_async

0 replies

grondo · 2020-10-12T19:35:38Z

grondo
Oct 12, 2020
Maintainer

To give you an idea of the level of detail in the job eventlogs, you can run with flux mini run -vvv which will attach to the job with -vEX options, e.g.

ƒ(s=1,d=0,builddir) grondo@asp:~/git/flux-core.git$ flux mini run -n2 -vvv hostname
jobid: ƒBPoRn5H9
0.000s: job.submit {"userid":1000,"priority":16,"flags":0}
0.014s: job.depend
0.018s: job.alloc {"annotations":{"sched":{"resource_summary":"rank0/core[0-1]"}}}
0.051s: job.start
0.018s: exec.init
0.021s: exec.starting
0.082s: exec.shell.init {"leader-rank":0,"size":1,"service":"1000-shell-22947825713152"}
0.152s: exec.shell.start {"task-count":2}
asp
asp
0.176s: exec.complete {"status":0}
0.176s: exec.cleanup.start {"ranks":"all"}
0.178s: exec.cleanup.finish {"ranks":"all","rc":0}
0.179s: exec.done
0.176s: job.finish {"status":0}
0.182s: job.release {"ranks":"all","final":true}
0.183s: job.free
0.183s: job.clean

0 replies

andre-merzky · 2020-10-13T18:43:50Z

andre-merzky
Oct 13, 2020
Author

Great @grondo, this is what I was looking for! I have this working in Python now, and get the eventlog. It seems that I get all events for a task even if I pull for those events post mortem - is that reliable and infinite, or do events eventually expire after some time, and I am lucky because I happen to pull fast enough?

0 replies

grondo · 2020-10-13T21:19:34Z

grondo
Oct 13, 2020
Maintainer

Those events are coming from an eventlog that is stored in the Flux KVS, so they do not expire (or are not removed anyway) unless something removes them (which is not done by anything in Flux now, but that could be coming in the future).

1 reply

andre-merzky Oct 13, 2020
Author

Great - then I'm a happy camper...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to understand task progression? #3262

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to understand task progression? #3262

andre-merzky Oct 10, 2020

Replies: 4 comments · 1 reply

grondo Oct 12, 2020 Maintainer

grondo Oct 12, 2020 Maintainer

andre-merzky Oct 13, 2020 Author

grondo Oct 13, 2020 Maintainer

andre-merzky Oct 13, 2020 Author

andre-merzky
Oct 10, 2020

Replies: 4 comments 1 reply

grondo
Oct 12, 2020
Maintainer

grondo
Oct 12, 2020
Maintainer

andre-merzky
Oct 13, 2020
Author

grondo
Oct 13, 2020
Maintainer

andre-merzky Oct 13, 2020
Author