Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

waiting-for-jobs: add new guide #220

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Mar 28, 2023

Add a new guide on how to wait for jobs to complete.


.. code-block:: console

$ flux submit --wait -n1 bash -c "sleep 30; /bin/false"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know this! TIL

$ echo $?
1

The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.
The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluxid>` is immediately output, but the command won't return until the 30 second job has completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you typo something here? When I grep I don't see a reference to "fluxid".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed "fluid" should be fluxid, but if "fluid" is correct my mistake!


The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.

After the command has finished we print the exit code from ``flux submit``. You'll notice the exit code is ``1``, which is the final exit code of the job, which in this case was ``1`` because we ran ``/bin/false``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After the command has finished we print the exit code from ``flux submit``. You'll notice the exit code is ``1``, which is the final exit code of the job, which in this case was ``1`` because we ran ``/bin/false``.
After the command has finished we print the exit code from ``flux submit``, which is ``1``, because we ran ``/bin/false``.

Flux Job Status
---------------

In most cases, you do not want to sit and wait for the current job submission to complete. You would like to do other things, such as submit more jobs, and then wait for those specific jobs to complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I don't! I have avocados to eat! Mountains to climb!


In most cases, you do not want to sit and wait for the current job submission to complete. You would like to do other things, such as submit more jobs, and then wait for those specific jobs to complete.

The ``flux job status`` command is the most basic way to wait for a specific job, based on jobid, to complete. Pass it one or more jobids to wait on, and ``flux job status`` will return once all of the jobs have completed. It will exit with largest exit code from any of the jobids specified. If the job(s) have already completed, ``flux job status`` returns immediately. It can be run as many times as the user would like against the same jobid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the context here that I've submit a bunch, and then (after that) I want to wait for a specific job?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, think I should mention something to that affect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly - you read between the lines.

$ flux job wait
flux-job: there are no more waitable jobs

In this above example, we submit three jobs, sleeping for 60, 45, and 30 seconds respectively before running ``/bin/true``. We then run ``flux job wait`` without any inputs. You'll notice the jobids for the ``sleep 30`` job, then ``sleep 45`` job, then ``sleep 60`` job are returned in that order. Finally, without any jobs left running with the ``waitable`` flag, ``flux job wait`` indicates there are no more waitable jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it doesn't wait for all of them to complete (like the multiple one on the same line?) What is the use case for this if I have to run it a gazillion times?

Copy link
Member Author

@chu11 chu11 Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the typical use case is a user wants to know when a job has finished and can do some type of post-processing on its results while the other jobs keep on running. They don't care which one finishes first/next, they just need to know that one has finished (and which one).

(Hopefully this use case might explain other questions you had above/below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably good to stick one sentence in there to note this common use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you couldn't use flux job status for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flux job status requires you to input all of the jobids and doesn't exit until all of the jobs finish, thus more inconvenient.

ƒ4YPufmCjq
$ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/false"
ƒ4YSVQWfZq
$ flux job wait --all --verbose
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh this one makes sense! But what is the use case for without --all?


This example is similar to the above, except one of the jobs runs ``/bin/false`` instead of ``/bin/true``. When ``flux job wait --all`` is executed, you'll notice a message output indicating that one job has failed (the one that ran ``/bin/false``). And similar to ``flux job status``, the exit code of ``1`` is returned due to the highest exit code of all the jobs.

The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only waited on once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only waited on once.
The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only be waited on once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only being able to wait on a job once is not necessarily a disadvantage, without it you would not be able to flux job wait in a loop (you'd just keep getting the same jobid continually). So there is a purpose here and each interface satisfies different se cases. Instead of calling this a disadvantage, maybe the guide should discuss the use cases for which each interface is designed?


$ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/true"
ƒBbk3qrdro
$ flux job wait ƒBbk3qrdro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would I put the jobid at all? Wouldn't I just run flux job wait without any args like shown in the example above?

Copy link
Member Author

@chu11 chu11 Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh you're correct for this specific case, they wouldn't need to. Would it be clearer to not put in the jobid in this case? (Edit: i see your comment below, probably should remove it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there was a previously submitted waitable job that was not yet reaped? flux job wait doesn't necessarily only wait for the last submitted job...

Pros:

- ``flux job wait`` more efficient when waiting for a set of jobs
- Jobids do not need to be specified to ``flux job wait``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe just take that part of the tutorial out - don't show giving a job id to flux job wait if that shouldn't be learned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I'll mention it, but definitely stress it less.

@chu11
Copy link
Member Author

chu11 commented Mar 28, 2023

pushed a fixup, tweaking a few things, adding a sentence here and there given comments above

@grondo
Copy link
Contributor

grondo commented Mar 28, 2023

Perhaps something should be said in here to the effect of: flux job wait implements semantics similar to wait(2) and waitpid(2) system calls or the wait POSIX shell command. It is much more efficient than flux job status, but can only be called once per waitable job since the wait status is "reaped", and it requires instance owner privileges.

If you need to wait for thousands of jobs efficiently, or need to wait for single jobs as they complete, then flux job wait with waitable jobs is probably the best solution.

@chu11
Copy link
Member Author

chu11 commented Mar 29, 2023

re-pushed. taking into account several of the comments above, re-worked the flow of the flux job wait section a bit.

@chu11
Copy link
Member Author

chu11 commented Apr 1, 2023

re-pushed, updating example script given completion of flux-framework/flux-core#5033

Add a new guide on how to wait for jobs to complete.
@chu11 chu11 force-pushed the how_to_wait_jobs branch from e9d5654 to bbd2e27 Compare June 8, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants