waiting-for-jobs: add new guide #220

chu11 · 2023-03-28T04:10:55Z

Add a new guide on how to wait for jobs to complete.

vsoch · 2023-03-28T04:12:04Z

jobs/waiting-for-jobs.rst

+
+.. code-block:: console
+
+    $ flux submit --wait -n1 bash -c "sleep 30; /bin/false"


I did not know this! TIL

vsoch · 2023-03-28T04:12:23Z

jobs/waiting-for-jobs.rst

+    $ echo $?
+    1
+
+The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``.  The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.


Suggested change

The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.

The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluxid>` is immediately output, but the command won't return until the 30 second job has completed.

did you typo something here? When I grep I don't see a reference to "fluxid".

I assumed "fluid" should be fluxid, but if "fluid" is correct my mistake!

vsoch · 2023-03-28T04:13:18Z

jobs/waiting-for-jobs.rst

+
+The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``.  The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.
+
+After the command has finished we print the exit code from ``flux submit``.  You'll notice the exit code is ``1``, which is the final exit code of the job, which in this case was ``1`` because we ran ``/bin/false``.


Suggested change

After the command has finished we print the exit code from ``flux submit``. You'll notice the exit code is ``1``, which is the final exit code of the job, which in this case was ``1`` because we ran ``/bin/false``.

After the command has finished we print the exit code from ``flux submit``, which is ``1``, because we ran ``/bin/false``.

vsoch · 2023-03-28T04:14:03Z

jobs/waiting-for-jobs.rst

+Flux Job Status
+---------------
+
+In most cases, you do not want to sit and wait for the current job submission to complete.  You would like to do other things, such as submit more jobs, and then wait for those specific jobs to complete.


Indeed I don't! I have avocados to eat! Mountains to climb!

vsoch · 2023-03-28T04:14:33Z

jobs/waiting-for-jobs.rst

+
+In most cases, you do not want to sit and wait for the current job submission to complete.  You would like to do other things, such as submit more jobs, and then wait for those specific jobs to complete.
+
+The ``flux job status`` command is the most basic way to wait for a specific job, based on jobid, to complete.  Pass it one or more jobids to wait on, and ``flux job status`` will return once all of the jobs have completed.  It will exit with largest exit code from any of the jobids specified.  If the job(s) have already completed, ``flux job status`` returns immediately.  It can be run as many times as the user would like against the same jobid.


Is the context here that I've submit a bunch, and then (after that) I want to wait for a specific job?

Yes, think I should mention something to that affect?

Yes exactly - you read between the lines.

vsoch · 2023-03-28T04:18:31Z

jobs/waiting-for-jobs.rst

+    $ flux job wait
+    flux-job: there are no more waitable jobs
+
+In this above example, we submit three jobs, sleeping for 60, 45, and 30 seconds respectively before running ``/bin/true``.  We then run ``flux job wait`` without any inputs.  You'll notice the jobids for the ``sleep 30`` job, then ``sleep 45`` job, then ``sleep 60`` job are returned in that order.  Finally, without any jobs left running with the ``waitable`` flag, ``flux job wait`` indicates there are no more waitable jobs.


So it doesn't wait for all of them to complete (like the multiple one on the same line?) What is the use case for this if I have to run it a gazillion times?

I believe the typical use case is a user wants to know when a job has finished and can do some type of post-processing on its results while the other jobs keep on running. They don't care which one finishes first/next, they just need to know that one has finished (and which one).

(Hopefully this use case might explain other questions you had above/below).

probably good to stick one sentence in there to note this common use case.

So you couldn't use flux job status for that?

flux job status requires you to input all of the jobids and doesn't exit until all of the jobs finish, thus more inconvenient.

vsoch · 2023-03-28T04:19:07Z

jobs/waiting-for-jobs.rst

+    ƒ4YPufmCjq
+    $ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/false"
+    ƒ4YSVQWfZq
+    $ flux job wait --all --verbose


ohh this one makes sense! But what is the use case for without --all?

vsoch · 2023-03-28T04:19:32Z

jobs/waiting-for-jobs.rst

+
+This example is similar to the above, except one of the jobs runs ``/bin/false`` instead of ``/bin/true``.  When ``flux job wait --all`` is executed, you'll notice a message output indicating that one job has failed (the one that ran ``/bin/false``).  And similar to ``flux job status``, the exit code of ``1`` is returned due to the highest exit code of all the jobs.
+
+The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only waited on once.


Suggested change

The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only waited on once.

The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only be waited on once.

Only being able to wait on a job once is not necessarily a disadvantage, without it you would not be able to flux job wait in a loop (you'd just keep getting the same jobid continually). So there is a purpose here and each interface satisfies different se cases. Instead of calling this a disadvantage, maybe the guide should discuss the use cases for which each interface is designed?

vsoch · 2023-03-28T04:20:36Z

jobs/waiting-for-jobs.rst

+
+    $ flux submit --flags waitable -n1 bash -c "sleep 30; /bin/true"
+    ƒBbk3qrdro
+    $ flux job wait ƒBbk3qrdro


Why would I put the jobid at all? Wouldn't I just run flux job wait without any args like shown in the example above?

ahh you're correct for this specific case, they wouldn't need to. Would it be clearer to not put in the jobid in this case? (Edit: i see your comment below, probably should remove it)

what if there was a previously submitted waitable job that was not yet reaped? flux job wait doesn't necessarily only wait for the last submitted job...

vsoch · 2023-03-28T04:21:17Z

jobs/waiting-for-jobs.rst

+Pros:
+
+- ``flux job wait`` more efficient when waiting for a set of jobs
+- Jobids do not need to be specified to ``flux job wait``


So maybe just take that part of the tutorial out - don't show giving a job id to flux job wait if that shouldn't be learned.

I see your point. I'll mention it, but definitely stress it less.

chu11 · 2023-03-28T05:01:27Z

pushed a fixup, tweaking a few things, adding a sentence here and there given comments above

grondo · 2023-03-28T14:28:29Z

Perhaps something should be said in here to the effect of: flux job wait implements semantics similar to wait(2) and waitpid(2) system calls or the wait POSIX shell command. It is much more efficient than flux job status, but can only be called once per waitable job since the wait status is "reaped", and it requires instance owner privileges.

If you need to wait for thousands of jobs efficiently, or need to wait for single jobs as they complete, then flux job wait with waitable jobs is probably the best solution.

chu11 · 2023-03-29T05:13:00Z

re-pushed. taking into account several of the comments above, re-worked the flow of the flux job wait section a bit.

chu11 · 2023-04-01T13:04:46Z

re-pushed, updating example script given completion of flux-framework/flux-core#5033

Add a new guide on how to wait for jobs to complete.

vsoch reviewed Mar 28, 2023

View reviewed changes

chu11 mentioned this pull request Mar 29, 2023

flux job wait: special exit code when no more waitable jobs flux-framework/flux-core#5033

Closed

chu11 force-pushed the how_to_wait_jobs branch from 3360364 to 77543eb Compare March 29, 2023 05:12

chu11 mentioned this pull request Mar 29, 2023

flux-job(1): note that flux job wait works on jobs that specify waitable flag flux-framework/flux-core#5038

Closed

chu11 force-pushed the how_to_wait_jobs branch from 77543eb to e9d5654 Compare April 1, 2023 13:03

waiting-for-jobs: add new guide

bbd2e27

Add a new guide on how to wait for jobs to complete.

chu11 force-pushed the how_to_wait_jobs branch from e9d5654 to bbd2e27 Compare June 8, 2023 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

waiting-for-jobs: add new guide #220

waiting-for-jobs: add new guide #220

chu11 commented Mar 28, 2023

vsoch Mar 28, 2023

vsoch Mar 28, 2023

chu11 Mar 28, 2023

vsoch Mar 28, 2023

vsoch Mar 28, 2023

vsoch Mar 28, 2023

vsoch Mar 28, 2023

chu11 Mar 28, 2023

vsoch Mar 28, 2023

vsoch Mar 28, 2023

chu11 Mar 28, 2023 •

edited

Loading

chu11 Mar 28, 2023

vsoch Mar 28, 2023

chu11 Mar 28, 2023

vsoch Mar 28, 2023

vsoch Mar 28, 2023

grondo Mar 28, 2023

vsoch Mar 28, 2023

chu11 Mar 28, 2023 •

edited

Loading

grondo Mar 28, 2023

vsoch Mar 28, 2023

chu11 Mar 28, 2023

chu11 commented Mar 28, 2023

grondo commented Mar 28, 2023

chu11 commented Mar 29, 2023

chu11 commented Apr 1, 2023


		.. code-block:: console

		$ flux submit --wait -n1 bash -c "sleep 30; /bin/false"

	The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.
	The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluxid>` is immediately output, but the command won't return until the 30 second job has completed.


		The above command submits a job that simply sleeps for 30 seconds on one processor (``-n1``) and then runs ``/bin/false``. The :ref:`jobid <fluid>` is immediately output, but the command won't return until the 30 second job has completed.

		After the command has finished we print the exit code from ``flux submit``. You'll notice the exit code is ``1``, which is the final exit code of the job, which in this case was ``1`` because we ran ``/bin/false``.


		In most cases, you do not want to sit and wait for the current job submission to complete. You would like to do other things, such as submit more jobs, and then wait for those specific jobs to complete.

		The ``flux job status`` command is the most basic way to wait for a specific job, based on jobid, to complete. Pass it one or more jobids to wait on, and ``flux job status`` will return once all of the jobs have completed. It will exit with largest exit code from any of the jobids specified. If the job(s) have already completed, ``flux job status`` returns immediately. It can be run as many times as the user would like against the same jobid.


		This example is similar to the above, except one of the jobs runs ``/bin/false`` instead of ``/bin/true``. When ``flux job wait --all`` is executed, you'll notice a message output indicating that one job has failed (the one that ran ``/bin/false``). And similar to ``flux job status``, the exit code of ``1`` is returned due to the highest exit code of all the jobs.

		The biggest disadvantage of ``flux job wait`` compared to ``flux job status`` is that jobs can only waited on once.

waiting-for-jobs: add new guide #220

Are you sure you want to change the base?

waiting-for-jobs: add new guide #220

Conversation

chu11 commented Mar 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Mar 28, 2023

grondo commented Mar 28, 2023

chu11 commented Mar 29, 2023

chu11 commented Apr 1, 2023

chu11 Mar 28, 2023 •

edited

Loading

chu11 Mar 28, 2023 •

edited

Loading