In Flux, a batch job is a script submitted along with a request for resources. When resources become available, the script is run as the initial program of a new single-user instance of Flux, with the batch user as instance owner. Thus, in contrast to a traditional batch scheduler, a Flux batch job has access to fully featured resource manager instance, and may not only serially execute work on allocated resources, but load custom modules or customize parameters of existing modules, monitor and update status of allocated resources, and even submit more batch jobs to run further sub-instances of Flux.
Note
In the following, batch jobs will be described using single-user Flux, but it is important to note that the same commands and techniques will also work at the system level.
For this demonstration, we first launch Flux under SLURM and get an interactive shell.
$ salloc -N4 -ppdebug
salloc: Granted job allocation 5620626
$ srun -N ${SLURM_NNODES} -n ${SLURM_NNODES} --pty --mpi=none --mpibind=off flux start
Note
If launching under SLURM is not possible or convenient, a single-node
single user instance of Flux can be started with flux start -s 4
.
The high-level Flux command that users can use to submit a
batch script is flux mini batch
.
sleep_batch.sh
:
#!/bin/bash
echo "Starting my batch job"
echo "Print the resources allocated to this batch job"
flux hwloc info
echo "Use sleep to emulate a parallel program"
echo "Run the program at a total of 4 processes each requiring"
echo "1 core. These processes are equally spread across 2 nodes."
flux mini run -N 2 -n 4 sleep 120
flux mini run -N 2 -n 4 sleep 120
The name of your batch script is sleep_batch.sh
.
As shown from the flux mini run
lines in this script, you can see the the
overall resource requirement is 4 cores equally spread
across two nodes as the default numbers of cores assigned to each task
(specified by the -n
option) is just one.
We now submit this script interactively three times using flux mini batch
.
$ flux mini batch --nslots=2 --cores-per-slot=2 --nodes=2 ./sleep_batch.sh
$ flux mini batch --nslots=2 --cores-per-slot=2 --nodes=2 ./sleep_batch.sh
$ flux mini batch --nslots=2 --cores-per-slot=2 --nodes=2 ./sleep_batch.sh
Users must specify the overall resource requirement of each
job using --nslots
, which is the only required option, along with
other optional options including those describing the resource shape
of each slot and how the slots are distributed across different nodes.
In this example, each job requests two slots each with
2 cores (--cores-per-slot=2
) and these slots must be equally spread
across two nodes (--nodes=4
).
Currently, if you want a script to be exclusively allocated to a set of
nodes, the recommended option set is:
--nslots=<NUM OF NODES>
--cores-per-slot=<MAX NUM OF CORES PER NODE>
and --nodes=<NUM OF NODES>
Note
Internally, Flux will create a nested Flux instance allocated
to the requested resources per batch job and run the batch
script inside that nested instance. While a batch script is
expected to launch parallel jobs using flux mini run
or
flux mini submit
at this level, nothing prevents the
script from further batching other sub-batch-jobs using
the flux mini batch
interface, if desired.
You can check the status of these batch jobs:
$ flux jobs -a
JOBID USER NAME ST NTASKS NNODES RUNTIME RANKS
ƒWZEQa8X dahn sleep_batc R 2 2 1.931s [0-1]
ƒW8eCV2o dahn sleep_batc R 2 2 2.896s [0-1]
ƒVhYLeJB dahn sleep_batc R 2 2 3.878s [0-1]
By default, the stdout
/stderr
of each batch job will be redirected
to the flux-${JOBID}.out
file and you can easily change the name
of this file by passing --output=<FILE NAME1>
and --error=<FILE NAME2>
.
Checking the output file of one of the batch job:
$ flux job attach ƒWZEQa8X
0: stdout redirected to flux-ƒWZEQa8X.out
0: stderr redirected to flux-ƒWZEQa8X.out
$ cat flux-ƒWZEQa8X.out
Print the resources allocated to this batch job
2 Machines, 4 Cores, 8 PUs
Use sleep to emulate a parallel program
Run the program at a total of 4 processes each requiring
1 core. These processes are equally spread across 2 nodes.
Users may want to script the above procedures within a script to submit to another resource manager such SLURM.
An example sbatch script:
#!/bin/sh
#SBATCH -N 4
srun -N ${SLURM_NNODES} -n ${SLURM_NNODES} --mpi=none --mpibind=off flux start flux_batch.sh
Note
--pty
is not used in this case because this option
is known to produce a side effect in a non-interactive batch
environment.
flux_batch.sh
:
#!/bin/sh
flux mini batch --nslots=2 --cores-per-slot=2 --nodes=2 ./sleep_batch.sh
flux mini batch --nslots=2 --cores-per-slot=2 --nodes=2 ./sleep_batch.sh
flux mini batch --nslots=2 --cores-per-slot=2 --nodes=2 ./sleep_batch.sh
flux queue drain
It is important to note that some of the Flux commands used above are blocking and some of them are non-blocking.
Both flux mini submit
and flux mini batch
have submit semantics
and as such they submit a parallel program or batch script and return
shortly after.
To avoid the exiting of the containing script, you can use
flux queue drain
which drains the queue such that no job can
be submitted and then waits until all submitted jobs complete.
Thus, it is recommended not to run those commands in background.
By contrast, flux mini run
blocks until the target program
completes.
With our Fluxion graph-based scheduler, users can easily specialize their scheduling behaviors tailored to the characteristics of their workloads.
As an example, we will describe how you can set the queue and backfill policies of the submitting Flux to a simple policy named EASY while still keeping the policy of the nested Flux instances default: First Come First Served (FCFS).
$ salloc -N4 -ppdebug
salloc: Granted job allocation 5620626
$ cat sched-fluxion-qmanager.toml
[sched-fluxion-qmanager]
queue-policy = "easy"
$ srun -N ${SLURM_NNODES} -n ${SLURM_NNODES} --pty --mpi=none --mpibind=off flux start -o,--config-path=./
sched-fluxion-qmanager
is the one of the modules from Fluxion and
sched-fluxion-qmanager.toml
in the current working directory is our TOML
configuration file that changes the queue/backfilling policy to EASY-backfilling.
This backfill-capable queue policy can significantly increase
the makespan of batch jobs.
Note
Note that we pass the current working directory to -o,--config-path
so that Fluxion can use this TOML file in customizing its scheduling.
This file will not affect any other nested Flux instances unless they
are also passed with the same -o,--config-path
option.