Skip to content

Commit

Permalink
Update docs/aurora/running-jobs-aurora.md Fix #308
Browse files Browse the repository at this point in the history
  • Loading branch information
saforem2 committed Dec 15, 2023
1 parent 385126c commit e227c37
Showing 1 changed file with 19 additions and 13 deletions.
32 changes: 19 additions & 13 deletions docs/aurora/running-jobs-aurora.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@

# Running Jobs on Aurora

## <a name="Aurora-Queues"></a>Queues
## <a name="Aurora-Queues"></a>Queues

There is a single routing queue in place called `EarlyAppAccess` which currently has a node count of 2,844, but we recommend a max job size of 2048 or 2560. This will be replaced by new queues during an upcoming PM.

For example, a one-node interactive job can be requested for 30 minutes with the following command, where `[your_ProjectName]` is replaced with an appropriate project name.

```
```bash
qsub -l select=1 -l walltime=30:00 -A [your_ProjectName] -q EarlyAppAccess -I
```

Recommended PBSPro options follow.

```
```bash
#!/bin/sh
#PBS -A [your_ProjectName]
#PBS -N
Expand All @@ -32,11 +32,13 @@ We recommend against useing `-W tolerate_node_failures=all` in your qsub command
1. Start your interactive job
2. When the job transitions to Running state, run `pbsnodes -l | grep <jobid>`
3. Manually REMOVE all nodes identified in that output from inclusion in your mpiexec
```
$ cat $PBS_NODEFILE > local.hostfile
# edit local.hostfile to remove problem nodes
$ mpiexec --hostfile local.hostfile [other mpiexec arguments]
```

```bash
$ cat $PBS_NODEFILE > local.hostfile
# edit local.hostfile to remove problem nodes
$ mpiexec --hostfile local.hostfile [other mpiexec arguments]
```

4. Continue to execute
5. If other nodes go down during your job, it will not be killed, and you can further exclude those nodes from your mpiexec as needed

Expand Down Expand Up @@ -83,7 +85,7 @@ GPU-enabled applications will similarly run on the compute nodes using the above
- If running on a specific GPU or subset of GPUs and/or tiles is desired, then the `ZE_AFFINITY_MASK` environment variable can be used. For example, if one only wanted an application to access the first two GPUs on a node, then setting `ZE_AFFINITY_MASK=0,1` could be used.

### <a name="Binding-MPI-ranks-to-GPUs"></a>Binding MPI ranks to GPUs
Support in MPICH on Aurora to bind MPI ranks to GPUs is currently work-in-progress. For applications that need this support, this instead can be handled by use of a small helper script that will appropriately set `ZE_AFFINITY_MASK` for each MPI rank. Users are encouraged to use the `/soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh` script for instances where each MPI rank is to be bound to a single GPU tile with a round-robin assignment.
Support in MPICH on Aurora to bind MPI ranks to GPUs is currently work-in-progress. For applications that need this support, this instead can be handled by use of a small helper script that will appropriately set `ZE_AFFINITY_MASK` for each MPI rank. Users are encouraged to use the `/soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh` script for instances where each MPI rank is to be bound to a single GPU tile with a round-robin assignment.

This script can be placed just before the executable in an `mpiexec` command like so.

Expand All @@ -110,16 +112,19 @@ Users with different MPI-GPU affinity needs, such as assigning multiple GPUs/til

## <a name="Interactive-Jobs-on-Compute-Nodes"></a>Interactive Jobs on Compute Nodes

Here is how to submit an interactive job to, for example, edit/build/test an application Polaris compute nodes:
```
Here is how to submit an interactive job to, for example, edit/build/test an application on Aurora compute nodes:

```bash
qsub -I -l select=1,walltime=1:00:00,place=scatter -A MYPROJECT -q workq
```

This command requests 1 node for a period of 1 hour in the `workq` queue. After waiting in the queue for a node to become available, a shell prompt on a compute node will appear. You may then start building applications and testing gpu affinity scripts on the compute node.

**NOTE:** If you want to ```ssh``` or ```scp``` to one of your assigned compute nodes you will need to make sure your ```$HOME``` directory and your ```$HOME/.ssh``` directory permissions are both set to ```700```.
**NOTE:** If you want to `ssh` or `scp` to one of your assigned compute nodes you will need to make sure your `$HOME` directory and your `$HOME/.ssh` directory permissions are both set to `700`.


## <a name="Running-Multiple-MPI-Applications-on-a-node"></a>Running Multiple MPI Applications on a node

Multiple applications can be run simultaneously on a node by launching several `mpiexec` commands and backgrounding them. For performance, it will likely be necessary to ensure that each application runs on a distinct set of CPU resources and/or targets specific GPUs and tiles. One can provide a list of CPUs using the `--cpu-bind` option, which when combined with `ZE_AFFINITY_MASK` provides a user with specifying exactly which CPU and GPU resources to run each application on. In the simple example below, twelve instances of the application are simultaneously running on a single node. In the first instance, the application is spawning MPI ranks 0-3 on CPU cores 0-3 and using GPU 0 tile 0.

```bash
Expand All @@ -141,6 +146,7 @@ mpiexec -n 4 --ppn 4 --cpu-bind list:40:41:42:43 ./hello_affinity &
wait
```

Users will likely find it beneficial to launch processes across CPU cores in both sockets of a node.

## <a name="Compute-Node-Access-to-the-Internet"></a>Compute Node Access to the Internet
Expand All @@ -153,7 +159,7 @@ export https_proxy="http://proxy.alcf.anl.gov:3128"
export ftp_proxy="http://proxy.alcf.anl.gov:3128"
```

#In the future, though we don't have a timeline on this because it depends on future features in slingshot and internal software development, we intend to have public IP addresses be a schedulable resource. For instance, if only your head node needed public access your select statement might looks something like: `-l select=1:pubnet=True+63`.
In the future, though we don't have a timeline on this because it depends on future features in slingshot and internal software development, we intend to have public IP addresses be a schedulable resource. For instance, if only your head node needed public access your select statement might looks something like: `-l select=1:pubnet=True+63`.
## <a name="Controlling-Where-Your-Job-Runs"></a>Controlling Where Your Job Runs
If you wish to have your job run on specific nodes form your select like this: `-l select=1:vnode=<node name1>+1:vnode=<node name2>...` . Obviously, that gets tedious for large jobs.
Expand Down

0 comments on commit e227c37

Please sign in to comment.