Skip to content

Commit

Permalink
Merge pull request #382 from argonne-lcf/aurora
Browse files Browse the repository at this point in the history
Aurora & Other things
  • Loading branch information
cms21 authored Nov 29, 2023
2 parents 40688ec + cc930a5 commit f0bd159
Show file tree
Hide file tree
Showing 16 changed files with 75 additions and 16 deletions.
21 changes: 21 additions & 0 deletions balsam/config/defaults/alcf_aurora/job-template.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
#PBS -l select={{ num_nodes }}:system=aurora,place=scatter
#PBS -l walltime={{ wall_time_min//60 | int }}:{{ wall_time_min | int }}:00
#PBS -l filesystems=home
#PBS -A {{ project }}
#PBS -q {{ queue }}

export HTTP_PROXY=http://proxy.alcf.anl.gov:3128
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
export http_proxy=http://proxy.alcf.anl.gov:3128
export https_proxy=http://proxy.alcf.anl.gov:3128

#remove export PMI_NO_FORK=1
export BALSAM_SITE_PATH={{balsam_site_path}}
cd $BALSAM_SITE_PATH

echo "Starting balsam launcher at $(date)"
{{launcher_cmd}} -j {{job_mode}} -t {{wall_time_min - 2}} \
{% for k, v in filter_tags.items() %} --tag {{k}}={{v}} {% endfor %} \
{{partitions}}
echo "Balsam launcher done at $(date)"
21 changes: 21 additions & 0 deletions balsam/config/defaults/alcf_aurora/settings.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
title: "Aurora (ALCF)"

compute_node: balsam.platform.compute_node.AuroraNode
mpi_app_launcher: balsam.platform.app_run.AuroraRun
local_app_launcher: balsam.platform.app_run.LocalAppRun
mpirun_allows_node_packing: true

serial_mode_startup_params:
cpu_affinity: none

scheduler_class: balsam.platform.scheduler.PBSScheduler
allowed_queues:
workq:
max_nodes: 128
max_queued_jobs: 1
max_walltime: 240

allowed_projects:
- Aurora_deployment

optional_batch_job_params: {}
4 changes: 2 additions & 2 deletions balsam/config/defaults/alcf_sunspot/settings.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
title: "Sunspot (ALCF)"

compute_node: balsam.platform.compute_node.SunspotNode
mpi_app_launcher: balsam.platform.app_run.SunspotRun
compute_node: balsam.platform.compute_node.AuroraNode
mpi_app_launcher: balsam.platform.app_run.AuroraRun
local_app_launcher: balsam.platform.app_run.LocalAppRun
mpirun_allows_node_packing: true

Expand Down
3 changes: 2 additions & 1 deletion balsam/config/defaults/settings.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ launcher:
local_app_launcher: {{ local_app_launcher }}
mpirun_allows_node_packing: {{ mpirun_allows_node_packing }} # mpi_app_launcher supports multiple concurrent runs per node
serial_mode_prefetch_per_rank: 64 # How many jobs to prefetch from API in serial mode
# sort_by: long_large_first # Enable this option to run jobs with longest wall_time_min first, followed by jobs with largest num_nodes

# Pass-through parameters to mpirun when starting the serial mode launcher:
serial_mode_startup_params: {{ {} if not serial_mode_startup_params }}
Expand Down Expand Up @@ -137,4 +138,4 @@ queue_maintainer: null
file_cleaner: null
# file_cleaner:
# cleanup_batch_size: 180 # Clean up to this many Job workdirs at a time
# service_period: 30 # Cleanup files every `service_period` seconds
# service_period: 30 # Cleanup files every `service_period` seconds
4 changes: 2 additions & 2 deletions balsam/platform/app_run/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
from .app_run import AppRun, LocalAppRun
from .aurora import AuroraRun
from .mpich import MPICHRun
from .openmpi import OpenMPIRun
from .perlmutter import PerlmutterRun
from .polaris import PolarisRun
from .slurm import SlurmRun
from .summit import SummitJsrun
from .sunspot import SunspotRun
from .theta import ThetaAprun
from .theta_gpu import ThetaGPURun

Expand All @@ -19,6 +19,6 @@
"ThetaGPURun",
"MPICHRun",
"SummitJsrun",
"SunspotRun",
"AuroraRun",
"PerlmutterRun",
]
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from .app_run import SubprocessAppRun


class SunspotRun(SubprocessAppRun):
class AuroraRun(SubprocessAppRun):
"""
https://www.open-mpi.org/doc/v3.0/man1/mpiexec.1.php
"""
Expand All @@ -29,7 +29,7 @@ def _build_cmdline(self) -> str:
]
return " ".join(str(arg) for arg in args)

# Overide default because sunspot does not use CUDA
# Overide default because aurora/sunspot does not use CUDA
def _set_envs(self) -> None:
envs = os.environ.copy()
envs.update(self._envs)
Expand Down
8 changes: 2 additions & 6 deletions balsam/platform/compute_node/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
from .alcf_aurora_node import AuroraNode
from .alcf_cooley_node import CooleyNode
from .alcf_polaris_node import PolarisNode
from .alcf_sunspot_node import SunspotNode
from .alcf_thetagpu_node import ThetaGPUNode
from .alcf_thetaknl_node import ThetaKNLNode
from .compute_node import ComputeNode
from .default import DefaultNode
from .nersc_corihas_node import CoriHaswellNode
from .nersc_coriknl_node import CoriKNLNode
from .nersc_perlmutter import PerlmutterNode
from .summit_node import SummitNode

Expand All @@ -16,10 +14,8 @@
"SummitNode",
"ThetaGPUNode",
"CooleyNode",
"CoriHaswellNode",
"CoriKNLNode",
"PerlmutterNode",
"PolarisNode",
"SunspotNode",
"AuroraNode",
"ComputeNode",
]
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
IntStr = Union[int, str]


class SunspotNode(ComputeNode):
class AuroraNode(ComputeNode):
cpu_ids = list(range(104))
gpu_ids: List[IntStr]

Expand All @@ -18,7 +18,7 @@ class SunspotNode(ComputeNode):
gpu_ids.append(str(gid) + "." + str(tid))

@classmethod
def get_job_nodelist(cls) -> List["SunspotNode"]:
def get_job_nodelist(cls) -> List["AuroraNode"]:
"""
Get all compute nodes allocated in the current job context
"""
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/theta-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ the available default site setups:
- Theta-GPU
- Theta-KNL
- Cooley
- Cori (Haswell or KNL partitions)
- Perlmutter
- Summit
- Aurora (coming soon)

## Install

Expand Down
20 changes: 20 additions & 0 deletions docs/user-guide/batchjob.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,26 @@ multiple runs per node.
smaller BatchJobs can get through the queues faster and improve overall
throughput.

## Ordering Job Execution

By default, Balsam will sort jobs that are ready to run first by `num_nodes`
in acending order, then by `node_packing_count` in decending order, and finally
by `wall_time_min` in decending order. This default behavior will result in
the smallest jobs by node count starting first.

There is an alternative sorting model that can be enabled that sorts jobs first
by `wall_time_min` in decending order, then by `num_nodes` in decending order,
and finally by `node_packing_count` in decending order. This alternative
sorting behavior will start the longest running jobs, as estimated by
`wall_time_min`, first. If jobs have no `wall_time_min` set, it will start
the largest jobs by node count first. This alternative sorting model can be
enabled for the site by modifying the site's configuration `settings.yml` file.
Under `launcher`, add this option:
```yaml
sort_by: long_large_first # set this to enable alternative sorting model that starts the longest running and largest node count jobs first
```
Restart the site after changing `settings.yml` for the changes to take effect.

## Using the API

A unique capability of the [Balsam Python API](./api.md) is that it allows us
Expand Down

0 comments on commit f0bd159

Please sign in to comment.