Skip to content

Commit

Permalink
For Jet, change APRUN for various tasks from mpirun to srun since tha…
Browse files Browse the repository at this point in the history
…t is much faster; fix path in WE2E run script where the external model files are located. (#599)

## DESCRIPTION OF CHANGES: 
1. Update location of external model files on Jet in run_experiments.sh (this location was changed due to an administrative change in a user group name).
3. Change `APRUN` for various tasks from `mpirun` to `srun` since that is much faster (10 to 50 times faster!!).

## TESTS CONDUCTED: 
The WE2E tests for the release in `tests/testlist.release_public_v1.txt` were run on various Jet partitions.  The tests are:

- GST_release_public_v1
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
- grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha
- grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
- grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha
- grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
- grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha

The partitions on which these were run are sjet, vjet, xjet, and kjet.  Note that for each partition, all tasks except `run_fcst` were run on that partition, but `run_fcst` was always run on xjet (because that was the required configuration for the test).

All combinations of tests and partitions were successful except the `run_fcst` task when running the other tasks on xjet.  See image below for detailed test results.  It is not clear why the `run_fcst` task fails when running the other tasks on xjet because it does succeed when running the other tasks on other partitions (while still running `run_fcst` on xjet). 

Note also that the following code snippets need to be appended to the WE2E configuration files in `tests/baseline_configs` for each partition (to customize the experiment configuration to that partition).

### For running on sjet:
```
PARTITION_FCST=${PARTITION_FCST:-"xjet"}
PPN_RUN_FCST="24"

PARTITION_DEFAULT=${PARTITION_DEFAULT:-"sjet"}
PPN_MAKE_GRID="16"
PPN_MAKE_OROG="${PPN_MAKE_GRID}"
PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}"
PPN_GET_EXTRN_ICS="1"
PPN_GET_EXTRN_LBCS="1"
PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_RUN_POST="${PPN_MAKE_GRID}"

# To enable use of "debug" QOS for the run_fcst task, set the wall time
# for this task to 30 minutes (or less).  All the WE2E tests for the
# release should complete within this time on the Jet partitions sjet,
# vjet, xjet, and kjet.
WTIME_RUN_FCST="00:30:00"
```

### For running on vjet:
```
PARTITION_FCST=${PARTITION_FCST:-"xjet"}
PPN_RUN_FCST="24"

PARTITION_DEFAULT=${PARTITION_DEFAULT:-"vjet"}
PPN_MAKE_GRID="16"
PPN_MAKE_OROG="${PPN_MAKE_GRID}"
PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}"
PPN_GET_EXTRN_ICS="1"
PPN_GET_EXTRN_LBCS="1"
PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_RUN_POST="${PPN_MAKE_GRID}"

# To enable use of "debug" QOS for the run_fcst task, set the wall time
# for this task to 30 minutes (or less).  All the WE2E tests for the
# release should complete within this time on the Jet partitions sjet,
# vjet, xjet, and kjet.
WTIME_RUN_FCST="00:30:00"
```

### For running on xjet:
```
PARTITION_FCST=${PARTITION_FCST:-"xjet"}
PPN_RUN_FCST="24"

PARTITION_DEFAULT=${PARTITION_DEFAULT:-"xjet"}
PPN_MAKE_GRID="24"
PPN_MAKE_OROG="${PPN_MAKE_GRID}"
PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}"
PPN_GET_EXTRN_ICS="1"
PPN_GET_EXTRN_LBCS="1"
PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_RUN_POST="${PPN_MAKE_GRID}"

# To enable use of "debug" QOS for the run_fcst task, set the wall time
# for this task to 30 minutes (or less).  All the WE2E tests for the
# release should complete within this time on the Jet partitions sjet,
# vjet, xjet, and kjet.
WTIME_RUN_FCST="00:30:00"
```

### For running on kjet:
```
PARTITION_FCST=${PARTITION_FCST:-"xjet"}
PPN_RUN_FCST="24"

PARTITION_DEFAULT=${PARTITION_DEFAULT:-"kjet"}
PPN_MAKE_GRID="40"
PPN_MAKE_OROG="${PPN_MAKE_GRID}"
PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}"
PPN_GET_EXTRN_ICS="1"
PPN_GET_EXTRN_LBCS="1"
PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 ))
PPN_RUN_POST="${PPN_MAKE_GRID}"

# The following are needed in order for the make_sfc_climo, make_ics,
# make_lbcs, and run_post (meta)tasks on the 25km grid to not fail
# due to too many MPI processes (which is an ESMF error).
NNODES_MAKE_SFC_CLIMO="1"
NNODES_MAKE_ICS="2"
NNODES_MAKE_LBCS="2"
NNODES_RUN_POST="1"

# To enable use of "debug" QOS for the run_fcst task, set the wall time
# for this task to 30 minutes (or less).  All the WE2E tests for the
# release should complete within this time on the Jet partitions sjet,
# vjet, xjet, and kjet.
WTIME_RUN_FCST="00:30:00"
```

![Jet_partition_testing_summary](https://user-images.githubusercontent.com/31046882/133491326-fa4004f2-6f13-4296-a078-53fed6c1e3f5.jpg)
  • Loading branch information
gsketefian authored Sep 15, 2021
1 parent 99a2c47 commit d5aa4d1
Show file tree
Hide file tree
Showing 6 changed files with 6 additions and 11 deletions.
3 changes: 1 addition & 2 deletions scripts/exregional_make_ics.sh
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,7 @@ case "$MACHINE" in

"JET")
ulimit -s unlimited
nprocs=$(( NNODES_MAKE_ICS*PPN_MAKE_ICS ))
APRUN="mpirun -np $nprocs"
APRUN="srun"
;;

"GAEA")
Expand Down
3 changes: 1 addition & 2 deletions scripts/exregional_make_lbcs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,7 @@ case "$MACHINE" in

"JET")
ulimit -s unlimited
nprocs=$(( NNODES_MAKE_LBCS*PPN_MAKE_LBCS ))
APRUN="mpirun -np $nprocs"
APRUN="srun"
;;

"GAEA")
Expand Down
3 changes: 1 addition & 2 deletions scripts/exregional_make_sfc_climo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -152,8 +152,7 @@ case $MACHINE in
;;

"JET")
nprocs=$(( NNODES_MAKE_SFC_CLIMO*PPN_MAKE_SFC_CLIMO ))
APRUN="mpirun -np $nprocs"
APRUN="srun"
;;

"GAEA")
Expand Down
3 changes: 1 addition & 2 deletions scripts/exregional_run_fcst.sh
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,7 @@ case $MACHINE in
"JET")
ulimit -s unlimited
ulimit -a
nprocs=$(( NNODES_RUN_FCST*PPN_RUN_FCST ))
APRUN="mpirun -np $nprocs"
APRUN="srun"
OMP_NUM_THREADS=4
;;

Expand Down
3 changes: 1 addition & 2 deletions scripts/exregional_run_post.sh
Original file line number Diff line number Diff line change
Expand Up @@ -117,8 +117,7 @@ case $MACHINE in
;;

"JET")
nprocs=$(( NNODES_RUN_POST*PPN_RUN_POST ))
APRUN="mpirun -np $nprocs"
APRUN="srun"
;;

"GAEA")
Expand Down
2 changes: 1 addition & 1 deletion tests/run_experiments.sh
Original file line number Diff line number Diff line change
Expand Up @@ -709,7 +709,7 @@ PTMP=\"${PTMP}\""
if [ "$MACHINE" = "HERA" ]; then
extrn_mdl_source_basedir="/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/staged_extrn_mdl_files"
elif [ "$MACHINE" = "JET" ]; then
extrn_mdl_source_basedir="/mnt/lfs1/BMC/fim/Gerard.Ketefian/UFS_CAM/staged_extrn_mdl_files"
extrn_mdl_source_basedir="/mnt/lfs1/BMC/gsd-fv3/Gerard.Ketefian/UFS_CAM/staged_extrn_mdl_files"
elif [ "$MACHINE" = "CHEYENNE" ]; then
extrn_mdl_source_basedir="/glade/p/ral/jntp/UFS_SRW_app/staged_extrn_mdl_files"
elif [ "$MACHINE" = "ORION" ]; then
Expand Down

0 comments on commit d5aa4d1

Please sign in to comment.