Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
For Jet, change APRUN for various tasks from mpirun to srun since tha…
…t is much faster; fix path in WE2E run script where the external model files are located. (#599) ## DESCRIPTION OF CHANGES: 1. Update location of external model files on Jet in run_experiments.sh (this location was changed due to an administrative change in a user group name). 3. Change `APRUN` for various tasks from `mpirun` to `srun` since that is much faster (10 to 50 times faster!!). ## TESTS CONDUCTED: The WE2E tests for the release in `tests/testlist.release_public_v1.txt` were run on various Jet partitions. The tests are: - GST_release_public_v1 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha - grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha - grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha The partitions on which these were run are sjet, vjet, xjet, and kjet. Note that for each partition, all tasks except `run_fcst` were run on that partition, but `run_fcst` was always run on xjet (because that was the required configuration for the test). All combinations of tests and partitions were successful except the `run_fcst` task when running the other tasks on xjet. See image below for detailed test results. It is not clear why the `run_fcst` task fails when running the other tasks on xjet because it does succeed when running the other tasks on other partitions (while still running `run_fcst` on xjet). Note also that the following code snippets need to be appended to the WE2E configuration files in `tests/baseline_configs` for each partition (to customize the experiment configuration to that partition). ### For running on sjet: ``` PARTITION_FCST=${PARTITION_FCST:-"xjet"} PPN_RUN_FCST="24" PARTITION_DEFAULT=${PARTITION_DEFAULT:-"sjet"} PPN_MAKE_GRID="16" PPN_MAKE_OROG="${PPN_MAKE_GRID}" PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}" PPN_GET_EXTRN_ICS="1" PPN_GET_EXTRN_LBCS="1" PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 )) PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 )) PPN_RUN_POST="${PPN_MAKE_GRID}" # To enable use of "debug" QOS for the run_fcst task, set the wall time # for this task to 30 minutes (or less). All the WE2E tests for the # release should complete within this time on the Jet partitions sjet, # vjet, xjet, and kjet. WTIME_RUN_FCST="00:30:00" ``` ### For running on vjet: ``` PARTITION_FCST=${PARTITION_FCST:-"xjet"} PPN_RUN_FCST="24" PARTITION_DEFAULT=${PARTITION_DEFAULT:-"vjet"} PPN_MAKE_GRID="16" PPN_MAKE_OROG="${PPN_MAKE_GRID}" PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}" PPN_GET_EXTRN_ICS="1" PPN_GET_EXTRN_LBCS="1" PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 )) PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 )) PPN_RUN_POST="${PPN_MAKE_GRID}" # To enable use of "debug" QOS for the run_fcst task, set the wall time # for this task to 30 minutes (or less). All the WE2E tests for the # release should complete within this time on the Jet partitions sjet, # vjet, xjet, and kjet. WTIME_RUN_FCST="00:30:00" ``` ### For running on xjet: ``` PARTITION_FCST=${PARTITION_FCST:-"xjet"} PPN_RUN_FCST="24" PARTITION_DEFAULT=${PARTITION_DEFAULT:-"xjet"} PPN_MAKE_GRID="24" PPN_MAKE_OROG="${PPN_MAKE_GRID}" PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}" PPN_GET_EXTRN_ICS="1" PPN_GET_EXTRN_LBCS="1" PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 )) PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 )) PPN_RUN_POST="${PPN_MAKE_GRID}" # To enable use of "debug" QOS for the run_fcst task, set the wall time # for this task to 30 minutes (or less). All the WE2E tests for the # release should complete within this time on the Jet partitions sjet, # vjet, xjet, and kjet. WTIME_RUN_FCST="00:30:00" ``` ### For running on kjet: ``` PARTITION_FCST=${PARTITION_FCST:-"xjet"} PPN_RUN_FCST="24" PARTITION_DEFAULT=${PARTITION_DEFAULT:-"kjet"} PPN_MAKE_GRID="40" PPN_MAKE_OROG="${PPN_MAKE_GRID}" PPN_MAKE_SFC_CLIMO="${PPN_MAKE_GRID}" PPN_GET_EXTRN_ICS="1" PPN_GET_EXTRN_LBCS="1" PPN_MAKE_ICS=$(( ${PPN_MAKE_GRID}/2 )) PPN_MAKE_LBCS=$(( ${PPN_MAKE_GRID}/2 )) PPN_RUN_POST="${PPN_MAKE_GRID}" # The following are needed in order for the make_sfc_climo, make_ics, # make_lbcs, and run_post (meta)tasks on the 25km grid to not fail # due to too many MPI processes (which is an ESMF error). NNODES_MAKE_SFC_CLIMO="1" NNODES_MAKE_ICS="2" NNODES_MAKE_LBCS="2" NNODES_RUN_POST="1" # To enable use of "debug" QOS for the run_fcst task, set the wall time # for this task to 30 minutes (or less). All the WE2E tests for the # release should complete within this time on the Jet partitions sjet, # vjet, xjet, and kjet. WTIME_RUN_FCST="00:30:00" ``` ![Jet_partition_testing_summary](https://user-images.githubusercontent.com/31046882/133491326-fa4004f2-6f13-4296-a078-53fed6c1e3f5.jpg)
- Loading branch information