Skip to content

Commit

Permalink
Enable asynchronous jobs (#54)
Browse files Browse the repository at this point in the history
* GitHub Action: Apply Pep8-formatting

* Small fix

* GitHub Action: Apply Pep8-formatting

* Format duration

* Remove --wait from submit function

* Let geosp be a separate job

* GitHub Action: Apply Pep8-formatting

* Add run_chain argument to process specified chunk

* GitHub Action: Apply Pep8-formatting

* Fix datetime comparison in int2lm

* Submit all jobs as sbatch

* GitHub Action: Apply Pep8-formatting

* Use existing submit() function and only return job_id

* GitHub Action: Apply Pep8-formatting

* Fix prepare_data

* Disable logging for individual jobs

* GitHub Action: Apply Pep8-formatting

* Write casename into logfile

* Fix timers for jobs

* GitHub Action: Apply Pep8-formatting

* Use UTC times

* GitHub Action: Apply Pep8-formatting

* More async jobs and workflows

* GitHub Action: Apply Pep8-formatting

* Add logging to new jobs

* GitHub Action: Apply Pep8-formatting

* Remove unused imports

* Remove global icon-art from prepare_data

* GitHub Action: Apply Pep8-formatting

* Replace os.path with Pathlib

* GitHub Action: Apply Pep8-formatting

* Fix for PosixPaths

* ADD: convenience function for slurm job info

* REF: move get_job_info method to the end of the class definition

* GitHub Action: Apply Pep8-formatting

* Add some docstrings

* Fix quotes for f-strings

* Add imports for art-global job

* Add dependencies for all icon workflows

* Fix pathname

* GitHub Action: Apply Pep8-formatting

* Fix chunk calculation

* Add icontools job

* GitHub Action: Apply Pep8-formatting

* Add icontools to dependencies

* Modify docstrings

* Some code cleanup

* fix(icontools): remove unsused packages + fix dependency logic

- The `icontools` job already depends on `prepapre_data`, no need to
  add a `copy_id` in the dependencies (which was anyways undefinied
  in this scope).

- The name of the first argument of `cfg.submit()` must be 'icontools'
  so that all the sub job ids are associated with the icontools job
  and are taken into account for other jobs depending on it (icon).
  Otherwise, icon will find no dependency in
  `cfg.job_ids['current']['icontools']`.

- `cfg.get_job_info()` should be used later in `run_chain.py` to
  monitor time for all async jobs. Currently the reported time is
  only the submission time.

* Make prepare_art_global dependent on previous icon

* Fix import

* Explicitly set async variable

* Add logfile for icontools

* Set variables for icontools

* Revert "Explicitly set async variable"

This reverts commit 99b96cf.

* Define sequential workflow in icon-seq-test

* Make cosmo-ghg workflow async

* Add info about (a)sync mode

* fix: also launch waiting job when some workflow jobs failed

* add: other keys to default job info dict

* fix: empty current job ids before each chunk

* GitHub Action: Apply Pep8-formatting

* Split prepare_data into cosmo and icon

* Remove unused imports

* GitHub Action: Apply Pep8-formatting

* Rename prepare_data job in workflows

* Rename prepare_data in icon-seq case

* Set additional cfg variables in prepare_icon

* GitHub Action: Apply Pep8-formatting

* Time logging for all jobs

* Merge geosp into prepare_art

* GitHub Action: Apply Pep8-formatting

* Include oem in icon-art-oem workflow

* Fix icon job

* Move geosp to icontools job

* GitHub Action: Apply Pep8-formatting

* add: placeholder for monitoring async jobs

* Run geosp after icontools

* Remove oem job from icon-art-oem case

* Change to current logfile within jobs

* GitHub Action: Apply Pep8-formatting

* Add missing log inits

* Configure root logger

* set logger

* Fix logger

* Format logging output

* Introduce BASIC_PYTHON_JOB option to call jobs directly in async mode

* GitHub Action: Apply Pep8-formatting

* add: only submit basic python jobs through a nested run_chain

* GitHub Action: Apply Pep8-formatting

* fix: loop over jobs in run_chunk

* ref: job_id becomes chunk_id

reflects reality and avoids confusion with actual jobs id

* GitHub Action: Apply Pep8-formatting

* fix: leftover `job_id` -> `chunk_id`

* GitHub Action: Apply Pep8-formatting

* add(untested): Slurm monitoring

* GitHub Action: Apply Pep8-formatting

* Add BASIC_PYHTON_JOB to missing jobs

* Fix function arguments

* Small fix and hint to KeyError

* GitHub Action: Apply Pep8-formatting

* ref(slurm summary): streamline code a bit

* GitHub Action: Apply Pep8-formatting

* Some settings for cosmo-ghg case

* afterany -> afterok for wait job

* Comment some function calls

* add:ref: remove unused `info_requests` + print failing jobs

* fix: job summary for previous chunk, not current

* fix: reactivate empty current job ids at beginning of chunk

* fix: only wait and monitor if not very first chunk

* GitHub Action: Apply Pep8-formatting

* ref: move icon-art error handling in slurm job itself

* fix: remove BASIC_PYTHON_JOB workaround for icon

* fix(icon.py): escape curly brackets for string formatting

* fix(icon): escape curly brackets in python way

* Add walltime and remove conda activation

* GitHub Action: Apply Pep8-formatting

* Set walltime correctly

* Fix call to handle_error function

* Set walltimes for icon workflows

* Debug

* Remove job_ids override

* ref(config): clean up

* Add smaller walltime to wait job

* Remove prints

* Some fixes for cosmo case

* Remove model check

* Don't pass logfile anymore

* GitHub Action: Apply Pep8-formatting

* Submit int2lm and cosmo jobs correctly

* GitHub Action: Apply Pep8-formatting

* Small fixes for int2lm and cosmo

* Store job scripts in separate directory

* Add BASIC_PYTHON_JOB to int2lm

* GitHub Action: Apply Pep8-formatting

* Add post_cosmo dependency

* Bugfixes for cosmo jobs

* Further fixes

* Make post_cosmo a submit job

* Fix post_cosmo

* Fix post_cosmo

* ref: refactor cycling and monitoring

- Regroup waiting, monitoring nd cycling in a single `Config.cycle()`
  method.
- write chunk monitoring info into chain log file

* GitHub Action: Apply Pep8-formatting

* Fix config variables

* Fix log file output

* Formatting

* Remove time logging from jobs

* GitHub Action: Apply Pep8-formatting

* Fix for icon job

* GitHub Action: Apply Pep8-formatting

* ref: exception handling in Config.submit()

* Change table cell widths

* Remove old way of logging

* Cleanup

* NNodes -> N

* GitHub Action: Apply Pep8-formatting

* Fix for N/NNodes

* GitHub Action: Apply Pep8-formatting

* Yet another fix for N/NNodes

* Simplify job names

* Change missing job names

* Fix syntax error from commit 8810dcf

* GitHub Action: Apply Pep8-formatting

* Just jobname for jobs

* Cleaner console output

* GitHub Action: Apply Pep8-formatting

* Unify restart and spinup runs

* GitHub Action: Apply Pep8-formatting

* Complete config file for spinup test

* Fix for spinup

* GitHub Action: Apply Pep8-formatting

* Fix 2 for spinup

* Fancy formatting

* Define separate spinup workflow

* Fix formatting (hopefully)

* Aligning case + workflow

* Custom workflow_name and improve check

* Remove workflow_name check in jobs

* GitHub Action: Apply Pep8-formatting

* Bugfix in cosmo job

* Remove restart info

* Compute chunks separately

* GitHub Action: Apply Pep8-formatting

* Add function to get previous chunk ID

* GitHub Action: Apply Pep8-formatting

* Directly get previous chunk id

* Further cleanup and refactoring

* GitHub Action: Apply Pep8-formatting

* Fix for chunk_id_prev

* Fix

* Save total chunk list

* Fix restart variables

* Don't print chunk list twice

* Fix for cosmo_restart_out

* Remove sequential part

* GitHub Action: Apply Pep8-formatting

* Remove sequential case

* Add dependencies to all workflows

* Remove is_async config variable

* Incorporate review

* GitHub Action: Apply Pep8-formatting

* Fix for basename

* Add --wait again for seq. jobs in nested run_chain

* Fixes for spinup

* GitHub Action: Apply Pep8-formatting

* ref: remove unnecessary if levels

* del: remove unused imports

* ref(basic python jobs): merge generation of script and submission

The create_sbatch_script method is used nowhere else => No need to
seperate generation of job script from its submission

---------

Co-authored-by: github-actions <[email protected]>
Co-authored-by: Matthieu <[email protected]>
  • Loading branch information
3 people authored Feb 7, 2024
1 parent 45d6c01 commit 008bf17
Show file tree
Hide file tree
Showing 51 changed files with 2,062 additions and 1,680 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ input_processing-chain.tgz
input/
output/
work/
src/*/
ext/*/
*.code-workspace
.vscode/
.vscode/
4 changes: 3 additions & 1 deletion cases/cosmo-ghg-spinup-test/config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Configuration file for the 'cosmo-ghg-spinup-test' case with COSMO-GHG

model: cosmo-ghg
workflow: cosmo-ghg-spinup
constraint: gpu
run_on: gpu
compute_queue: normal
ntasks_per_node: 12
restart_step: PT6H
spinup: 3
Expand Down
4 changes: 2 additions & 2 deletions cases/cosmo-ghg-spinup-test/cosmo_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash -l
#SBATCH --job-name="cosmo_{cfg.startdate_sim_yyyymmddhh}_{cfg.forecasttime}"
#SBATCH --job-name=cosmo
#SBATCH --account={cfg.compute_account}
#SBATCH --time={walltime}
#SBATCH --nodes={np_tot}
Expand Down Expand Up @@ -34,7 +34,7 @@ echo "============== StartTime: `date +%s` s"
echo "============== StartTime: `date`"
echo "====================================================="

srun -u ./{execname} >> {logfile} 2>&1
srun -u ./{cfg.cosmo_execname} >> {logfile} 2>&1
pid=$?

echo "====================================================="
Expand Down
2 changes: 1 addition & 1 deletion cases/cosmo-ghg-spinup-test/int2lm_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash -l
#SBATCH --job-name=int2lm_{cfg.startdate_sim_yyyymmddhh}_{cfg.enddate_sim_yyyymmddhh}
#SBATCH --job-name=int2lm
#SBATCH --account={cfg.compute_account}
#SBATCH --time={walltime}
#SBATCH --nodes={nodes}
Expand Down
4 changes: 3 additions & 1 deletion cases/cosmo-ghg-test/config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Configuration file for the 'cosmo-ghg-test' case with COSMO-GHG

model: cosmo-ghg
workflow: cosmo-ghg
constraint: gpu
run_on: gpu
compute_queue: normal
ntasks_per_node: 12
restart_step: PT6H
startdate: 2015-01-01T00:00:00Z
Expand Down
4 changes: 2 additions & 2 deletions cases/cosmo-ghg-test/cosmo_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash -l
#SBATCH --job-name="cosmo_{cfg.startdate_sim_yyyymmddhh}_{cfg.forecasttime}"
#SBATCH --job-name=cosmo
#SBATCH --account={cfg.compute_account}
#SBATCH --time={walltime}
#SBATCH --nodes={np_tot}
Expand Down Expand Up @@ -34,7 +34,7 @@ echo "============== StartTime: `date +%s` s"
echo "============== StartTime: `date`"
echo "====================================================="

srun -u ./{execname} >> {logfile} 2>&1
srun -u ./{cfg.cosmo_execname} >> {logfile} 2>&1
pid=$?

echo "====================================================="
Expand Down
2 changes: 1 addition & 1 deletion cases/cosmo-ghg-test/int2lm_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash -l
#SBATCH --job-name=int2lm_{cfg.startdate_sim_yyyymmddhh}_{cfg.enddate_sim_yyyymmddhh}
#SBATCH --job-name=int2lm
#SBATCH --account={cfg.compute_account}
#SBATCH --time={walltime}
#SBATCH --nodes={nodes}
Expand Down
8 changes: 6 additions & 2 deletions cases/icon-art-global-test/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Configuration file for the 'icon-art-global-test' case with ICON

model: icon-art-global
workflow: icon-art-global
constraint: gpu
run_on: cpu
compute_queue: normal
Expand Down Expand Up @@ -28,6 +28,11 @@ species_global_nudging: False
species2nudge: []
nudging_step: 6

walltime:
prepare_icon: '00:15:00'
prepare_art_global: '00:10:00'
icon: '00:05:00'

era5:
inicond: False
global_nudging: False
Expand Down Expand Up @@ -67,7 +72,6 @@ icon:
species_nudgingjob: icon_species_nudging.sh
output_writing_step: 6
compute_queue: normal
walltime: '00:10:00'
np_tot: 4
np_io: 1
np_restart: 1
Expand Down
34 changes: 12 additions & 22 deletions cases/icon-art-global-test/icon_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#!/usr/bin/env bash
#SBATCH --job-name="{cfg.casename}_{cfg.startdate_sim_yyyymmddhh}_{cfg.forecasttime}"
#SBATCH --job-name=icon
#SBATCH --account={cfg.compute_account}
#SBATCH --time={cfg.icon_walltime}
#SBATCH --time={cfg.walltime_icon}
#SBATCH --nodes={cfg.icon_np_tot}
#SBATCH --ntasks-per-node={cfg.ntasks_per_node}
#SBATCH --partition={cfg.compute_queue}
#SBATCH --constraint={cfg.constraint}
#SBATCH --hint=nomultithread
#SBATCH --output={logfile}
#SBATCH --output={cfg.logfile}
#SBATCH --open-mode=append
#SBATCH --chdir={cfg.icon_work}

Expand Down Expand Up @@ -388,22 +388,12 @@ EOF
# ----------------------------------------------------------------------
# run the model!
# ----------------------------------------------------------------------
srun ./icon.exe



# ! output_nml: specifies an output stream --------------------------------------
# &output_nml
# filetype = 4 ! output format: 2=GRIB2, 4=NETCDFv2
# dom = -1 ! write all domains
# output_bounds = 0., 2678400., 3600. ! start, end, increment
# steps_per_file = 1 ! number of steps per file
# mode = 1 ! 1: forecast mode (relative t-axis), 2: climate mode (absolute t-axis)
# include_last = .TRUE.
# output_filename = 'ICON-ART'
# filename_format = '{cfg.icon_output}/<output_filename>_latlon_<datetime2>' ! file name base
# remap = 1 ! 1: remap to lat-lon grid
# reg_lon_def = -179.,2,179
# reg_lat_def = 90.,-1,-90.
# ml_varlist = 'z_ifc','z_mc','pres','pres_sfc','qc','rh','rho','temp','u','v','w','group:ART_CHEMISTRY',
# /
handle_error(){{
# Check for invalid pointer error at the end of icon-art
if grep -q "free(): invalid pointer" {cfg.logfile} && grep -q "clean-up finished" {cfg.logfile}; then
exit 0
else
exit 1
fi
}}
srun ./{cfg.icon_execname} || handle_error
9 changes: 8 additions & 1 deletion cases/icon-art-oem-test/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Configuration file for the 'icon-art-oem-test' case with ICON

model: icon-art-oem
workflow: icon-art-oem
constraint: gpu
run_on: cpu
compute_queue: normal
Expand All @@ -21,6 +21,13 @@ filename_format: <output_filename>_DOM<physdom>_<ddhhmmss>
lateral_boundary_grid_order: lateral_boundary
art_input_folder: ./input/icon-art-oem/ART

walltime:
prepare_icon: '00:10:00'
icontools: '00:30:00'
prepare_art: '00:10:00'
prepare_art_oem: '00:10:00'
icon: '00:30:00'

meteo:
dir: ./input/meteo
prefix: ifs_
Expand Down
16 changes: 12 additions & 4 deletions cases/icon-art-oem-test/icon_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#!/usr/bin/env bash
#SBATCH --job-name="{cfg.casename}_{cfg.startdate_sim_yyyymmddhh}_{cfg.forecasttime}"
#SBATCH --job-name=icon
#SBATCH --account={cfg.compute_account}
#SBATCH --time={cfg.icon_walltime}
#SBATCH --time={cfg.walltime_icon}
#SBATCH --nodes={cfg.icon_np_tot}
#SBATCH --ntasks-per-node={cfg.ntasks_per_node}
#SBATCH --partition={cfg.compute_queue}
#SBATCH --constraint={cfg.constraint}
#SBATCH --hint=nomultithread
#SBATCH --output={logfile}
#SBATCH --output={cfg.logfile}
#SBATCH --open-mode=append
#SBATCH --chdir={cfg.icon_work}

Expand Down Expand Up @@ -368,4 +368,12 @@ EOF
# ----------------------------------------------------------------------
# run the model!
# ----------------------------------------------------------------------
srun ./icon.exe
handle_error(){{
# Check for invalid pointer error at the end of icon-art
if grep -q "free(): invalid pointer" {cfg.logfile} && grep -q "clean-up finished" {cfg.logfile}; then
exit 0
else
exit 1
fi
}}
srun ./{cfg.icon_execname} || handle_error
2 changes: 1 addition & 1 deletion cases/icon-art-oem-test/icontools_remap_00_lbc_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="iconsub_{cfg.startdate_sim_yyyymmddhh}"
#SBATCH --job-name=iconsub
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
2 changes: 1 addition & 1 deletion cases/icon-art-oem-test/icontools_remap_ic_chem_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="{cfg.casename}_{cfg.startdate_sim_yyyymmddhh}_{cfg.forecasttime}"
#SBATCH --job-name=iconremap_ic_chem
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
2 changes: 1 addition & 1 deletion cases/icon-art-oem-test/icontools_remap_ic_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="iconremap_{cfg.startdate_sim_yyyymmddhh}"
#SBATCH --job-name=iconremap_ic
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="{cfg.casename}_{cfg.startdate_sim_yyyymmddhh}_{cfg.forecasttime}"
#SBATCH --job-name=iconremap_lbc
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="iconremap_lbc_{cfg.startdate_sim_yyyymmddhh}"
#SBATCH --job-name=iconremap_lbc
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
10 changes: 7 additions & 3 deletions cases/icon-test/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Configuration file for the 'icon-test' case with ICON
# Configuration file for the 'icon-async-test' case with ICON

model: icon
workflow: icon
constraint: gpu
run_on: cpu
compute_queue: normal
Expand All @@ -18,6 +18,11 @@ output_filename: NWP_LAM
filename_format: <output_filename>_DOM<physdom>_<ddhhmmss>
lateral_boundary_grid_order: lateral_boundary

walltime:
prepare_icon: '00:10:00'
icontools: '00:30:00'
icon: '00:30:00'

meteo:
dir: ./input/meteo
prefix: ifs_
Expand All @@ -44,7 +49,6 @@ icon:
binary_file: ./ext/icon/bin/icon
runjob_filename: icon_runjob.cfg
compute_queue: normal
walltime: '00:10:00'
np_tot: 8
np_io: 1
np_restart: 1
Expand Down
8 changes: 4 additions & 4 deletions cases/icon-test/icon_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#!/usr/bin/env bash
#SBATCH --job-name="{cfg.casename}_{cfg.startdate_sim_yyyymmddhh}_{cfg.enddate_sim_yyyymmddhh}"
#SBATCH --job-name=icon
#SBATCH --account={cfg.compute_account}
#SBATCH --time={cfg.icon_walltime}
#SBATCH --time={cfg.walltime_icon}
#SBATCH --nodes={cfg.icon_np_tot}
#SBATCH --ntasks-per-node={cfg.ntasks_per_node}
#SBATCH --partition={cfg.compute_queue}
#SBATCH --constraint={cfg.constraint}
#SBATCH --hint=nomultithread
#SBATCH --output={logfile}
#SBATCH --output={cfg.logfile}
#SBATCH --open-mode=append
#SBATCH --chdir={cfg.icon_work}

Expand Down Expand Up @@ -342,4 +342,4 @@ EOF
# ----------------------------------------------------------------------
# run the model!
# ----------------------------------------------------------------------
srun ./icon.exe
srun ./{cfg.icon_execname} || handle_error
2 changes: 1 addition & 1 deletion cases/icon-test/icontools_remap_00_lbc_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="iconsub_{cfg.startdate_sim_yyyymmddhh}"
#SBATCH --job-name=iconsub
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
2 changes: 1 addition & 1 deletion cases/icon-test/icontools_remap_ic_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="iconremap_{cfg.startdate_sim_yyyymmddhh}"
#SBATCH --job-name=iconremap_ic
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
2 changes: 1 addition & 1 deletion cases/icon-test/icontools_remap_lbc_rest_runjob.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
#SBATCH --job-name="iconremap_lbc_{cfg.startdate_sim_yyyymmddhh}"
#SBATCH --job-name=iconremap_lbc
#SBATCH --account={cfg.compute_account}
#SBATCH --chdir={cfg.icon_work}
#SBATCH --partition={cfg.compute_queue}
Expand Down
Loading

0 comments on commit 008bf17

Please sign in to comment.