Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up stable_driver.sh #1434

Merged
merged 5 commits into from
Jan 10, 2025
Merged

Conversation

RussTreadon-NOAA
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA commented Jan 8, 2025

Description

This PR modifies ci/stable_driver.sh to allow return code checking on individual command lines.

Companion PRs

none

Issues

Resolves #1433

Automated CI tests to run in Global Workflow

  • atm_jjob
  • C96C48_ufs_hybatmDA
  • C96C48_hybatmaerosnowDA
  • C48mx500_3DVarAOWCDA
  • C48mx500_hybAOWCDA
  • C96C48_hybatmDA

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Jan 8, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

This PR is in draft mode while I ask some questions

@DavidNew-NOAA , @CoryMartin-NOAA , @guillaumevernieres , and @danholdaway:
stable_driver.sh emails the addresses on the PEOPLE line in the script. Currently this line includes David, Cory, Guillaume, and me. For testing I trimmed PEOPLE down to myself. Who do we want on the PEOPLE line moving forward? Just David and me? Who else?

@DavidNew-NOAA:
I commented out the git stash and git stash pop lines. stable_driver.sh has been running OK the past two weeks without these lines. Given this, I'm inclined to completely remove them and the scripting around them. Is there a reason we want to retain the git stash and git stash pop lines?

@DavidNew-NOAA
Copy link
Collaborator

@RussTreadon-NOAA Feel free to delete those lines. I don't see a reason for them given that no local changes are being made other than updating the submodules.

@CoryMartin-NOAA
Copy link
Contributor

I'm fine to still be included on the email list. What's one more email from the dozens I get daily?

@RussTreadon-NOAA RussTreadon-NOAA changed the title improved error checking for stable_driver (#1433) clean up stable_driver.sh Jan 8, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @CoryMartin-NOAA and @DavidNew-NOAA for your replies. Changes committed at 1365231

@RussTreadon-NOAA
Copy link
Contributor Author

RussTreadon-NOAA commented Jan 8, 2025

Will let the changes in this PR run in the daily automated feature/stable-nightly check run on Hera.

Heads up @CoryMartin-NOAA and @DavidNew-NOAA: you will receive an email from Darth Vader tonight.

If the stable-nightly check runs OK tonight, I'll mark this PR as Ready for review.

@RussTreadon-NOAA
Copy link
Contributor Author

20250109 run of stable-nightly successfully ran using the updated ci/stable_driver.sh. Branch feature/stable-nightly updated with most recent JEDI hashes. Success email sent to @CoryMartin-NOAA , @DavidNew-NOAA , and @RussTreadon-NOAA.

This PR is ready for review.

DavidNew-NOAA
DavidNew-NOAA previously approved these changes Jan 10, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @CoryMartin-NOAA and @DavidNew-NOAA for your approvals.

As noted in PR #1435, we can not use role-da on Orion or Hercules to run automated CI due to role-da not belonging to the stmp group.

MSU ticket #2025010854000354 was submitted requesting that role-da be added to stmp. I do not know when this request will be acted upon.

Therefore, I am currently testing logic in a working copy of ci/run_ci.sh to change on the fly the STMP and PTMP path in $HOMEgfs/workflow/hosts/${TARGET}.yaml.

if [[ "${TARGET}" = "orion" || "${TARGET}" = "hercules" ]]; then
    sed -i "s|/noaa/stmp|/noaa/da|g" $workflow_dir/workflow/hosts/${TARGET}.yaml
fi

If tests prove successful, I would like to add the modified ci/run_ci.sh to this PR. Doing so will enable role-da to run GDASApp CI on Orion and Hercules.

What do you think?

@CoryMartin-NOAA
Copy link
Contributor

Seems like a good enough temporary solution, I think this is probably preferred to waiting on MSU, especially given how big our da disk allocation is on those machines. But we should open an issue to note to fix this in the future

@RussTreadon-NOAA
Copy link
Contributor Author

Good point. I'll open an issue to note this patch. We want to remove the patch once role-da is added to the stmp group.

@RussTreadon-NOAA
Copy link
Contributor Author

Test of modified ci/run_ci.sh in PR #1435 returned success on Orion. Will now test the updated ci/run_ci.sh on Hercules using this PR.

@RussTreadon-NOAA RussTreadon-NOAA added the hercules-GW-RT Queue for automated testing with global-workflow on Hercules label Jan 10, 2025
@emcbot emcbot added hercules-GW-RT-Running Automated testing with global-workflow running on Hercules and removed hercules-GW-RT Queue for automated testing with global-workflow on Hercules labels Jan 10, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

@CoryMartin-NOAA and @DavidNew-NOAA : Testing of the modified run_ci.sh worked on Orion in PR #1435.

I am now testing the modified run_ci.sh on Hercules using this PR. I'd like to merge this PR into develop as soon as the Hercules test completes. Unfortunately, Hercules throughput is slower than Orion. All tests may not complete until later tonight.

You already approved the changes in this PR once. Would you mind doing so again? This will allow me to merge this PR into develop as soon as the Hercules test passes.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @CoryMartin-NOAA and @DavidNew-NOAA for the approvals. The hercules test has completed 69 of 134 tests. 49 jobs are pending in the queue. Enough tests have run to confirm that the revised run_ci.sh is working on Hercules.

@RussTreadon-NOAA RussTreadon-NOAA merged commit 15113ad into develop Jan 10, 2025
9 checks passed
@RussTreadon-NOAA RussTreadon-NOAA deleted the feature/improve_stable_driver branch January 10, 2025 18:53
@emcbot
Copy link

emcbot commented Jan 10, 2025

Automated GW-GDASApp Testing Results:
Machine: hercules

Start: Fri Jan 10 16:56:49 UTC 2025 on hercules-login-4.hpc.msstate.edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Fri Jan 10 17:31:54 UTC 2025
---------------------------------------------------
Tests: ctest -j12 -R gdasapp
Tests:                                  *Failed*
Tests: Failed at Fri Jan 10 22:57:57 UTC 2025
Tests: 93% tests passed, 9 tests failed out of 134
	2025 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_sfcanl_202112201800 (Failed)
	2026 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_analcalc_202112201800 (Failed)
	2027 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_fcst_202112201800 (Failed)
	2031 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_ecmn_202112201800 (Failed)
	2033 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_esfc_202112201800 (Failed)
	2034 - test_gdasapp_C96C48_hybatmaerosnowDA_enkfgdas_fcst_202112201800 (Failed)
	2050 - test_gdasapp_C48mx500_hybAOWCDA_gdas_marineanlletkf_202103250000 (Failed)
	1975 - test_gdasapp_C96C48_hybatmDA_gdas_anal_202112210000 (Timeout)
	2020 - test_gdasapp_C96C48_hybatmaerosnowDA_gdas_anal_202112201800 (Timeout)
Tests: see output at /work2/noaa/da/role-da/CI/hercules/GDASApp/workflow/PR/1434/global-workflow/sorc/gdas.cd/build/log.ctest

@emcbot emcbot added hercules-GW-RT-Failed Automated testing with global-workflow failed on Hera and removed hercules-GW-RT-Running Automated testing with global-workflow running on Hercules labels Jan 10, 2025
@RussTreadon-NOAA
Copy link
Contributor Author

Hercules CI failures

run_ci.sh runs ctests with --timeout 7200. The queue wait times on Hercules are extremely long. Two jobs waited in the queue longer than the specified 2 hour (7200 second) limit.

  • test_gdasapp_C96C48_hybatmDA_gdas_anal_202112210000
  • test_gdasapp_C96C48_hybatmaerosnowDA_gdas_anal_202112201800.

The hybatmaerosnowDA timeout resulted in the failure of downstream jobs.

After test_gdasapp_C96C48_hybatmaerosnowDA_gdas_anal_202112201800 timed out, the subsequent jobs in hybatmaerosnowDA suite were sequentially submitted. Jobs dependent on output from gdas_anal failed because expected output was not present. Jobs downstream of these failed jobs failed because the upstream job did not produce expected output. The gdas_anal job eventually ran and passed. It did so after all the downstream jobs had failed as seen by a time listing of the hybatmaerosnowDA log files for 20211220 18Z

hercules-login-2:/work2/noaa/da/role-da/CI/hercules/GDASApp/workflow/PR/1434/global-workflow/sorc/gdas.cd/build/gdas/test/gw-ci/C96C48_hybatmaerosnowDA/COMROOT/C96C48_hybatmaerosnowDA/logs/2021122018$ ls -lt
total 26388
-rw-r----- 1 role-da da   539204 Jan 11 00:17 gdas_anal.log
-rw-r----- 1 role-da da   183899 Jan 10 22:18 enkfgdas_fcst_mem002.log
-rw-r----- 1 role-da da   184030 Jan 10 22:13 enkfgdas_fcst_mem001.log
-rw-r----- 1 role-da da   203173 Jan 10 21:43 enkfgdas_ecen002.log
-rw-r----- 1 role-da da   160357 Jan 10 21:42 enkfgdas_esfc.log
-rw-r----- 1 role-da da   203312 Jan 10 21:38 enkfgdas_ecen000.log
-rw-r----- 1 role-da da   200500 Jan 10 21:38 enkfgdas_ecen001.log
-rw-r----- 1 role-da da   170516 Jan 10 21:33 gdas_fcst_seg0.log
-rw-r----- 1 role-da da    89756 Jan 10 21:33 gdas_analcalc.log
-rw-r----- 1 role-da da    60678 Jan 10 21:27 gdas_sfcanl.log
-rw-r----- 1 role-da da   393664 Jan 10 19:58 gdas_aeroanlfinal.log
-rw-r----- 1 role-da da  1634757 Jan 10 19:54 enkfgdas_esnowanl.log
-rw-r----- 1 role-da da    83240 Jan 10 19:49 enkfgdas_eupd.log
-rw-r----- 1 role-da da  1258139 Jan 10 19:48 gdas_aeroanlvar.log
-rw-r----- 1 role-da da   244340 Jan 10 19:44 enkfgdas_ediag.log
-rw-r----- 1 role-da da  1179960 Jan 10 19:44 gdas_snowanl.log
-rw-r----- 1 role-da da   624130 Jan 10 19:29 enkfgdas_eobs.log
-rw-r----- 1 role-da da   979293 Jan 10 19:23 gdas_aeroanlinit.log
-rw-r----- 1 role-da da 18554995 Jan 10 19:14 gdas_prep.log

Job test_gdasapp_C96C48_hybatmDA_gdas_anal_202112210000 also waited in the queue beyond the specified 2 hour limit. It eventually ran and successfully completed. As shown by log.ctest and the log files, jobs downstream of gdas_anal ran after gdas_anal finished. Thus, the downstream job passed.

hercules-login-2:/work2/noaa/da/role-da/CI/hercules/GDASApp/workflow/PR/1434/global-workflow/sorc/gdas.cd/build/gdas/test/gw-ci/C96C48_hybatmDA/COMROOT/C96C48_hybatmDA/logs/2021122100$ ls -lt
total 51640
-rw-r----- 1 role-da da   699871 Jan 10 22:57 enkfgdas_fcst_mem002.log
-rw-r----- 1 role-da da   699871 Jan 10 22:57 enkfgdas_fcst_mem001.log
-rw-r----- 1 role-da da   722839 Jan 10 22:04 enkfgdas_esfc.log
-rw-r----- 1 role-da da    88287 Jan 10 21:53 enkfgdas_ecen001.log
-rw-r----- 1 role-da da    88238 Jan 10 21:53 enkfgdas_ecen002.log
-rw-r----- 1 role-da da    88322 Jan 10 21:48 enkfgdas_ecen000.log
-rw-r----- 1 role-da da   686434 Jan 10 21:44 gdas_fcst_seg0.log
-rw-r----- 1 role-da da   161217 Jan 10 21:38 gdas_analcalc.log
-rw-r----- 1 role-da da   290934 Jan 10 21:28 gdas_sfcanl.log
-rw-r----- 1 role-da da   535279 Jan 10 21:23 gdas_anal.log
-rw-r----- 1 role-da da    82236 Jan 10 19:54 enkfgdas_eupd.log
-rw-r----- 1 role-da da   240979 Jan 10 19:44 enkfgdas_ediag.log
-rw-r----- 1 role-da da   420849 Jan 10 19:24 enkfgdas_eobs.log
-rw-r----- 1 role-da da 48018615 Jan 10 19:14 gdas_prep.log

The test_gdasapp_C48mx500_hybAOWCDA_gdas_marineanlletkf_202103250000 appears to be a valid failure. The job log file shows the job failed due to a missing input file

 4:
 4: FATAL from PE     4: fms_io(restore_state_all): unable to find any restart files specified by ../ensdata/ens/ocean.1.nc
 4:
 4: Image              PC                Routine            Line        Source
 4: libsoca.so         000014AAD6962959  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
 4: libsoca.so         000014AAD6836EA8  fms_io_mod_mp_res        4094  fms_io.F90
 4: libsoca.so         000014AAD66A1FD2  Unknown               Unknown  Unknown

File ../ensdata/ens/ocean.1.nc is created by gdas_marinebmat. This job ran after gdas_marineanlletkf as shown by the log files

hercules-login-2:/work2/noaa/da/role-da/CI/hercules/GDASApp/workflow/PR/1434/global-workflow/sorc/gdas.cd/build/gdas/test/gw-ci/C48mx500_hybAOWCDA/COMROOT/C48mx500_hybAOWCDA/logs/2021032500$ ls -lt
total 6948
-rw-r----- 1 role-da da  424031 Jan 10 19:58 gdas_marineanlfinal.log
-rw-r----- 1 role-da da  573104 Jan 10 19:47 gdas_marineanlchkpt.log
-rw-r----- 1 role-da da  515964 Jan 10 19:44 gdas_marineanlvar.log
-rw-r----- 1 role-da da  639341 Jan 10 19:22 gdas_marineanlinit.log
-rw-r----- 1 role-da da 3440276 Jan 10 19:14 gdas_marinebmat.log
-rw-r----- 1 role-da da  319349 Jan 10 18:59 gdas_marineanlletkf.log
-rw-r----- 1 role-da da  319419 Jan 10 18:34 gdas_marineanlletkf.log.0
-rw-r----- 1 role-da da  860660 Jan 10 17:50 gdas_prepoceanobs.log

A check of C48mx500_hybAOWCDA.xml shows that gdas_marineanlletkf only lists the following dependencies

        <dependency>
                <and>
                        <metataskdep metatask="enkfgdas_fcst" cycle_offset="-06:00:00"/>
                        <taskdep task="gdas_prepoceanobs"/>
                </and>
        </dependency>

gdas_marinebmat is not listed as a dependency even though gdas_marineanlletkf uses output from gdas_marinebmat.

Summary

  1. @DavidNew-NOAA: If we want to run GDASApp CI on Hercules we need to increase the ctest timeout. 7200 seconds is not enough. Would 10800 (3 hours suffice) or is an even larger ``--timeout` needed?
  2. @guillaumevernieres and @AndrewEichmann-NOAA: Does gdas_marineanlletkf depend on output from gdas_marinebmat? If yes, we need to update g-w to add this dependency to the rocoto xml.

@DavidNew-NOAA
Copy link
Collaborator

@RussTreadon-NOAA I don't see a real problem with setting the timeout time arbitrarily large.

@RussTreadon-NOAA RussTreadon-NOAA restored the feature/improve_stable_driver branch January 13, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hercules-GW-RT-Failed Automated testing with global-workflow failed on Hera
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improve stable_driver.sh error checking
4 participants