Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Enable CI testing for DistConv. Added DistConv CI tests Added Corona DistConv test and disabled FFT on ROCm Ensure that DistConv tests keep error signals Enable NVSHMEM on Lassen Added a multi-stage pipeline for Lassen Fixed a typo and disabled other tests. Added spack environment Added check stage for the catch tests Added the definition of the RESULTS_DIR environment variable Added release notes. Fixed the launcher for catch tests Changed the batch launch commands to be interactive to block completion. Added a wrapper shell script for launching the unit tests Added the number of nodes for the unit test. Cleaning up launching paths Added execute permissions for unit test script. Ingest the Spack dependent environment information. Fixing launch command and exclusion of externallayer Bugfix python Added number of tasks per node Added integration tests. Set some NVSHMEM runtime variables Uniquify the CI JOB_NAME fields for DistConv tests. * Re-introduced the WITH_CLEAN_BUILD flag. * Adapting new tests to the new build script framework with modules. * Increased the time limit for the build on Lassen. Code cleanup. * Removed duplicate get_distconv_environment function in the ci_test common python tools. Switched all tests to using the standard contrib args version. * Changed the default behavior on MIOpen systems to use a local cache for JIT state. * Added back note about existing issue in DiHydrogen. * Enable CI runs to specific a subset of unit tests to run. * Tweaking the allowed runtimes for tests. * Debuging the test selection. Increasing some test time limits. * Added test filter flags to all systems. * Increasing time limits * Added flags to skip integration tests on distconv CI runs. * Bumped up pooling time limit. * Testing out setting a set of MIOpen dB cache directories for CI testing, both for normal users and lbannusr. * Adding caching options for Corona and changed how the username is queried. * Updated CI tests to use common MIOpen caches. Split user and custom cache paths. * Fix the lassen multi-stage pipeline to record the spack architecture. * Increase the build time limit on Lassen. * Fixed the new lassen build to avoid installing pytest through spack. * Added the clean build flags into the multi-stage pipeline. * Skip failing tests in distconv. * Change the test utils to not set cluster value to unset, but rather None if it is not set. * Added support for passing in the system cluster name by default if it is known. * Cleanup the paths for the MIOpen caches. * Added a guard to skip inplace test if DistConv is disabled. * Removing unnecessary variable definitions. * ResNet tests should run on Corona. * Added support in the data coordinator for explicitly recording the dimensions of each field. This is then compared to the dimensions reported from the data reader, or if there are no valid data readers, it can be substituted. Note that for the time being this is redundant information, but it allows the catch2 test to properly construct a model without data readers. This fixes a bug seen on Corona where the MPI catch2 tests were failing because they allocated a set of buffers with a size of -1. Cleaned up the way in which the data coordinator checks for linearized size to reduce code duplication. Switch the data type of the data field dimensions to use El::Int rather than int values. Added a utility function to type cast between two vectors. * Force lassen to clean build. * Fixed the legacy HDF5 data reader. * Increased the timeout for the lassen build and test. * Bumped up the time limit on the catch tests for ROCm systems. * Increase the catch test sizes. * Trying to avoid forcing static linking when using NVSHMEM. * Changed the run catch tests script for flux to use a flux proxy. * Export the lbann setup for Lassen unit and integration tests. * Minimize what is saved from the catch2 unit tests. * Cleaning up the environment variables. * Added a flag to extend the spack env name. * Tweaking the flux proxy. * Change how the NVSHMEM variables are setup so that the .before_script sections do not collide. * Removed the -o cpu-affinity=per-task flag from the flux run commands on the catch2 tests because it is causing a hang on Corona. Removed the nested flux proxy commands in the flux catch2 tests since they should be unnecessary due to the flux proxy command that invokes the script. * Tweak the flux commands to resovle hang on Corona catch tests. * Cleaning up the flux launch commands on Tioga and Corona to help avoid a hang. * Added a job name suffix variable. * Ensure that the spack environment names are unique. * Tightened up the inclusion of the LBANN Python packages to avoid conflicts when using the test_compiler script to build LBANN. * Added support to Pip install into the lbann build directory. Removed setting the --test=root command from the extra root packages to avoid triggering spack build failures on Power systems. * Updated the baseline modules used on Corona and package versions on Lassen. * Fixing the allocation flux command for Tioga. * Changing it so that only Corona addes the -o pmi=pmix flags to flux. * Enable module generation for multiple core compilers. * Making the flux commands consistent. * Applied clang format. * Fixed the compiler path on Pasacl. * Reenable lassen multi-stage distconv test pipeline. * Fixed how the new Lassen distconv tests are invoked and avoid erronously resetting up the spack environment. Changed the saved spack environment name to SPACK_ENV_NAME. Cleaned up some dead code. * Added a second if clause to the integration tests so that there is always at least one true clause, so the stage will schedule. Fixed the regex so that the distconv substring doesn't have to come at the start of the string. * Consolidated the rules clause into a common one. * Fix the rules regex. * Added corona numbers for resnet. * Tweaking the CI rules to avoid integrations on distconv builds. * Tweaking how the lassen unit tests are called. * Disable nvshmem build on Lassen. Code cleanup and adding suggestions. * Changed the guard in resnet 50 test * Disable NVSHMEM environemnt variables. * Disabled Lassen DistConv unit tests. * Apply suggestions from code review Co-authored-by: Tom Benson <[email protected]> --------- Co-authored-by: Tom Benson <[email protected]>
- Loading branch information