Ci enable distconv (#2235)

* Enable CI testing for DistConv. Added DistConv CI tests Added Corona DistConv test and disabled FFT on ROCm Ensure that DistConv tests keep error signals Enable NVSHMEM on Lassen Added a multi-stage pipeline for Lassen Fixed a typo and disabled other tests. Added spack environment Added check stage for the catch tests Added the definition of the RESULTS_DIR environment variable Added release notes. Fixed the launcher for catch tests Changed the batch launch commands to be interactive to block completion. Added a wrapper shell script for launching the unit tests Added the number of nodes for the unit test. Cleaning up launching paths Added execute permissions for unit test script. Ingest the Spack dependent environment information. Fixing launch command and exclusion of externallayer Bugfix python Added number of tasks per node Added integration tests. Set some NVSHMEM runtime variables Uniquify the CI JOB_NAME fields for DistConv tests. * Re-introduced the WITH_CLEAN_BUILD flag. * Adapting new tests to the new build script framework with modules. * Increased the time limit for the build on Lassen. Code cleanup. * Removed duplicate get_distconv_environment function in the ci_test common python tools. Switched all tests to using the standard contrib args version. * Changed the default behavior on MIOpen systems to use a local cache for JIT state. * Added back note about existing issue in DiHydrogen. * Enable CI runs to specific a subset of unit tests to run. * Tweaking the allowed runtimes for tests. * Debuging the test selection. Increasing some test time limits. * Added test filter flags to all systems. * Increasing time limits * Added flags to skip integration tests on distconv CI runs. * Bumped up pooling time limit. * Testing out setting a set of MIOpen dB cache directories for CI testing, both for normal users and lbannusr. * Adding caching options for Corona and changed how the username is queried. * Updated CI tests to use common MIOpen caches. Split user and custom cache paths. * Fix the lassen multi-stage pipeline to record the spack architecture. * Increase the build time limit on Lassen. * Fixed the new lassen build to avoid installing pytest through spack. * Added the clean build flags into the multi-stage pipeline. * Skip failing tests in distconv. * Change the test utils to not set cluster value to unset, but rather None if it is not set. * Added support for passing in the system cluster name by default if it is known. * Cleanup the paths for the MIOpen caches. * Added a guard to skip inplace test if DistConv is disabled. * Removing unnecessary variable definitions. * ResNet tests should run on Corona. * Added support in the data coordinator for explicitly recording the dimensions of each field. This is then compared to the dimensions reported from the data reader, or if there are no valid data readers, it can be substituted. Note that for the time being this is redundant information, but it allows the catch2 test to properly construct a model without data readers. This fixes a bug seen on Corona where the MPI catch2 tests were failing because they allocated a set of buffers with a size of -1. Cleaned up the way in which the data coordinator checks for linearized size to reduce code duplication. Switch the data type of the data field dimensions to use El::Int rather than int values. Added a utility function to type cast between two vectors. * Force lassen to clean build. * Fixed the legacy HDF5 data reader. * Increased the timeout for the lassen build and test. * Bumped up the time limit on the catch tests for ROCm systems. * Increase the catch test sizes. * Trying to avoid forcing static linking when using NVSHMEM. * Changed the run catch tests script for flux to use a flux proxy. * Export the lbann setup for Lassen unit and integration tests. * Minimize what is saved from the catch2 unit tests. * Cleaning up the environment variables. * Added a flag to extend the spack env name. * Tweaking the flux proxy. * Change how the NVSHMEM variables are setup so that the .before_script sections do not collide. * Removed the -o cpu-affinity=per-task flag from the flux run commands on the catch2 tests because it is causing a hang on Corona. Removed the nested flux proxy commands in the flux catch2 tests since they should be unnecessary due to the flux proxy command that invokes the script. * Tweak the flux commands to resovle hang on Corona catch tests. * Cleaning up the flux launch commands on Tioga and Corona to help avoid a hang. * Added a job name suffix variable. * Ensure that the spack environment names are unique. * Tightened up the inclusion of the LBANN Python packages to avoid conflicts when using the test_compiler script to build LBANN. * Added support to Pip install into the lbann build directory. Removed setting the --test=root command from the extra root packages to avoid triggering spack build failures on Power systems. * Updated the baseline modules used on Corona and package versions on Lassen. * Fixing the allocation flux command for Tioga. * Changing it so that only Corona addes the -o pmi=pmix flags to flux. * Enable module generation for multiple core compilers. * Making the flux commands consistent. * Applied clang format. * Fixed the compiler path on Pasacl. * Reenable lassen multi-stage distconv test pipeline. * Fixed how the new Lassen distconv tests are invoked and avoid erronously resetting up the spack environment. Changed the saved spack environment name to SPACK_ENV_NAME. Cleaned up some dead code. * Added a second if clause to the integration tests so that there is always at least one true clause, so the stage will schedule. Fixed the regex so that the distconv substring doesn't have to come at the start of the string. * Consolidated the rules clause into a common one. * Fix the rules regex. * Added corona numbers for resnet. * Tweaking the CI rules to avoid integrations on distconv builds. * Tweaking how the lassen unit tests are called. * Disable nvshmem build on Lassen. Code cleanup and adding suggestions. * Changed the guard in resnet 50 test * Disable NVSHMEM environemnt variables. * Disabled Lassen DistConv unit tests. * Apply suggestions from code review Co-authored-by: Tom Benson <[email protected]> --------- Co-authored-by: Tom Benson <[email protected]>
LLNL · Sep 22, 2023 · b75a718 · b75a718
1 parent 73ef72b
commit b75a718
Show file tree

Hide file tree

Showing 69 changed files with 678 additions and 231 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -40,6 +40,19 @@ corona testing:
     strategy: depend
     include: .gitlab/corona/pipeline.yml
 
+corona distconv testing:
+  stage: run-all-clusters
+  variables:
+    JOB_NAME_SUFFIX: _distconv
+    SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
+    SPACK_SPECS: "+rocm +distconv"
+    WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
+    WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
+    TEST_FLAG: "test_*_distconv.py"
+  trigger:
+    strategy: depend
+    include: .gitlab/corona/pipeline.yml
+
 lassen testing:
   stage: run-all-clusters
   variables:
@@ -49,6 +62,20 @@ lassen testing:
     strategy: depend
     include: .gitlab/lassen/pipeline.yml
 
+lassen distconv testing:
+  stage: run-all-clusters
+  variables:
+    JOB_NAME_SUFFIX: _distconv
+    SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv"
+    SPACK_SPECS: "+cuda +distconv +fft"
+#    SPACK_SPECS: "+cuda +distconv +nvshmem +fft"
+    WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
+    WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
+    TEST_FLAG: "test_*_distconv.py"
+  trigger:
+    strategy: depend
+    include: .gitlab/lassen/multi_stage_pipeline.yml
+
 pascal testing:
   stage: run-all-clusters
   variables:
@@ -68,6 +95,19 @@ pascal compiler testing:
     strategy: depend
     include: .gitlab/pascal/pipeline_compiler_tests.yml
 
+pascal distconv testing:
+  stage: run-all-clusters
+  variables:
+    JOB_NAME_SUFFIX: _distconv
+    SPACK_SPECS: "%[email protected] +cuda +distconv +fft"
+    BUILD_SCRIPT_OPTIONS: "--no-default-mirrors"
+    WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
+    WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
+    TEST_FLAG: "test_*_distconv.py"
+  trigger:
+    strategy: depend
+    include: .gitlab/pascal/pipeline.yml
+
 tioga testing:
   stage: run-all-clusters
   variables:
@@ -76,3 +116,16 @@ tioga testing:
   trigger:
     strategy: depend
     include: .gitlab/tioga/pipeline.yml
+
+tioga distconv testing:
+  stage: run-all-clusters
+  variables:
+    JOB_NAME_SUFFIX: _distconv
+    SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
+    SPACK_SPECS: "+rocm +distconv"
+    WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
+    WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
+    TEST_FLAG: "test_*_distconv.py"
+  trigger:
+    strategy: depend
+    include: .gitlab/tioga/pipeline.yml
diff --git a/.gitlab/common/common.yml b/.gitlab/common/common.yml
@@ -29,11 +29,11 @@
   variables:
     # This is based on the assumption that each runner will only ever
     # be able to run one pipeline on a given cluster at one time.
-    SPACK_ENV_BASE_NAME: gitlab-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}
+    SPACK_ENV_BASE_NAME: gitlab${SPACK_ENV_BASE_NAME_MODIFIER}-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}
 
     # This variable is the name used to identify the job in the Slurm
     # queue. We need this to be able to access the correct jobid.
-    JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}
+    JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}${JOB_NAME_SUFFIX}
 
     # This is needed to ensure that we run as lbannusr.
     LLNL_SERVICE_USER: lbannusr
@@ -105,7 +105,7 @@
     - ml use ${LBANN_MODFILES_DIR}
     - ml load lbann
     - echo "Using LBANN binary $(which lbann)"
-    - echo "export SPACK_DEP_ENV_NAME=${SPACK_ENV_NAME}" > spack-ci-env-name.sh
+    - echo "export SPACK_ENV_NAME=${SPACK_ENV_NAME}" > spack-ci-env-name.sh
     - echo "export SPACK_ARCH=${SPACK_ARCH}" >> spack-ci-env-name.sh
     - echo "export SPACK_ARCH_TARGET=${SPACK_ARCH_TARGET}" >> spack-ci-env-name.sh
     - echo "export LBANN_BUILD_PARENT_DIR=${LBANN_BUILD_PARENT_DIR}" >> spack-ci-env-name.sh
@@ -137,7 +137,13 @@
       - builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/*.cmake
       - builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/CMakeCache.txt
       - builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/build.ninja
-      - builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*
       - ${RESULTS_DIR}/*
     exclude:
       - builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/**/*.o
+      - builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*
+
+.lbann-test-rules:
+  rules:
+    - if: $JOB_NAME_SUFFIX == "_distconv"
+      when: never
+    - if: $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME == $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME
diff --git a/.gitlab/common/run-catch-tests-flux.sh b/.gitlab/common/run-catch-tests-flux.sh
@@ -52,14 +52,9 @@ export LD_LIBRARY_PATH=${ROCM_PATH}/lib:${LD_LIBRARY_PATH}
 
 cd ${LBANN_BUILD_DIR}
 
-
-flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=per-task -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort
-
-flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=off -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort
-
 echo "Running sequential catch tests"
 
-flux run -N 1 -n 1 -g 1 -t 5m \
+flux run -N 1 -n 1 --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} -t 5m \
      ./unit_test/seq-catch-tests \
      -r JUnit \
      -o ${OUTPUT_DIR}/seq-catch-results.xml
@@ -71,7 +66,7 @@ echo "Running MPI catch tests with ${LBANN_NNODES} nodes and ${TEST_TASKS_PER_NO
 
 flux run \
      -N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
-     -g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
+     -t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
      ./unit_test/mpi-catch-tests "exclude:[random]" "exclude:[filesystem]"\
      -r JUnit \
      -o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
@@ -83,7 +78,7 @@ echo "Running MPI filesystem catch tests"
 
 flux run \
      -N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
-     -g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
+     -t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
      ./unit_test/mpi-catch-tests -s "[filesystem]" \
      -r JUnit \
      -o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"

diff --git a/.gitlab/common/run-catch-tests-lsf.sh b/.gitlab/common/run-catch-tests-lsf.sh
@@ -0,0 +1,78 @@
+################################################################################
+## Copyright (c) 2014-2023, Lawrence Livermore National Security, LLC.
+## Produced at the Lawrence Livermore National Laboratory.
+## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
+## the CONTRIBUTORS file. <[email protected]>
+##
+## LLNL-CODE-697807.
+## All rights reserved.
+##
+## This file is part of LBANN: Livermore Big Artificial Neural Network
+## Toolkit. For details, see http://software.llnl.gov/LBANN or
+## https://github.com/LLNL/LBANN.
+##
+## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
+## may not use this file except in compliance with the License.  You may
+## obtain a copy of the License at:
+##
+## http://www.apache.org/licenses/LICENSE-2.0
+##
+## Unless required by applicable law or agreed to in writing, software
+## distributed under the License is distributed on an "AS IS" BASIS,
+## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+## implied. See the License for the specific language governing
+## permissions and limitations under the license.
+################################################################################
+
+#!/bin/bash
+cd ${LBANN_BUILD_DIR}
+
+# Configure the output directory
+OUTPUT_DIR=${CI_PROJECT_DIR}/${RESULTS_DIR}
+if [[ -d ${OUTPUT_DIR} ]];
+then
+    rm -rf ${OUTPUT_DIR}
+fi
+mkdir -p ${OUTPUT_DIR}
+
+FAILED_JOBS=""
+
+lrun -N 1 -n 1 -W 5 \
+     ./unit_test/seq-catch-tests \
+     -r JUnit \
+     -o ${OUTPUT_DIR}/seq-catch-results.xml
+if [[ $? -ne 0 ]]; then
+    FAILED_JOBS+=" seq"
+fi
+
+lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
+     -T $TEST_TASKS_PER_NODE \
+     -W 5 ${TEST_MPIBIND_FLAG} \
+     ./unit_test/mpi-catch-tests "exclude:[externallayer]" "exclude:[filesystem]" \
+     -r JUnit \
+     -o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
+if [[ $? -ne 0 ]]; then
+    FAILED_JOBS+=" mpi"
+fi
+
+lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
+     -T $TEST_TASKS_PER_NODE \
+     -W 5 ${TEST_MPIBIND_FLAG} \
+     ./unit_test/mpi-catch-tests "[filesystem]" \
+     -r JUnit \
+     -o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"
+if [[ $? -ne 0 ]];
+then
+    FAILED_JOBS+=" mpi-filesystem"
+fi
+
+# Try to write a semi-useful message to this file since it's being
+# saved as an artifact. It's not completely outside the realm that
+# someone would look at it.
+if [[ -n "${FAILED_JOBS}" ]];
+then
+    echo "Some Catch2 tests failed:${FAILED_JOBS}" > ${OUTPUT_DIR}/catch-tests-failed.txt
+fi
+
+# Return "success" so that the pytest-based testing can run.
+exit 0
diff --git a/.gitlab/corona/pipeline.yml b/.gitlab/corona/pipeline.yml
@@ -55,7 +55,7 @@ allocate lc resources:
     - export TEST_TIME=$([[ -n "${WITH_WEEKLY}" ]] && echo "150m" || echo "120m")
     - export LBANN_NNODES=$([[ -n "${WITH_WEEKLY}" ]] && echo "4" || echo "2")
     - export FLUX_F58_FORCE_ASCII=t
-    - jobid=$(flux --parent alloc -N ${LBANN_NNODES} -g 1 -t ${TEST_TIME} --job-name=${JOB_NAME} --bg)
+    - jobid=$(flux --parent alloc -N ${LBANN_NNODES} --exclusive -t ${TEST_TIME} --job-name=${JOB_NAME} --bg)
     - export JOB_ID=$jobid
   timeout: 6h
 
@@ -79,6 +79,7 @@ build and install:
     - export TEST_MPIBIND_FLAG="--mpibind=off"
     - export SPACK_ARCH=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch)
     - export SPACK_ARCH_TARGET=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch -t)
+    - export EXTRA_FLUX_ARGS="-o pmi=pmix"
     - !reference [.setup_lbann, script]
     - flux proxy ${JOB_ID} .gitlab/common/run-catch-tests-flux.sh
 
@@ -97,7 +98,8 @@ unit tests:
     - export OMP_NUM_THREADS=10
     - "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
     - cd ci_test/unit_tests
-    - flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 --junitxml=results.xml
+    # - echo "Running unit tests with file pattern: ${TEST_FLAG}"
+    - flux proxy ${FLUX_JOB_ID} python3 -m pytest -s -vv --durations=0 --junitxml=results.xml ${TEST_FLAG}
   artifacts:
     when: always
     paths:
@@ -114,15 +116,18 @@ integration tests:
   stage: test
   dependencies:
     - build and install
+  rules:
+    - !reference [.lbann-test-rules, rules]
   script:
     - echo "== RUNNING PYTHON-BASED INTEGRATION TESTS =="
     - echo "Testing $(which lbann)"
     - export OMP_NUM_THREADS=10
     - "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
     - cd ci_test/integration_tests
     - export WEEKLY_FLAG=${WITH_WEEKLY:+--weekly}
-    - echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml"
-    - flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml
+    # - echo "Running integration tests with file pattern: ${TEST_FLAG}"
+    # - echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}"
+    - flux proxy ${FLUX_JOB_ID} python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}
   artifacts:
     when: always
     paths: