Skip to content

Commit

Permalink
Ci enable distconv (#2235)
Browse files Browse the repository at this point in the history
* Enable CI testing for DistConv.

Added DistConv CI tests

Added Corona DistConv test and disabled FFT on ROCm

Ensure that DistConv tests keep error signals

Enable NVSHMEM on Lassen

Added a multi-stage pipeline for Lassen

Fixed a typo and disabled other tests.

Added spack environment

Added check stage for the catch tests

Added the definition of the RESULTS_DIR environment variable

Added release notes.

Fixed the launcher for catch tests

Changed the batch launch commands to be interactive to block completion.

Added a wrapper shell script for launching the unit tests

Added the number of nodes for the unit test.

Cleaning up launching paths

Added execute permissions for unit test script.

Ingest the Spack dependent environment information.

Fixing launch command and exclusion of externallayer

Bugfix python

Added number of tasks per node

Added integration tests.  Set some NVSHMEM runtime variables

Uniquify the CI JOB_NAME fields for DistConv tests.

* Re-introduced the WITH_CLEAN_BUILD flag.

* Adapting new tests to the new build script framework with modules.

* Increased the time limit for the build on Lassen.  Code cleanup.

* Removed duplicate get_distconv_environment function in the ci_test
common python tools.  Switched all tests to using the standard contrib
args version.

* Changed the default behavior on MIOpen systems to use a local cache
for JIT state.

* Added back note about existing issue in DiHydrogen.

* Enable CI runs to specific a subset of unit tests to run.

* Tweaking the allowed runtimes for tests.

* Debuging the test selection.  Increasing some test time limits.

* Added test filter flags to all systems.

* Increasing time limits

* Added flags to skip integration tests on distconv CI runs.

* Bumped up pooling time limit.

* Testing out setting a set of MIOpen dB cache directories for CI
testing, both for normal users and lbannusr.

* Adding caching options for Corona and changed how the username is queried.

* Updated CI tests to use common MIOpen caches.  Split user and custom
cache paths.

* Fix the lassen multi-stage pipeline to record the spack architecture.

* Increase the build time limit on Lassen.

* Fixed the new lassen build to avoid installing pytest through spack.

* Added the clean build flags into the multi-stage pipeline.

* Skip failing tests in distconv.

* Change the test utils to not set cluster value to unset, but rather None if it is not set.

* Added support for passing in the system cluster name by default if it is known.

* Cleanup the paths for the MIOpen caches.

* Added a guard to skip inplace test if DistConv is disabled.

* Removing unnecessary variable definitions.

* ResNet tests should run on Corona.

* Added support in the data coordinator for explicitly recording the
dimensions of each field.  This is then compared to the dimensions
reported from the data reader, or if there are no valid data readers,
it can be substituted.  Note that for the time being this is redundant
information, but it allows the catch2 test to properly construct a
model without data readers.  This fixes a bug seen on Corona where the
MPI catch2 tests were failing because they allocated a set of buffers
with a size of -1.

Cleaned up the way in which the data coordinator checks for linearized
size to reduce code duplication.

Switch the data type of the data field dimensions to use El::Int
rather than int values.

Added a utility function to type cast between two vectors.

* Force lassen to clean build.

* Fixed the legacy HDF5 data reader.

* Increased the timeout for the lassen build and test.

* Bumped up the time limit on the catch tests for ROCm systems.

* Increase the catch test sizes.

* Trying to avoid forcing static linking when using NVSHMEM.

* Changed the run catch tests script for flux to use a flux proxy.

* Export the lbann setup for Lassen unit and integration tests.

* Minimize what is saved from the catch2 unit tests.

* Cleaning up the environment variables.

* Added a flag to extend the spack env name.

* Tweaking the flux proxy.

* Change how the NVSHMEM variables are setup so that the .before_script
sections do not collide.

* Removed the -o cpu-affinity=per-task flag from the flux run commands
on the catch2 tests because it is causing a hang on Corona.

Removed the nested flux proxy commands in the flux catch2 tests since
they should be unnecessary due to the flux proxy command that invokes
the script.

* Tweak the flux commands to resovle hang on Corona catch tests.

* Cleaning up the flux launch commands on Tioga and Corona to help avoid a hang.

* Added a job name suffix variable.

* Ensure that the spack environment names are unique.

* Tightened up the inclusion of the LBANN Python packages to avoid
conflicts when using the test_compiler script to build LBANN.

* Added support to Pip install into the lbann build directory.  Removed
setting the --test=root command from the extra root packages to avoid
triggering spack build failures on Power systems.

* Updated the baseline modules used on Corona and package versions on
Lassen.

* Fixing the allocation flux command for Tioga.

* Changing it so that only Corona addes the -o pmi=pmix flags to flux.

* Enable module generation for multiple core compilers.

* Making the flux commands consistent.

* Applied clang format.

* Fixed the compiler path on Pasacl.

* Reenable lassen multi-stage distconv test pipeline.

* Fixed how the new Lassen distconv tests are invoked and avoid
erronously resetting up the spack environment.  Changed the saved
spack environment name to SPACK_ENV_NAME.  Cleaned up some dead code.

* Added a second if clause to the integration tests so that there is
always at least one true clause, so the stage will schedule.  Fixed
the regex so that the distconv substring doesn't have to come at the
start of the string.

* Consolidated the rules clause into a common one.

* Fix the rules regex.

* Added corona numbers for resnet.

* Tweaking the CI rules to avoid integrations on distconv builds.

* Tweaking how the lassen unit tests are called.

* Disable nvshmem build on Lassen.  Code cleanup and adding suggestions.

* Changed the guard in resnet 50 test

* Disable NVSHMEM environemnt variables.

* Disabled Lassen DistConv unit tests.

* Apply suggestions from code review

Co-authored-by: Tom Benson <[email protected]>

---------

Co-authored-by: Tom Benson <[email protected]>
  • Loading branch information
bvanessen and benson31 authored Sep 22, 2023
1 parent 73ef72b commit b75a718
Show file tree
Hide file tree
Showing 69 changed files with 678 additions and 231 deletions.
53 changes: 53 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,19 @@ corona testing:
strategy: depend
include: .gitlab/corona/pipeline.yml

corona distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
SPACK_SPECS: "+rocm +distconv"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/corona/pipeline.yml

lassen testing:
stage: run-all-clusters
variables:
Expand All @@ -49,6 +62,20 @@ lassen testing:
strategy: depend
include: .gitlab/lassen/pipeline.yml

lassen distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv"
SPACK_SPECS: "+cuda +distconv +fft"
# SPACK_SPECS: "+cuda +distconv +nvshmem +fft"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/lassen/multi_stage_pipeline.yml

pascal testing:
stage: run-all-clusters
variables:
Expand All @@ -68,6 +95,19 @@ pascal compiler testing:
strategy: depend
include: .gitlab/pascal/pipeline_compiler_tests.yml

pascal distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_SPECS: "%[email protected] +cuda +distconv +fft"
BUILD_SCRIPT_OPTIONS: "--no-default-mirrors"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/pascal/pipeline.yml

tioga testing:
stage: run-all-clusters
variables:
Expand All @@ -76,3 +116,16 @@ tioga testing:
trigger:
strategy: depend
include: .gitlab/tioga/pipeline.yml

tioga distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
SPACK_SPECS: "+rocm +distconv"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/tioga/pipeline.yml
14 changes: 10 additions & 4 deletions .gitlab/common/common.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@
variables:
# This is based on the assumption that each runner will only ever
# be able to run one pipeline on a given cluster at one time.
SPACK_ENV_BASE_NAME: gitlab-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}
SPACK_ENV_BASE_NAME: gitlab${SPACK_ENV_BASE_NAME_MODIFIER}-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}

# This variable is the name used to identify the job in the Slurm
# queue. We need this to be able to access the correct jobid.
JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}
JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}${JOB_NAME_SUFFIX}

# This is needed to ensure that we run as lbannusr.
LLNL_SERVICE_USER: lbannusr
Expand Down Expand Up @@ -105,7 +105,7 @@
- ml use ${LBANN_MODFILES_DIR}
- ml load lbann
- echo "Using LBANN binary $(which lbann)"
- echo "export SPACK_DEP_ENV_NAME=${SPACK_ENV_NAME}" > spack-ci-env-name.sh
- echo "export SPACK_ENV_NAME=${SPACK_ENV_NAME}" > spack-ci-env-name.sh
- echo "export SPACK_ARCH=${SPACK_ARCH}" >> spack-ci-env-name.sh
- echo "export SPACK_ARCH_TARGET=${SPACK_ARCH_TARGET}" >> spack-ci-env-name.sh
- echo "export LBANN_BUILD_PARENT_DIR=${LBANN_BUILD_PARENT_DIR}" >> spack-ci-env-name.sh
Expand Down Expand Up @@ -137,7 +137,13 @@
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/*.cmake
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/CMakeCache.txt
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/build.ninja
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*
- ${RESULTS_DIR}/*
exclude:
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/**/*.o
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*

.lbann-test-rules:
rules:
- if: $JOB_NAME_SUFFIX == "_distconv"
when: never
- if: $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME == $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME
11 changes: 3 additions & 8 deletions .gitlab/common/run-catch-tests-flux.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,9 @@ export LD_LIBRARY_PATH=${ROCM_PATH}/lib:${LD_LIBRARY_PATH}

cd ${LBANN_BUILD_DIR}


flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=per-task -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort

flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=off -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort

echo "Running sequential catch tests"

flux run -N 1 -n 1 -g 1 -t 5m \
flux run -N 1 -n 1 --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} -t 5m \
./unit_test/seq-catch-tests \
-r JUnit \
-o ${OUTPUT_DIR}/seq-catch-results.xml
Expand All @@ -71,7 +66,7 @@ echo "Running MPI catch tests with ${LBANN_NNODES} nodes and ${TEST_TASKS_PER_NO

flux run \
-N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
-g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
-t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
./unit_test/mpi-catch-tests "exclude:[random]" "exclude:[filesystem]"\
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
Expand All @@ -83,7 +78,7 @@ echo "Running MPI filesystem catch tests"

flux run \
-N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
-g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
-t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
./unit_test/mpi-catch-tests -s "[filesystem]" \
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"
Expand Down
78 changes: 78 additions & 0 deletions .gitlab/common/run-catch-tests-lsf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
################################################################################
## Copyright (c) 2014-2023, Lawrence Livermore National Security, LLC.
## Produced at the Lawrence Livermore National Laboratory.
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
## the CONTRIBUTORS file. <[email protected]>
##
## LLNL-CODE-697807.
## All rights reserved.
##
## This file is part of LBANN: Livermore Big Artificial Neural Network
## Toolkit. For details, see http://software.llnl.gov/LBANN or
## https://github.com/LLNL/LBANN.
##
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
## may not use this file except in compliance with the License. You may
## obtain a copy of the License at:
##
## http://www.apache.org/licenses/LICENSE-2.0
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
## implied. See the License for the specific language governing
## permissions and limitations under the license.
################################################################################

#!/bin/bash
cd ${LBANN_BUILD_DIR}

# Configure the output directory
OUTPUT_DIR=${CI_PROJECT_DIR}/${RESULTS_DIR}
if [[ -d ${OUTPUT_DIR} ]];
then
rm -rf ${OUTPUT_DIR}
fi
mkdir -p ${OUTPUT_DIR}

FAILED_JOBS=""

lrun -N 1 -n 1 -W 5 \
./unit_test/seq-catch-tests \
-r JUnit \
-o ${OUTPUT_DIR}/seq-catch-results.xml
if [[ $? -ne 0 ]]; then
FAILED_JOBS+=" seq"
fi

lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
-T $TEST_TASKS_PER_NODE \
-W 5 ${TEST_MPIBIND_FLAG} \
./unit_test/mpi-catch-tests "exclude:[externallayer]" "exclude:[filesystem]" \
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
if [[ $? -ne 0 ]]; then
FAILED_JOBS+=" mpi"
fi

lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
-T $TEST_TASKS_PER_NODE \
-W 5 ${TEST_MPIBIND_FLAG} \
./unit_test/mpi-catch-tests "[filesystem]" \
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"
if [[ $? -ne 0 ]];
then
FAILED_JOBS+=" mpi-filesystem"
fi

# Try to write a semi-useful message to this file since it's being
# saved as an artifact. It's not completely outside the realm that
# someone would look at it.
if [[ -n "${FAILED_JOBS}" ]];
then
echo "Some Catch2 tests failed:${FAILED_JOBS}" > ${OUTPUT_DIR}/catch-tests-failed.txt
fi

# Return "success" so that the pytest-based testing can run.
exit 0
13 changes: 9 additions & 4 deletions .gitlab/corona/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ allocate lc resources:
- export TEST_TIME=$([[ -n "${WITH_WEEKLY}" ]] && echo "150m" || echo "120m")
- export LBANN_NNODES=$([[ -n "${WITH_WEEKLY}" ]] && echo "4" || echo "2")
- export FLUX_F58_FORCE_ASCII=t
- jobid=$(flux --parent alloc -N ${LBANN_NNODES} -g 1 -t ${TEST_TIME} --job-name=${JOB_NAME} --bg)
- jobid=$(flux --parent alloc -N ${LBANN_NNODES} --exclusive -t ${TEST_TIME} --job-name=${JOB_NAME} --bg)
- export JOB_ID=$jobid
timeout: 6h

Expand All @@ -79,6 +79,7 @@ build and install:
- export TEST_MPIBIND_FLAG="--mpibind=off"
- export SPACK_ARCH=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch)
- export SPACK_ARCH_TARGET=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch -t)
- export EXTRA_FLUX_ARGS="-o pmi=pmix"
- !reference [.setup_lbann, script]
- flux proxy ${JOB_ID} .gitlab/common/run-catch-tests-flux.sh

Expand All @@ -97,7 +98,8 @@ unit tests:
- export OMP_NUM_THREADS=10
- "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
- cd ci_test/unit_tests
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 --junitxml=results.xml
# - echo "Running unit tests with file pattern: ${TEST_FLAG}"
- flux proxy ${FLUX_JOB_ID} python3 -m pytest -s -vv --durations=0 --junitxml=results.xml ${TEST_FLAG}
artifacts:
when: always
paths:
Expand All @@ -114,15 +116,18 @@ integration tests:
stage: test
dependencies:
- build and install
rules:
- !reference [.lbann-test-rules, rules]
script:
- echo "== RUNNING PYTHON-BASED INTEGRATION TESTS =="
- echo "Testing $(which lbann)"
- export OMP_NUM_THREADS=10
- "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
- cd ci_test/integration_tests
- export WEEKLY_FLAG=${WITH_WEEKLY:+--weekly}
- echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml"
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml
# - echo "Running integration tests with file pattern: ${TEST_FLAG}"
# - echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}"
- flux proxy ${FLUX_JOB_ID} python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}
artifacts:
when: always
paths:
Expand Down
Loading

0 comments on commit b75a718

Please sign in to comment.