Skip to content

Commit b75a718

Browse files
bvanessenbenson31
andauthored
Ci enable distconv (#2235)
* Enable CI testing for DistConv. Added DistConv CI tests Added Corona DistConv test and disabled FFT on ROCm Ensure that DistConv tests keep error signals Enable NVSHMEM on Lassen Added a multi-stage pipeline for Lassen Fixed a typo and disabled other tests. Added spack environment Added check stage for the catch tests Added the definition of the RESULTS_DIR environment variable Added release notes. Fixed the launcher for catch tests Changed the batch launch commands to be interactive to block completion. Added a wrapper shell script for launching the unit tests Added the number of nodes for the unit test. Cleaning up launching paths Added execute permissions for unit test script. Ingest the Spack dependent environment information. Fixing launch command and exclusion of externallayer Bugfix python Added number of tasks per node Added integration tests. Set some NVSHMEM runtime variables Uniquify the CI JOB_NAME fields for DistConv tests. * Re-introduced the WITH_CLEAN_BUILD flag. * Adapting new tests to the new build script framework with modules. * Increased the time limit for the build on Lassen. Code cleanup. * Removed duplicate get_distconv_environment function in the ci_test common python tools. Switched all tests to using the standard contrib args version. * Changed the default behavior on MIOpen systems to use a local cache for JIT state. * Added back note about existing issue in DiHydrogen. * Enable CI runs to specific a subset of unit tests to run. * Tweaking the allowed runtimes for tests. * Debuging the test selection. Increasing some test time limits. * Added test filter flags to all systems. * Increasing time limits * Added flags to skip integration tests on distconv CI runs. * Bumped up pooling time limit. * Testing out setting a set of MIOpen dB cache directories for CI testing, both for normal users and lbannusr. * Adding caching options for Corona and changed how the username is queried. * Updated CI tests to use common MIOpen caches. Split user and custom cache paths. * Fix the lassen multi-stage pipeline to record the spack architecture. * Increase the build time limit on Lassen. * Fixed the new lassen build to avoid installing pytest through spack. * Added the clean build flags into the multi-stage pipeline. * Skip failing tests in distconv. * Change the test utils to not set cluster value to unset, but rather None if it is not set. * Added support for passing in the system cluster name by default if it is known. * Cleanup the paths for the MIOpen caches. * Added a guard to skip inplace test if DistConv is disabled. * Removing unnecessary variable definitions. * ResNet tests should run on Corona. * Added support in the data coordinator for explicitly recording the dimensions of each field. This is then compared to the dimensions reported from the data reader, or if there are no valid data readers, it can be substituted. Note that for the time being this is redundant information, but it allows the catch2 test to properly construct a model without data readers. This fixes a bug seen on Corona where the MPI catch2 tests were failing because they allocated a set of buffers with a size of -1. Cleaned up the way in which the data coordinator checks for linearized size to reduce code duplication. Switch the data type of the data field dimensions to use El::Int rather than int values. Added a utility function to type cast between two vectors. * Force lassen to clean build. * Fixed the legacy HDF5 data reader. * Increased the timeout for the lassen build and test. * Bumped up the time limit on the catch tests for ROCm systems. * Increase the catch test sizes. * Trying to avoid forcing static linking when using NVSHMEM. * Changed the run catch tests script for flux to use a flux proxy. * Export the lbann setup for Lassen unit and integration tests. * Minimize what is saved from the catch2 unit tests. * Cleaning up the environment variables. * Added a flag to extend the spack env name. * Tweaking the flux proxy. * Change how the NVSHMEM variables are setup so that the .before_script sections do not collide. * Removed the -o cpu-affinity=per-task flag from the flux run commands on the catch2 tests because it is causing a hang on Corona. Removed the nested flux proxy commands in the flux catch2 tests since they should be unnecessary due to the flux proxy command that invokes the script. * Tweak the flux commands to resovle hang on Corona catch tests. * Cleaning up the flux launch commands on Tioga and Corona to help avoid a hang. * Added a job name suffix variable. * Ensure that the spack environment names are unique. * Tightened up the inclusion of the LBANN Python packages to avoid conflicts when using the test_compiler script to build LBANN. * Added support to Pip install into the lbann build directory. Removed setting the --test=root command from the extra root packages to avoid triggering spack build failures on Power systems. * Updated the baseline modules used on Corona and package versions on Lassen. * Fixing the allocation flux command for Tioga. * Changing it so that only Corona addes the -o pmi=pmix flags to flux. * Enable module generation for multiple core compilers. * Making the flux commands consistent. * Applied clang format. * Fixed the compiler path on Pasacl. * Reenable lassen multi-stage distconv test pipeline. * Fixed how the new Lassen distconv tests are invoked and avoid erronously resetting up the spack environment. Changed the saved spack environment name to SPACK_ENV_NAME. Cleaned up some dead code. * Added a second if clause to the integration tests so that there is always at least one true clause, so the stage will schedule. Fixed the regex so that the distconv substring doesn't have to come at the start of the string. * Consolidated the rules clause into a common one. * Fix the rules regex. * Added corona numbers for resnet. * Tweaking the CI rules to avoid integrations on distconv builds. * Tweaking how the lassen unit tests are called. * Disable nvshmem build on Lassen. Code cleanup and adding suggestions. * Changed the guard in resnet 50 test * Disable NVSHMEM environemnt variables. * Disabled Lassen DistConv unit tests. * Apply suggestions from code review Co-authored-by: Tom Benson <[email protected]> --------- Co-authored-by: Tom Benson <[email protected]>
1 parent 73ef72b commit b75a718

File tree

69 files changed

+678
-231
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+678
-231
lines changed

.gitlab-ci.yml

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,19 @@ corona testing:
4040
strategy: depend
4141
include: .gitlab/corona/pipeline.yml
4242

43+
corona distconv testing:
44+
stage: run-all-clusters
45+
variables:
46+
JOB_NAME_SUFFIX: _distconv
47+
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
48+
SPACK_SPECS: "+rocm +distconv"
49+
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
50+
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
51+
TEST_FLAG: "test_*_distconv.py"
52+
trigger:
53+
strategy: depend
54+
include: .gitlab/corona/pipeline.yml
55+
4356
lassen testing:
4457
stage: run-all-clusters
4558
variables:
@@ -49,6 +62,20 @@ lassen testing:
4962
strategy: depend
5063
include: .gitlab/lassen/pipeline.yml
5164

65+
lassen distconv testing:
66+
stage: run-all-clusters
67+
variables:
68+
JOB_NAME_SUFFIX: _distconv
69+
SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv"
70+
SPACK_SPECS: "+cuda +distconv +fft"
71+
# SPACK_SPECS: "+cuda +distconv +nvshmem +fft"
72+
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
73+
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
74+
TEST_FLAG: "test_*_distconv.py"
75+
trigger:
76+
strategy: depend
77+
include: .gitlab/lassen/multi_stage_pipeline.yml
78+
5279
pascal testing:
5380
stage: run-all-clusters
5481
variables:
@@ -68,6 +95,19 @@ pascal compiler testing:
6895
strategy: depend
6996
include: .gitlab/pascal/pipeline_compiler_tests.yml
7097

98+
pascal distconv testing:
99+
stage: run-all-clusters
100+
variables:
101+
JOB_NAME_SUFFIX: _distconv
102+
SPACK_SPECS: "%[email protected] +cuda +distconv +fft"
103+
BUILD_SCRIPT_OPTIONS: "--no-default-mirrors"
104+
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
105+
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
106+
TEST_FLAG: "test_*_distconv.py"
107+
trigger:
108+
strategy: depend
109+
include: .gitlab/pascal/pipeline.yml
110+
71111
tioga testing:
72112
stage: run-all-clusters
73113
variables:
@@ -76,3 +116,16 @@ tioga testing:
76116
trigger:
77117
strategy: depend
78118
include: .gitlab/tioga/pipeline.yml
119+
120+
tioga distconv testing:
121+
stage: run-all-clusters
122+
variables:
123+
JOB_NAME_SUFFIX: _distconv
124+
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
125+
SPACK_SPECS: "+rocm +distconv"
126+
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
127+
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
128+
TEST_FLAG: "test_*_distconv.py"
129+
trigger:
130+
strategy: depend
131+
include: .gitlab/tioga/pipeline.yml

.gitlab/common/common.yml

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,11 @@
2929
variables:
3030
# This is based on the assumption that each runner will only ever
3131
# be able to run one pipeline on a given cluster at one time.
32-
SPACK_ENV_BASE_NAME: gitlab-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}
32+
SPACK_ENV_BASE_NAME: gitlab${SPACK_ENV_BASE_NAME_MODIFIER}-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}
3333

3434
# This variable is the name used to identify the job in the Slurm
3535
# queue. We need this to be able to access the correct jobid.
36-
JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}
36+
JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}${JOB_NAME_SUFFIX}
3737

3838
# This is needed to ensure that we run as lbannusr.
3939
LLNL_SERVICE_USER: lbannusr
@@ -105,7 +105,7 @@
105105
- ml use ${LBANN_MODFILES_DIR}
106106
- ml load lbann
107107
- echo "Using LBANN binary $(which lbann)"
108-
- echo "export SPACK_DEP_ENV_NAME=${SPACK_ENV_NAME}" > spack-ci-env-name.sh
108+
- echo "export SPACK_ENV_NAME=${SPACK_ENV_NAME}" > spack-ci-env-name.sh
109109
- echo "export SPACK_ARCH=${SPACK_ARCH}" >> spack-ci-env-name.sh
110110
- echo "export SPACK_ARCH_TARGET=${SPACK_ARCH_TARGET}" >> spack-ci-env-name.sh
111111
- echo "export LBANN_BUILD_PARENT_DIR=${LBANN_BUILD_PARENT_DIR}" >> spack-ci-env-name.sh
@@ -137,7 +137,13 @@
137137
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/*.cmake
138138
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/CMakeCache.txt
139139
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/build.ninja
140-
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*
141140
- ${RESULTS_DIR}/*
142141
exclude:
143142
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/**/*.o
143+
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*
144+
145+
.lbann-test-rules:
146+
rules:
147+
- if: $JOB_NAME_SUFFIX == "_distconv"
148+
when: never
149+
- if: $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME == $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME

.gitlab/common/run-catch-tests-flux.sh

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -52,14 +52,9 @@ export LD_LIBRARY_PATH=${ROCM_PATH}/lib:${LD_LIBRARY_PATH}
5252

5353
cd ${LBANN_BUILD_DIR}
5454

55-
56-
flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=per-task -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort
57-
58-
flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=off -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort
59-
6055
echo "Running sequential catch tests"
6156

62-
flux run -N 1 -n 1 -g 1 -t 5m \
57+
flux run -N 1 -n 1 --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} -t 5m \
6358
./unit_test/seq-catch-tests \
6459
-r JUnit \
6560
-o ${OUTPUT_DIR}/seq-catch-results.xml
@@ -71,7 +66,7 @@ echo "Running MPI catch tests with ${LBANN_NNODES} nodes and ${TEST_TASKS_PER_NO
7166

7267
flux run \
7368
-N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
74-
-g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
69+
-t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
7570
./unit_test/mpi-catch-tests "exclude:[random]" "exclude:[filesystem]"\
7671
-r JUnit \
7772
-o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
@@ -83,7 +78,7 @@ echo "Running MPI filesystem catch tests"
8378

8479
flux run \
8580
-N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
86-
-g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
81+
-t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
8782
./unit_test/mpi-catch-tests -s "[filesystem]" \
8883
-r JUnit \
8984
-o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
################################################################################
2+
## Copyright (c) 2014-2023, Lawrence Livermore National Security, LLC.
3+
## Produced at the Lawrence Livermore National Laboratory.
4+
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
5+
## the CONTRIBUTORS file. <[email protected]>
6+
##
7+
## LLNL-CODE-697807.
8+
## All rights reserved.
9+
##
10+
## This file is part of LBANN: Livermore Big Artificial Neural Network
11+
## Toolkit. For details, see http://software.llnl.gov/LBANN or
12+
## https://github.com/LLNL/LBANN.
13+
##
14+
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
15+
## may not use this file except in compliance with the License. You may
16+
## obtain a copy of the License at:
17+
##
18+
## http://www.apache.org/licenses/LICENSE-2.0
19+
##
20+
## Unless required by applicable law or agreed to in writing, software
21+
## distributed under the License is distributed on an "AS IS" BASIS,
22+
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
23+
## implied. See the License for the specific language governing
24+
## permissions and limitations under the license.
25+
################################################################################
26+
27+
#!/bin/bash
28+
cd ${LBANN_BUILD_DIR}
29+
30+
# Configure the output directory
31+
OUTPUT_DIR=${CI_PROJECT_DIR}/${RESULTS_DIR}
32+
if [[ -d ${OUTPUT_DIR} ]];
33+
then
34+
rm -rf ${OUTPUT_DIR}
35+
fi
36+
mkdir -p ${OUTPUT_DIR}
37+
38+
FAILED_JOBS=""
39+
40+
lrun -N 1 -n 1 -W 5 \
41+
./unit_test/seq-catch-tests \
42+
-r JUnit \
43+
-o ${OUTPUT_DIR}/seq-catch-results.xml
44+
if [[ $? -ne 0 ]]; then
45+
FAILED_JOBS+=" seq"
46+
fi
47+
48+
lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
49+
-T $TEST_TASKS_PER_NODE \
50+
-W 5 ${TEST_MPIBIND_FLAG} \
51+
./unit_test/mpi-catch-tests "exclude:[externallayer]" "exclude:[filesystem]" \
52+
-r JUnit \
53+
-o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
54+
if [[ $? -ne 0 ]]; then
55+
FAILED_JOBS+=" mpi"
56+
fi
57+
58+
lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
59+
-T $TEST_TASKS_PER_NODE \
60+
-W 5 ${TEST_MPIBIND_FLAG} \
61+
./unit_test/mpi-catch-tests "[filesystem]" \
62+
-r JUnit \
63+
-o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"
64+
if [[ $? -ne 0 ]];
65+
then
66+
FAILED_JOBS+=" mpi-filesystem"
67+
fi
68+
69+
# Try to write a semi-useful message to this file since it's being
70+
# saved as an artifact. It's not completely outside the realm that
71+
# someone would look at it.
72+
if [[ -n "${FAILED_JOBS}" ]];
73+
then
74+
echo "Some Catch2 tests failed:${FAILED_JOBS}" > ${OUTPUT_DIR}/catch-tests-failed.txt
75+
fi
76+
77+
# Return "success" so that the pytest-based testing can run.
78+
exit 0

.gitlab/corona/pipeline.yml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ allocate lc resources:
5555
- export TEST_TIME=$([[ -n "${WITH_WEEKLY}" ]] && echo "150m" || echo "120m")
5656
- export LBANN_NNODES=$([[ -n "${WITH_WEEKLY}" ]] && echo "4" || echo "2")
5757
- export FLUX_F58_FORCE_ASCII=t
58-
- jobid=$(flux --parent alloc -N ${LBANN_NNODES} -g 1 -t ${TEST_TIME} --job-name=${JOB_NAME} --bg)
58+
- jobid=$(flux --parent alloc -N ${LBANN_NNODES} --exclusive -t ${TEST_TIME} --job-name=${JOB_NAME} --bg)
5959
- export JOB_ID=$jobid
6060
timeout: 6h
6161

@@ -79,6 +79,7 @@ build and install:
7979
- export TEST_MPIBIND_FLAG="--mpibind=off"
8080
- export SPACK_ARCH=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch)
8181
- export SPACK_ARCH_TARGET=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch -t)
82+
- export EXTRA_FLUX_ARGS="-o pmi=pmix"
8283
- !reference [.setup_lbann, script]
8384
- flux proxy ${JOB_ID} .gitlab/common/run-catch-tests-flux.sh
8485

@@ -97,7 +98,8 @@ unit tests:
9798
- export OMP_NUM_THREADS=10
9899
- "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
99100
- cd ci_test/unit_tests
100-
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 --junitxml=results.xml
101+
# - echo "Running unit tests with file pattern: ${TEST_FLAG}"
102+
- flux proxy ${FLUX_JOB_ID} python3 -m pytest -s -vv --durations=0 --junitxml=results.xml ${TEST_FLAG}
101103
artifacts:
102104
when: always
103105
paths:
@@ -114,15 +116,18 @@ integration tests:
114116
stage: test
115117
dependencies:
116118
- build and install
119+
rules:
120+
- !reference [.lbann-test-rules, rules]
117121
script:
118122
- echo "== RUNNING PYTHON-BASED INTEGRATION TESTS =="
119123
- echo "Testing $(which lbann)"
120124
- export OMP_NUM_THREADS=10
121125
- "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
122126
- cd ci_test/integration_tests
123127
- export WEEKLY_FLAG=${WITH_WEEKLY:+--weekly}
124-
- echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml"
125-
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml
128+
# - echo "Running integration tests with file pattern: ${TEST_FLAG}"
129+
# - echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}"
130+
- flux proxy ${FLUX_JOB_ID} python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}
126131
artifacts:
127132
when: always
128133
paths:

0 commit comments

Comments
 (0)