Skip to content

Commit 9da46f3

Browse files
authored
Bugfix flux hangs on Corona and Lassen CI jobs. (#2398)
* Bugfix time limits on Corona CI jobs. * Removed quotes from the lassen catch2 filesystem tests argument. * Fix the invocation of the unit and integration tests in the lassen distconv CI pipeline. * Fixed the LC contrib launcher to properly set the correct version of the PMI (pmix) field for the Corona system. This resolves hangs in flux as jobs are finalizing MPI. * Reset time limits now that the Corona hang has been resolved. * Renamed the cosmoflow integration test to have a _distconv.py suffix so that it is picked up by CI. Removed deadcode.
1 parent 9e8057b commit 9da46f3

File tree

6 files changed

+6
-5
lines changed

6 files changed

+6
-5
lines changed

.gitlab/common/common.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@
116116
rules:
117117
- exists:
118118
- spack-ci-env-name.sh
119+
- ${LBANN_BUILD_PARENT_DIR}/build
119120
script:
120121
- echo "== CLEAN BUILD DIRECTORY =="
121122
- source spack-ci-env-name.sh

.gitlab/lassen/run_integration_tests.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,7 @@ echo "Task: Intergation Tests"
2929
cd ci_test/integration_tests
3030
echo "Running integration tests with file pattern: ${TEST_FLAG}"
3131
export OMP_NUM_THREADS=10
32-
lrun -N ${LBANN_NNODES} -T $TEST_TASKS_PER_NODE lbann_pfe.sh -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}
33-
#lrun -N ${LBANN_NNODES} -T $TEST_TASKS_PER_NODE python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}
32+
lbann_pfe.sh -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}
3433
status=$(($status + $?))
3534

3635
echo "Task: Finished"

.gitlab/lassen/run_unit_tests.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,7 @@ echo "Task: Unit Tests"
2828
cd ci_test/unit_tests
2929
echo "Running unit tests with file pattern: ${TEST_FLAG}"
3030
export OMP_NUM_THREADS=10
31-
lrun -N ${LBANN_NNODES} -T $TEST_TASKS_PER_NODE lbann_pfe.sh -m pytest -s -vv --durations=0 --junitxml=results.xml ${TEST_FLAG}
32-
#lrun -N ${LBANN_NNODES} -T $TEST_TASKS_PER_NODE python3 -m pytest -s -vv --durations=0 --junitxml=results.xml ${TEST_FLAG}
31+
lbann_pfe.sh -m pytest -s -vv --durations=0 --junitxml=results.xml ${TEST_FLAG}
3332
status=$(($status + $?))
3433

3534
echo "Task: Finished"

ci_test/unit_tests/test_catch2_unit_tests.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ def test_run_parallel_filesystem_catch_tests(cluster, dirname):
9393
mpi_launch = get_system_mpi_launch(cluster)
9494
mpi_output_file_name = 'mpi_filesystem_catch_tests_output-%s-rank=%%r-size=%%s.xml' % (cluster)
9595
mpi_output_file = os.path.join(output_dir, mpi_output_file_name)
96-
mpi_catch_args = [mpi_catch_exe, '"[filesystem]"', '-r', 'junit', '-o', mpi_output_file]
96+
mpi_catch_args = [mpi_catch_exe, '[filesystem]', '-r', 'junit', '-o', mpi_output_file]
9797
output = sp.run(mpi_launch + mpi_catch_args, cwd=build_dir)
9898
if output.returncode != 0:
9999
raise AssertionError('return_code={%d}' % output.returncode)

python/lbann/contrib/lc/launcher.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,8 @@ def prepend_environment_path(key, prefix):
132132
set_environment('MIOPEN_CUSTOM_CACHE_DIR', f'{tmpdir}/MIOpen_custom_cache')
133133
if os.getenv('ROCM_PATH') is not None:
134134
prepend_environment_path('LD_LIBRARY_PATH', os.path.join(os.getenv('ROCM_PATH'), 'llvm', 'lib'))
135+
# Ensure that the correct PMI is set to avoid hangs, etc.
136+
launcher_args.append('-o pmi=pmix')
135137

136138
# Optimizations for Sierra-like systems
137139
if system in ('sierra', 'lassen', 'rzansel'):

0 commit comments

Comments
 (0)