Skip to content

Framework/CI: AT2/cuda12 and AT2/cuda12-uvm hanging during build #14641

@brian-kelley

Description

@brian-kelley

In these AT2 testing for #14616, the build steps are hanging near the end, for example this was cuda12:

2025-10-30T19:23:50.7457118Z [25548/26301] Linking CXX executable packages/panzer/disc-fe/example/SacadoExample/PanzerDiscFE_SacadoExample.exe
2025-10-30T19:23:51.0768242Z [25549/26301] Linking CXX executable packages/adelus/test/vector_random/Adelus_vector_random.exe
2025-10-30T19:23:51.1183653Z [25550/26301] Linking CXX executable packages/adelus/test/vector_random_fs/Adelus_vector_random_fs.exe
2025-10-30T19:23:51.2428391Z [25551/26301] Linking CXX executable packages/adelus/test/vector_random_mc/Adelus_vector_random_mc.exe
2025-10-30T19:23:51.3134471Z [25552/26301] Linking CXX executable packages/panzer/disc-fe/test/evaluator_tests/PanzerDiscFE_IntegratorScalar.exe
2025-10-30T19:23:51.3311657Z [25553/26301] Linking CXX executable packages/panzer/disc-fe/test/evaluator_tests/PanzerDiscFE_GatherCoordinates.exe
2025-10-30T19:23:51.4403498Z [25554/26301] Linking CXX executable packages/panzer/disc-fe/test/evaluator_tests/PanzerDiscFE_DOF_PointFields.exe
2025-10-30T19:23:51.7232788Z [25555/26301] Linking CXX executable packages/panzer/disc-fe/test/evaluator_tests/PanzerDiscFE_DOF_BasisToBasis.exe
2025-10-30T19:23:51.9332544Z [25556/26301] Linking CXX executable packages/panzer/disc-fe/test/local_mesh/PanzerDiscFE_LocalMesh_Tests.exe
2025-10-30T19:24:12.8767536Z [25557/26301] Building CXX object packages/panzer/adapters-stk/test/projection/CMakeFiles/PanzerAdaptersSTK_projection.dir/projection.cpp.o
2025-10-30T19:24:53.3823564Z [25558/26301] Building CXX object packages/panzer/adapters-stk/test/projection/CMakeFiles/PanzerAdaptersSTK_ProjectField.dir/project_field.cpp.o
2025-10-31T00:07:18.1604377Z ##[error]The operation was canceled.
2025-10-31T00:07:18.1700774Z Post job cleanup.
2025-10-31T00:07:18.2454609Z [command]/usr/bin/git version
2025-10-31T00:07:18.2493644Z git version 2.43.7
2025-10-31T00:07:18.2530699Z Temporarily overriding HOME='/home/runner/_work/_temp/6808d949-ff85-47f0-a2fb-3a803ddf3ea2' before making global git config change

and this was cuda12-uvm:

2025-10-30T18:44:27.9149814Z [6598/6648] Linking CXX static library packages/panzer/disc-fe/src/libpanzer-disc-fe.a
2025-10-30T18:44:29.6362349Z [6599/6648] Building CXX object packages/percept/CMakeFiles/perceptX.dir/src/adapt/main/MeshAdapt.cpp.o
2025-10-30T18:44:29.6364461Z ../../runner/_work/Trilinos/Trilinos/packages/stk/stk_expreval/stk_expreval/Function.hpp(286): warning #20013-D: calling a constexpr __host__ function("min") from a __host__ __device__ function("time_space_normal") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
2025-10-30T18:44:29.6368770Z     static const double epsilon = std::numeric_limits<double>::min();
2025-10-30T18:44:29.6370957Z                                   ^
2025-10-30T18:44:29.6371465Z 
2025-10-30T18:44:29.6371729Z Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
2025-10-30T18:44:29.6372142Z 
2025-10-30T18:44:31.3587213Z [6600/6648] Linking CXX executable packages/percept/perceptX
2025-10-30T18:45:13.6964299Z [6601/6648] Building CXX object packages/panzer/adapters-stk/src/CMakeFiles/panzer-stk.dir/evaluators/Panzer_STK_ProjectField.cpp.o
2025-10-30T18:48:23.5545935Z [6602/6648] Building CXX object packages/stokhos/src/CMakeFiles/stokhos_muelu_mp_16_cuda.dir/MueLu_ScalarDroppingDistanceLaplacian_MP_Vector_16_Cuda.cpp.o
... some more warnings are emitted...
2025-10-30T18:53:09.3125001Z 
2025-10-31T00:11:12.0857979Z ##[error]The operation was canceled.
2025-10-31T00:11:12.0950966Z Post job cleanup.
2025-10-31T00:11:12.1693800Z [command]/usr/bin/git version
2025-10-31T00:11:12.1736439Z git version 2.43.7

Looking at the timestamps of the output, both of them are producing no new output for about 5 hours, and then the 6 hour timeout kills the jobs. This has happened reliably for 3 or 4 testing attempts and each time they get stuck close to the end of the build.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PA: FrameworkIssues that fall under the Trilinos Framework Product AreaautotesterIssues related to the autotester.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions