Skip to content

Enable more jobs relying on the server's spack and module#1841

Merged
yhmtsai merged 13 commits into
developfrom
spack_job
Sep 11, 2025
Merged

Enable more jobs relying on the server's spack and module#1841
yhmtsai merged 13 commits into
developfrom
spack_job

Conversation

@yhmtsai

@yhmtsai yhmtsai commented May 12, 2025

Copy link
Copy Markdown
Member

This PR enables more jobs running on our server and they are reusing the packages and spack from the system rather than building everything through spack.

  • before_script: setup the environment to use the system packages.
  • script: additionally allows environment variables MODULE_LOAD and SPACK_LOAD to load the package through the variables. They are simply expanded to module load ${MODULE_LOAD} and spack load ${SPACK_LOAD} if they contains context. cuda module from spack does not include LD_LIBRARY_PATH, so the script contains extending LD_LIBRARY_PATH before test_install. RPATH_USE_LINK only helps pure cpu build but not for the hip part.
  • image-tags: tags contains "s" like nvidia-gpus is the new settings. Do we have any specific tags additionally for the server or do we need to have? -> using tum

This PR will only extend the job relying on the same linux version and the packages from system.
Loading the system packages on different linux version or distribution might still work but it is not the purpose of this PR.

TODO:

  • move rocky_tum to some registry? It is a local image for apptainer now. It uses the same linux version as the server's system with the necessary components
  • move the jobs to appropriate place (just show them together in quick condition for easy check)
  • add more version from the current package sets

@yhmtsai yhmtsai requested review from a team, MarcelKoch, pratikvn and upsj May 12, 2025 13:46
@yhmtsai yhmtsai self-assigned this May 12, 2025
@yhmtsai yhmtsai added reg:ci-cd This is related to the continuous integration system. 1:ST:need-feedback The PR is somewhat ready but feedback on a blocking topic is required before a proper review. 1:ST:no-changelog-entry Skip the wiki check for changelog update labels May 12, 2025
@yhmtsai yhmtsai force-pushed the spack_job branch 13 times, most recently from a3d12fd to 58c1e15 Compare May 15, 2025 15:43
yhmtsai added a commit that referenced this pull request May 17, 2025
…ion, workspace reallcation

This moves CI jobs and fixes cuda12.2 cusparse matrix, coo exception, workspace reallcation found from #1841

Related PR: #1843
@yhmtsai yhmtsai added the 1:ST:ready-for-review This PR is ready for review label May 30, 2025
@sonarqubecloud

Copy link
Copy Markdown

@yhmtsai yhmtsai force-pushed the spack_job branch 2 times, most recently from faeb610 to 083f458 Compare August 27, 2025 11:32

@pratikvn pratikvn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bit weird that the OSX jobs are failing.

Comment thread .gitlab-ci.yml Outdated
@yhmtsai

yhmtsai commented Aug 28, 2025

Copy link
Copy Markdown
Member Author

@pratikvn OSX issue is tracked by #1924

@pratikvn pratikvn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM! I am little concerned that so many jobs might overload the TUM system.

Comment thread .gitlab-ci.yml
Comment on lines +164 to +192
build/cuda120/openmpi/gcc/cuda/release/static:
extends:
- .build_and_test_tum_template
- .default_variables
- .full_test_condition
- .use_tum-nvidia
variables:
BUILD_CUDA: "ON"
BUILD_HWLOC: "OFF"
ENABLE_HALF: "ON"
BUILD_MPI: "ON"
BUILD_SHARED_LIBS: "OFF"
BUILD_TYPE: "Release"
MODULE_LOAD: "cmake/3.18.6 cuda/12.0.1 gcc/12.4.0 openmpi/4.1.8"

build/cuda122/openmpi/gcc/cuda/release/shared:
extends:
- .build_and_test_tum_template
- .default_variables
- .full_test_condition
- .use_tum-nvidia
variables:
BUILD_CUDA: "ON"
BUILD_HWLOC: "OFF"
ENABLE_HALF: "ON"
BUILD_TYPE: "Release"
MODULE_LOAD: "cmake/3.20.6 cuda/12.2.2 gcc/12.4.0 openmpi/5.0.7"

build/cuda124/mpich/gcc/cuda/release/shared:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel so many jobs might overload the TUM system ? Maybe we can:

  1. Switch off MPI or CUDA wherever possible ?
  2. Not test multiple versions of the same CUDA major version ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the scope of this pr.
We will allow a GPU can serve multiple instance.
also, we are working on another project to make the setup more widely cover with limited resource

@yhmtsai yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Sep 11, 2025
@yhmtsai yhmtsai merged commit 36688a5 into develop Sep 11, 2025
19 of 21 checks passed
@yhmtsai yhmtsai deleted the spack_job branch September 11, 2025 16:12
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-to-merge This PR is ready to merge. 1:ST:run-full-test reg:ci-cd This is related to the continuous integration system.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants