CI execution over Slurm managed clusters by MrBr-github · Pull Request #1231 · openucx/ucc

MrBr-github · 2025-11-30T22:11:59Z

What

This PR adds execution over scctl managed slurm clusters

slurm_cmd.sh - provides abstraction to execute slurm based commands
run_slurm_tests_ucc.sh - basic test file
fix_enroot.sh - comes to mitigate enroot issue to handle errors with anonymous image access`

Issue: HPCINFRA-3983

Comes to show further changes Issue: HPCINFRA-3983 Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

greptile-apps · 2025-11-30T22:12:03Z

Skipped: This PR does not contain any of your configured labels: (Ready-For-Review)

Add support for slurm clusters managed by scctl slurm_cmd.sh - provides abstraction to execute slurm based commands run_slurm_tests_ucc.sh - basic test file fix_enroot.sh - comes to mitigate enroot issue to handle errors with anonymous image access` Issue: HPCINFRA-3983 Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

MrBr-github · 2025-12-01T07:57:18Z

Working example https://nbuprod.blsm.nvidia.com/hpcx-01/blue/organizations/jenkins/UCC%2Fucc_slurm_cluster_tests/detail/ucc_slurm_cluster_tests/44/pipeline

dpressle

The whole design is not clear, the main work here is to run NVLS tests, not SLURM tests, slurm is just a way to run these tests on specific HW/clusters but it is used as the main subject in the code.

.ci/scripts/run_slurm_tests_ucc.sh

dpressle · 2025-12-01T07:33:42Z

.ci/slurm_tests/proj_jjb.yaml

use the existing jjb yaml, we dont want to duplicate as many variables are common to all jobs

.ci/scripts/run_slurm_tests_ucc.sh

.ci/slurm_tests/slurm_scripts/fix_enroot.sh

.ci/slurm_tests/job_matrix.yaml

.ci/slurm_tests/proj_jjb.yaml

mike-dubman · 2025-12-02T10:29:03Z

very cool
imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

.ci/scripts/run_slurm_tests_ucc.sh

.ci/slurm_tests/job_matrix.yaml

MrBr-github · 2025-12-03T09:34:55Z

very cool imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

@mike-dubman
Opportunism (like in clusterminder) is a great idea
However I don't quite understand what do you mean by fallback?
Only alternative I see right now for opportunism is to approach multiple partitions with salloc -p jazz.rock -N 2
In later stages, when we'll have multiple managed slurm clusters with suitable HW (not partitions), we can loop through them (like hpchead partition and scctl)

mike-dubman · 2025-12-03T10:28:32Z

very cool imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

@mike-dubman Opportunism (like in clusterminder) is a great idea However I don't quite understand what do you mean by fallback? Only alternative I see right now for opportunism is to approach multiple partitions with salloc -p jazz.rock -N 2 In later stages, when we'll have multiple managed slurm clusters with suitable HW (not partitions), we can loop through them (like hpchead partition and scctl)

fallback to static servers you have today?

MrBr-github · 2025-12-03T12:28:38Z

very cool imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

@mike-dubman Opportunism (like in clusterminder) is a great idea However I don't quite understand what do you mean by fallback? Only alternative I see right now for opportunism is to approach multiple partitions with salloc -p jazz.rock -N 2 In later stages, when we'll have multiple managed slurm clusters with suitable HW (not partitions), we can loop through them (like hpchead partition and scctl)

fallback to static servers you have today?

Oh, got you
Main idea of this PR was to execute tests over HW not available on static servers, in particular NVLS test
But opportunistic approach opens new use cases for the same tests
Will discuss possibilities internally

Also remove stale variables Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

Move initialization and allocation to slurm_allocate.sh Move slurm cleanup logic to slurm_release.sh Move slurm test execution logic to run_slurm_tests_ucc.sh Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

CI on slurm: Copy Jenkins job from existing AS IS

735d835

Comes to show further changes Issue: HPCINFRA-3983 Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

MrBr-github force-pushed the master branch from fca6df9 to e4a89fb Compare November 30, 2025 22:18

dpressle requested changes Dec 1, 2025

View reviewed changes

dpressle requested changes Dec 3, 2025

View reviewed changes

.ci/scripts/run_slurm_tests_ucc.sh Outdated Show resolved Hide resolved

.ci/slurm_tests/job_matrix.yaml Outdated Show resolved Hide resolved

MrBr-github added 5 commits December 9, 2025 15:13

Rename Jenkins job

9e16ad6

Also remove stale variables Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

Add README

cf35910

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

Hide slurm business logic from pipeline

b165777

Move initialization and allocation to slurm_allocate.sh Move slurm cleanup logic to slurm_release.sh Move slurm test execution logic to run_slurm_tests_ucc.sh Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

Add ssh slurm flow

66c32cd

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

Split test in to 2 files: dispatcher and the test itself

3473706

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>

dpressle added the WIP - Don't Merge label Dec 15, 2025

Conversation

MrBr-github commented Nov 30, 2025

What

Uh oh!

greptile-apps bot commented Nov 30, 2025

Uh oh!

MrBr-github commented Dec 1, 2025

Uh oh!

dpressle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dpressle Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mike-dubman commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

MrBr-github commented Dec 3, 2025

Uh oh!

mike-dubman commented Dec 3, 2025

Uh oh!

MrBr-github commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants