Skip to content

CI execution over Slurm managed clusters#1231

Open
MrBr-github wants to merge 7 commits intoopenucx:masterfrom
MrBr-github:master
Open

CI execution over Slurm managed clusters#1231
MrBr-github wants to merge 7 commits intoopenucx:masterfrom
MrBr-github:master

Conversation

@MrBr-github
Copy link

What

This PR adds execution over scctl managed slurm clusters

slurm_cmd.sh - provides abstraction to execute slurm based commands
run_slurm_tests_ucc.sh - basic test file
fix_enroot.sh - comes to mitigate enroot issue to handle errors with anonymous image access`

Issue: HPCINFRA-3983

Comes to show further changes

Issue: HPCINFRA-3983

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 30, 2025

Skipped: This PR does not contain any of your configured labels: (Ready-For-Review)

Add support for slurm clusters managed by scctl

slurm_cmd.sh - provides abstraction to execute slurm based commands
run_slurm_tests_ucc.sh - basic test file
fix_enroot.sh - comes to mitigate enroot issue to handle errors with
anonymous image access`

Issue: HPCINFRA-3983

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
@MrBr-github
Copy link
Author

Copy link
Collaborator

@dpressle dpressle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole design is not clear, the main work here is to run NVLS tests, not SLURM tests, slurm is just a way to run these tests on specific HW/clusters but it is used as the main subject in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the existing jjb yaml, we dont want to duplicate as many variables are common to all jobs

@mike-dubman
Copy link
Contributor

very cool
imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

@MrBr-github
Copy link
Author

very cool imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

@mike-dubman
Opportunism (like in clusterminder) is a great idea
However I don't quite understand what do you mean by fallback?
Only alternative I see right now for opportunism is to approach multiple partitions with salloc -p jazz.rock -N 2
In later stages, when we'll have multiple managed slurm clusters with suitable HW (not partitions), we can loop through them (like hpchead partition and scctl)

@mike-dubman
Copy link
Contributor

very cool imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

@mike-dubman Opportunism (like in clusterminder) is a great idea However I don't quite understand what do you mean by fallback? Only alternative I see right now for opportunism is to approach multiple partitions with salloc -p jazz.rock -N 2 In later stages, when we'll have multiple managed slurm clusters with suitable HW (not partitions), we can loop through them (like hpchead partition and scctl)

fallback to static servers you have today?

@MrBr-github
Copy link
Author

very cool imho - it should use opportunistic approach, i,e. - try to use scctl to submit (check if desired amount of nodes available like dry run) use salloc --immediate=sec flag that if salloc waits for nodes more than sec - it will bail and go for fallback

@mike-dubman Opportunism (like in clusterminder) is a great idea However I don't quite understand what do you mean by fallback? Only alternative I see right now for opportunism is to approach multiple partitions with salloc -p jazz.rock -N 2 In later stages, when we'll have multiple managed slurm clusters with suitable HW (not partitions), we can loop through them (like hpchead partition and scctl)

fallback to static servers you have today?

Oh, got you
Main idea of this PR was to execute tests over HW not available on static servers, in particular NVLS test
But opportunistic approach opens new use cases for the same tests
Will discuss possibilities internally

Also remove stale variables

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Move initialization and allocation to slurm_allocate.sh
Move slurm cleanup logic to slurm_release.sh
Move slurm test execution logic to run_slurm_tests_ucc.sh

Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants