CI execution over Slurm managed clusters#1231
CI execution over Slurm managed clusters#1231MrBr-github wants to merge 7 commits intoopenucx:masterfrom
Conversation
Comes to show further changes Issue: HPCINFRA-3983 Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
|
Skipped: This PR does not contain any of your configured labels: ( |
Add support for slurm clusters managed by scctl slurm_cmd.sh - provides abstraction to execute slurm based commands run_slurm_tests_ucc.sh - basic test file fix_enroot.sh - comes to mitigate enroot issue to handle errors with anonymous image access` Issue: HPCINFRA-3983 Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
dpressle
left a comment
There was a problem hiding this comment.
The whole design is not clear, the main work here is to run NVLS tests, not SLURM tests, slurm is just a way to run these tests on specific HW/clusters but it is used as the main subject in the code.
There was a problem hiding this comment.
use the existing jjb yaml, we dont want to duplicate as many variables are common to all jobs
|
very cool |
@mike-dubman |
fallback to static servers you have today? |
Oh, got you |
Also remove stale variables Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Move initialization and allocation to slurm_allocate.sh Move slurm cleanup logic to slurm_release.sh Move slurm test execution logic to run_slurm_tests_ucc.sh Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
Signed-off-by: Michael Braverman <michaelbr@nvidia.com>
What
This PR adds execution over scctl managed slurm clusters
slurm_cmd.sh - provides abstraction to execute slurm based commands
run_slurm_tests_ucc.sh - basic test file
fix_enroot.sh - comes to mitigate enroot issue to handle errors with anonymous image access`
Issue: HPCINFRA-3983