-
Notifications
You must be signed in to change notification settings - Fork 210
Add SLURM script for launching multi-node Ray clusters with Singularity #1269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add SLURM script for launching multi-node Ray clusters with Singularity #1269
Conversation
…ty/Apptainer Signed-off-by: Federico D'Ambrosio <[email protected]>
Greptile SummaryAdds a SLURM batch script and comprehensive documentation for launching multi-node Ray clusters using Singularity/Apptainer on HPC systems. The script orchestrates Ray head and worker nodes across SLURM-allocated compute nodes, handles container bind mounts, propagates environment variables properly, and includes cleanup handling. The documentation thoroughly explains all configuration options, environment variables, and provides practical usage examples for air-gapped clusters. Key features:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant SLURM
participant HeadNode
participant WorkerNodes
participant Container
participant Ray
User->>SLURM: Submit job with RUN_COMMAND
SLURM->>SLURM: Allocate nodes based on --nodes, --gres
SLURM->>HeadNode: Identify first node as head
SLURM->>WorkerNodes: Identify remaining nodes as workers
HeadNode->>HeadNode: Create temp directories (RAY_TMP, RAY_SPILL_DIR)
HeadNode->>HeadNode: Setup environment variables (HF_HOME, PYTHONPATH, etc.)
HeadNode->>Container: Launch Singularity/Apptainer with --bind mounts
Container->>Ray: Start Ray head with --head --block
Note over Ray: GCS on port 6379<br/>Dashboard on port 8265<br/>Client on port 10001
par Start Workers
loop For each worker node
WorkerNodes->>WorkerNodes: Create worker temp directory
WorkerNodes->>Container: Launch container with bind mounts
Container->>Ray: Start Ray worker with --address HEAD_NODE_IP:6379
end
end
Note over HeadNode,WorkerNodes: Wait for cluster to stabilize
HeadNode->>Container: Execute RUN_COMMAND inside container
Container->>Ray: User script connects to Ray cluster
Ray->>Ray: Distribute work across head and workers
Container-->>User: Return execution results
Note over SLURM: Job ends
HeadNode->>HeadNode: Cleanup trap removes temp directories
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: federico-dambrosio <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
tutorials/deployment/slurm/ray-singularity-sbatch.sh, line 171-175 (link)syntax:
${CONTAINER_CMD}and${IMAGE}should be quoted to prevent word splitting if paths contain spaces -
tutorials/deployment/slurm/ray-singularity-sbatch.sh, line 216-220 (link)syntax:
${CONTAINER_CMD}and${IMAGE}should be quoted to prevent word splitting if paths contain spaces
2 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
2 files reviewed, 2 comments
Signed-off-by: Federico D'Ambrosio <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
| ######################################################## | ||
| export PYTHONNOUSERSITE=1 | ||
| export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}" | ||
| export PYTHONPATH="${PYTHONPATH:-"$(pwd)"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Incorrect quoting syntax - $(pwd) will be evaluated on the host, not inside the container. The PYTHONPATH variable assignment should use \$(pwd) to delay evaluation until inside the container, or omit quotes entirely to let the shell expand it in the container context.
| export PYTHONPATH="${PYTHONPATH:-"$(pwd)"}" | |
| export PYTHONPATH="${PYTHONPATH:-$(pwd)}" |
Description
This PR adds a sample SLURM script and accompanying documentation for running NeMo Curator pipelines on a multi-node Ray cluster using Singularity / Apptainer.
Specifically, it:
ray-singularity-sbatch.sh, a generic SLURM batch script that:CONTAINER_CMDknob.HF_HUB_OFFLINE=1.No existing code paths are modified; this is an example script + documentation intended to make it easier for users to run NeMo Curator on SLURM-based HPC systems.
Similar to #1168 but for Slurm clusters with Singularity and no internet connection on compute nodes.
Usage
On the SLURM side, the corresponding submission looks like:
Checklist