Skip to content

Conversation

@federico-dambrosio
Copy link

Description

This PR adds a sample SLURM script and accompanying documentation for running NeMo Curator pipelines on a multi-node Ray cluster using Singularity / Apptainer.

Specifically, it:

  • Adds ray-singularity-sbatch.sh, a generic SLURM batch script that:
    • Starts a Ray head on the first SLURM node and Ray workers on the remaining nodes.
    • Runs a user-provided Python command inside a NeMo Curator container on the head node.
    • Supports both Singularity and Apptainer via a CONTAINER_CMD knob.
    • Is safe for air-gapped clusters by default via HF_HUB_OFFLINE=1.
  • Adds a README documenting:
    • Prerequisites (NeMo Curator container, SLURM, Singularity/Apptainer).
    • How the script works and how to customize SBATCH directives.
    • All relevant environment knobs (ports, HF cache, scratch paths, mounts, etc.).
    • Example usage patterns for NeMo Curator pipelines.

No existing code paths are modified; this is an example script + documentation intended to make it easier for users to run NeMo Curator on SLURM-based HPC systems.

Similar to #1168 but for Slurm clusters with Singularity and no internet connection on compute nodes.

Usage

from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna.executor import XennaExecutor

# Define your pipeline
pipeline = Pipeline(...)
pipeline.add_stage(...)

# Use the XennaExecutor to run on the Ray cluster started by the sbatch script
executor = XennaExecutor()
results = pipeline.run(executor=executor)

On the SLURM side, the corresponding submission looks like:

export IMAGE=/path/to/nemo-curator_25.09.sif

RUN_COMMAND="python curator_pipeline.py" \
sbatch --nodes=2 --gres=gpu:4 ray-singularity-sbatch.sh

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 24, 2025

Greptile Summary

Adds a SLURM batch script and comprehensive documentation for launching multi-node Ray clusters using Singularity/Apptainer on HPC systems. The script orchestrates Ray head and worker nodes across SLURM-allocated compute nodes, handles container bind mounts, propagates environment variables properly, and includes cleanup handling. The documentation thoroughly explains all configuration options, environment variables, and provides practical usage examples for air-gapped clusters.

Key features:

  • Properly quotes variables to handle paths with spaces (addressed in commit 1853800)
  • Uses set -euo pipefail for error handling
  • Implements cleanup trap for temporary directories
  • Supports both Singularity and Apptainer via CONTAINER_CMD
  • Defaults to air-gapped mode with HF_HUB_OFFLINE=1
  • Auto-detects resources from SLURM environment variables

Confidence Score: 5/5

  • This PR is safe to merge
  • The PR adds new deployment infrastructure without modifying existing code. The script follows bash best practices with proper error handling (set -euo pipefail), variable quoting, cleanup traps, and comprehensive documentation. Previous syntax issues were already resolved in commit 1853800. The code properly handles environment variable propagation into containers and includes sensible defaults for air-gapped HPC environments.
  • No files require special attention

Important Files Changed

Filename Overview
tutorials/deployment/slurm/ray-singularity-sbatch.sh Well-structured SLURM script with proper error handling and environment propagation; properly quotes variables to handle spaces
tutorials/deployment/slurm/README.md Comprehensive documentation with clear examples and environment variable explanations

Sequence Diagram

sequenceDiagram
    participant User
    participant SLURM
    participant HeadNode
    participant WorkerNodes
    participant Container
    participant Ray

    User->>SLURM: Submit job with RUN_COMMAND
    SLURM->>SLURM: Allocate nodes based on --nodes, --gres
    SLURM->>HeadNode: Identify first node as head
    SLURM->>WorkerNodes: Identify remaining nodes as workers
    
    HeadNode->>HeadNode: Create temp directories (RAY_TMP, RAY_SPILL_DIR)
    HeadNode->>HeadNode: Setup environment variables (HF_HOME, PYTHONPATH, etc.)
    
    HeadNode->>Container: Launch Singularity/Apptainer with --bind mounts
    Container->>Ray: Start Ray head with --head --block
    Note over Ray: GCS on port 6379<br/>Dashboard on port 8265<br/>Client on port 10001
    
    par Start Workers
        loop For each worker node
            WorkerNodes->>WorkerNodes: Create worker temp directory
            WorkerNodes->>Container: Launch container with bind mounts
            Container->>Ray: Start Ray worker with --address HEAD_NODE_IP:6379
        end
    end
    
    Note over HeadNode,WorkerNodes: Wait for cluster to stabilize
    
    HeadNode->>Container: Execute RUN_COMMAND inside container
    Container->>Ray: User script connects to Ray cluster
    Ray->>Ray: Distribute work across head and workers
    Container-->>User: Return execution results
    
    Note over SLURM: Job ends
    HeadNode->>HeadNode: Cleanup trap removes temp directories
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@federico-dambrosio federico-dambrosio changed the title Add SLURM script for launching multi-node Ray clusters with Singulari… Add SLURM script for launching multi-node Ray clusters with Singularity Nov 24, 2025
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: federico-dambrosio <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. tutorials/deployment/slurm/ray-singularity-sbatch.sh, line 171-175 (link)

    syntax: ${CONTAINER_CMD} and ${IMAGE} should be quoted to prevent word splitting if paths contain spaces

  2. tutorials/deployment/slurm/ray-singularity-sbatch.sh, line 216-220 (link)

    syntax: ${CONTAINER_CMD} and ${IMAGE} should be quoted to prevent word splitting if paths contain spaces

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. tutorials/deployment/slurm/ray-singularity-sbatch.sh, line 63 (link)

    syntax: Quote $(pwd) to prevent word splitting if path contains spaces

  2. tutorials/deployment/slurm/ray-singularity-sbatch.sh, line 137 (link)

    syntax: Quote $(pwd) to prevent word splitting if path contains spaces

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

########################################################
export PYTHONNOUSERSITE=1
export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
export PYTHONPATH="${PYTHONPATH:-"$(pwd)"}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Incorrect quoting syntax - $(pwd) will be evaluated on the host, not inside the container. The PYTHONPATH variable assignment should use \$(pwd) to delay evaluation until inside the container, or omit quotes entirely to let the shell expand it in the container context.

Suggested change
export PYTHONPATH="${PYTHONPATH:-"$(pwd)"}"
export PYTHONPATH="${PYTHONPATH:-$(pwd)}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants