Skip to content

Latest commit

 

History

History
325 lines (242 loc) · 8.76 KB

File metadata and controls

325 lines (242 loc) · 8.76 KB

Execution through NeMo-Run

Nemotron recipes use NeMo-Run for job orchestration. NeMo-Run is an NVIDIA tool that streamlines configuration, execution, and management of ML experiments across computing environments.

Note: This release has been tested primarily with Slurm execution. Support for additional executors is planned.

What is NeMo-Run?

NeMo-Run decouples what you want to run from where you run it:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
    subgraph config["Configuration"]
        Task["Training Task"]
    end

    subgraph nemorun["NeMo-Run"]
        Packager["Packager"]
        Executor["Executor"]
    end

    subgraph targets["Execution Targets"]
        Local["Local"]
        Docker["Docker"]
        Slurm["Slurm"]
        Cloud["Cloud"]
    end

    Task --> Packager
    Packager --> Executor
    Executor --> Local
    Executor --> Docker
    Executor --> Slurm
    Executor --> Cloud
Loading

Key concepts:

  • Executor: Where to run (local, Docker, Slurm, cloud)
  • Packager: How to package code for the executor
  • Launcher: How to launch the process (torchrun, direct)

For full documentation, see the NeMo-Run GitHub repository.

Quick Start

Add --run <profile> to any recipe command:

# Execute on a Slurm cluster
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER

# Detached execution (submit and exit)
uv run nemotron nano3 pretrain -c tiny --batch YOUR-CLUSTER

# Preview without executing
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER --dry-run

Execution Profiles

Nemotron Kit provides an env.toml configuration layer on top of NeMo-Run, enabling declarative execution profiles that integrate natively with the CLI. This is a Nemotron-specific feature—standard NeMo-Run requires programmatic configuration.

Create an env.toml in your project root. Each section defines a named execution profile that can be referenced via --run <profile> or --batch <profile>:

# env.toml

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

When you run uv run nemotron nano3 pretrain --run YOUR-CLUSTER, the kit reads this profile, builds the appropriate NeMo-Run executor, and submits your job.

Profile Inheritance

Profiles can extend other profiles to reduce duplication:

[base-slurm]
executor = "slurm"
account = "my-account"
partition = "gpu"
time = "04:00:00"

[YOUR-CLUSTER]
extends = "base-slurm"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8

[YOUR-CLUSTER-large]
extends = "YOUR-CLUSTER"
nodes = 16
time = "08:00:00"

Executors

Slurm

Submit jobs to a Slurm cluster. Container execution requires Pyxis, NVIDIA's container plugin for Slurm.

[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8
time = "04:00:00"
mounts = ["/data:/data"]

SSH Tunnel Submission

Submit from a remote machine via SSH:

[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
tunnel = "ssh"
host = "cluster.example.com"
user = "username"
identity = "~/.ssh/id_rsa"

Partition Overrides

Specify different partitions for --run vs --batch:

[YOUR-CLUSTER]
executor = "slurm"
partition = "batch"           # Default
run_partition = "interactive" # For --run (attached)
batch_partition = "backfill"  # For --batch (detached)

Other Executors

NeMo-Run supports additional executors:

Executor Description Status
local Local execution with torchrun Planned
docker Docker container with GPU support Planned
skypilot Cloud instances (AWS, GCP, Azure) Planned
dgxcloud NVIDIA DGX Cloud Planned

Packagers

Packagers determine how your code is bundled and transferred to the execution target. NeMo-Run provides several packagers for different workflows.

GitArchivePackager (Default)

Packages code from a git repository using git archive. Only includes committed files.

[YOUR-CLUSTER]
executor = "slurm"
packager = "git"
# ... other settings

How it works:

  1. Runs git archive --format=tar.gz from repository root
  2. Includes only committed files (uncommitted changes are excluded)
  3. Optionally includes git submodules
  4. Transfers archive to execution target

Key options:

Option Default Description
subpath - Package only a subdirectory of the repo
include_submodules true Include git submodules in archive
include_pattern - Glob pattern for additional uncommitted files

Best for: Production runs with version-controlled code.

PatternPackager

Packages files matching glob patterns. Useful for code not under version control.

[YOUR-CLUSTER]
executor = "slurm"
packager = "pattern"
packager_include_pattern = "src/**/*.py"

How it works:

  1. Finds files matching the glob pattern
  2. Creates tar archive of matched files
  3. Preserves directory structure relative to pattern base

Key options:

Option Description
include_pattern Glob pattern(s) for files to include
relative_path Base path for pattern matching

Best for: Quick iterations, non-git projects, or including generated files.

HybridPackager

Combines multiple packagers into a single archive. Useful for complex projects.

[YOUR-CLUSTER]
executor = "slurm"
packager = "hybrid"
# Configuration via code (see below)

How it works:

  1. Runs each sub-packager independently
  2. Extracts outputs to temporary directories
  3. Merges all into final archive with folder organization

Example in code:

import nemo_run as run

hybrid = run.HybridPackager(
    sub_packagers={
        "code": run.GitArchivePackager(subpath="src"),
        "configs": run.PatternPackager(include_pattern="configs/*.yaml"),
        "data": run.PatternPackager(include_pattern="data/*.json"),
    }
)

Best for: Projects with mixed version-controlled and generated content.

Passthrough Packager

No packaging—assumes code is already available on the target (e.g., in a container image or shared filesystem).

[YOUR-CLUSTER]
executor = "slurm"
packager = "none"

Best for: Pre-built containers, shared filesystems like /lustre.

CLI Options

--run vs --batch

Option Behavior Use Case
--run Attached—waits for completion, streams logs Interactive development
--batch Detached—submits and exits immediately Long-running jobs

Config Overrides

Override config values using Hydra-style syntax:

# Override training iterations
uv run nemotron nano3 pretrain --run YOUR-CLUSTER train.train_iters=5000

# Override nodes
uv run nemotron nano3 pretrain --run YOUR-CLUSTER run.nodes=8

Profile Reference

Field Type Default Description
executor str "local" Backend: local, docker, slurm, skypilot
packager str "git" Packager: git, pattern, hybrid, none
nproc_per_node int 8 GPUs per node
nodes int 1 Number of nodes
container_image str - Container image (from recipe config)
mounts list [] Mount points (/host:/container)
account str - Slurm account
partition str - Slurm partition
run_partition str - Partition for --run
batch_partition str - Partition for --batch
time str "04:00:00" Job time limit
tunnel str "local" Slurm tunnel: local or ssh
host str - SSH host
user str - SSH user
env_vars list [] Environment variables
dry_run bool false Preview without executing
detach bool false Submit and exit

Ray-Enabled Recipes

Some recipes use Ray for distributed execution. When you run a Ray-enabled recipe with --run, the Ray cluster is set up automatically:

# Data prep uses Ray
uv run nemotron nano3 data prep pretrain --run YOUR-CLUSTER

# RL training uses Ray
uv run nemotron nano3 rl -c tiny --run YOUR-CLUSTER

Further Reading