Execution through NeMo-Run

Nemotron recipes use NeMo-Run for job orchestration. NeMo-Run is an NVIDIA tool that streamlines configuration, execution, and management of ML experiments across computing environments.

Note: This release has been tested primarily with Slurm execution. Support for additional executors is planned.

What is NeMo-Run?

NeMo-Run decouples what you want to run from where you run it:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
    subgraph config["Configuration"]
        Task["Training Task"]
    end

    subgraph nemorun["NeMo-Run"]
        Packager["Packager"]
        Executor["Executor"]
    end

    subgraph targets["Execution Targets"]
        Local["Local"]
        Docker["Docker"]
        Slurm["Slurm"]
        Cloud["Cloud"]
    end

    Task --> Packager
    Packager --> Executor
    Executor --> Local
    Executor --> Docker
    Executor --> Slurm
    Executor --> Cloud

Key concepts:

Executor: Where to run (local, Docker, Slurm, cloud)
Packager: How to package code for the executor
Launcher: How to launch the process (torchrun, direct)

For full documentation, see the NeMo-Run GitHub repository.

Quick Start

Add --run <profile> to any recipe command:

# Execute on a Slurm cluster
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER

# Detached execution (submit and exit)
uv run nemotron nano3 pretrain -c tiny --batch YOUR-CLUSTER

# Preview without executing
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER --dry-run

Execution Profiles

Nemotron Kit provides an env.toml configuration layer on top of NeMo-Run, enabling declarative execution profiles that integrate natively with the CLI. This is a Nemotron-specific feature—standard NeMo-Run requires programmatic configuration.

Create an env.toml in your project root. Each section defines a named execution profile that can be referenced via --run <profile> or --batch <profile>:

# env.toml

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

When you run uv run nemotron nano3 pretrain --run YOUR-CLUSTER, the kit reads this profile, builds the appropriate NeMo-Run executor, and submits your job.

Profile Inheritance

Profiles can extend other profiles to reduce duplication:

[base-slurm]
executor = "slurm"
account = "my-account"
partition = "gpu"
time = "04:00:00"

[YOUR-CLUSTER]
extends = "base-slurm"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8

[YOUR-CLUSTER-large]
extends = "YOUR-CLUSTER"
nodes = 16
time = "08:00:00"

Executors

Slurm

Submit jobs to a Slurm cluster. Container execution requires Pyxis, NVIDIA's container plugin for Slurm.

[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8
time = "04:00:00"
mounts = ["/data:/data"]

SSH Tunnel Submission

Submit from a remote machine via SSH:

[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
tunnel = "ssh"
host = "cluster.example.com"
user = "username"
identity = "~/.ssh/id_rsa"

Partition Overrides

Specify different partitions for --run vs --batch:

[YOUR-CLUSTER]
executor = "slurm"
partition = "batch"           # Default
run_partition = "interactive" # For --run (attached)
batch_partition = "backfill"  # For --batch (detached)

Other Executors

NeMo-Run supports additional executors:

Executor	Description	Status
`local`	Local execution with torchrun	Planned
`docker`	Docker container with GPU support	Planned
`skypilot`	Cloud instances (AWS, GCP, Azure)	Planned
`dgxcloud`	NVIDIA DGX Cloud	Planned

Packagers

Packagers determine how your code is bundled and transferred to the execution target. NeMo-Run provides several packagers for different workflows.

GitArchivePackager (Default)

Packages code from a git repository using git archive. Only includes committed files.

[YOUR-CLUSTER]
executor = "slurm"
packager = "git"
# ... other settings

How it works:

Runs git archive --format=tar.gz from repository root
Includes only committed files (uncommitted changes are excluded)
Optionally includes git submodules
Transfers archive to execution target

Key options:

Option	Default	Description
`subpath`	-	Package only a subdirectory of the repo
`include_submodules`	`true`	Include git submodules in archive
`include_pattern`	-	Glob pattern for additional uncommitted files

Best for: Production runs with version-controlled code.

PatternPackager

Packages files matching glob patterns. Useful for code not under version control.

[YOUR-CLUSTER]
executor = "slurm"
packager = "pattern"
packager_include_pattern = "src/**/*.py"

How it works:

Finds files matching the glob pattern
Creates tar archive of matched files
Preserves directory structure relative to pattern base

Key options:

Option	Description
`include_pattern`	Glob pattern(s) for files to include
`relative_path`	Base path for pattern matching

Best for: Quick iterations, non-git projects, or including generated files.

HybridPackager

Combines multiple packagers into a single archive. Useful for complex projects.

[YOUR-CLUSTER]
executor = "slurm"
packager = "hybrid"
# Configuration via code (see below)

How it works:

Runs each sub-packager independently
Extracts outputs to temporary directories
Merges all into final archive with folder organization

Example in code:

import nemo_run as run

hybrid = run.HybridPackager(
    sub_packagers={
        "code": run.GitArchivePackager(subpath="src"),
        "configs": run.PatternPackager(include_pattern="configs/*.yaml"),
        "data": run.PatternPackager(include_pattern="data/*.json"),
    }
)

Best for: Projects with mixed version-controlled and generated content.

Passthrough Packager

No packaging—assumes code is already available on the target (e.g., in a container image or shared filesystem).

[YOUR-CLUSTER]
executor = "slurm"
packager = "none"

Best for: Pre-built containers, shared filesystems like /lustre.

CLI Options

`--run` vs `--batch`

Option	Behavior	Use Case
`--run`	Attached—waits for completion, streams logs	Interactive development
`--batch`	Detached—submits and exits immediately	Long-running jobs

Config Overrides

Override config values using Hydra-style syntax:

# Override training iterations
uv run nemotron nano3 pretrain --run YOUR-CLUSTER train.train_iters=5000

# Override nodes
uv run nemotron nano3 pretrain --run YOUR-CLUSTER run.nodes=8

Profile Reference

Field	Type	Default	Description
`executor`	str	`"local"`	Backend: local, docker, slurm, skypilot
`packager`	str	`"git"`	Packager: git, pattern, hybrid, none
`nproc_per_node`	int	`8`	GPUs per node
`nodes`	int	`1`	Number of nodes
`container_image`	str	-	Container image (from recipe config)
`mounts`	list	`[]`	Mount points (`/host:/container`)
`account`	str	-	Slurm account
`partition`	str	-	Slurm partition
`run_partition`	str	-	Partition for `--run`
`batch_partition`	str	-	Partition for `--batch`
`time`	str	`"04:00:00"`	Job time limit
`tunnel`	str	`"local"`	Slurm tunnel: local or ssh
`host`	str	-	SSH host
`user`	str	-	SSH user
`env_vars`	list	`[]`	Environment variables
`dry_run`	bool	`false`	Preview without executing
`detach`	bool	`false`	Submit and exit

Ray-Enabled Recipes

Some recipes use Ray for distributed execution. When you run a Ray-enabled recipe with --run, the Ray cluster is set up automatically:

# Data prep uses Ray
uv run nemotron nano3 data prep pretrain --run YOUR-CLUSTER

# RL training uses Ray
uv run nemotron nano3 rl -c tiny --run YOUR-CLUSTER

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution through NeMo-Run

What is NeMo-Run?

Quick Start

Execution Profiles

Profile Inheritance

Executors

Slurm

SSH Tunnel Submission

Partition Overrides

Other Executors

Packagers

GitArchivePackager (Default)

PatternPackager

HybridPackager

Passthrough Packager

CLI Options

`--run` vs `--batch`

Config Overrides

Profile Reference

Ray-Enabled Recipes

Further Reading

FilesExpand file tree

nemo-run.md

Latest commit

History

nemo-run.md

File metadata and controls

Execution through NeMo-Run

What is NeMo-Run?

Quick Start

Execution Profiles

Profile Inheritance

Executors

Slurm

SSH Tunnel Submission

Partition Overrides

Other Executors

Packagers

GitArchivePackager (Default)

PatternPackager

HybridPackager

Passthrough Packager

CLI Options

--run vs --batch

Config Overrides

Profile Reference

Ray-Enabled Recipes

Further Reading

`--run` vs `--batch`