Nemotron recipes use NeMo-Run for job orchestration. NeMo-Run is an NVIDIA tool that streamlines configuration, execution, and management of ML experiments across computing environments.
Note: This release has been tested primarily with Slurm execution. Support for additional executors is planned.
NeMo-Run decouples what you want to run from where you run it:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
subgraph config["Configuration"]
Task["Training Task"]
end
subgraph nemorun["NeMo-Run"]
Packager["Packager"]
Executor["Executor"]
end
subgraph targets["Execution Targets"]
Local["Local"]
Docker["Docker"]
Slurm["Slurm"]
Cloud["Cloud"]
end
Task --> Packager
Packager --> Executor
Executor --> Local
Executor --> Docker
Executor --> Slurm
Executor --> Cloud
Key concepts:
- Executor: Where to run (local, Docker, Slurm, cloud)
- Packager: How to package code for the executor
- Launcher: How to launch the process (torchrun, direct)
For full documentation, see the NeMo-Run GitHub repository.
Add --run <profile> to any recipe command:
# Execute on a Slurm cluster
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER
# Detached execution (submit and exit)
uv run nemotron nano3 pretrain -c tiny --batch YOUR-CLUSTER
# Preview without executing
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER --dry-runNemotron Kit provides an env.toml configuration layer on top of NeMo-Run, enabling declarative execution profiles that integrate natively with the CLI. This is a Nemotron-specific feature—standard NeMo-Run requires programmatic configuration.
Create an env.toml in your project root. Each section defines a named execution profile that can be referenced via --run <profile> or --batch <profile>:
# env.toml
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]When you run uv run nemotron nano3 pretrain --run YOUR-CLUSTER, the kit reads this profile, builds the appropriate NeMo-Run executor, and submits your job.
Profiles can extend other profiles to reduce duplication:
[base-slurm]
executor = "slurm"
account = "my-account"
partition = "gpu"
time = "04:00:00"
[YOUR-CLUSTER]
extends = "base-slurm"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8
[YOUR-CLUSTER-large]
extends = "YOUR-CLUSTER"
nodes = 16
time = "08:00:00"Submit jobs to a Slurm cluster. Container execution requires Pyxis, NVIDIA's container plugin for Slurm.
[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8
time = "04:00:00"
mounts = ["/data:/data"]Submit from a remote machine via SSH:
[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
tunnel = "ssh"
host = "cluster.example.com"
user = "username"
identity = "~/.ssh/id_rsa"Specify different partitions for --run vs --batch:
[YOUR-CLUSTER]
executor = "slurm"
partition = "batch" # Default
run_partition = "interactive" # For --run (attached)
batch_partition = "backfill" # For --batch (detached)NeMo-Run supports additional executors:
| Executor | Description | Status |
|---|---|---|
local |
Local execution with torchrun | Planned |
docker |
Docker container with GPU support | Planned |
skypilot |
Cloud instances (AWS, GCP, Azure) | Planned |
dgxcloud |
NVIDIA DGX Cloud | Planned |
Packagers determine how your code is bundled and transferred to the execution target. NeMo-Run provides several packagers for different workflows.
Packages code from a git repository using git archive. Only includes committed files.
[YOUR-CLUSTER]
executor = "slurm"
packager = "git"
# ... other settingsHow it works:
- Runs
git archive --format=tar.gzfrom repository root - Includes only committed files (uncommitted changes are excluded)
- Optionally includes git submodules
- Transfers archive to execution target
Key options:
| Option | Default | Description |
|---|---|---|
subpath |
- | Package only a subdirectory of the repo |
include_submodules |
true |
Include git submodules in archive |
include_pattern |
- | Glob pattern for additional uncommitted files |
Best for: Production runs with version-controlled code.
Packages files matching glob patterns. Useful for code not under version control.
[YOUR-CLUSTER]
executor = "slurm"
packager = "pattern"
packager_include_pattern = "src/**/*.py"How it works:
- Finds files matching the glob pattern
- Creates tar archive of matched files
- Preserves directory structure relative to pattern base
Key options:
| Option | Description |
|---|---|
include_pattern |
Glob pattern(s) for files to include |
relative_path |
Base path for pattern matching |
Best for: Quick iterations, non-git projects, or including generated files.
Combines multiple packagers into a single archive. Useful for complex projects.
[YOUR-CLUSTER]
executor = "slurm"
packager = "hybrid"
# Configuration via code (see below)How it works:
- Runs each sub-packager independently
- Extracts outputs to temporary directories
- Merges all into final archive with folder organization
Example in code:
import nemo_run as run
hybrid = run.HybridPackager(
sub_packagers={
"code": run.GitArchivePackager(subpath="src"),
"configs": run.PatternPackager(include_pattern="configs/*.yaml"),
"data": run.PatternPackager(include_pattern="data/*.json"),
}
)Best for: Projects with mixed version-controlled and generated content.
No packaging—assumes code is already available on the target (e.g., in a container image or shared filesystem).
[YOUR-CLUSTER]
executor = "slurm"
packager = "none"Best for: Pre-built containers, shared filesystems like /lustre.
| Option | Behavior | Use Case |
|---|---|---|
--run |
Attached—waits for completion, streams logs | Interactive development |
--batch |
Detached—submits and exits immediately | Long-running jobs |
Override config values using Hydra-style syntax:
# Override training iterations
uv run nemotron nano3 pretrain --run YOUR-CLUSTER train.train_iters=5000
# Override nodes
uv run nemotron nano3 pretrain --run YOUR-CLUSTER run.nodes=8| Field | Type | Default | Description |
|---|---|---|---|
executor |
str | "local" |
Backend: local, docker, slurm, skypilot |
packager |
str | "git" |
Packager: git, pattern, hybrid, none |
nproc_per_node |
int | 8 |
GPUs per node |
nodes |
int | 1 |
Number of nodes |
container_image |
str | - | Container image (from recipe config) |
mounts |
list | [] |
Mount points (/host:/container) |
account |
str | - | Slurm account |
partition |
str | - | Slurm partition |
run_partition |
str | - | Partition for --run |
batch_partition |
str | - | Partition for --batch |
time |
str | "04:00:00" |
Job time limit |
tunnel |
str | "local" |
Slurm tunnel: local or ssh |
host |
str | - | SSH host |
user |
str | - | SSH user |
env_vars |
list | [] |
Environment variables |
dry_run |
bool | false |
Preview without executing |
detach |
bool | false |
Submit and exit |
Some recipes use Ray for distributed execution. When you run a Ray-enabled recipe with --run, the Ray cluster is set up automatically:
# Data prep uses Ray
uv run nemotron nano3 data prep pretrain --run YOUR-CLUSTER
# RL training uses Ray
uv run nemotron nano3 rl -c tiny --run YOUR-CLUSTER- NeMo-Run GitHub — Full documentation
- W&B Integration — Automatic credential handling
- Nemotron Kit — Framework overview
- CLI Framework — Building recipe CLIs