This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
NeMo Run is a tool for configuring, executing, and managing ML experiments across various computing environments. Its three core pillars are:
- Configuration - Python-native config using Google's Fiddle library
- Execution - Running tasks on local machines, SLURM clusters, Docker, cloud (SkyPilot, DGX Cloud, Lepton)
- Management - Tracking experiment metadata locally in
NEMORUN_HOME(default:~/.nemo_run)
# Install for development
uv sync --extra skypilot
# Run tests (slow tests are skipped by default)
uv run -- pytest test/
# Run a single test
uv run -- pytest test/test_config.py::TestClass::test_method
# Run including slow tests
uv run -- pytest -m "" test/
# Lint
uv run --group lint -- ruff check
# Format
uv run --group lint -- ruff format
# Run with coverage
uv run -- coverage run --branch --source=nemo_run -a -m pytest test/
uv run -- coverage report -mLine length is 100 (configured in pyproject.toml under [tool.ruff]).
Config[T] / Partial[T] (nemo_run/config.py): Built on Fiddle. Config instantiates the target directly when built; Partial creates a functools.partial. Script wraps shell commands. These are the primary user-facing types.
Executor (nemo_run/core/execution/base.py): Abstract base for all execution environments. Key fields: packager, launcher, env_vars, retries. Implementations:
LocalExecutor- direct local executionDockerExecutor- via DockerSlurmExecutor- HPC via SLURM + SSH tunnelSkypilotExecutor/SkypilotJobsExecutor- multi-cloud via SkyPilotDGXCloudExecutor- NVIDIA DGX CloudLeptonExecutor- Lepton AI
Experiment (nemo_run/run/experiment.py): Context manager that groups multiple tasks/jobs, handles parallel execution, log syncing, state tracking, and plugin hooks. Uses TorchX (torchx>=0.7.0) as the distributed execution backend.
Packager (nemo_run/core/packaging/): Strategies to bundle code for remote execution:
GitArchivePackager- packages viagit archivePatternPackager- file glob patternsHybridPackager- combines strategies
Launcher (nemo_run/core/execution/launcher.py): Controls how tasks are launched within an executor. Options: Torchrun, FaultTolerance (NVIDIA), SlurmRay, SlurmTemplate.
Tunnels (nemo_run/core/tunnel/): SSHTunnel for remote cluster access with rsync for file syncing.
- User defines a function/class and wraps it in
run.Configorrun.Partial - An
Executoris configured (withPackager+ optionalLauncher) run.run(task, executor)orrun.Experimentis used to execute- TorchX schedulers (registered as entry points in
pyproject.toml) dispatch work - Metadata stored in
~/.nemo_run/for experiment tracking
Entry points nemorun / nemo (via Typer) provide experiment management and configuration inspection. The CLI uses lazy imports (nemo_run/cli/lazy.py) for fast startup. Extensible via nemo_run.cli.entrypoints namespace.
Configurations can be serialized to YAML (nemo_run/core/serialization/yaml.py) or compressed JSON (zlib_json.py) for persistence.
ExperimentPlugin (nemo_run/run/plugin.py) provides hooks into the experiment lifecycle.
nemo_run/api.py- all public exportsnemo_run/config.py-Config,Partial,Scriptclassesnemo_run/run/experiment.py-Experimentcontext managernemo_run/core/execution/base.py-Executorbase classnemo_run/core/execution/slurm.py- most complex executor (SLURM + SSH)test/conftest.py- shared fixtures
- Pytest marker
slowis skipped by default (addopts = -m "not slow"inpyproject.toml) INCLUDE_WORKSPACE_FILEenv var controls workspace-related test behavior- Test directory is added to
PYTHONPATHviaadd_test_to_pythonpathfixture