Weights & Biases Integration

Nemotron Kit provides automatic W&B configuration that seamlessly passes credentials and settings to containers running via nemo-run. This eliminates manual credential management across local, Docker, Slurm, and cloud executors.

Note: The artifact system currently requires W&B. Backend-agnostic artifact tracking is in development.

Configuration

env.toml Setup

Add a [wandb] section to your env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

Field	Description
`project`	W&B project name (required to enable tracking)
`entity`	W&B team/entity name

Authentication

Authenticate locally before running jobs:

wandb login

Your API key is stored in ~/.netrc and automatically detected by the kit.

Automatic Environment Variables

When you run jobs via nemo-run, the kit automatically detects your W&B configuration and passes it to the container as environment variables:

Variable	Source	Description
`WANDB_API_KEY`	`wandb.api.api_key`	API key from local wandb login
`WANDB_PROJECT`	`env.toml [wandb]`	Project name
`WANDB_ENTITY`	`env.toml [wandb]`	Team/entity name

This works across all executor types:

Local — Environment variables set directly
Docker — Passed via container env vars
Slurm — Included in job submission
SkyPilot — Set in cloud instance environment
Ray — Passed via runtime_env.env_vars

How It Works

The build_executor() function in nemotron.kit.run handles automatic detection:

# Auto-detect W&B API key from local login
if "WANDB_API_KEY" not in merged_env:
    import wandb
    api_key = wandb.api.api_key
    if api_key:
        merged_env["WANDB_API_KEY"] = api_key

# Load project/entity from env.toml [wandb] section
wandb_config = load_wandb_config()
if wandb_config is not None:
    if wandb_config.project:
        merged_env["WANDB_PROJECT"] = wandb_config.project
    if wandb_config.entity:
        merged_env["WANDB_ENTITY"] = wandb_config.entity

Using W&B in Training Scripts

Initialization from Environment

Training scripts running inside containers can initialize W&B from environment variables:

from nemotron.kit.train_script import init_wandb_from_env

# Reads WANDB_PROJECT and WANDB_ENTITY from environment
init_wandb_from_env()

Conditional Initialization

For scripts that support optional W&B tracking:

from nemotron.kit import init_wandb_if_configured
from nemotron.kit.wandb import WandbConfig

# Initialize only if WandbConfig is provided and has a project set
wandb_config = WandbConfig(project="nemotron", entity="my-team")
init_wandb_if_configured(wandb_config, job_type="training")

WandbConfig Dataclass

The WandbConfig dataclass provides typed configuration:

from nemotron.kit.wandb import WandbConfig

config = WandbConfig(
    project="nemotron",           # Required to enable tracking
    entity="my-team",             # Team/entity name
    run_name="experiment-001",    # Optional run name
    tags=("pretrain", "nano3"),   # Tags for filtering
    notes="First pretrain run",   # Run description
)

# Check if tracking is enabled
if config.enabled:
    print(f"Logging to {config.entity}/{config.project}")

Artifact Lineage

W&B artifacts provide full lineage tracking. See Artifact Lineage for details on:

End-to-end lineage from raw data to final model
Semantic URIs for artifact references
Viewing lineage in the W&B UI

Advanced Features

Checkpoint Logging

The kit automatically patches checkpoint saving to log artifacts to W&B:

from nemotron.kit.wandb import patch_wandb_checkpoint_logging

# Patch Megatron-Bridge checkpoint saving
patch_wandb_checkpoint_logging()

This enables:

Automatic artifact creation for each checkpoint
Lineage links to training data artifacts
Version tracking with step numbers

NeMo-RL Checkpoint Logging

For reinforcement learning with NeMo-RL:

from nemotron.kit.wandb import patch_nemo_rl_checkpoint_logging

# Patch NeMo-RL checkpoint saving
patch_nemo_rl_checkpoint_logging()

Seeded Random Fix

When using seeded random states (common in RL), W&B's default run ID generation can fail. The kit provides a patch:

from nemotron.kit.wandb import patch_wandb_runid_for_seeded_random

# Fix "Invalid Client ID digest" errors
patch_wandb_runid_for_seeded_random()

Troubleshooting

"WANDB_API_KEY not found"

Ensure you're logged in locally:

wandb login

"Project not found"

Verify the project exists in your W&B workspace, or let W&B create it automatically on first run.

Environment variables not passed to container

Check that your env.toml has a [wandb] section:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

Ray workers missing credentials

For Ray data prep jobs, credentials are passed via runtime_env.env_vars. Ensure your local wandb login is active before submitting the job.

API Reference

wandb.py Exports

Export	Description
`WandbConfig`	Configuration dataclass
`init_wandb_if_configured()`	Conditional W&B initialization
`patch_wandb_checkpoint_logging()`	Enable Megatron-Bridge checkpoint artifacts
`patch_nemo_rl_checkpoint_logging()`	Enable NeMo-RL checkpoint artifacts
`patch_wandb_runid_for_seeded_random()`	Fix seeded random ID generation

run.py Exports

Export	Description
`load_wandb_config()`	Load `WandbConfig` from env.toml
`build_executor()`	Build executor with auto W&B env vars

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weights & Biases Integration

Configuration

env.toml Setup

Authentication

Automatic Environment Variables

How It Works

Using W&B in Training Scripts

Initialization from Environment

Conditional Initialization

WandbConfig Dataclass

Artifact Lineage

Advanced Features

Checkpoint Logging

NeMo-RL Checkpoint Logging

Seeded Random Fix

Troubleshooting

"WANDB_API_KEY not found"

"Project not found"

Environment variables not passed to container

Ray workers missing credentials

API Reference

wandb.py Exports

run.py Exports

Further Reading

FilesExpand file tree

wandb.md

Latest commit

History

wandb.md

File metadata and controls

Weights & Biases Integration

Configuration

env.toml Setup

Authentication

Automatic Environment Variables

How It Works

Using W&B in Training Scripts

Initialization from Environment

Conditional Initialization

WandbConfig Dataclass

Artifact Lineage

Advanced Features

Checkpoint Logging

NeMo-RL Checkpoint Logging

Seeded Random Fix

Troubleshooting

"WANDB_API_KEY not found"

"Project not found"

Environment variables not passed to container

Ray workers missing credentials

API Reference

wandb.py Exports

run.py Exports

Further Reading