Skip to content

Latest commit

 

History

History
219 lines (150 loc) · 6 KB

File metadata and controls

219 lines (150 loc) · 6 KB

Weights & Biases Integration

Nemotron Kit provides automatic W&B configuration that seamlessly passes credentials and settings to containers running via nemo-run. This eliminates manual credential management across local, Docker, Slurm, and cloud executors.

Note: The artifact system currently requires W&B. Backend-agnostic artifact tracking is in development.

Configuration

env.toml Setup

Add a [wandb] section to your env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
Field Description
project W&B project name (required to enable tracking)
entity W&B team/entity name

Authentication

Authenticate locally before running jobs:

wandb login

Your API key is stored in ~/.netrc and automatically detected by the kit.

Automatic Environment Variables

When you run jobs via nemo-run, the kit automatically detects your W&B configuration and passes it to the container as environment variables:

Variable Source Description
WANDB_API_KEY wandb.api.api_key API key from local wandb login
WANDB_PROJECT env.toml [wandb] Project name
WANDB_ENTITY env.toml [wandb] Team/entity name

This works across all executor types:

  • Local — Environment variables set directly
  • Docker — Passed via container env vars
  • Slurm — Included in job submission
  • SkyPilot — Set in cloud instance environment
  • Ray — Passed via runtime_env.env_vars

How It Works

The build_executor() function in nemotron.kit.run handles automatic detection:

# Auto-detect W&B API key from local login
if "WANDB_API_KEY" not in merged_env:
    import wandb
    api_key = wandb.api.api_key
    if api_key:
        merged_env["WANDB_API_KEY"] = api_key

# Load project/entity from env.toml [wandb] section
wandb_config = load_wandb_config()
if wandb_config is not None:
    if wandb_config.project:
        merged_env["WANDB_PROJECT"] = wandb_config.project
    if wandb_config.entity:
        merged_env["WANDB_ENTITY"] = wandb_config.entity

Using W&B in Training Scripts

Initialization from Environment

Training scripts running inside containers can initialize W&B from environment variables:

from nemotron.kit.train_script import init_wandb_from_env

# Reads WANDB_PROJECT and WANDB_ENTITY from environment
init_wandb_from_env()

Conditional Initialization

For scripts that support optional W&B tracking:

from nemotron.kit import init_wandb_if_configured
from nemotron.kit.wandb import WandbConfig

# Initialize only if WandbConfig is provided and has a project set
wandb_config = WandbConfig(project="nemotron", entity="my-team")
init_wandb_if_configured(wandb_config, job_type="training")

WandbConfig Dataclass

The WandbConfig dataclass provides typed configuration:

from nemotron.kit.wandb import WandbConfig

config = WandbConfig(
    project="nemotron",           # Required to enable tracking
    entity="my-team",             # Team/entity name
    run_name="experiment-001",    # Optional run name
    tags=("pretrain", "nano3"),   # Tags for filtering
    notes="First pretrain run",   # Run description
)

# Check if tracking is enabled
if config.enabled:
    print(f"Logging to {config.entity}/{config.project}")

Artifact Lineage

W&B artifacts provide full lineage tracking. See Artifact Lineage for details on:

  • End-to-end lineage from raw data to final model
  • Semantic URIs for artifact references
  • Viewing lineage in the W&B UI

Advanced Features

Checkpoint Logging

The kit automatically patches checkpoint saving to log artifacts to W&B:

from nemotron.kit.wandb import patch_wandb_checkpoint_logging

# Patch Megatron-Bridge checkpoint saving
patch_wandb_checkpoint_logging()

This enables:

  • Automatic artifact creation for each checkpoint
  • Lineage links to training data artifacts
  • Version tracking with step numbers

NeMo-RL Checkpoint Logging

For reinforcement learning with NeMo-RL:

from nemotron.kit.wandb import patch_nemo_rl_checkpoint_logging

# Patch NeMo-RL checkpoint saving
patch_nemo_rl_checkpoint_logging()

Seeded Random Fix

When using seeded random states (common in RL), W&B's default run ID generation can fail. The kit provides a patch:

from nemotron.kit.wandb import patch_wandb_runid_for_seeded_random

# Fix "Invalid Client ID digest" errors
patch_wandb_runid_for_seeded_random()

Troubleshooting

"WANDB_API_KEY not found"

Ensure you're logged in locally:

wandb login

"Project not found"

Verify the project exists in your W&B workspace, or let W&B create it automatically on first run.

Environment variables not passed to container

Check that your env.toml has a [wandb] section:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

Ray workers missing credentials

For Ray data prep jobs, credentials are passed via runtime_env.env_vars. Ensure your local wandb login is active before submitting the job.

API Reference

wandb.py Exports

Export Description
WandbConfig Configuration dataclass
init_wandb_if_configured() Conditional W&B initialization
patch_wandb_checkpoint_logging() Enable Megatron-Bridge checkpoint artifacts
patch_nemo_rl_checkpoint_logging() Enable NeMo-RL checkpoint artifacts
patch_wandb_runid_for_seeded_random() Fix seeded random ID generation

run.py Exports

Export Description
load_wandb_config() Load WandbConfig from env.toml
build_executor() Build executor with auto W&B env vars

Further Reading