feat: Add validation loss tracking, early stopping, and checkpoint cleanup #2633

NotNANtoN · 2025-12-12T13:58:49Z

Summary

Adds optional validation loss tracking during training using a separate validation episode split.

Features

Validation split: validation_fraction config option to split episodes into train/val sets
Validation loss: Computed using select_action inference for model-agnostic metrics (L1/L2)
Early stopping: Stop training when validation loss or eval success stops improving
Checkpoint cleanup: keep_last_n_checkpoints to automatically remove old checkpoints

Design Decisions

Uses select_action for validation rather than modifying individual policies - this makes validation policy-agnostic
Validation dataset is created without augmentations for clean evaluation
All features are opt-in with sensible defaults (no breaking changes)

Config Options

validation_fraction: float = 0.0 # 0.1 = 10% for validation
early_stopping.enable: bool = False
early_stopping.patience_steps: int = 10000
early_stopping.monitor: str = "val_loss" # or "eval_success"
keep_last_n_checkpoints: int = 0 # 0 = keep all

Testing

Tested with ACT policy
Tested with SmolVLA policy

This PR adds the ability to track validation loss during training: Features: - validation_fraction config option to split episodes into train/val sets - Validation loss computed using inference (select_action) for model-agnostic metrics - L1 and L2 loss metrics logged to wandb under val/ prefix - Early stopping based on validation loss or eval success rate - keep_last_n_checkpoints option to automatically cleanup old checkpoints The validation uses a separate dataset copy without augmentations for clean evaluation. Uses select_action for inference-based validation, making it policy-agnostic. Backward compatible - defaults maintain existing behavior (no validation split). Config options: - validation_fraction: 0.0-1.0 (default 0.0, no validation) - early_stopping.enable: bool (default False) - early_stopping.patience_steps: int (default 10000) - early_stopping.monitor: 'val_loss' or 'eval_success' - keep_last_n_checkpoints: int (default 0, keep all)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add validation loss tracking, early stopping, and checkpoint cleanup #2633

feat: Add validation loss tracking, early stopping, and checkpoint cleanup #2633

NotNANtoN commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add validation loss tracking, early stopping, and checkpoint cleanup #2633

Are you sure you want to change the base?

feat: Add validation loss tracking, early stopping, and checkpoint cleanup #2633

Conversation

NotNANtoN commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Design Decisions

Config Options

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NotNANtoN commented Dec 12, 2025 •

edited

Loading