Add checkpoint callback support to NLE.fit() by lauradriscoll · Pull Request #57 · dirmeier/sbijax

lauradriscoll · 2026-01-14T01:02:18Z

Summary

Adds optional checkpoint_callback parameter to NLE.fit() to enable periodic checkpointing and real-time logging during training.

Motivation

Long training runs need checkpointing for crash recovery
Real-time monitoring with tools like Weights & Biases requires periodic callbacks
Users want to save optimizer state to resume training exactly

Changes

Added checkpoint_callback parameter (optional callable) to fit() method
Added checkpoint_every parameter (default: 100 iterations)
Callback receives: iteration, params, train_loss, val_loss, state
Fully backward compatible (defaults to None)

Usage Example

import pickle
from pathlib import Path

def save_checkpoint(iteration, params, train_loss, val_loss, state):
    checkpoint = {
        'iteration': iteration,
        'params': params,
        'train_loss': train_loss,
        'val_loss': val_loss,
        'state': state
    }
    Path("checkpoints").mkdir(exist_ok=True)
    with open(f"checkpoints/ckpt_{iteration}.pkl", 'wb') as f:
        pickle.dump(checkpoint, f)
    print(f"Saved checkpoint at iteration {iteration}")

# Use with NLE
params, losses = model.fit(
    rng_key,
    data,
    checkpoint_callback=save_checkpoint,
    checkpoint_every=500
)

Testing

Backward compatible (existing code works without changes)
Callback is optional (defaults to None)
Add unit test (named nle_test_checkpoint.py)

lauradriscoll added 6 commits November 15, 2025 10:46

from jax.tree_util import tree_map compatible with new JAX

b7c845d

Add checkpoint callback support to NLE.fit()

9c2efc5

add checkpoint test

039a9ff

Remove test script from PR

3436047

Apply pre-commit formatting

ca3d4af

Merge branch 'main' into add-checkpoint-callback

e5dd8e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checkpoint callback support to NLE.fit()#57

Add checkpoint callback support to NLE.fit()#57
lauradriscoll wants to merge 6 commits intodirmeier:mainfrom
lauradriscoll:add-checkpoint-callback

lauradriscoll commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lauradriscoll commented Jan 14, 2026

Summary

Motivation

Changes

Usage Example

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant