Add model merging callback with integration test #558

baileykuehl · 2026-01-21T18:03:05Z

Added a model merging callback that averages model weights over the last N steps of training and saves the result as a stepXXXX-merged directory

Key features:

Disabled by default (enabled=False) because it seems to somewhat affect performance at merge steps on larger models (measuring this impact was out of scope for now, will follow-up with experiments if this gets adopted)
For WSD-S schedulers, automatically merges and evaluates right before each decay phase begins. This overlaps with the pre-decay eval in the old ladder.
Saves accumulator state to checkpoints, so interrupted runs can resume mid-merge-window without losing progress
Integration test added that trains a small model for 5 steps (averages the last 3 steps)

A few examples of recent runs with this callback:

Adds a trainer callback that maintains a running average of model weights during training. At merge_step, it saves the averaged model as a new checkpoint. This provides an alternative to learning rate annealing by averaging weights from multiple training steps.

- merge_step now defaults to trainer.max_steps if not explicitly set - Fixed off-by-one: merge_last_n_steps=100 now gives exactly 100 steps - Updated docstring with clearer examples

… support - Support list of merge steps (merge_step can be int or List[int]) - Auto-detect decay phase from scheduler and set merge_step accordingly - Warn if merge steps fall within the decay phase - Add evaluation support for merged models with "merged-" metric prefix - Auto-discover evaluators from EvaluatorCallback instances - Add eval_merged flag to enable/disable merged model evaluation

- Add `prefix` parameter to EvaluatorCallback._perform_eval() to support custom metric prefixes (defaults to "eval") - Simplify ModelMergeCallback._evaluate_merged() to call existing _perform_eval(prefix="eval/merged") instead of duplicating eval logic - Add `validate` flag to ModelMergeCallback for testing weight averaging correctness by comparing running sum vs stack-and-mean methods - Remove unused `evaluators` and `eval_duration` fields from ModelMergeCallback - Add integration test that trains a small 30M model for 5 steps with weight averaging validation enabled - Add GitHub Actions job to run integration tests on GPU

- Move import pytest to top of file (lint fix) - Use torchrun instead of pytest in CI for the integration test since FSDP requires distributed environment initialization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove FSDP and bfloat16 (not needed for testing callback logic) - Move from GPU checks to regular CPU checks in CI - Simplify test file (remove torchrun, main function) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Keep CPU test for basic callback logic - Add GPU test with FSDP to verify sharded checkpoint handling - GPU test runs with torchrun and 2 GPUs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ModelMergeCallback now auto-detects WSDS schedulers and computes merge steps before each decay phase. For a ladder run with N periods, it will automatically set up N merge steps (one per anneal). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Match existing test patterns - use run_distributed_test which handles spawning processes internally, instead of torchrun wrapper. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Move accumulator and captured weights to CPU during accumulation - Explicitly move weights back to correct device when evaluating merged model - Avoids keeping extra copy of model weights on GPU during training Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Document tradeoffs for 70B+ models: - Per-step state_dict() overhead during merge window - Checkpoint size increase when saving accumulator Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use public scheduler attributes instead of private _resolve_decay() - Rename batch_size to tokens_per_step to clarify units - Add documentation about assumption that global_batch_size is in tokens - Log tokens/step in info message so users can verify - Add debug logging for each period's computed values Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Ensures all ranks finish writing before any rank proceeds to evaluation or next merge step. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add -m "not gpu" to skip GPU-marked tests in the CPU CI job. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use get_model_state_dict/set_model_state_dict for proper FSDP support - Use full_state_dict=True in accumulation to ensure correct saving - Remove broken LMEvaluatorCallbackConfig from integration test (missing label metadata) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds a ladder script that includes ModelMergeCallback for weight averaging experiments. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… step

- Add __post_init__ validation for merge_last_n_steps and merge_step values - Fix resume handling: save accumulated weights when past merge step - Use olmo_core.io utilities for remote path support (S3/GCS) - Check for .metadata file to detect complete checkpoints (vs partial) - Add overlap detection for merge windows with helpful error message - Add warning for truncated accumulation windows - Fix decay phase warning to apply when user specifies merge_step on WSDS - Switch to sharded checkpoint approach (full_state_dict=False) - Add unit tests for validation logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add continue after handling past merge steps to check remaining steps - Add barriers around merged model evaluation to synchronize all ranks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Change `enabled` default from True to False so callback must be explicitly enabled - Remove olmo3_model_merging_ladder.py (test script, not needed in PR)

.github/workflows/main.yml

src/test/train/callbacks/model_merger_test.py

.gitignore

CHANGELOG.md

src/olmo_core/train/callbacks/model_merger.py

epwalsh · 2026-02-02T23:20:43Z

src/olmo_core/train/callbacks/model_merger.py

+            - Making accumulator checkpointing configurable
+        """
+        state = {
+            "n_accumulated": self._n_accumulated,


The way the trainer saves this state is not ideal for model state, especially when it's distributed/sharded. So we'll probably need to rethink this.

Here's one alternative:

Say we're merging the last 100 steps. We enforce saving a full training checkpoint right when start to accumulate, i.e. 100 steps before we want to evaluate the merged model, and we don't checkpoint again until after the merged model is evaluated. That way we wouldn't need to save the state of this call back because it would always be recomputed.

That makes a lot of sense! So the trade-off of this would just be that if we get interrupted during the merge window, we would lose up to merge_last_n_steps - 1 steps of work, at most? Which seems reasonable while we're using fairly low values

epwalsh · 2026-02-02T23:34:22Z

src/olmo_core/train/callbacks/model_merger.py

+        sd_options = dist_cp_sd.StateDictOptions(full_state_dict=False, cpu_offload=True)
+        model_state = dist_cp_sd.get_model_state_dict(
+            self.trainer.train_module.model, options=sd_options
+        )


Using dist_cp* functions comes with additional overhead that I think we can safely avoid by just collecting model parameters directly, e.g. something like model_state = {k: get_local_tensor(p.data.detach()) for k, p in model.named_parameters()}.

There's still the overhead of copying those params to CPU... One option is to keep them on GPU if we can afford it. But for larger models we're typically memory-starved, so that's doesn't scale. Alternatively we could still copy to CPU but try to hide the overhead of the copy by doing it asynchronously.

For example,

Trigger an async copy as early as possible, like in pre_load_batch(). You can do an async copy with olmo_core.utils.move_to_device(tensor, torch.device("cpu"), non_blocking=True). Now you can't do anything else with the CPU-tensor until you've triggered a host-device sync, so for now you wait.

Right before the optimizer step you force a host-device sync (call torch.cuda.synchronize()) to ensure the copy to CPU has finished before params are modified.

Then accumulate those weights.

This means the accumulated weights are a step behind (since they're updated pre-optim step), but I think we can adjust the bookkeeping accordingly.

Co-authored-by: Pete Walsh <petew@allenai.org>

baileykuehl changed the title ~~Bk/model merging~~ Add model merging callback and integration test Jan 21, 2026

baileykuehl changed the title ~~Add model merging callback and integration test~~ [DRAFT] Add model merging callback and integration test Jan 21, 2026

baileykuehl and others added 22 commits January 21, 2026 11:05

ModelMergeCallback: default merge_step to last step, fix off-by-one

00104fc

- merge_step now defaults to trainer.max_steps if not explicitly set - Fixed off-by-one: merge_last_n_steps=100 now gives exactly 100 steps - Updated docstring with clearer examples

Clean up ModelMergeCallback: simplify docstring, remove eval_merged flag

843119a

Rename CI job to 'Test train small model (GPU)'

eb7e3d3

Simplify ModelMergeCallback docstring

664e4b2

Add LM evaluator to integration test to verify merged eval

5990380

Fix: all ranks must participate in distributed checkpoint saving

0e45ef7

Inline _warn_if_steps_in_decay into pre_train

1127e58

Simplify _evaluate_merged docstring

d7c949f

Add changelog entries, formatting fixes

9f48e31

Rename test file to test_train_small_model.py for pytest discovery

19e11e6

Fix integration test to use torchrun for distributed setup

2252800

- Move import pytest to top of file (lint fix) - Use torchrun instead of pytest in CI for the integration test since FSDP requires distributed environment initialization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Convert integration test to CPU-only

32bad26

- Remove FSDP and bfloat16 (not needed for testing callback logic) - Move from GPU checks to regular CPU checks in CI - Simplify test file (remove torchrun, main function) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add GPU test for ModelMergeCallback with FSDP

17357c4

- Keep CPU test for basic callback logic - Add GPU test with FSDP to verify sharded checkpoint handling - GPU test runs with torchrun and 2 GPUs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use run_distributed_test for GPU integration test

026fbb7

Match existing test patterns - use run_distributed_test which handles spawning processes internally, instead of torchrun wrapper. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add TODO notes for large-scale model merging considerations

1c9870a

Document tradeoffs for 70B+ models: - Per-step state_dict() overhead during merge window - Checkpoint size increase when saving accumulator Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add barrier after save_state_dict to prevent rank divergence

8568747

Ensures all ranks finish writing before any rank proceeds to evaluation or next merge step. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

baileykuehl force-pushed the bk/model-merging branch from ef69d79 to 8568747 Compare January 22, 2026 00:18

baileykuehl and others added 5 commits January 21, 2026 16:44

Exclude GPU tests from CPU integration test job

81d09b2

Add -m "not gpu" to skip GPU-marked tests in the CPU CI job. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into bk/model-merging

f23aaff

Add LMEvaluatorCallback to test merged model evaluation

f6148f8

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use Weka/NFS paths for eval data when available

a4b1dab

baileykuehl and others added 15 commits January 23, 2026 16:45

Add model merging ladder script

f918d89

Adds a ladder script that includes ModelMergeCallback for weight averaging experiments. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix FSDP distributed state dict handling in ModelMergeCallback

a6b0125

Fix set_model_state_dict options for FSDP full state dict

37d5f27

Fix ModelMergeCallback to skip past completed merge steps on resume

4bb5220

Update docstring to clarify merged checkpoint has no optimizer state

98f026e

Change merge_last_n_steps to 500 for testing

f847447

Fix ModelMergeCallback to save accumulated weights on resume at merge…

4070c5a

… step

Change merge_last_n_steps to 250 for testing

fd2b55e

Merge remote-tracking branch 'origin/main' into bk/model-merging

3bcec65

Merge remote-tracking branch 'origin/main' into bk/model-merging

45e1e3d

Fix resume loop and add barriers for distributed eval

bb3c96b

- Add continue after handling past merge steps to check remaining steps - Add barriers around merged model evaluation to synchronize all ranks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Set merge_last_n_steps=100 for testing

f2b7de6

Fix formatting (isort, black)

59f2fc5

Disable ModelMergeCallback by default and remove test ladder script

57b4738

- Change `enabled` default from True to False so callback must be explicitly enabled - Remove olmo3_model_merging_ladder.py (test script, not needed in PR)

baileykuehl changed the title ~~[DRAFT] Add model merging callback and integration test~~ Add model merging callback with integration test Jan 30, 2026

baileykuehl requested review from epwalsh and tyler-romero January 30, 2026 15:49

Add model merging ladder script for testing

f1e6097

baileykuehl requested a review from dirkgr February 2, 2026 18:09

epwalsh reviewed Feb 2, 2026

View reviewed changes

baileykuehl and others added 9 commits February 2, 2026 16:52

rename integration test

135097a

Co-authored-by: Pete Walsh <petew@allenai.org>

remove .gitignore update

4bc1f8f

Co-authored-by: Pete Walsh <petew@allenai.org>

remove unnecessary changes to CHANGELOG.md

1116fd1

Co-authored-by: Pete Walsh <petew@allenai.org>

rename integration test (GPU)

ad12376

Co-authored-by: Pete Walsh <petew@allenai.org>

fix typo in doc string

4025100

move model merging test file into callback test dir

650e0a1

pull test functions out of class

c987b1f

Set merge window to 500 steps

ba8c83b

Replace dist_cp with get_local_tensor, set merge window to 500 steps

9428ef3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model merging callback with integration test #558

Add model merging callback with integration test #558

Uh oh!

baileykuehl commented Jan 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

epwalsh Feb 2, 2026

Uh oh!

baileykuehl Feb 3, 2026 •

edited

Loading

Uh oh!

epwalsh Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add model merging callback with integration test #558

Are you sure you want to change the base?

Add model merging callback with integration test #558

Uh oh!

Conversation

baileykuehl commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

epwalsh Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

baileykuehl Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

epwalsh Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baileykuehl commented Jan 21, 2026 •

edited

Loading

baileykuehl Feb 3, 2026 •

edited

Loading