Commit a5b2b51
[levanter] Wait for final checkpoint async save before process exit
The final checkpoint save in Levanter is async — the GCS write runs in a
background thread while the process continues. In train_dpo.py, the
post-training code attempted to wait via `wait_until_finished()`, but
called it on a *newly created* Checkpointer instance (which had nothing
to wait for) instead of the original one that holds the pending save.
train_lm.py had no wait at all.
If preemption (or a normal exit) kills the process while the async write
is still in flight, the only checkpoint at the target training step is
silently lost. This is especially damaging when the target step is not a
multiple of the permanent checkpoint interval (e.g. step 9917 with
`keep=[dict(every=10000)]`), since every checkpoint below 10000 is
temporary and subject to deletion on restart.
Fix:
- Store the Checkpointer as `self._checkpointer` on the Trainer so it is
accessible after training.
- train_lm.py: add `trainer._checkpointer.wait_until_finished()` after
training completes.
- train_dpo.py: replace the broken new-instance wait with
`trainer._checkpointer.wait_until_finished()`.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent c2df1e7 commit a5b2b51
3 files changed
Lines changed: 15 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
488 | 488 | | |
489 | 489 | | |
490 | 490 | | |
491 | | - | |
| 491 | + | |
492 | 492 | | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
493 | 497 | | |
494 | | - | |
495 | | - | |
496 | | - | |
| 498 | + | |
497 | 499 | | |
498 | 500 | | |
499 | 501 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
319 | 319 | | |
320 | 320 | | |
321 | 321 | | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
322 | 329 | | |
323 | 330 | | |
324 | 331 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
579 | 579 | | |
580 | 580 | | |
581 | 581 | | |
582 | | - | |
| 582 | + | |
583 | 583 | | |
584 | 584 | | |
585 | | - | |
| 585 | + | |
586 | 586 | | |
587 | 587 | | |
588 | 588 | | |
| |||
0 commit comments