Skip to content

Fix: resume_training: True and EMACallback active#372

Merged
dtronmans merged 9 commits into
mainfrom
fix/resume-training-ema
Apr 24, 2026
Merged

Fix: resume_training: True and EMACallback active#372
dtronmans merged 9 commits into
mainfrom
fix/resume-training-ema

Conversation

@dtronmans

@dtronmans dtronmans commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Purpose

  • Previous problem: when EMACallback was enabled, training for a few epochs, then training again with resume_training True in the config and passing the previous checkpoint using the weights argument was leading to an error.

Specification

  • The error was the following:
│ ❱  70 │   │   │   for ema_v, model_v in zip(                                                     │
│    71 │   │   │   │   self.state_dict_ema.values(),                                              │
│    72 │   │   │   │   model.state_dict().values(),                                               │
│    73 │   │   │   │   strict=True,                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: zip() argument 2 is longer than argument 1
  • By allowing state dict loading that was not strict (load matching keys and log keys that only exist in either the EMA state dict or in the .ckpt file) I found that there were duplicate _node keys in the live EMA state that were filtered out of the saved checkpoint
  • Fix: no more zip(..., strict=True) for EMA update, just match entries by key and log the asymmetric entries
  • Ignore _node alias-only asymmetries during EMA loading since these keys are filtered out of saved checkpoints

Dependencies & Potential Impact

Deployment Plan

Testing & Validation

  • test_resume_training_with_ema_does_not_crash: train 1 epoch with EMA enabled and overfit_batches=1, start a new run with resume_training=True, pass the previous checkpoint through train(weights=...), assert it does not crash
  • Manual validation:
manual_validation

@dtronmans dtronmans requested a review from a team as a code owner April 14, 2026 12:54
@dtronmans dtronmans requested review from conorsim, klemen1999, kozlov721 and tersekmatija and removed request for a team April 14, 2026 12:54
@github-actions github-actions Bot added the fix Fixing a bug label Apr 14, 2026
@kozlov721 kozlov721 force-pushed the fix/resume-training-ema branch from 317f6a6 to 5107566 Compare April 15, 2026 13:22
Comment thread luxonis_train/callbacks/ema.py Outdated
@dtronmans dtronmans merged commit 6548c2d into main Apr 24, 2026
11 checks passed
@dtronmans dtronmans deleted the fix/resume-training-ema branch April 24, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Fixing a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants