-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Problem
The sync score used to rely on signed gradients + LR to estimate steps behind. Now it’s based on average parameter changes, which may not be reliable. We also see model weights diverge after several windows despite stable gather success rates.
Proposal
- Explore alternative sync metrics (e.g. comparing logits on a fixed evaluation set).
- Investigate weight divergence: mainly parameter differences during outer step (SGD only), not optimizer state drift.
Next Steps
- Prototype a logits-based sync score.
- Add logging/checkpoints to trace when weight drift starts.
Metadata
Metadata
Assignees
Labels
No labels