- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3.2k
 
Description
Describe the bug
A clear and concise description of what the bug is.
- Load an optimizer state from early version of Megatron. It creates two chaiend optimizers, but they aren't properly deduplicated. Meaning once a checkpoint is saved, if it tries to reload it, it will fail this assert
 
        assert len(steps) <= 1, f"steps: {steps}"
So if the original optimizer state was resumed at 1300. Assertion will fail with steps: [2250, 1300] to provide a real example.
Steps/Code to reproduce bug
Load an optimizer state from a pre014 checkpoint, save a new checkpoint with it, and tried to load that checkpoint. It will be invalid as additional state optimizer state will be written.
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.