Skip to content

Loading a legacy pre014 optimizer state corrupts any future optimizer states #2016

@Skylion007

Description

@Skylion007

Describe the bug

A clear and concise description of what the bug is.

  • Load an optimizer state from early version of Megatron. It creates two chaiend optimizers, but they aren't properly deduplicated. Meaning once a checkpoint is saved, if it tries to reload it, it will fail this assert
        assert len(steps) <= 1, f"steps: {steps}"

So if the original optimizer state was resumed at 1300. Assertion will fail with steps: [2250, 1300] to provide a real example.

Steps/Code to reproduce bug

Load an optimizer state from a pre014 checkpoint, save a new checkpoint with it, and tried to load that checkpoint. It will be invalid as additional state optimizer state will be written.

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions