Skip to content

Mamba3-SISO exploding gradients on B200 #938

@NadavSc

Description

@NadavSc

Hi!
I have encountered some gradients exploding while training on B200.
As you can see, the training starts the same but after a while it explodes, contrary to the L40s training which remains stable.
I trained the model on the-pile with 1 GPU and ~60k tokens per step. AdamW optimizer with weight_decay of 0.1.

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions