Mamba3-SISO exploding gradients on B200

Hi!
I have encountered some gradients exploding while training on B200.
As you can see, the training starts the same but after a while it explodes, contrary to the L40s training which remains stable.
I trained the model on the-pile with 1 GPU and ~60k tokens per step. AdamW optimizer with weight_decay of 0.1.

<img width="368" height="272" alt="Image" src="https://github.com/user-attachments/assets/4ab3079d-5dcd-403d-8fd8-f353b146d611" />

<img width="370" height="264" alt="Image" src="https://github.com/user-attachments/assets/7a51b61d-0edb-44b1-b9da-9252715097ad" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mamba3-SISO exploding gradients on B200 #938

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mamba3-SISO exploding gradients on B200 #938

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions