Hi!
I have encountered some gradients exploding while training on B200.
As you can see, the training starts the same but after a while it explodes, contrary to the L40s training which remains stable.
I trained the model on the-pile with 1 GPU and ~60k tokens per step. AdamW optimizer with weight_decay of 0.1.

Hi!
I have encountered some gradients exploding while training on B200.
As you can see, the training starts the same but after a while it explodes, contrary to the L40s training which remains stable.
I trained the model on the-pile with 1 GPU and ~60k tokens per step. AdamW optimizer with weight_decay of 0.1.