NaN on longer training sequence

Hi there,

We were trying the kernel(s) for a longer training sequence and observe the following behavior after a few 10ks of steps:
- when everything is in fp32 (NATTEN, and the rest of the model, including normalization, etc) everything is fine
- when NATTEN is in bf16, and everything else is in fp32, then the system runs, but the metrics start diverging and getting worse
- when everything is in bf16, we run into NaN issues

Have you observed anything like this before? do you know whether there might be a bug in the backward/gradient accumulation or handling, or something in the dtype versions/accumulation?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN on longer training sequence #307

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NaN on longer training sequence #307

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions