[Question] Expected behavior for blockwise FP8? Hybrid E4M3/E5M2 format & eval metrics outperforming BF16

Hi,

I am currently experimenting TE with blockwise FP8 training (DeepSeek-V3 like recipe). I am training on internal company dataset. Since I currently do not have access to SM10x-architecture devices, I am unable to utilize the MXFP8 format for these runs.

During these experiments, I observed two somewhat counter-intuitive phenomena compared to BF16 baseline. I would like to consult if these are expected behaviors under this specific training regime, or if they might indicate potential flaws in my scaling/quantization implementation.

**Observation 1: Hybrid format (Fwd E4M3 + Bwd E5M2) aligns much better with BF16 than full-E4M3**
When using the default e4m3 format for both forward and backward passes, the loss alignment with the BF16 baseline is suboptimal. However, switching to a hybrid approach (e4m3 for forward activations/weights, e5m2 for backward gradients) yields a much closer alignment to BF16.

**Observation 2: FP8 slightly but stably outperforms BF16 in late-stage evaluation**
In the final 30% of the training steps, the eval metrics for the FP8 run slightly but stably surpass the BF16 baseline.

Are these observations common and theoretically sound when using TE with fine-grained FP8 recipes? Any insights or references would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Expected behavior for blockwise FP8? Hybrid E4M3/E5M2 format & eval metrics outperforming BF16 #2754

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Expected behavior for blockwise FP8? Hybrid E4M3/E5M2 format & eval metrics outperforming BF16 #2754

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions