Skip to content

[Question] Expected behavior for blockwise FP8? Hybrid E4M3/E5M2 format & eval metrics outperforming BF16 #2754

@Mnb66

Description

@Mnb66

Hi,

I am currently experimenting TE with blockwise FP8 training (DeepSeek-V3 like recipe). I am training on internal company dataset. Since I currently do not have access to SM10x-architecture devices, I am unable to utilize the MXFP8 format for these runs.

During these experiments, I observed two somewhat counter-intuitive phenomena compared to BF16 baseline. I would like to consult if these are expected behaviors under this specific training regime, or if they might indicate potential flaws in my scaling/quantization implementation.

Observation 1: Hybrid format (Fwd E4M3 + Bwd E5M2) aligns much better with BF16 than full-E4M3
When using the default e4m3 format for both forward and backward passes, the loss alignment with the BF16 baseline is suboptimal. However, switching to a hybrid approach (e4m3 for forward activations/weights, e5m2 for backward gradients) yields a much closer alignment to BF16.

Observation 2: FP8 slightly but stably outperforms BF16 in late-stage evaluation
In the final 30% of the training steps, the eval metrics for the FP8 run slightly but stably surpass the BF16 baseline.

Are these observations common and theoretically sound when using TE with fine-grained FP8 recipes? Any insights or references would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions