Hi,
I am currently experimenting TE with blockwise FP8 training (DeepSeek-V3 like recipe). I am training on internal company dataset. Since I currently do not have access to SM10x-architecture devices, I am unable to utilize the MXFP8 format for these runs.
During these experiments, I observed two somewhat counter-intuitive phenomena compared to BF16 baseline. I would like to consult if these are expected behaviors under this specific training regime, or if they might indicate potential flaws in my scaling/quantization implementation.
Observation 1: Hybrid format (Fwd E4M3 + Bwd E5M2) aligns much better with BF16 than full-E4M3
When using the default e4m3 format for both forward and backward passes, the loss alignment with the BF16 baseline is suboptimal. However, switching to a hybrid approach (e4m3 for forward activations/weights, e5m2 for backward gradients) yields a much closer alignment to BF16.
Observation 2: FP8 slightly but stably outperforms BF16 in late-stage evaluation
In the final 30% of the training steps, the eval metrics for the FP8 run slightly but stably surpass the BF16 baseline.
Are these observations common and theoretically sound when using TE with fine-grained FP8 recipes? Any insights or references would be greatly appreciated.