Open
Description
Hello, I would like to perform quantization from the FP16 data type to the FP8E4M3 data type. I referred to the method described in the link https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu#L629, but I have a question. Why is the calculation of min_scaling_factor done by dividing by (FP8_E4M3_MAX::value * 512.f)? Could you please explain the basis for choosing 512.f? Thanks.
Metadata
Metadata
Assignees
Labels
No labels
Activity