-
Notifications
You must be signed in to change notification settings - Fork 52
Open
Description
e8m0 serves as the common scale type for MX (microscaling) formats, however, different libraries have adopted different rounding behaviors, see summarized in this table
Although rounding to nearest is the common behavior for other floating point types, when used in MX formats it has been shown that rounding up with saturation is the most beneficial for training accuracy: https://arxiv.org/abs/2506.08027. As a result, this has been chosen as the default in the cuda spec.
I'm wondering if ml-dtypes has plans to update the rounding behavior or expose different rounding modes?
Thanks.
Metadata
Metadata
Assignees
Labels
No labels
