Rounding behavior of float8_e8m0fnu

e8m0 serves as the common scale type for MX (microscaling) formats, however, different libraries have adopted different rounding behaviors, see summarized in this table

![Image](https://github.com/user-attachments/assets/cfe69e1d-0f93-4eb5-9684-d364191b7b2e)

Although rounding to nearest is the common behavior for other floating point types, when used in MX formats it has been shown that rounding up with saturation is the most beneficial for training accuracy: https://arxiv.org/abs/2506.08027. As a result, this has been chosen as the default in the cuda spec.

I'm wondering if ml-dtypes has plans to update the rounding behavior or expose different rounding modes?

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rounding behavior of float8_e8m0fnu #298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rounding behavior of float8_e8m0fnu #298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions