Skip to content

Rounding behavior of float8_e8m0fnu #298

@yuanyao-nv

Description

@yuanyao-nv

e8m0 serves as the common scale type for MX (microscaling) formats, however, different libraries have adopted different rounding behaviors, see summarized in this table

Image

Although rounding to nearest is the common behavior for other floating point types, when used in MX formats it has been shown that rounding up with saturation is the most beneficial for training accuracy: https://arxiv.org/abs/2506.08027. As a result, this has been chosen as the default in the cuda spec.

I'm wondering if ml-dtypes has plans to update the rounding behavior or expose different rounding modes?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions