Is NaN Propagation Necessary in ONNX Runtime? (Architectural Differences e.g. RISC-V vs x86/ARM) #24589

qiujiandong · 2025-04-29T07:32:28Z

qiujiandong
Apr 29, 2025

Hi all,

I've been exploring the behavior of floating-point NaN propagation in ONNX Runtime and noticed that the specification and implementation might implicitly assume that NaNs propagate across operations (i.e., a NaN input results in a NaN output that retains its bit pattern).

However, this behavior is not consistent across architectures.

For example, in RISC-V, the floating-point units (as defined by the F and D extensions) do not propagate input NaN payloads. Instead, any operation involving a NaN input always produces a fixed, canonical NaN output (e.g., 0x7fc00000 for float32). This aligns strictly with the IEEE 754 spec but eliminates any form of NaN payload preservation.

By contrast, on some other architectures (like x86 or ARM), the hardware may propagate NaN payloads, and in some cases, even preserve the first encountered quiet NaN’s payload. This could cause subtle differences in outputs, particularly in deep learning pipelines that may not expect bitwise equality but still rely on NaN tracking behavior (e.g., for debugging, tracing invalid values, etc.).

For example, run mlas Activation test on RISCV-V:

$ qemu-riscv64 -cpu rv64,v=true  onnxruntime_mlas_test --gtest_filter=Activation.ShortExecute
-------------------------------------------------------
----Running normal quick check mode. To enable more complete test,
----  run with '--long' as first argument!
----Total 7508 tests registered programmably!
-------------------------------------------------------
Note: Google Test filter = Activation.ShortExecute
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Activation
[ RUN      ] Activation.ShortExecute
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:2, i=2, value:7fc00000, expecting:7ff00002

/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:2, i=3, value:7fc00000, expecting:fff00002

/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:3, i=2, value:7fc00000, expecting:7ff00002

/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:3, i=3, value:7fc00000, expecting:fff00002

/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:4, i=2, value:7fc00000, expecting:7ff00002

/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:4, i=3, value:7fc00000, expecting:fff00002

/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:6, i=2, value:7fc00000, expecting:7ff00002

/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
  Actual: false
Expected: true
, Scalar Activation Kind:6, i=3, value:7fc00000, expecting:fff00002

[  FAILED  ] Activation.ShortExecute (8 ms)
[----------] 1 test from Activation (10 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (14 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Activation.ShortExecute

 1 FAILED TEST

❓ Questions for discussion:

Should ONNX Runtime enforce or define consistent NaN propagation behavior across platforms?

For example: always propagate first-encountered NaN payload? always canonicalize?

Is it acceptable for NaN propagation behavior to vary by backend or architecture?

Should ONNX model authors assume anything about NaN bit patterns being preserved?

Would it make sense to explicitly document this in ONNX Runtime’s operator behavior or backend compliance expectations?

⚙️ Motivation:

Understanding this could help backend developers (especially on RISC-V and other minimal/flexible hardware targets) know whether they must implement software-emulated NaN propagation for compliance, or if canonical NaN is acceptable.

Looking forward to your thoughts!

Answered by snnn

Jun 16, 2025

I think it would depend on model. Most models are not trained on RISC-V. When we run the models on RISC-V, we should get a good enough accuracy practically. MLperf has imagenet benchmarks. If we ran the same model on RISC-V (with any backend) with the full imagenet validation dataset as the input, as long as the accuracy on the images still makes sense, the other things are not important.

View full answer

snnn · 2025-06-16T05:41:55Z

snnn
Jun 16, 2025

I think it would depend on model. Most models are not trained on RISC-V. When we run the models on RISC-V, we should get a good enough accuracy practically. MLperf has imagenet benchmarks. If we ran the same model on RISC-V (with any backend) with the full imagenet validation dataset as the input, as long as the accuracy on the images still makes sense, the other things are not important.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is NaN Propagation Necessary in ONNX Runtime? (Architectural Differences e.g. RISC-V vs x86/ARM) #24589

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is NaN Propagation Necessary in ONNX Runtime? (Architectural Differences e.g. RISC-V vs x86/ARM) #24589

Uh oh!

qiujiandong Apr 29, 2025

Replies: 1 comment

Uh oh!

snnn Jun 16, 2025

qiujiandong
Apr 29, 2025

snnn
Jun 16, 2025