Is NaN Propagation Necessary in ONNX Runtime? (Architectural Differences e.g. RISC-V vs x86/ARM) #24589
-
|
Hi all, I've been exploring the behavior of floating-point NaN propagation in ONNX Runtime and noticed that the specification and implementation might implicitly assume that NaNs propagate across operations (i.e., a NaN input results in a NaN output that retains its bit pattern). However, this behavior is not consistent across architectures. For example, in RISC-V, the floating-point units (as defined by the F and D extensions) do not propagate input NaN payloads. Instead, any operation involving a NaN input always produces a fixed, canonical NaN output (e.g., 0x7fc00000 for float32). This aligns strictly with the IEEE 754 spec but eliminates any form of NaN payload preservation. By contrast, on some other architectures (like x86 or ARM), the hardware may propagate NaN payloads, and in some cases, even preserve the first encountered quiet NaN’s payload. This could cause subtle differences in outputs, particularly in deep learning pipelines that may not expect bitwise equality but still rely on NaN tracking behavior (e.g., for debugging, tracing invalid values, etc.). For example, run mlas Activation test on RISCV-V: $ qemu-riscv64 -cpu rv64,v=true onnxruntime_mlas_test --gtest_filter=Activation.ShortExecute
-------------------------------------------------------
----Running normal quick check mode. To enable more complete test,
---- run with '--long' as first argument!
----Total 7508 tests registered programmably!
-------------------------------------------------------
Note: Google Test filter = Activation.ShortExecute
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Activation
[ RUN ] Activation.ShortExecute
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:2, i=2, value:7fc00000, expecting:7ff00002
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:2, i=3, value:7fc00000, expecting:fff00002
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:3, i=2, value:7fc00000, expecting:7ff00002
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:3, i=3, value:7fc00000, expecting:fff00002
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:4, i=2, value:7fc00000, expecting:7ff00002
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:4, i=3, value:7fc00000, expecting:fff00002
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:6, i=2, value:7fc00000, expecting:7ff00002
/home/jdqiu/onnx/onnxruntime/onnxruntime/test/mlas/unittest/test_activation.cpp:250: Failure
Value of: Buffer[i].u == TestData[i][kind].u || Buffer[i].f == TestData[i][kind].f || error < 0.000001f
Actual: false
Expected: true
, Scalar Activation Kind:6, i=3, value:7fc00000, expecting:fff00002
[ FAILED ] Activation.ShortExecute (8 ms)
[----------] 1 test from Activation (10 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (14 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] Activation.ShortExecute
1 FAILED TEST❓ Questions for discussion: Should ONNX Runtime enforce or define consistent NaN propagation behavior across platforms? For example: always propagate first-encountered NaN payload? always canonicalize? Is it acceptable for NaN propagation behavior to vary by backend or architecture? Should ONNX model authors assume anything about NaN bit patterns being preserved? Would it make sense to explicitly document this in ONNX Runtime’s operator behavior or backend compliance expectations? ⚙️ Motivation: Understanding this could help backend developers (especially on RISC-V and other minimal/flexible hardware targets) know whether they must implement software-emulated NaN propagation for compliance, or if canonical NaN is acceptable. Looking forward to your thoughts! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
I think it would depend on model. Most models are not trained on RISC-V. When we run the models on RISC-V, we should get a good enough accuracy practically. MLperf has imagenet benchmarks. If we ran the same model on RISC-V (with any backend) with the full imagenet validation dataset as the input, as long as the accuracy on the images still makes sense, the other things are not important. |
Beta Was this translation helpful? Give feedback.
I think it would depend on model. Most models are not trained on RISC-V. When we run the models on RISC-V, we should get a good enough accuracy practically. MLperf has imagenet benchmarks. If we ran the same model on RISC-V (with any backend) with the full imagenet validation dataset as the input, as long as the accuracy on the images still makes sense, the other things are not important.