Skip to content

[CPU] Fix u8 Subtract to use wrap-around instead of saturation#33453

Open
Nishant-ZFYII wants to merge 8 commits intoopenvinotoolkit:masterfrom
Nishant-ZFYII:fix/33164-u8-subtract-wrap-around
Open

[CPU] Fix u8 Subtract to use wrap-around instead of saturation#33453
Nishant-ZFYII wants to merge 8 commits intoopenvinotoolkit:masterfrom
Nishant-ZFYII:fix/33164-u8-subtract-wrap-around

Conversation

@Nishant-ZFYII
Copy link
Contributor

[CPU] Fix u8 Subtract to use wrap-around instead of saturation

Fixes #33164

This fixes the bug where u8 subtraction was saturating to 0 instead of wrapping around like NumPy does.

For example: uint8(3) - uint8(4) was returning 0 but should return 255 (like 3 - 4 mod 256).

What Changed

I found the bug was happening in two places:

ARM (ACL backend) - The subtract operation was hardcoded to use ConvertPolicy::SATURATE. I changed it to check the output type and use ConvertPolicy::WRAP when working with u8.

x64 (JIT backend) - The JIT emitter didn't support u8 precision at all for subtraction, so it was falling back to float operations and then saturating when converting back. I added u8 to the supported precisions and implemented it using the vpsubb instruction, which automatically does wrap-around.

I also added tests to make sure this doesn't break again. The tests cover basic cases like 3 - 4 = 255, larger vectors, and 4D tensors.

Files modified

  • src/plugins/intel_cpu/src/nodes/executors/acl/acl_eltwise.cpp
  • src/plugins/intel_cpu/src/emitters/plugin/x64/jit_eltwise_emitters.cpp
  • src/plugins/intel_cpu/tests/.../subtract_u8_wrap.cpp (new test file)

Closes #33164

Fixes openvinotoolkit#33164

- Changed ACL executor to use ConvertPolicy::WRAP for u8 subtract
- Added u8 support to x64 JIT subtract emitter using vpsubb instruction
- Added regression tests for u8 subtract wrap-around behavior
@Nishant-ZFYII
Copy link
Contributor Author

Hi maintainers — request for review/CI.

This PR fixes u8 Subtract wrap-around semantics in the CPU plugin (Fixes #33164). The issue reporter tested/reviewed and confirmed it solves their problem (they closed the issue after validating).

Changes summary:

  • ACL executor: use ConvertPolicy::WRAP for u8 subtract (instead of saturate)
  • x64 JIT subtract emitter: add u8 support via vpsubb (wrap-around behavior)
  • Added regression tests for u8 subtract wrap-around

Review request:

  • @openvinotoolkit/openvino-ie-cpu-maintainers (CODEOWNERS for src/plugins/intel_cpu/...)
  • @zhihaoxu1325 (tagged in the issue thread)

@maxnick maxnick assigned maxnick and EgorDuplensky and unassigned maxnick Jan 5, 2026
@maxnick
Copy link
Contributor

maxnick commented Jan 5, 2026

@EgorDuplensky , could you please review?

Gate u8 subtract execution to only u8->u8 operations. This ensures
wrap-around behavior (e.g., 3 - 4 = 255) for pure u8 arithmetic while
preventing u8 execution for dequantization patterns (u8 input, f32/i32
output) where wrap-around would corrupt the math.

Changes:
- Modified get_supported_precisions() to conditionally enable u8 support
  only when both inputs AND output are u8
- Added defensive assertion in emit_isa() u8 case
- Removed [[maybe_unused]] attribute as node parameter is now used

Fixes openvinotoolkit#33164
Gate u8 subtract execution to only pure u8->u8 operations. This ensures
wrap-around behavior (e.g., 3 - 4 = 255) for unsigned arithmetic while
preventing u8 execution for dequantization patterns (u8 input, f32/i32
output) where wrap-around would corrupt the math.

Changes:
- JIT: Modified get_supported_precisions() to enable u8 only when both
  inputs AND output are u8
- ACL: Added same u8->u8 gating for ConvertPolicy::WRAP
- Tests: Added TypeRelaxed regression tests to catch LPT/dequant failures

Fixes openvinotoolkit#33164
@Nishant-ZFYII
Copy link
Contributor Author

I investigated the CI failures and narrowed them down to overly-broad u8 enablement in the x64 JIT subtract path.

Root cause

The previous change advertised {u8, u8} unconditionally in jit_subtract_emitter::get_supported_precisions().
That allowed kernel selection to pick the u8 JIT implementation in Q/DQ / dequantization patterns where inputs are u8, but the subtraction is semantically part of dequant and the output is f32/i32.

This led to:

  • Crash: store_vector / store path doesn’t support emitting a u8 source into a non-u8 destination in that configuration (unsupported src_prc: u8).
  • Wrong results: wrap-around arithmetic was applied where signed/expanded arithmetic is required (e.g., 100u8 - 128 should become -28, not 228).

Fix

Tests

Extended subtract_u8_wrap.cpp with additional coverage for the failure mode:

  • u8 inputs with overridden f32/i32 outputs (TypeRelaxed) → verifies no wrap-around and prevents regression of the crash/wrong-results behavior.

Kept existing tests that validate wrap-around for pure u8 - u8 → u8.

Key point: wrap-around is only correct when the result type is also u8; for u8 - u8 → f32/i32 (typical dequant), modular arithmetic is incorrect.

@EgorDuplensky Could you please take another look at this updated approach?

@Nishant-ZFYII
Copy link
Contributor Author

Thanks for the review. Will make the changes .

Thanks!

- Replace subtract_u8_wrap.cpp with proper eltwise_overflow test class
- Test both UNDERFLOW (subtract) and OVERFLOW (add) using CompareWithRefs
- Use all_of() utility instead of chained && comparisons
- Use OPENVINO_ASSERT instead of if-check for node null
- Remove issue ticket references from comments
@Nishant-ZFYII
Copy link
Contributor Author

@EgorDuplensky Thanks for the review — I’ve pushed an update that addresses all the notes:

  • Removed issue/ticket links from the code comments and kept only the behavior description.
  • Switched the u8 type checks to ov::intel_cpu::all_of(...) in both the ACL executor and the JIT gating.
  • Replaced the if (node) guard with OPENVINO_ASSERT(node, ...) in jit_subtract_emitter::get_supported_precisions().
  • Reworked the regression coverage to use the existing CompareWithRefs flow (CPU plugin vs reference): removed subtract_u8_wrap.cpp and added eltwise_overflow tests parameterized by UNDERFLOW/OVERFLOW, covering Subtract underflow and Add overflow for u8.

Let me know if you’d prefer different shapes or a narrower/wider test scope.

@Nishant-ZFYII
Copy link
Contributor Author

@EgorDuplensky . Kindly requesting you to look into the recent updates.

Thanks and regards.

… comments

- Added u8 wrap-around support for jit_add_emitter (x64 JIT)
- Added ConvertPolicy::WRAP for EltwiseAdd in ACL executor (ARM)
- Changed test inputs to hardcoded values that guarantee overflow/underflow
- Fixed test to use ov::Model directly instead of makeNgraphFunction
@Nishant-ZFYII
Copy link
Contributor Author

Hi @EgorDuplensky — thanks again for the detailed review and for your patience and guidance while I worked through this.

I’ve pushed an update that addresses the latest comments:

  • Updated the regression test to use hardcoded u8 inputs (shape-independent) to deterministically trigger underflow/overflow.
  • Fixed u8 Add overflow handling by using wrap-around for pure u8→u8 in both ACL and the x64 JIT path, while keeping QDQ/mixed-precision cases on the widened/saturating path.

Whenever you have a moment, could you please take another look? Thanks!

Willing to make more corrections if required.

Copy link
Contributor

@EgorDuplensky EgorDuplensky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nishant-ZFYII Could you please double check that new tests are failing without the changes.

@EgorDuplensky
Copy link
Contributor

@Nishant-ZFYII Many tests failed, please check the logs.

@Nishant-ZFYII
Copy link
Contributor Author

Nishant-ZFYII commented Feb 18, 2026

Hi, @EgorDuplensky ,

Pushed a fix for the 52 CI failures.

Root cause: get_supported_precisions() is called without a node argument by the SupportedPrecisions functor (in jit_uni_eltwise_generic.cpp), which means node defaults to nullptr. The OPENVINO_ASSERT(node, ...) I added was treating that as an error, but it's actually a valid code path — it's a general query for the base set of supported precisions, not tied to any specific node.

Fix: Replaced OPENVINO_ASSERT(node, ...) with if (node && ov::intel_cpu::all_of(...)) in both jit_add_emitter::get_supported_precisions and jit_subtract_emitter::get_supported_precisions. When node is nullptr, the method now returns the default precision set ({f32, f32}, {i32, i32}). When a concrete node is available and all its inputs/outputs are u8, it additionally includes {u8, u8}.

This was my oversight — I should have traced how get_supported_precisions is invoked across the codebase before adding the assert.

I also verified locally that the tests catch the bug on unpatched code:

  • With fix applied: all 6 smoke_EltwiseOverflowU8 tests pass
  • Without fix (reverted jit_eltwise_emitters.cpp and acl_eltwise.cpp to master while keeping the test files): all 6 tests fail with Expected: 255 Actual: 0 for underflow and Expected: 0 Actual: 255 for overflow

I want to make sure I'm not missing anything — does this if (node && ov::intel_cpu::all_of(...)) approach look correct to you, or would you prefer a different pattern here?

Comment on lines +1 to +3
// Copyright (C) 2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright (C) 2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
// Copyright (C) 2018-2026 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

Comment on lines +1 to +3
// Copyright (C) 2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright (C) 2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
// Copyright (C) 2018-2026 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

Comment on lines +1 to +3
// Copyright (C) 2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright (C) 2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
// Copyright (C) 2018-2026 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

@praasz praasz added this to the 2026.1 milestone Feb 23, 2026
@EgorDuplensky
Copy link
Contributor

Hi, @EgorDuplensky ,

Pushed a fix for the 52 CI failures.

Root cause: get_supported_precisions() is called without a node argument by the SupportedPrecisions functor (in jit_uni_eltwise_generic.cpp), which means node defaults to nullptr. The OPENVINO_ASSERT(node, ...) I added was treating that as an error, but it's actually a valid code path — it's a general query for the base set of supported precisions, not tied to any specific node.

Fix: Replaced OPENVINO_ASSERT(node, ...) with if (node && ov::intel_cpu::all_of(...)) in both jit_add_emitter::get_supported_precisions and jit_subtract_emitter::get_supported_precisions. When node is nullptr, the method now returns the default precision set ({f32, f32}, {i32, i32}). When a concrete node is available and all its inputs/outputs are u8, it additionally includes {u8, u8}.

This was my oversight — I should have traced how get_supported_precisions is invoked across the codebase before adding the assert.

I also verified locally that the tests catch the bug on unpatched code:

  • With fix applied: all 6 smoke_EltwiseOverflowU8 tests pass
  • Without fix (reverted jit_eltwise_emitters.cpp and acl_eltwise.cpp to master while keeping the test files): all 6 tests fail with Expected: 255 Actual: 0 for underflow and Expected: 0 Actual: 255 for overflow

I want to make sure I'm not missing anything — does this if (node && ov::intel_cpu::all_of(...)) approach look correct to you, or would you prefer a different pattern here?

@v-Golubev Could you please clarify, why there is a 'Node' parameter at all for get_supported_precisions() function? It almost never used, but maybe I have missed something

@Nishant-ZFYII
Copy link
Contributor Author

Good Day @v-Golubev , @EgorDuplensky

Is there anything I can do from my end to help with the issue.

Thanks and Regards,
Nishant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin ExternalPR External contributor

Projects

None yet

5 participants