Skip to content

Fix silent data corruption in JIT eltwise kernel for i8/u8 bitwise ops with broadcast#34639

Open
goyaladitya05 wants to merge 2 commits intoopenvinotoolkit:masterfrom
goyaladitya05:fix/jit_eltwise_bitwise_i8_broadcast
Open

Fix silent data corruption in JIT eltwise kernel for i8/u8 bitwise ops with broadcast#34639
goyaladitya05 wants to merge 2 commits intoopenvinotoolkit:masterfrom
goyaladitya05:fix/jit_eltwise_bitwise_i8_broadcast

Conversation

@goyaladitya05
Copy link

@goyaladitya05 goyaladitya05 commented Mar 11, 2026

Fixed a data corruption bug in load_vector() where broadcasting i8/u8 values during bitwise operations produced incorrect results.

Cause

load_vector() has two broadcast paths based on whether src_prc == dst_prc:

  • src_prc != dst_prc: calls load_scalar to widen the value to 32 bits first, then broadcasts with uni_vbroadcastss - correct, the value is already 32-bit by the time it is broadcast.
  • src_prc == dst_prc: also called uni_vbroadcastss unconditionally - wrong for 8-bit types.

vbroadcastss copies 4 bytes at a time. For an i8 value, only byte 0 of each 4-byte lane gets the scalar; the other 3 bytes are zeroed. In a 256-bit register that means 8 correct bytes and 24 zeros, so any bitwise AND/OR/XOR operating on those lanes silently produces wrong results.

Fix

In the src_prc == dst_prc branch, dispatch on src_prc.size() instead of always calling uni_vbroadcastss:

  • 1 byte (i8/u8)

    • AVX2+: vpbroadcastb - fills all byte lanes directly.
    • SSE4.1: punpcklbw + punpcklbw + pshufd 0 - SSE has no byte-broadcast instruction; two unpacks interleave the byte with itself, then pshufd splats it across all dword lanes.
  • 2 bytes

    • AVX2+: vpbroadcastw.
    • SSE4.1: punpcklwd + pshufd 0.
  • 4 bytes (i32/f32): uni_vbroadcastss is unchanged.

Tests

Added smoke_CompareWithRefs_2D_Bitwise_i8u8_Broadcast to eltwise.cpp.

  • 24 test cases: AND / OR / XOR × i8 / u8 × CONSTANT / PARAMETER secondary input.
  • Shapes: two pairs - {1,64} vs {1,1} and {32,256} vs {1,1} - from the bug report. Each pair runs inference twice (full shape, then the {1,1} broadcast operand) to exercise the fixed path.
  • 2D only, no format constraints: unlike the existing 4D bitwise suite which tests nhwc/nchw layout permutations, 2D tensors have no channel-last layout so no CPUSpecificParams format is set and keeps the test focused purely on broadcast correctness.

Closes #34638

AI Assistance:

  • AI assistance used: yes
  • If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks): Used Claude Sonnet 4.6 to help with pinpointing the location of bug, and fixes.
    Built it locally, and verified everything works.

@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Mar 11, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Mar 11, 2026
@goyaladitya05 goyaladitya05 marked this pull request as ready for review March 12, 2026 07:23
@goyaladitya05 goyaladitya05 requested review from a team as code owners March 12, 2026 07:23
Copilot AI review requested due to automatic review settings March 12, 2026 07:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a silent data corruption issue in the Intel CPU plugin’s x64 JIT eltwise kernel when broadcasting i8/u8 scalars for bitwise ops, and adds a focused regression test to cover the broadcast scenario from the reported bug.

Changes:

  • Update jit_uni_eltwise_generic::load_vector() to use byte/word-aware broadcast for src_prc == dst_prc (avoids vbroadcastss for 8-bit types).
  • Add a new 2D bitwise broadcast instantiation to validate i8/u8 AND/OR/XOR correctness when one operand is {1,1}-broadcast.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/plugins/intel_cpu/src/nodes/kernels/x64/jit_uni_eltwise_generic.cpp Fixes scalar broadcast emission for 8-bit element types in the JIT load path used by bitwise ops.
src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/instances/common/eltwise.cpp Adds a regression test suite covering i8/u8 bitwise ops with {1,1} broadcast in 2D shapes.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +562 to +569
case 2:
if (isa == x64::sse41) {
punpcklwd(xmm_src, xmm_src);
pshufd(xmm_src, xmm_src, 0);
} else {
vpbroadcastw(vmm_src, xmm_src);
}
break;
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] load_vector() adds a 2-byte broadcast path (case 2), but this code calls load_scalar() first, and load_scalar() currently throws for src_prc == dst_prc with src_prc.size() == 2 (it only supports sizes 1 and 4 in that branch). As a result, the new 2-byte broadcast logic is effectively unreachable and any future attempt to broadcast u16/i16 without type conversion will still fail at runtime. Either add 2-byte support to load_scalar() for the src_prc == dst_prc case (load 16 bits and clear upper bits) or remove the case 2 handling here to avoid implying support that isn't actually implemented.

Suggested change
case 2:
if (isa == x64::sse41) {
punpcklwd(xmm_src, xmm_src);
pshufd(xmm_src, xmm_src, 0);
} else {
vpbroadcastw(vmm_src, xmm_src);
}
break;

Copilot uses AI. Check for mistakes.
@maxnick maxnick added this to the 2026.1 milestone Mar 12, 2026
@maxnick
Copy link
Contributor

maxnick commented Mar 12, 2026

build_jenkins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin ExternalPR External contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: bitwise_and / bitwise_or / bitwise_xor return incorrect values for int8/uint8 when broadcasting

4 participants