Fix silent data corruption in JIT eltwise kernel for i8/u8 bitwise ops with broadcast#34639
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a silent data corruption issue in the Intel CPU plugin’s x64 JIT eltwise kernel when broadcasting i8/u8 scalars for bitwise ops, and adds a focused regression test to cover the broadcast scenario from the reported bug.
Changes:
- Update
jit_uni_eltwise_generic::load_vector()to use byte/word-aware broadcast forsrc_prc == dst_prc(avoidsvbroadcastssfor 8-bit types). - Add a new 2D bitwise broadcast instantiation to validate i8/u8 AND/OR/XOR correctness when one operand is
{1,1}-broadcast.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/plugins/intel_cpu/src/nodes/kernels/x64/jit_uni_eltwise_generic.cpp |
Fixes scalar broadcast emission for 8-bit element types in the JIT load path used by bitwise ops. |
src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/instances/common/eltwise.cpp |
Adds a regression test suite covering i8/u8 bitwise ops with {1,1} broadcast in 2D shapes. |
You can also share your feedback on Copilot code review. Take the survey.
| case 2: | ||
| if (isa == x64::sse41) { | ||
| punpcklwd(xmm_src, xmm_src); | ||
| pshufd(xmm_src, xmm_src, 0); | ||
| } else { | ||
| vpbroadcastw(vmm_src, xmm_src); | ||
| } | ||
| break; |
There was a problem hiding this comment.
[MEDIUM] load_vector() adds a 2-byte broadcast path (case 2), but this code calls load_scalar() first, and load_scalar() currently throws for src_prc == dst_prc with src_prc.size() == 2 (it only supports sizes 1 and 4 in that branch). As a result, the new 2-byte broadcast logic is effectively unreachable and any future attempt to broadcast u16/i16 without type conversion will still fail at runtime. Either add 2-byte support to load_scalar() for the src_prc == dst_prc case (load 16 bits and clear upper bits) or remove the case 2 handling here to avoid implying support that isn't actually implemented.
| case 2: | |
| if (isa == x64::sse41) { | |
| punpcklwd(xmm_src, xmm_src); | |
| pshufd(xmm_src, xmm_src, 0); | |
| } else { | |
| vpbroadcastw(vmm_src, xmm_src); | |
| } | |
| break; |
|
build_jenkins |
Fixed a data corruption bug in
load_vector()where broadcasting i8/u8 values during bitwise operations produced incorrect results.Cause
load_vector()has two broadcast paths based on whethersrc_prc == dst_prc:src_prc != dst_prc: callsload_scalarto widen the value to 32 bits first, then broadcasts withuni_vbroadcastss- correct, the value is already 32-bit by the time it is broadcast.src_prc == dst_prc: also calleduni_vbroadcastssunconditionally - wrong for 8-bit types.vbroadcastsscopies 4 bytes at a time. For an i8 value, only byte 0 of each 4-byte lane gets the scalar; the other 3 bytes are zeroed. In a 256-bit register that means 8 correct bytes and 24 zeros, so any bitwise AND/OR/XOR operating on those lanes silently produces wrong results.Fix
In the
src_prc == dst_prcbranch, dispatch onsrc_prc.size()instead of always callinguni_vbroadcastss:1 byte (i8/u8)
vpbroadcastb- fills all byte lanes directly.punpcklbw+punpcklbw+pshufd 0- SSE has no byte-broadcast instruction; two unpacks interleave the byte with itself, thenpshufdsplats it across all dword lanes.2 bytes
vpbroadcastw.punpcklwd+pshufd 0.4 bytes (i32/f32):
uni_vbroadcastssis unchanged.Tests
Added
smoke_CompareWithRefs_2D_Bitwise_i8u8_Broadcasttoeltwise.cpp.{1,64}vs{1,1}and{32,256}vs{1,1}- from the bug report. Each pair runs inference twice (full shape, then the{1,1}broadcast operand) to exercise the fixed path.CPUSpecificParamsformat is set and keeps the test focused purely on broadcast correctness.Closes #34638
AI Assistance:
Built it locally, and verified everything works.