[Bug]: Segmentation fault when running inference with oneDNN forced to SSE4.1 (AVX disabled) – possible bug in `jit_uni_reorder.cpp` for SSE4.1 path

### Environment

- OpenVINO version: latest master
- oneDNN version: v3.10.2
- Host OS: Linux (Ubuntu 22.04)
- Target ISA: SSE4.1 only (no AVX, AVX2, AVX512)
- Compiler: GCC 11.4 / 12.x
- Model: YOLOv8 with binary convolution (xnore) layers

### Device used for inference

CPU

### Framework

PyTorch

### Model used

custome yoloV8

### Issue description

I am trying to use OpenVINO with binary neural network layers (xnore) on embedded boards that do not support AVX (no AVX2, AVX512F, BMI2). To test the behaviour, I built OpenVINO from source on my development machine (which does support AVX) with all AVX options disabled via CMake. The build succeeded and a simple C++ inference program worked fine on my machine.

However, when I forced oneDNN to use SSE4.1 only (by setting `ONEDNN_MAX_CPU_ISA=SSE41` and `ONEDNN_VERBOSE=1`), the same program crashes with a segmentation fault during execution. The oneDNN logs show that the ISA is correctly set to SSE4.1 and the crash occurs inside a reorder operation.

I suspect a bug in oneDNN’s SSE4.1 fallback path, likely in `jit_uni_reorder.cpp` or related memory alignment/instruction generation.


### Step-by-step reproduction

1- Build OpenVINO with AVX fully disabled using the following CMake command (inside build directory):
```bash
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DENABLE_INTEL_CPU=ON \
  -DENABLE_INTEL_GPU=OFF \
  -DENABLE_INTEL_NPU=OFF \
  -DENABLE_AVX2=OFF \
  -DENABLE_AVX512F=OFF \
  -DENABLE_SSE42=OFF \
  -DENABLE_CPU_DISPATCHER=OFF \
  -DCMAKE_C_FLAGS="-msse4.1 -mno-avx -mno-avx2 -mno-avx512f -mno-bmi2 -mfpmath=sse" \
  -DCMAKE_CXX_FLAGS="-msse4.1 -mno-avx -mno-avx2 -mno-avx512f -mno-bmi2 -mfpmath=sse" \
  -DDNNL_CPU_RUNTIME=SEQ \
  -DDNNL_ARCH_OPT_FLAGS="-msse4.1" \
  -DDNNL_ENABLE_JIT_PROFILING=OFF \
  -DDNNL_JIT=OFF \
  -DENABLE_OV_PADDLE_FRONTEND=OFF \
  -DENABLE_OV_JAX_FRONTEND=OFF \
  -DENABLE_OPENCV=OFF
```

2- Build OpenVINO as usual.
3- Write a minimal C++ inference program (e.g., `run_model.cpp`):
```cpp
#include <iostream>
#include <vector>
#include <openvino/openvino.hpp>

int main(int argc, char* argv[]) {
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <path_to_model.xml>" << std::endl;
        return 1;
    }

    std::string model_path = argv[1];

    try {
        ov::Core core;
        std::cout << "OpenVINO Core initialized." << std::endl;

        std::shared_ptr<ov::Model> model = core.read_model(model_path);
        std::cout << "Model loaded: " << model_path << std::endl;

        ov::CompiledModel compiled_model = core.compile_model(model, "CPU");
        std::cout << "Model compiled for CPU." << std::endl;

        ov::InferRequest infer_request = compiled_model.create_infer_request();

        ov::Tensor input_tensor = infer_request.get_input_tensor();
        auto* input_data = input_tensor.data<float>();
        for (size_t i = 0; i < input_tensor.get_size(); ++i) {
            input_data[i] = 0.0f;
        }

        infer_request.infer();
        std::cout << "Inference executed successfully." << std::endl;

        ov::Tensor output_tensor = infer_request.get_output_tensor();
        auto* output_data = output_tensor.data<float>();

        std::cout << "First 5 output values: ";
        for (size_t i = 0; i < std::min(output_tensor.get_size(), (size_t)5); ++i) {
            std::cout << output_data[i] << " ";
        }
        std::cout << std::endl;

    } catch (const std::exception& ex) {
        std::cerr << "Error: " << ex.what() << std::endl;
        return 1;
    }

    return 0;
}
```
4- Compile the program against the built OpenVINO.
```bash
g++ -O3 run_model.cpp -o run_model     -I/MountPoint/openvino/src/inference/include     -I/MountPoint/openvino/src/core/include     -L/MountPoint/openvino/bin/intel64/Release     -lopenvin
```
5- Run the program with a model that contains binary convolution layers (e.g., a YOLOv8 BNN model). On a machine with AVX support it works:
```bash
./run_model ./yolov8_bnn.xml
```
6- Force oneDNN to SSE4.1 and enable verbose logging:
```bash
export ONEDNN_MAX_CPU_ISA=SSE41
export ONEDNN_VERBOSE=1
./run_model ./yolov8_bnn.xml
```
# Observed Behaviour
The program crashes with `Segmentation fault (core dumped)`.
The oneDNN verbose output shows:
```text
OpenVINO Core initialized.
Model loaded: yolov8_bnn.xml
onednn_verbose,v1,info,oneDNN v3.10.2 (commit 87f65fdd1927b1d0cbdf0ea37728146abfbffb52)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:10
onednn_verbose,v1,info,cpu,isa:Intel SSE4.1
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32::blocked:Acdb8a::f0,,,32x1x3x3,0.0100098
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,64x32x3x3,0.166016
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,64x64x3x3,0.092041
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,128x64x3x3,0.164062
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,128x128x3x3,0.181885
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x128x3x3,0.468018
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x256x3x3,0.679932
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,512x256x3x3,1.04199
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,512x512x3x3,2.18701
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x512x1x1,0.314941
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32:p:blocked:ABcd8b8a::f0,,,70x256x1x1,0.291016
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32::blocked:ABcd8b8a::f0,,,256x256x1x1,0.436035
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x512x3x3,1.073
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32:p:blocked:ABcd8b8a::f0,,,70x256x1x1,0.286133
Model compiled for CPU.
onednn_verbose,v1,primitive,exec,cpu,convolution,jit:sse41,forward_inference,src:f32::blocked:abcd::f0 wei:f32:a:blocked:Acdb8a::f0 bia:undef::undef::: dst:f32::blocked:aBcd8b::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic1oc32_ih256oh128kh3sh2dh0ph1_iw480ow240kw3sw2dw0pw1,1.37598
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:aBcd8b::f0 dst:f32::blocked:acdb::f0,,,1x32x128x240,1.24316
Segmentation fault (core dumped)
```
The crash occurs during the reorder from `aBcd8b` to `acdb`.

# Additional Information

- I also rebuilt OpenVINO entirely on a system without any AVX support (same CMake flags) – the segmentation fault still occurs.
- The problem does not happen when AVX is enabled (either via hardware or by not setting `ONEDNN_MAX_CPU_ISA`).
- The model uses binary (xnore) layers, and the memory layout changes from AVX‑optimized (e.g., `Acdb16a`) to SSE‑optimized (`aBcd8b`) when AVX is unavailable.
- I suspect the SSE4.1 reorder JIT code (likely in `src/plugins/intel_cpu/thirdparty/onednn/src/cpu/x64/jit_uni_reorder.cpp`) has a bug, possibly an unaligned memory access or incorrect instruction generation for the aBcd8b → acdb transformation.

Request
I need guidance on how to successfully run OpenVINO with binary neural network models on embedded CPUs that lack any AVX support (only SSE4.1). Is there a workaround (e.g., additional build flags, different memory layout, disabling certain JIT kernels) or a known fix for this reorder crash? Any help would be greatly appreciated.

Possible Root Cause (as per my analysis)
On CPUs without AVX2, oneDNN falls back to SSE4.1 and uses 128‑bit registers. The reorder from aBcd8b (SSE‑specific blocked layout) to acdb (plain layout) is handled by a JIT kernel. This specific kernel may contain a bug – either unaligned memory access or incorrect register allocation – leading to a segfault.
Please let me know if any additional logs or debug information would help. Thank you!

### Issue submission checklist

- [x] I'm reporting an issue. It's not a question.
- [x] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
- [ ] There is reproducer code and related data files such as images, videos, models, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Segmentation fault when running inference with oneDNN forced to SSE4.1 (AVX disabled) – possible bug in `jit_uni_reorder.cpp` for SSE4.1 path #35445

Environment

Device used for inference

Framework

Model used

Issue description

Step-by-step reproduction

Observed Behaviour

Additional Information

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Segmentation fault when running inference with oneDNN forced to SSE4.1 (AVX disabled) – possible bug in jit_uni_reorder.cpp for SSE4.1 path #35445

Description

Environment

Device used for inference

Framework

Model used

Issue description

Step-by-step reproduction

Observed Behaviour

Additional Information

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: Segmentation fault when running inference with oneDNN forced to SSE4.1 (AVX disabled) – possible bug in `jit_uni_reorder.cpp` for SSE4.1 path #35445