Skip to content

[Bug]: Segmentation fault when running inference with oneDNN forced to SSE4.1 (AVX disabled) – possible bug in jit_uni_reorder.cpp for SSE4.1 path #35445

@Arshianb

Description

@Arshianb

Environment

  • OpenVINO version: latest master
  • oneDNN version: v3.10.2
  • Host OS: Linux (Ubuntu 22.04)
  • Target ISA: SSE4.1 only (no AVX, AVX2, AVX512)
  • Compiler: GCC 11.4 / 12.x
  • Model: YOLOv8 with binary convolution (xnore) layers

Device used for inference

CPU

Framework

PyTorch

Model used

custome yoloV8

Issue description

I am trying to use OpenVINO with binary neural network layers (xnore) on embedded boards that do not support AVX (no AVX2, AVX512F, BMI2). To test the behaviour, I built OpenVINO from source on my development machine (which does support AVX) with all AVX options disabled via CMake. The build succeeded and a simple C++ inference program worked fine on my machine.

However, when I forced oneDNN to use SSE4.1 only (by setting ONEDNN_MAX_CPU_ISA=SSE41 and ONEDNN_VERBOSE=1), the same program crashes with a segmentation fault during execution. The oneDNN logs show that the ISA is correctly set to SSE4.1 and the crash occurs inside a reorder operation.

I suspect a bug in oneDNN’s SSE4.1 fallback path, likely in jit_uni_reorder.cpp or related memory alignment/instruction generation.

Step-by-step reproduction

1- Build OpenVINO with AVX fully disabled using the following CMake command (inside build directory):

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DENABLE_INTEL_CPU=ON \
  -DENABLE_INTEL_GPU=OFF \
  -DENABLE_INTEL_NPU=OFF \
  -DENABLE_AVX2=OFF \
  -DENABLE_AVX512F=OFF \
  -DENABLE_SSE42=OFF \
  -DENABLE_CPU_DISPATCHER=OFF \
  -DCMAKE_C_FLAGS="-msse4.1 -mno-avx -mno-avx2 -mno-avx512f -mno-bmi2 -mfpmath=sse" \
  -DCMAKE_CXX_FLAGS="-msse4.1 -mno-avx -mno-avx2 -mno-avx512f -mno-bmi2 -mfpmath=sse" \
  -DDNNL_CPU_RUNTIME=SEQ \
  -DDNNL_ARCH_OPT_FLAGS="-msse4.1" \
  -DDNNL_ENABLE_JIT_PROFILING=OFF \
  -DDNNL_JIT=OFF \
  -DENABLE_OV_PADDLE_FRONTEND=OFF \
  -DENABLE_OV_JAX_FRONTEND=OFF \
  -DENABLE_OPENCV=OFF

2- Build OpenVINO as usual.
3- Write a minimal C++ inference program (e.g., run_model.cpp):

#include <iostream>
#include <vector>
#include <openvino/openvino.hpp>

int main(int argc, char* argv[]) {
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <path_to_model.xml>" << std::endl;
        return 1;
    }

    std::string model_path = argv[1];

    try {
        ov::Core core;
        std::cout << "OpenVINO Core initialized." << std::endl;

        std::shared_ptr<ov::Model> model = core.read_model(model_path);
        std::cout << "Model loaded: " << model_path << std::endl;

        ov::CompiledModel compiled_model = core.compile_model(model, "CPU");
        std::cout << "Model compiled for CPU." << std::endl;

        ov::InferRequest infer_request = compiled_model.create_infer_request();

        ov::Tensor input_tensor = infer_request.get_input_tensor();
        auto* input_data = input_tensor.data<float>();
        for (size_t i = 0; i < input_tensor.get_size(); ++i) {
            input_data[i] = 0.0f;
        }

        infer_request.infer();
        std::cout << "Inference executed successfully." << std::endl;

        ov::Tensor output_tensor = infer_request.get_output_tensor();
        auto* output_data = output_tensor.data<float>();

        std::cout << "First 5 output values: ";
        for (size_t i = 0; i < std::min(output_tensor.get_size(), (size_t)5); ++i) {
            std::cout << output_data[i] << " ";
        }
        std::cout << std::endl;

    } catch (const std::exception& ex) {
        std::cerr << "Error: " << ex.what() << std::endl;
        return 1;
    }

    return 0;
}

4- Compile the program against the built OpenVINO.

g++ -O3 run_model.cpp -o run_model     -I/MountPoint/openvino/src/inference/include     -I/MountPoint/openvino/src/core/include     -L/MountPoint/openvino/bin/intel64/Release     -lopenvin

5- Run the program with a model that contains binary convolution layers (e.g., a YOLOv8 BNN model). On a machine with AVX support it works:

./run_model ./yolov8_bnn.xml

6- Force oneDNN to SSE4.1 and enable verbose logging:

export ONEDNN_MAX_CPU_ISA=SSE41
export ONEDNN_VERBOSE=1
./run_model ./yolov8_bnn.xml

Observed Behaviour

The program crashes with Segmentation fault (core dumped).
The oneDNN verbose output shows:

OpenVINO Core initialized.
Model loaded: yolov8_bnn.xml
onednn_verbose,v1,info,oneDNN v3.10.2 (commit 87f65fdd1927b1d0cbdf0ea37728146abfbffb52)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:10
onednn_verbose,v1,info,cpu,isa:Intel SSE4.1
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32::blocked:Acdb8a::f0,,,32x1x3x3,0.0100098
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,64x32x3x3,0.166016
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,64x64x3x3,0.092041
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,128x64x3x3,0.164062
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,128x128x3x3,0.181885
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x128x3x3,0.468018
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x256x3x3,0.679932
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,512x256x3x3,1.04199
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,512x512x3x3,2.18701
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x512x1x1,0.314941
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32:p:blocked:ABcd8b8a::f0,,,70x256x1x1,0.291016
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32::blocked:ABcd8b8a::f0,,,256x256x1x1,0.436035
onednn_verbose,v1,primitive,exec,cpu,reorder,simple:any,undef,src:bin::blocked:abcd::f0 dst:bin::blocked:ABcd8a32b::f0,,,256x512x3x3,1.073
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcd::f0 dst:f32:p:blocked:ABcd8b8a::f0,,,70x256x1x1,0.286133
Model compiled for CPU.
onednn_verbose,v1,primitive,exec,cpu,convolution,jit:sse41,forward_inference,src:f32::blocked:abcd::f0 wei:f32:a:blocked:Acdb8a::f0 bia:undef::undef::: dst:f32::blocked:aBcd8b::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic1oc32_ih256oh128kh3sh2dh0ph1_iw480ow240kw3sw2dw0pw1,1.37598
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:aBcd8b::f0 dst:f32::blocked:acdb::f0,,,1x32x128x240,1.24316
Segmentation fault (core dumped)

The crash occurs during the reorder from aBcd8b to acdb.

Additional Information

  • I also rebuilt OpenVINO entirely on a system without any AVX support (same CMake flags) – the segmentation fault still occurs.
  • The problem does not happen when AVX is enabled (either via hardware or by not setting ONEDNN_MAX_CPU_ISA).
  • The model uses binary (xnore) layers, and the memory layout changes from AVX‑optimized (e.g., Acdb16a) to SSE‑optimized (aBcd8b) when AVX is unavailable.
  • I suspect the SSE4.1 reorder JIT code (likely in src/plugins/intel_cpu/thirdparty/onednn/src/cpu/x64/jit_uni_reorder.cpp) has a bug, possibly an unaligned memory access or incorrect instruction generation for the aBcd8b → acdb transformation.

Request
I need guidance on how to successfully run OpenVINO with binary neural network models on embedded CPUs that lack any AVX support (only SSE4.1). Is there a workaround (e.g., additional build flags, different memory layout, disabling certain JIT kernels) or a known fix for this reorder crash? Any help would be greatly appreciated.

Possible Root Cause (as per my analysis)
On CPUs without AVX2, oneDNN falls back to SSE4.1 and uses 128‑bit registers. The reorder from aBcd8b (SSE‑specific blocked layout) to acdb (plain layout) is handled by a JIT kernel. This specific kernel may contain a bug – either unaligned memory access or incorrect register allocation – leading to a segfault.
Please let me know if any additional logs or debug information would help. Thank you!

Issue submission checklist

  • I'm reporting an issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions