Skip to content

Got Caught exception: bad_function_call during inference on GPU #1516

@hermeschen1116

Description

@hermeschen1116

Hello, I'm trying to run a self-converted version of Qwen2-VL 2B on an Intel GPU, but I keep getting an exception below while the same code works totally fine on a CPU. I've tried to trace the input of each step in the Python side but cannot figure out what the difference is between inference on a CPU and a GPU. Sometimes it'll works after I refresh my venv then another failure occurs the next time.

Here's my code (partial of the class to handle inference)

#load
from openvino import Core
from optimum.intel.openvino import OVModelForVisualCausalLM

device = kwargs.get("device", "CPU")
ov_core = Core()
ov_core.set_property(device, {"EXECUTION_MODE": "PERFORMANCE", "CACHE_DIR": ""})

processor = AutoProcessor.from_pretrained(self._path_or_hf_repo, use_fast=True, trust_remote_code=True)
model = OVModelForVisualCausalLM.from_pretrained(
    self._path_or_hf_repo,
    ov_config=kwargs.get("ov_config", {}),
    use_cache=kwargs.get("use_cache", True),
    trust_remote_code=True,
    device=device.lower(),
    export=False,
    compile=True,
)

#inference
prompt = self.processor.apply_chat_template(messages, add_generation_prompt=True, continue_final_message=False)
inputs = self.processor(text=prompt, images=image, return_tensors="pt")
generated_ids = self.model.generate(**inputs)
generated_texts = self.processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

with these ov_config

[load_config.ov_config]
KV_CACHE_PRECISION = "u8"
DYNAMIC_QUANTIZATION_GROUP_SIZE = "128" # will be commented out if test on GPU
PERFORMANCE_HINT = "LATENCY"

And I'm running this in Python 3.13.9 with the packages below. I also tried to run the code with the different composition of installed packages(also torch with different backend) and different model, but nothing changed.

nncf                      2.18.0
openvino                  2025.3.0
openvino-telemetry        2025.2.0
openvino-tokenizers       2025.3.0.0
optimum                   2.0.0
optimum-intel             1.26.1
optimum-onnx              0.0.3
pytorch-triton-xpu        3.5.0
torch                     2.9.0+xpu
torchvision               0.24.0+xpu
transformers              4.55.4

(BTW, sometimes openvino 2025.2.0 will throw another error that occured on both CPU and GPU.)

Caught exception: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:272:
E       [GPU] [CL_EXT] setArgUsm in KernelIntel failed, error code: -49 CL_INVALID_ARG_INDEX

And the error

src/vlm_server/internal/Model.py:285: in _response
    generated_ids = self.model.generate(**inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py:123: in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:2618: in generate
    result = self._sample(
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:3602: in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/modeling_base.py:113: in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:2914: in forward
    result = super().forward(
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:791: in forward
    return self.language_model.forward(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <optimum.intel.openvino.modeling_visual_language.OVModelWithEmbedForCausalLM object at 0x7caf96699fd0>, input_ids = None
attention_mask = tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1... 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1]])
past_key_values = None
position_ids = tensor([[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
          17, 18, 19, 20, 21, 22, 23, 24...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37,
          38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])
inputs_embeds = tensor([[[-0.0075,  0.0098,  0.0053,  ..., -0.0015,  0.0090, -0.0060],
         [ 0.0023,  0.0172,  0.0163,  ...,  0.0...42,  0.0131,  ...,  0.0146,  0.0284, -0.0102],
         [ 0.0338, -0.0124,  0.0132,  ...,  0.0008, -0.0107,  0.0165]]])
kwargs = {'cache_position': tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20,..., 67, 68, 69, 70, 71,
        72, 73, 74, 75, 76, 77]), 'return_dict': True, 'token_type_ids': None, 'use_cache': True}
inputs = {'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1,...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39,
         40, 41, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])}

    def forward(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.LongTensor] = None,
        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        position_ids: Optional[torch.LongTensor] = None,
        inputs_embeds: Optional[torch.LongTensor] = None,
        **kwargs,
    ):
        self.compile()

        inputs = self.prepare_inputs(
            input_ids=input_ids,
            attention_mask=attention_mask,
            past_key_values=past_key_values,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            **kwargs,
        )
        # Run inference
        self.request.start_async(inputs, share_inputs=True)
>       self.request.wait()
E       RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
E       Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
E       Caught exception: bad_function_call

.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:223: RuntimeError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions