Got Caught exception: bad_function_call during inference on GPU

Hello, I'm trying to run a self-converted version of Qwen2-VL 2B on an Intel GPU, but I keep getting an exception below while the same code works totally fine on a CPU. I've tried to trace the input of each step in the Python side but cannot figure out what the difference is between inference on a CPU and a GPU. Sometimes it'll works after I refresh my venv then another failure occurs the next time.

Here's my code (partial of the class to handle inference)
```python
#load
from openvino import Core
from optimum.intel.openvino import OVModelForVisualCausalLM

device = kwargs.get("device", "CPU")
ov_core = Core()
ov_core.set_property(device, {"EXECUTION_MODE": "PERFORMANCE", "CACHE_DIR": ""})

processor = AutoProcessor.from_pretrained(self._path_or_hf_repo, use_fast=True, trust_remote_code=True)
model = OVModelForVisualCausalLM.from_pretrained(
    self._path_or_hf_repo,
    ov_config=kwargs.get("ov_config", {}),
    use_cache=kwargs.get("use_cache", True),
    trust_remote_code=True,
    device=device.lower(),
    export=False,
    compile=True,
)

#inference
prompt = self.processor.apply_chat_template(messages, add_generation_prompt=True, continue_final_message=False)
inputs = self.processor(text=prompt, images=image, return_tensors="pt")
generated_ids = self.model.generate(**inputs)
generated_texts = self.processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
````
with these ov_config
```toml
[load_config.ov_config]
KV_CACHE_PRECISION = "u8"
DYNAMIC_QUANTIZATION_GROUP_SIZE = "128" # will be commented out if test on GPU
PERFORMANCE_HINT = "LATENCY"
```

And I'm running this in Python 3.13.9 with the packages below. I also tried to run the code with the different composition of installed packages(also torch with different backend) and different model, but nothing changed. 
```shell
nncf                      2.18.0
openvino                  2025.3.0
openvino-telemetry        2025.2.0
openvino-tokenizers       2025.3.0.0
optimum                   2.0.0
optimum-intel             1.26.1
optimum-onnx              0.0.3
pytorch-triton-xpu        3.5.0
torch                     2.9.0+xpu
torchvision               0.24.0+xpu
transformers              4.55.4
```
(BTW, sometimes openvino 2025.2.0 will throw another error that occured on both CPU and GPU.)
```shell
Caught exception: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:272:
E       [GPU] [CL_EXT] setArgUsm in KernelIntel failed, error code: -49 CL_INVALID_ARG_INDEX
```




And the error
```shell
src/vlm_server/internal/Model.py:285: in _response
    generated_ids = self.model.generate(**inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py:123: in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:2618: in generate
    result = self._sample(
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:3602: in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/modeling_base.py:113: in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:2914: in forward
    result = super().forward(
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:791: in forward
    return self.language_model.forward(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <optimum.intel.openvino.modeling_visual_language.OVModelWithEmbedForCausalLM object at 0x7caf96699fd0>, input_ids = None
attention_mask = tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1... 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1]])
past_key_values = None
position_ids = tensor([[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
          17, 18, 19, 20, 21, 22, 23, 24...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37,
          38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])
inputs_embeds = tensor([[[-0.0075,  0.0098,  0.0053,  ..., -0.0015,  0.0090, -0.0060],
         [ 0.0023,  0.0172,  0.0163,  ...,  0.0...42,  0.0131,  ...,  0.0146,  0.0284, -0.0102],
         [ 0.0338, -0.0124,  0.0132,  ...,  0.0008, -0.0107,  0.0165]]])
kwargs = {'cache_position': tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20,..., 67, 68, 69, 70, 71,
        72, 73, 74, 75, 76, 77]), 'return_dict': True, 'token_type_ids': None, 'use_cache': True}
inputs = {'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1,...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39,
         40, 41, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])}

    def forward(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.LongTensor] = None,
        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        position_ids: Optional[torch.LongTensor] = None,
        inputs_embeds: Optional[torch.LongTensor] = None,
        **kwargs,
    ):
        self.compile()

        inputs = self.prepare_inputs(
            input_ids=input_ids,
            attention_mask=attention_mask,
            past_key_values=past_key_values,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            **kwargs,
        )
        # Run inference
        self.request.start_async(inputs, share_inputs=True)
>       self.request.wait()
E       RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
E       Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
E       Caught exception: bad_function_call

.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:223: RuntimeError

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Got Caught exception: bad_function_call during inference on GPU #1516

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Got Caught exception: bad_function_call during inference on GPU #1516

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions