-
Notifications
You must be signed in to change notification settings - Fork 165
Description
Hello, I'm trying to run a self-converted version of Qwen2-VL 2B on an Intel GPU, but I keep getting an exception below while the same code works totally fine on a CPU. I've tried to trace the input of each step in the Python side but cannot figure out what the difference is between inference on a CPU and a GPU. Sometimes it'll works after I refresh my venv then another failure occurs the next time.
Here's my code (partial of the class to handle inference)
#load
from openvino import Core
from optimum.intel.openvino import OVModelForVisualCausalLM
device = kwargs.get("device", "CPU")
ov_core = Core()
ov_core.set_property(device, {"EXECUTION_MODE": "PERFORMANCE", "CACHE_DIR": ""})
processor = AutoProcessor.from_pretrained(self._path_or_hf_repo, use_fast=True, trust_remote_code=True)
model = OVModelForVisualCausalLM.from_pretrained(
self._path_or_hf_repo,
ov_config=kwargs.get("ov_config", {}),
use_cache=kwargs.get("use_cache", True),
trust_remote_code=True,
device=device.lower(),
export=False,
compile=True,
)
#inference
prompt = self.processor.apply_chat_template(messages, add_generation_prompt=True, continue_final_message=False)
inputs = self.processor(text=prompt, images=image, return_tensors="pt")
generated_ids = self.model.generate(**inputs)
generated_texts = self.processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)with these ov_config
[load_config.ov_config]
KV_CACHE_PRECISION = "u8"
DYNAMIC_QUANTIZATION_GROUP_SIZE = "128" # will be commented out if test on GPU
PERFORMANCE_HINT = "LATENCY"And I'm running this in Python 3.13.9 with the packages below. I also tried to run the code with the different composition of installed packages(also torch with different backend) and different model, but nothing changed.
nncf 2.18.0
openvino 2025.3.0
openvino-telemetry 2025.2.0
openvino-tokenizers 2025.3.0.0
optimum 2.0.0
optimum-intel 1.26.1
optimum-onnx 0.0.3
pytorch-triton-xpu 3.5.0
torch 2.9.0+xpu
torchvision 0.24.0+xpu
transformers 4.55.4(BTW, sometimes openvino 2025.2.0 will throw another error that occured on both CPU and GPU.)
Caught exception: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:272:
E [GPU] [CL_EXT] setArgUsm in KernelIntel failed, error code: -49 CL_INVALID_ARG_INDEXAnd the error
src/vlm_server/internal/Model.py:285: in _response
generated_ids = self.model.generate(**inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py:123: in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:2618: in generate
result = self._sample(
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:3602: in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/modeling_base.py:113: in __call__
return self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:2914: in forward
result = super().forward(
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:791: in forward
return self.language_model.forward(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <optimum.intel.openvino.modeling_visual_language.OVModelWithEmbedForCausalLM object at 0x7caf96699fd0>, input_ids = None
attention_mask = tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1... 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1]])
past_key_values = None
position_ids = tensor([[[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37,
38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])
inputs_embeds = tensor([[[-0.0075, 0.0098, 0.0053, ..., -0.0015, 0.0090, -0.0060],
[ 0.0023, 0.0172, 0.0163, ..., 0.0...42, 0.0131, ..., 0.0146, 0.0284, -0.0102],
[ 0.0338, -0.0124, 0.0132, ..., 0.0008, -0.0107, 0.0165]]])
kwargs = {'cache_position': tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20,..., 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77]), 'return_dict': True, 'token_type_ids': None, 'use_cache': True}
inputs = {'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1,...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39,
40, 41, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])}
def forward(
self,
input_ids: torch.LongTensor,
attention_mask: Optional[torch.LongTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.LongTensor] = None,
**kwargs,
):
self.compile()
inputs = self.prepare_inputs(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
position_ids=position_ids,
inputs_embeds=inputs_embeds,
**kwargs,
)
# Run inference
self.request.start_async(inputs, share_inputs=True)
> self.request.wait()
E RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
E Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
E Caught exception: bad_function_call
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:223: RuntimeError