-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of python collect_env.py
vllm==0.11.0
🐛 Describe the bug
When initializing GLM-4.1V with vLLM (even for text-only/image-only inference), the engine core fails to start due to a constraint mismatch in dummy video input generation:
- vLLM automatically generates dummy video inputs during encoder budget calculation (even when no video is provided by the user).
- The fallback logic in
get_num_frames_with_most_featuressetsnum_frames=1(minimum value) when video token allocation is insufficient. - GLM-4.1V's
smart_resizefunction enforces a strict constraint:num_frames > temporal_factor=2(not>=), triggering a ValueError. - This leads to fatal engine core initialization failure, blocking all inference for GLM-4.1V.
Key Context
- The error occurs before any user prompts are processed (during engine initialization).
- No explicit video input is provided—this is an internal vLLM dummy input generation issue.
Minimal reproducible example
from vllm import LLM, SamplingParams
import os
# Initialize GLM-4.1V (triggers dummy video input generation)
llm = LLM(
model="THUDM/glm-4.1v",
tensor_parallel_size=2,
dtype="bfloat16",
enforce_eager=True,
max_model_len=4096,
disable_custom_all_reduce=True, # NPU-specific
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=32,
)
# Text-only prompt (error occurs during llm.generate call)
outputs = llm.generate(
["Describe the image."],
sampling_params,
)
for output in outputs:
print(output.outputs[0].text)Observed behavior
The engine core crashes during initialization with the following critical error:
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}
Full critical error traceback
ERROR 12-14 18:32:20 [core.py:708] Traceback (most recent call last):
File "/home/silas/vllm/vllm/multimodal/processing.py", line 1057, in call_hf_processor
output = hf_processor(**data,
File "/home/silas/transformers/src/transformers/models/glm4v/processing_glm4v.py", line 150, in __call__
videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
File "/home/silas/transformers/src/transformers/video_processing_utils.py", line 206, in __call__
return self.preprocess(videos,** kwargs)
File "/home/silas/transformers/src/transformers/video_processing_utils.py", line 387, in preprocess
preprocessed_videos = self._preprocess(videos=videos, **kwargs)
File "/home/silas/transformers/src/transformers/models/glm4v/video_processing_glm4v.py", line 177, in _preprocess
resized_height, resized_width = smart_resize(
File "/home/silas/transformers/src/transformers/models/glm4v/image_processing_glm4v.py", line 59, in smart_resize
raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")
ValueError: t:1 must be larger than temporal_factor:2
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/silas/vllm/vllm/v1/engine/core.py", line 699, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
File "/home/silas/vllm/vllm/v1/engine/core.py", line 498, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/home/silas/vllm/vllm/v1/engine/core.py", line 124, in __init__
self.scheduler: SchedulerInterface = Scheduler(
File "/home/silas/vllm/vllm/v1/core/sched/scheduler.py", line 142, in __init__
encoder_compute_budget, encoder_cache_size = compute_encoder_budget(
File "/home/silas/vllm/vllm/v1/core/encoder_cache_manager.py", line 264, in compute_encoder_budget
.get_max_tokens_per_item_by_nonzero_modality(model_config)
File "/home/silas/vllm/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_nonzero_modality
max_tokens_per_item = self.get_max_tokens_per_item_by_modality(
File "/home/silas/vllm/vllm/multimodal/registry.py", line 143, in get_max_tokens_per_item_by_modality
return profiler.get_mm_max_contiguous_tokens(
File "/home/silas/vllm/vllm/multimodal/profiling.py", line 282, in get_mm_max_contiguous_tokens
return self._get_mm_max_tokens(seq_len,
File "/home/silas/vllm/vllm/multimodal/profiling.py", line 262, in _get_mm_max_tokens
mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
File "/home/silas/vllm/vllm/multimodal/profiling.py", line 173, in _get_dummy_mm_inputs
return self.processor.apply(
File "/home/silas/vllm/vllm/multimodal/processing.py", line 1080, in call_hf_processor
raise ValueError(msg) from exc
ValueError: Failed to apply Glm4vProcessor on data={...}
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}
Expected behavior
- vLLM should respect GLM-4.1V's temporal factor constraint (set fallback
num_frames=2instead of1). - Dummy video input generation should not fail for text-only/image-only requests.
- Engine core should initialize successfully without manual workaround.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working