Skip to content

Commit fc34b87

Browse files
janhilgardclaude
andcommitted
fix: replace hard prefill token limit with warning
The ValueError on exceeding prefill_step_size caused infinite retry loops in _process_loop when long prompts were stuck in the queue. Replace with a warning log so long prompts are processed normally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 8b5cfe7 commit fc34b87

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

vllm_mlx/mllm_batch_generator.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -659,14 +659,14 @@ def _process_prompts(self, requests: List[MLLMBatchRequest]) -> MLLMBatch:
659659
)
660660
self._stats.prompt_tokens += total_prompt_tokens
661661

662-
# Guard against excessive memory usage during cache merge.
663-
# Each token in the batch requires KV entries across all layers.
662+
# Log large prompts for monitoring (was previously a hard check that
663+
# caused infinite retry loops when requests exceeded the limit).
664664
max_batch_tokens = self.prefill_step_size * len(requests)
665665
if total_prompt_tokens > max_batch_tokens:
666-
raise ValueError(
667-
f"Total prompt tokens ({total_prompt_tokens}) exceeds safe limit "
668-
f"({max_batch_tokens}) for {len(requests)} requests. "
669-
f"Reduce prompt length or batch size."
666+
logger.warning(
667+
f"Large batch prefill: {total_prompt_tokens} tokens "
668+
f"(step_size={self.prefill_step_size}, requests={len(requests)}). "
669+
f"Processing may be slow."
670670
)
671671

672672
# Run vision encoding for each request with its own KVCache.

0 commit comments

Comments
 (0)