-
Notifications
You must be signed in to change notification settings - Fork 341
Description
Environment
- OS: Windows 11 Pro 24H2
- Hardware: Intel Core Ultra 7 155H (Meteor Lake)
- NPU Driver: Intel NPU Driver 32.0.100.4512
- OpenVINO: 2025.4.1
- openvino-genai: 2025.4.1.0
- Python: 3.11.9
Description
When using ov_genai.LLMPipeline with the NPU device, prompts that exceed approximately 2,000 tokens produce garbled/nonsensical output instead of raising an error or returning a meaningful failure status.
The pipeline accepts the prompt, appears to process it successfully, and returns tokens — but the generated text is incoherent (e.g., "pester. A TA, PT A PTma.").
This behavior occurs even when MAX_PROMPT_LEN is configured to a higher value (e.g., 4096).
Expected Behavior
When a prompt exceeds the NPU's practical processing capacity, the pipeline should either:
- Raise an exception with a clear error message indicating the limit was exceeded
- Return an error status in the generation result
- Log a warning before generation begins if the prompt is near/over limits
Actual Behavior
- Pipeline accepts the prompt without warning
- Generation proceeds and returns tokens
- Output is garbled/nonsensical
- No error or warning is logged
- Application has no way to detect the failure programmatically
Reproduction Steps
1. Export a model for NPU
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 ~/.cache/models/tinyllama-1.1b-chat-int4-ov2. Create test script
import openvino_genai as ov_genai
import os
model_path = os.path.expanduser("~/.cache/models/tinyllama-1.1b-chat-int4-ov")
# Load pipeline with extended context config (using kwargs, not deprecated dict)
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, MIN_RESPONSE_LEN=150)
# Generate a long prompt (~2300 tokens) - this produces garbled output
base = "You are an expert AI assistant. "
long_prompt = base * 290 + "\n\nQuestion: What is 7+3? Answer with just the number."
# This is ~9300 chars, ~2325 tokens
result = pipe.generate(long_prompt, max_new_tokens=30)
print(result)3. Observe output
At ~2,100 tokens, math errors appear:
7+3 is 14
At ~2,200 tokens, repetition begins:
Based on. Based on. Based.
At ~2,300+ tokens, output is garbled:
AAA. A, A, A, the A. A, AAA. AT the A. A. A. A. A
Or:
pester. A TA, PT A PTma.
Prompt Length Testing
Systematic testing with TinyLlama 1.1B INT4, using MAX_PROMPT_LEN=4096:
| Prompt Length (chars) | Estimated Tokens | Output | Status |
|---|---|---|---|
| 2,037 | ~509 | "7 + 3 = 10" | ✅ Correct |
| 4,053 | ~1,013 | "7 + 3 = 10" | ✅ Correct |
| 6,037 | ~1,509 | "7 + 3 = 10" | ✅ Correct |
| 8,053 | ~2,013 | "The answer is 10." | ✅ Correct |
| 8,245 | ~2,061 | "The number 7+3 is 10." | ✅ Correct |
| 8,437 | ~2,109 | "7+3 is 14" | |
| 8,629 | ~2,157 | "7+3 is 15" | |
| 8,853 | ~2,213 | "Based on. Based on. Based." | ❌ Repetition |
| 9,045 | ~2,261 | (empty string) | ❌ Empty |
| 9,525 | ~2,381 | "AAA. A, A, A, the A..." | ❌ Garbled |
| 10,037 | ~2,509 | "pester. A TA, PT A PTma." | ❌ Garbled |
Key observation: Degradation begins at ~2,100 tokens (wrong answers), becomes severe at ~2,200 tokens (repetition/empty), and completely fails at ~2,300+ tokens (garbled nonsense).
Pipeline Architecture
# Standard stateful LLM pipeline for NPU
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, MIN_RESPONSE_LEN=150)
# Using default generation config
result = pipe.generate(prompt, max_new_tokens=30)Analysis
The documentation states:
"By default, the LLM pipeline supports input prompts up to 1024 tokens in length."
And mentions MAX_PROMPT_LEN can be increased. However, increasing this value appears to only affect the compile-time shape allocation, not the actual NPU memory constraints.
The garbled output suggests the KV-cache or attention computation is silently failing/overflowing rather than being bounds-checked at runtime.
Suggested Fix
- Runtime validation: Before generation, estimate token count and compare against known hardware limits
- Explicit error: Raise
RuntimeErrorif prompt exceeds practical NPU capacity - Generation result status: Add a
statusorerrorfield to generation results - Documentation update: Clarify that
MAX_PROMPT_LENhas hardware-dependent upper bounds
Related
- Documentation: https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html
- Error location (when hard limit exceeded):
src/cpp/src/llm/pipeline_stateful.cpp:309
Workaround
Currently, applications must implement their own token counting and reject/route prompts before calling the pipeline:
def safe_generate(pipe, prompt, max_tokens=100):
estimated_tokens = len(prompt) // 4 # Rough estimate
if estimated_tokens > 1800:
raise ValueError(f"Prompt too long for NPU: ~{estimated_tokens} tokens")
return pipe.generate(prompt, max_new_tokens=max_tokens)