Skip to content

NPU LLM Pipeline produces garbled output instead of error when prompt exceeds practical context limits #3255

@MrFixit96

Description

@MrFixit96

Environment

  • OS: Windows 11 Pro 24H2
  • Hardware: Intel Core Ultra 7 155H (Meteor Lake)
  • NPU Driver: Intel NPU Driver 32.0.100.4512
  • OpenVINO: 2025.4.1
  • openvino-genai: 2025.4.1.0
  • Python: 3.11.9

Description

When using ov_genai.LLMPipeline with the NPU device, prompts that exceed approximately 2,000 tokens produce garbled/nonsensical output instead of raising an error or returning a meaningful failure status.

The pipeline accepts the prompt, appears to process it successfully, and returns tokens — but the generated text is incoherent (e.g., "pester. A TA, PT A PTma.").

This behavior occurs even when MAX_PROMPT_LEN is configured to a higher value (e.g., 4096).

Expected Behavior

When a prompt exceeds the NPU's practical processing capacity, the pipeline should either:

  1. Raise an exception with a clear error message indicating the limit was exceeded
  2. Return an error status in the generation result
  3. Log a warning before generation begins if the prompt is near/over limits

Actual Behavior

  • Pipeline accepts the prompt without warning
  • Generation proceeds and returns tokens
  • Output is garbled/nonsensical
  • No error or warning is logged
  • Application has no way to detect the failure programmatically

Reproduction Steps

1. Export a model for NPU

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 ~/.cache/models/tinyllama-1.1b-chat-int4-ov

2. Create test script

import openvino_genai as ov_genai
import os

model_path = os.path.expanduser("~/.cache/models/tinyllama-1.1b-chat-int4-ov")

# Load pipeline with extended context config (using kwargs, not deprecated dict)
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, MIN_RESPONSE_LEN=150)

# Generate a long prompt (~2300 tokens) - this produces garbled output
base = "You are an expert AI assistant. "
long_prompt = base * 290 + "\n\nQuestion: What is 7+3? Answer with just the number."
# This is ~9300 chars, ~2325 tokens

result = pipe.generate(long_prompt, max_new_tokens=30)
print(result)

3. Observe output

At ~2,100 tokens, math errors appear:

7+3 is 14

At ~2,200 tokens, repetition begins:

Based on. Based on. Based.

At ~2,300+ tokens, output is garbled:

AAA. A, A, A, the A. A, AAA. AT the A. A. A. A. A

Or:

pester. A TA, PT A PTma.

Prompt Length Testing

Systematic testing with TinyLlama 1.1B INT4, using MAX_PROMPT_LEN=4096:

Prompt Length (chars) Estimated Tokens Output Status
2,037 ~509 "7 + 3 = 10" ✅ Correct
4,053 ~1,013 "7 + 3 = 10" ✅ Correct
6,037 ~1,509 "7 + 3 = 10" ✅ Correct
8,053 ~2,013 "The answer is 10." ✅ Correct
8,245 ~2,061 "The number 7+3 is 10." ✅ Correct
8,437 ~2,109 "7+3 is 14" ⚠️ Wrong math
8,629 ~2,157 "7+3 is 15" ⚠️ Wrong math
8,853 ~2,213 "Based on. Based on. Based." ❌ Repetition
9,045 ~2,261 (empty string) ❌ Empty
9,525 ~2,381 "AAA. A, A, A, the A..." ❌ Garbled
10,037 ~2,509 "pester. A TA, PT A PTma." ❌ Garbled

Key observation: Degradation begins at ~2,100 tokens (wrong answers), becomes severe at ~2,200 tokens (repetition/empty), and completely fails at ~2,300+ tokens (garbled nonsense).

Pipeline Architecture

# Standard stateful LLM pipeline for NPU
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, MIN_RESPONSE_LEN=150)

# Using default generation config
result = pipe.generate(prompt, max_new_tokens=30)

Analysis

The documentation states:

"By default, the LLM pipeline supports input prompts up to 1024 tokens in length."

And mentions MAX_PROMPT_LEN can be increased. However, increasing this value appears to only affect the compile-time shape allocation, not the actual NPU memory constraints.

The garbled output suggests the KV-cache or attention computation is silently failing/overflowing rather than being bounds-checked at runtime.

Suggested Fix

  1. Runtime validation: Before generation, estimate token count and compare against known hardware limits
  2. Explicit error: Raise RuntimeError if prompt exceeds practical NPU capacity
  3. Generation result status: Add a status or error field to generation results
  4. Documentation update: Clarify that MAX_PROMPT_LEN has hardware-dependent upper bounds

Related

Workaround

Currently, applications must implement their own token counting and reject/route prompts before calling the pipeline:

def safe_generate(pipe, prompt, max_tokens=100):
    estimated_tokens = len(prompt) // 4  # Rough estimate
    if estimated_tokens > 1800:
        raise ValueError(f"Prompt too long for NPU: ~{estimated_tokens} tokens")
    return pipe.generate(prompt, max_new_tokens=max_tokens)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions