NPU LLM Pipeline produces garbled output instead of error when prompt exceeds practical context limits

## Environment

- **OS:** Windows 11 Pro 24H2
- **Hardware:** Intel Core Ultra 7 155H (Meteor Lake)
- **NPU Driver:** Intel NPU Driver 32.0.100.4512
- **OpenVINO:** 2025.4.1
- **openvino-genai:** 2025.4.1.0
- **Python:** 3.11.9

## Description

When using `ov_genai.LLMPipeline` with the NPU device, prompts that exceed approximately 2,000 tokens produce garbled/nonsensical output instead of raising an error or returning a meaningful failure status.

The pipeline accepts the prompt, appears to process it successfully, and returns tokens — but the generated text is incoherent (e.g., `"pester. A TA, PT A PTma."`).

This behavior occurs even when `MAX_PROMPT_LEN` is configured to a higher value (e.g., 4096).

## Expected Behavior

When a prompt exceeds the NPU's practical processing capacity, the pipeline should either:

1. **Raise an exception** with a clear error message indicating the limit was exceeded
2. **Return an error status** in the generation result
3. **Log a warning** before generation begins if the prompt is near/over limits

## Actual Behavior

- Pipeline accepts the prompt without warning
- Generation proceeds and returns tokens
- Output is garbled/nonsensical
- No error or warning is logged
- Application has no way to detect the failure programmatically

## Reproduction Steps

### 1. Export a model for NPU

```bash
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 ~/.cache/models/tinyllama-1.1b-chat-int4-ov
```

### 2. Create test script

```python
import openvino_genai as ov_genai
import os

model_path = os.path.expanduser("~/.cache/models/tinyllama-1.1b-chat-int4-ov")

# Load pipeline with extended context config (using kwargs, not deprecated dict)
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, MIN_RESPONSE_LEN=150)

# Generate a long prompt (~2300 tokens) - this produces garbled output
base = "You are an expert AI assistant. "
long_prompt = base * 290 + "\n\nQuestion: What is 7+3? Answer with just the number."
# This is ~9300 chars, ~2325 tokens

result = pipe.generate(long_prompt, max_new_tokens=30)
print(result)
```

### 3. Observe output

At ~2,100 tokens, math errors appear:
```
7+3 is 14
```

At ~2,200 tokens, repetition begins:
```
Based on. Based on. Based.
```

At ~2,300+ tokens, output is garbled:
```
AAA. A, A, A, the A. A, AAA. AT the A. A. A. A. A
```

Or:
```
pester. A TA, PT A PTma.
```

## Prompt Length Testing

Systematic testing with TinyLlama 1.1B INT4, using `MAX_PROMPT_LEN=4096`:

| Prompt Length (chars) | Estimated Tokens | Output | Status |
|-----------------------|------------------|--------|--------|
| 2,037 | ~509 | "7 + 3 = 10" | ✅ Correct |
| 4,053 | ~1,013 | "7 + 3 = 10" | ✅ Correct |
| 6,037 | ~1,509 | "7 + 3 = 10" | ✅ Correct |
| 8,053 | ~2,013 | "The answer is 10." | ✅ Correct |
| 8,245 | ~2,061 | "The number 7+3 is 10." | ✅ Correct |
| 8,437 | ~2,109 | "7+3 is 14" | ⚠️ Wrong math |
| 8,629 | ~2,157 | "7+3 is 15" | ⚠️ Wrong math |
| 8,853 | ~2,213 | "Based on. Based on. Based." | ❌ Repetition |
| 9,045 | ~2,261 | (empty string) | ❌ Empty |
| 9,525 | ~2,381 | "AAA. A, A, A, the A..." | ❌ Garbled |
| 10,037 | ~2,509 | "pester. A TA, PT A PTma." | ❌ Garbled |

**Key observation:** Degradation begins at ~2,100 tokens (wrong answers), becomes severe at ~2,200 tokens (repetition/empty), and completely fails at ~2,300+ tokens (garbled nonsense).

## Pipeline Architecture

```python
# Standard stateful LLM pipeline for NPU
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, MIN_RESPONSE_LEN=150)

# Using default generation config
result = pipe.generate(prompt, max_new_tokens=30)
```

## Analysis

The documentation states:

> "By default, the LLM pipeline supports input prompts up to 1024 tokens in length."

And mentions `MAX_PROMPT_LEN` can be increased. However, increasing this value appears to only affect the compile-time shape allocation, not the actual NPU memory constraints.

The garbled output suggests the KV-cache or attention computation is silently failing/overflowing rather than being bounds-checked at runtime.

## Suggested Fix

1. **Runtime validation:** Before generation, estimate token count and compare against known hardware limits
2. **Explicit error:** Raise `RuntimeError` if prompt exceeds practical NPU capacity
3. **Generation result status:** Add a `status` or `error` field to generation results
4. **Documentation update:** Clarify that `MAX_PROMPT_LEN` has hardware-dependent upper bounds

## Related

- Documentation: https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html
- Error location (when hard limit exceeded): `src/cpp/src/llm/pipeline_stateful.cpp:309`

## Workaround

Currently, applications must implement their own token counting and reject/route prompts before calling the pipeline:

```python
def safe_generate(pipe, prompt, max_tokens=100):
    estimated_tokens = len(prompt) // 4  # Rough estimate
    if estimated_tokens > 1800:
        raise ValueError(f"Prompt too long for NPU: ~{estimated_tokens} tokens")
    return pipe.generate(prompt, max_new_tokens=max_tokens)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPU LLM Pipeline produces garbled output instead of error when prompt exceeds practical context limits #3255

Environment

Description

Expected Behavior

Actual Behavior

Reproduction Steps

1. Export a model for NPU

2. Create test script

3. Observe output

Prompt Length Testing

Pipeline Architecture

Analysis

Suggested Fix

Related

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prompt Length (chars)	Estimated Tokens	Output	Status
2,037	~509	"7 + 3 = 10"	✅ Correct
4,053	~1,013	"7 + 3 = 10"	✅ Correct
6,037	~1,509	"7 + 3 = 10"	✅ Correct
8,053	~2,013	"The answer is 10."	✅ Correct
8,245	~2,061	"The number 7+3 is 10."	✅ Correct
8,437	~2,109	"7+3 is 14"	⚠️ Wrong math
8,629	~2,157	"7+3 is 15"	⚠️ Wrong math
8,853	~2,213	"Based on. Based on. Based."	❌ Repetition
9,045	~2,261	(empty string)	❌ Empty
9,525	~2,381	"AAA. A, A, A, the A..."	❌ Garbled
10,037	~2,509	"pester. A TA, PT A PTma."	❌ Garbled

NPU LLM Pipeline produces garbled output instead of error when prompt exceeds practical context limits #3255

Description

Environment

Description

Expected Behavior

Actual Behavior

Reproduction Steps

1. Export a model for NPU

2. Create test script

3. Observe output

Prompt Length Testing

Pipeline Architecture

Analysis

Suggested Fix

Related

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions