# Vulkan: Performance degradation when context increases a small amount

### Name and Version

Windows 11
llama.cpp Vulkan
build: 55ac0909e (9458) & build: 210a6570c (9466)
built with Clang 19.1.5 for Windows x86_64


### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
set GGML_VK_FORCE_MAX_ALLOCATION=1
set OMP_NUM_THREADS=8
set GGML_VK_DISABLE_COOPMAT=1

C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096

C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768

C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --cache-type-k q4_1 --cache-type-v q4_1 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --cache-type-k q8_0 --cache-type-v q8_0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
```

### Problem description & steps to reproduce

When context increases, the TPS decline of PP and TG accelerates rapidly.

Prompt Processing (pp) Performance:
| Context Size | Tokens/s | Change from Previous | % Change |
|--------------|----------|----------------------|----------|
| 512          | 392.48   | -                    | -        |
| 1,024        | 391.77   | -0.71                | -0.18%   |
| 2,048        | 385.26   | -6.51                | -1.66%   |
| 4,096        | 364.78   | -20.48               | -5.32%   |
| 8,192        | 330.55   | -34.23               | -9.38%   |
| 16,384       | 278.57   | -51.98               | -15.73%  |
| 32,768       | 211.34   | -67.23               | -24.15%  |

Token Generation (tg) Performance:
| Context Size | Tokens/s | Change from Previous | % Change |
|--------------|----------|----------------------|----------|
| 128          | 26.76    | -                    | -        |
| 4,096        | 22.72    | -4.04                | -15.10%  |
| 8,192        | 19.57    | -3.15                | -13.86%  |
| 16,384       | 15.25    | -4.32                | -22.07%  |
| 32,768       | 10.57    | -4.68                | -30.69%  |

Direct Comparison at Matching Context Points:
| Context Size | PP (tokens/s) | TG (tokens/s) | PP/TG Ratio |
|--------------|---------------|---------------|-------------|
| 4,096        | 364.78        | 22.72         | 16.05x      |
| 8,192        | 330.55        | 19.57         | 16.89x      |
| 16,384       | 278.57        | 15.25         | 18.27x      |
| 32,768       | 211.34        | 10.57         | 19.99x      |

Rate of Decline Analysis:
Prompt Processing Decline (4096→32768):
364.78 → 211.34 = -153.44 tokens/s (-42.06%)

Token Generation Decline (4096→32768):
22.72 → 10.57 = -12.15 tokens/s (-53.47%)


Models tested:
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf
Qwen3.6-27B-UD-Q4_K_XL.gguf
Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
Qwen3.6-27B-Q4_K_M-MTP.gguf

[test3.txt](https://github.com/user-attachments/files/28489577/test3.txt)
[Verbose.txt](https://github.com/user-attachments/files/28489578/Verbose.txt)

[VP_VULKANINFO_Intel(R)_Arc(TM)_Pro_B70_Graphics_101_8517.json](https://github.com/user-attachments/files/28489590/VP_VULKANINFO_Intel.R._Arc.TM._Pro_B70_Graphics_101_8517.json)

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Vulkan: Performance degradation when context increases a small amount #24005

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Context Size	Tokens/s	Change from Previous	% Change
512	392.48	-	-
1,024	391.77	-0.71	-0.18%
2,048	385.26	-6.51	-1.66%
4,096	364.78	-20.48	-5.32%
8,192	330.55	-34.23	-9.38%
16,384	278.57	-51.98	-15.73%
32,768	211.34	-67.23	-24.15%

Context Size	Tokens/s	Change from Previous	% Change
128	26.76	-	-
4,096	22.72	-4.04	-15.10%
8,192	19.57	-3.15	-13.86%
16,384	15.25	-4.32	-22.07%
32,768	10.57	-4.68	-30.69%

# Vulkan: Performance degradation when context increases a small amount #24005

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions