Name and Version
version: 9459 (95b8b8e)
built with GNU 14.2.0 for Linux x86_64
CUDA 13.0
Operating systems
Linux
GGML backends
CUDA
Hardware
2x L4-24Q vGPUs
Models
latest gemma4-26b-a4b bartowski and unsloth quants Q4_K_M or Q5_K_L (all behave same)
Problem description & steps to reproduce
When using --swa-full and no cache quantization, the VRAM gets reserved on startup immediately. You know immediately if its stable and the quality is top - because of this, I prefer using --swa-full.
Q5_K_L with 80K Context use about 41-42 GB VRAM on default f16 precision
When using -ctk and -ctv with quantization q8_0 or even q4_0 the VRAM direct after start of the llama-server is as expected much lower. But once you give long contexts to LLMs (actually use the same max context size), the VRAM usage expands till it is just as big as with default f16 KV cache, rendering KV cache quantization useless.
Happens both on -sm layer and on tensor - I ran into this bug weeks ago and after I saw the commit about tensor KV cache quantization compatibility, I checked it again to see if over the time this maybe gotten ironed out - but nope still the same. So essentially I don't think its a new bug caused by the new commit.
First Bad Commit
No response
Relevant log output
ctk ctv q4_0 after startup with --swa-full:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261017 C ...ma.cpp/build/bin/llama-server 13300MiB |
| 1 N/A N/A 261017 C ...ma.cpp/build/bin/llama-server 12492MiB |
+-----------------------------------------------------------------------------------------+
after ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 260928 C ...ma.cpp/build/bin/llama-server 21072MiB |
| 1 N/A N/A 260928 C ...ma.cpp/build/bin/llama-server 20044MiB |
+-----------------------------------------------------------------------------------------+
meanwhile default f16 with --swa-full:
directly after startup:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 20042MiB |
| 1 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 18110MiB |
+-----------------------------------------------------------------------------------------+
after same ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 20084MiB |
| 1 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 18152MiB |
+-----------------------------------------------------------------------------------------+
Name and Version
version: 9459 (95b8b8e)
built with GNU 14.2.0 for Linux x86_64
CUDA 13.0
Operating systems
Linux
GGML backends
CUDA
Hardware
2x L4-24Q vGPUs
Models
latest gemma4-26b-a4b bartowski and unsloth quants Q4_K_M or Q5_K_L (all behave same)
Problem description & steps to reproduce
When using --swa-full and no cache quantization, the VRAM gets reserved on startup immediately. You know immediately if its stable and the quality is top - because of this, I prefer using --swa-full.
Q5_K_L with 80K Context use about 41-42 GB VRAM on default f16 precision
When using -ctk and -ctv with quantization q8_0 or even q4_0 the VRAM direct after start of the llama-server is as expected much lower. But once you give long contexts to LLMs (actually use the same max context size), the VRAM usage expands till it is just as big as with default f16 KV cache, rendering KV cache quantization useless.
Happens both on -sm layer and on tensor - I ran into this bug weeks ago and after I saw the commit about tensor KV cache quantization compatibility, I checked it again to see if over the time this maybe gotten ironed out - but nope still the same. So essentially I don't think its a new bug caused by the new commit.
First Bad Commit
No response
Relevant log output
ctk ctv q4_0 after startup with --swa-full:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261017 C ...ma.cpp/build/bin/llama-server 13300MiB |
| 1 N/A N/A 261017 C ...ma.cpp/build/bin/llama-server 12492MiB |
+-----------------------------------------------------------------------------------------+
after ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 260928 C ...ma.cpp/build/bin/llama-server 21072MiB |
| 1 N/A N/A 260928 C ...ma.cpp/build/bin/llama-server 20044MiB |
+-----------------------------------------------------------------------------------------+
meanwhile default f16 with --swa-full:
directly after startup:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 20042MiB |
| 1 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 18110MiB |
+-----------------------------------------------------------------------------------------+
after same ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 20084MiB |
| 1 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 18152MiB |
+-----------------------------------------------------------------------------------------+