Eval bug: --swa-full incompatible with cache quantization (on gemma4 at least) VRAM usage expands heavily with use

### Name and Version

version: 9459 (95b8b8ec1)
built with GNU 14.2.0 for Linux x86_64

CUDA 13.0

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

2x L4-24Q vGPUs

### Models

latest gemma4-26b-a4b bartowski and unsloth quants Q4_K_M or Q5_K_L (all behave same)

### Problem description & steps to reproduce

When using --swa-full and no cache quantization, the VRAM gets reserved on startup immediately. You know immediately if its stable and the quality is top - because of this, I prefer using --swa-full.

Q5_K_L with 80K Context use about 41-42 GB VRAM on default f16 precision


When using -ctk and -ctv with quantization q8_0 or even q4_0 the VRAM direct after start of the llama-server is as expected much lower. But once you give long contexts to LLMs (actually use the same max context size), the VRAM usage expands till it is just as big as with default f16 KV cache, rendering KV cache quantization useless.


Happens both on -sm layer and on tensor - I ran into this bug weeks ago and after I saw the commit about tensor KV cache quantization compatibility, I checked it again to see if over the time this maybe gotten ironed out - but nope still the same. So essentially I don't think its a new bug caused by the new commit.

### First Bad Commit

_No response_

### Relevant log output

ctk ctv q4_0 after startup with --swa-full:
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          261017      C   ...ma.cpp/build/bin/llama-server      13300MiB |
|    1   N/A  N/A          261017      C   ...ma.cpp/build/bin/llama-server      12492MiB |
+-----------------------------------------------------------------------------------------+

after ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          260928      C   ...ma.cpp/build/bin/llama-server      21072MiB |
|    1   N/A  N/A          260928      C   ...ma.cpp/build/bin/llama-server      20044MiB |
+-----------------------------------------------------------------------------------------+




meanwhile default f16 with  --swa-full:
directly after startup:
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          261224      C   ...ma.cpp/build/bin/llama-server      20042MiB |
|    1   N/A  N/A          261224      C   ...ma.cpp/build/bin/llama-server      18110MiB |
+-----------------------------------------------------------------------------------------+


after same ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          261224      C   ...ma.cpp/build/bin/llama-server      20084MiB |
|    1   N/A  N/A          261224      C   ...ma.cpp/build/bin/llama-server      18152MiB |
+-----------------------------------------------------------------------------------------+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: --swa-full incompatible with cache quantization (on gemma4 at least) VRAM usage expands heavily with use #23978

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: --swa-full incompatible with cache quantization (on gemma4 at least) VRAM usage expands heavily with use #23978

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions