Skip to content

Eval bug: --swa-full incompatible with cache quantization (on gemma4 at least) VRAM usage expands heavily with use #23978

@TomTheWise

Description

@TomTheWise

Name and Version

version: 9459 (95b8b8e)
built with GNU 14.2.0 for Linux x86_64

CUDA 13.0

Operating systems

Linux

GGML backends

CUDA

Hardware

2x L4-24Q vGPUs

Models

latest gemma4-26b-a4b bartowski and unsloth quants Q4_K_M or Q5_K_L (all behave same)

Problem description & steps to reproduce

When using --swa-full and no cache quantization, the VRAM gets reserved on startup immediately. You know immediately if its stable and the quality is top - because of this, I prefer using --swa-full.

Q5_K_L with 80K Context use about 41-42 GB VRAM on default f16 precision

When using -ctk and -ctv with quantization q8_0 or even q4_0 the VRAM direct after start of the llama-server is as expected much lower. But once you give long contexts to LLMs (actually use the same max context size), the VRAM usage expands till it is just as big as with default f16 KV cache, rendering KV cache quantization useless.

Happens both on -sm layer and on tensor - I ran into this bug weeks ago and after I saw the commit about tensor KV cache quantization compatibility, I checked it again to see if over the time this maybe gotten ironed out - but nope still the same. So essentially I don't think its a new bug caused by the new commit.

First Bad Commit

No response

Relevant log output

ctk ctv q4_0 after startup with --swa-full:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261017 C ...ma.cpp/build/bin/llama-server 13300MiB |
| 1 N/A N/A 261017 C ...ma.cpp/build/bin/llama-server 12492MiB |
+-----------------------------------------------------------------------------------------+

after ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 260928 C ...ma.cpp/build/bin/llama-server 21072MiB |
| 1 N/A N/A 260928 C ...ma.cpp/build/bin/llama-server 20044MiB |
+-----------------------------------------------------------------------------------------+

meanwhile default f16 with --swa-full:
directly after startup:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 20042MiB |
| 1 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 18110MiB |
+-----------------------------------------------------------------------------------------+

after same ~71k Context:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 20084MiB |
| 1 N/A N/A 261224 C ...ma.cpp/build/bin/llama-server 18152MiB |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions