server: add KV cache metrics by lvsijian8 · Pull Request #24010 · ggml-org/llama.cpp

lvsijian8 · 2026-06-02T06:41:23Z

Fixes #23632.

The Prometheus /metrics endpoint exposes request and throughput metrics, but it does not report KV cache pressure.

add a read-only KV cache cell usage query on llama memory
report KV cache used cells, total cells, and usage ratio from the server metrics task
document the new Prometheus gauges
cover the /metrics output in the server basic tests

.venv/bin/cmake -B build
.venv/bin/cmake --build build --target llama-server -j
PATH="$PWD/.venv/bin:$PATH" LLAMA_CACHE="$PWD/tools/server/tests/tmp" PORT=18080 LLAMA_SERVER_BIN_PATH="$PWD/build/bin/llama-server" ./tools/server/tests/tests.sh unit/test_basic.py::test_server_metrics_kv_cache -q

server: add KV cache metrics

ab354c6

lvsijian8 requested review from a team and ggerganov as code owners June 2, 2026 06:41

Provide feedback