Skip to content

server: add KV cache metrics#24010

Open
lvsijian8 wants to merge 1 commit into
ggml-org:masterfrom
lvsijian8:fix/kv-cache-metrics
Open

server: add KV cache metrics#24010
lvsijian8 wants to merge 1 commit into
ggml-org:masterfrom
lvsijian8:fix/kv-cache-metrics

Conversation

@lvsijian8
Copy link
Copy Markdown

Fixes #23632.

Problem

The Prometheus /metrics endpoint exposes request and throughput metrics, but it does not report KV cache pressure.

Changes

  • add a read-only KV cache cell usage query on llama memory
  • report KV cache used cells, total cells, and usage ratio from the server metrics task
  • document the new Prometheus gauges
  • cover the /metrics output in the server basic tests

Tests

  • .venv/bin/cmake -B build
  • .venv/bin/cmake --build build --target llama-server -j
  • PATH="$PWD/.venv/bin:$PATH" LLAMA_CACHE="$PWD/tools/server/tests/tmp" PORT=18080 LLAMA_SERVER_BIN_PATH="$PWD/build/bin/llama-server" ./tools/server/tests/tests.sh unit/test_basic.py::test_server_metrics_kv_cache -q

@lvsijian8 lvsijian8 requested review from a team and ggerganov as code owners June 2, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: expose KV cache utilization for /metrics endpoint

1 participant