Skip to content

Improve llm-d-kv-cache Observability for Metrics, Tracing Coverage, Alerting, and Dashboards #617

@gyliu513

Description

@gyliu513

This is taking llm-d/llm-d-workload-variant-autoscaler#911 as a sample reference.

@vMaroon can you please comment? Thanks

Summary

The kv-cache library currently emits 9 Prometheus metrics covering the index Add/Evict/Lookup paths (kvcache_index_*) and tokenization latency (kvcache_tokenization_*). What is still missing: the kvevents/ZMQ subscriber layer, storage backend connectivity, and error conditions have no metrics at all. OTel tracing covers Lookup, ScoreTokens, and Scorer.Score, but Add, Evict, and the entire ZMQ event-processing path remain untraced. There are no alerting rules and no operational dashboard for the core index library.

This epic defines a phased improvement across metrics, tracing, alerting, and dashboards, broken into independently deliverable sub-issues.

Current State

Area Status Notes
Prometheus metrics 9 collectors Index: admissions_total, evictions_total, lookup_*; tokenization: 3 metrics (when pool enabled). Gated by enableMetrics.
Error metrics None Lookup / tokenization / backend errors not counted
kvevents layer None ZMQ subscriber and pool operate silently; no metrics, no traces
Backend metrics None Redis / Valkey connectivity unmonitored
OpenTelemetry tracing Partial Covers Lookup, ScoreTokens, Scorer.Score; Add, Evict, ZMQ path not traced
HTTP endpoints None in library Host application exposes /metrics; core library has no default HTTP listener, no /healthz or /readyz
Alerting rules None No PrometheusRule CRDs
Dashboards llmd_fs_backend connector only No operational dashboard for the core index library

Components with No Observability Today

These files operate silently (no metrics, no traces):

  • pkg/kvevents/subscriber_manager.go — active subscriber count, reconnection events: zero metrics
  • pkg/kvevents/zmq_subscriber.go — message receive rate, ZMQ errors: zero metrics
  • pkg/kvevents/pool.go — pool utilization and capacity: zero metrics
  • pkg/kvcache/backend.go — backend health, connection errors: zero metrics
  • pkg/kvcache/kvblock/redis.go / in_memory.go / cost_aware_memory.go — per-backend cache size and eviction rates: zero metrics
  • pkg/kvcache/kvblock/traced_index.goAdd and Evict paths are not traced

Deployment Boundary

Scenario Who owns it
Embedded in EPP (precise-prefix-cache-scorer, enableMetrics: true) kvcache_* metrics appear on EPP's /metrics; no duplicate work needed here
Standalone service This epic: library-level metrics, health probes, scrape docs
EPP-side scrape configuration Stays in the llm-d-router observability issue

When embedded in EPP, kvcache_* metrics share the same controller-runtime Registry and appear at EPP's :9090/metrics endpoint unchanged. This epic does not add a second scrape endpoint for the embedded case.

Sub-Issues

  • Sub-issue 1: kvevents Subscriber Health Metrics: active subscriber gauge, message receive rate, ZMQ error counter, reconnection counter per pod_identifier
  • Sub-issue 2: Error Tracking Metrics: lookup error counter, tokenization error counter, backend read/write error counter by backend_type
  • Sub-issue 3: Cache Efficiency Metrics Enhancement: lookup_hits_total / lookup_requests_total already exist; add an explicit hit_rate gauge, an index entry count gauge, and per-backend utilization labels to existing eviction/admission counters
  • Sub-issue 4: Backend Connectivity Metrics: backend scrape latency histogram, error rate by backend_type, connection pool size gauge; covers Redis, Valkey, in-memory, cost-aware-memory
  • Sub-issue 5: Extend OTel Tracing Coverage: Add and Evict in the traced index wrapper, ZMQ event-processing spans, stronger tokenization span propagation; follow the existing tracedIndex wrapper pattern
  • Sub-issue 6: PrometheusRule Alerting Rules: high lookup error rate, ZMQ subscriber down, abnormal eviction spike, backend unreachable; targeting kvcache_* core library metrics
  • Sub-issue 7: Operational Grafana Dashboard for Core Library: index throughput, hit rate, tokenization latency, subscriber health, backend health; separate from the existing llmd_fs_backend connector dashboard
  • Sub-issue 8: Health / Readiness Endpoints for Standalone Deployments: HTTP /healthz and /readyz for processes that embed the library directly; not needed for the EPP-embedded case
  • Sub-issue 9: Observability Documentation: metric catalog, tracing setup guide, scrape config example for standalone and EPP-embedded deployments; EPP-side scrape config links to llm-d-router docs

Implementation Order

  • Phase 1 (P0): Sub-issues 1, 2, 3 — Core metrics gaps (can be done in parallel)
  • Phase 2 (P1): Sub-issues 4, 5, 8 — Backend metrics, tracing coverage, health probes
  • Phase 3 (P2): Sub-issues 6, 7, 9 — Alerts, dashboard, docs

Implementation Notes

  • New metrics must follow the existing kvcache_<subsystem>_<name> naming convention (subsystems: index, tokenization) and register through pkg/kvcache/metrics/ via metrics.Register() / sync.Once using the controller-runtime metrics.Registry
  • When embedded in EPP the same Registry is used; metric names are unchanged and no second registration is needed
  • Histogram buckets for sub-millisecond index ops: prometheus.DefBuckets; for ZMQ message processing latency: {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0}
  • Tracing instrumentation must follow the tracedIndex wrapper pattern; instrument at the interface boundary, do not inline spans into business logic
  • Each sub-issue should be a separate PR for reviewability
  • Sub-issues within the same phase can be developed in parallel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions