Improve llm-d-kv-cache Observability for Metrics, Tracing Coverage, Alerting, and Dashboards

This is taking https://github.com/llm-d/llm-d-workload-variant-autoscaler/issues/911 as a sample reference.

@vMaroon can you please comment? Thanks

## Summary

The kv-cache library currently emits **9 Prometheus metrics** covering the index Add/Evict/Lookup paths (`kvcache_index_*`) and tokenization latency (`kvcache_tokenization_*`). What is still missing: the **kvevents/ZMQ subscriber layer**, **storage backend connectivity**, and **error conditions** have no metrics at all. OTel tracing covers `Lookup`, `ScoreTokens`, and `Scorer.Score`, but `Add`, `Evict`, and the entire ZMQ event-processing path remain untraced. There are no alerting rules and no operational dashboard for the core index library.

This epic defines a phased improvement across metrics, tracing, alerting, and dashboards, broken into independently deliverable sub-issues.

## Current State

| Area | Status | Notes |
|------|--------|-------|
| Prometheus metrics | 9 collectors | Index: `admissions_total`, `evictions_total`, `lookup_*`; tokenization: 3 metrics (when pool enabled). Gated by `enableMetrics`. |
| Error metrics | None | Lookup / tokenization / backend errors not counted |
| kvevents layer | None | ZMQ subscriber and pool operate silently; no metrics, no traces |
| Backend metrics | None | Redis / Valkey connectivity unmonitored |
| OpenTelemetry tracing | Partial | Covers `Lookup`, `ScoreTokens`, `Scorer.Score`; `Add`, `Evict`, ZMQ path not traced |
| HTTP endpoints | None in library | Host application exposes `/metrics`; core library has no default HTTP listener, no `/healthz` or `/readyz` |
| Alerting rules | None | No PrometheusRule CRDs |
| Dashboards | `llmd_fs_backend` connector only | No operational dashboard for the core index library |

## Components with No Observability Today

These files operate silently (no metrics, no traces):

- `pkg/kvevents/subscriber_manager.go` — active subscriber count, reconnection events: zero metrics
- `pkg/kvevents/zmq_subscriber.go` — message receive rate, ZMQ errors: zero metrics
- `pkg/kvevents/pool.go` — pool utilization and capacity: zero metrics
- `pkg/kvcache/backend.go` — backend health, connection errors: zero metrics
- `pkg/kvcache/kvblock/redis.go` / `in_memory.go` / `cost_aware_memory.go` — per-backend cache size and eviction rates: zero metrics
- `pkg/kvcache/kvblock/traced_index.go` — `Add` and `Evict` paths are not traced

## Deployment Boundary

| Scenario | Who owns it |
|----------|-------------|
| Embedded in EPP (`precise-prefix-cache-scorer`, `enableMetrics: true`) | `kvcache_*` metrics appear on EPP's `/metrics`; no duplicate work needed here |
| Standalone service | This epic: library-level metrics, health probes, scrape docs |
| EPP-side scrape configuration | Stays in the llm-d-router observability issue |

When embedded in EPP, `kvcache_*` metrics share the same controller-runtime Registry and appear at EPP's `:9090/metrics` endpoint unchanged. This epic does not add a second scrape endpoint for the embedded case.

## Sub-Issues

- **Sub-issue 1**: kvevents Subscriber Health Metrics: active subscriber gauge, message receive rate, ZMQ error counter, reconnection counter per `pod_identifier`
- **Sub-issue 2**: Error Tracking Metrics: lookup error counter, tokenization error counter, backend read/write error counter by `backend_type`
- **Sub-issue 3**: Cache Efficiency Metrics Enhancement: `lookup_hits_total` / `lookup_requests_total` already exist; add an explicit `hit_rate` gauge, an index entry count gauge, and per-backend utilization labels to existing eviction/admission counters
- **Sub-issue 4**: Backend Connectivity Metrics: backend scrape latency histogram, error rate by `backend_type`, connection pool size gauge; covers Redis, Valkey, in-memory, cost-aware-memory
- **Sub-issue 5**: Extend OTel Tracing Coverage: `Add` and `Evict` in the traced index wrapper, ZMQ event-processing spans, stronger tokenization span propagation; follow the existing `tracedIndex` wrapper pattern
- **Sub-issue 6**: PrometheusRule Alerting Rules: high lookup error rate, ZMQ subscriber down, abnormal eviction spike, backend unreachable; targeting `kvcache_*` core library metrics
- **Sub-issue 7**: Operational Grafana Dashboard for Core Library: index throughput, hit rate, tokenization latency, subscriber health, backend health; separate from the existing `llmd_fs_backend` connector dashboard
- **Sub-issue 8**: Health / Readiness Endpoints for Standalone Deployments: HTTP `/healthz` and `/readyz` for processes that embed the library directly; not needed for the EPP-embedded case
- **Sub-issue 9**: Observability Documentation： metric catalog, tracing setup guide, scrape config example for standalone and EPP-embedded deployments; EPP-side scrape config links to llm-d-router docs

## Implementation Order

- Phase 1 (P0): Sub-issues 1, 2, 3    — Core metrics gaps (can be done in parallel)
- Phase 2 (P1): Sub-issues 4, 5, 8    — Backend metrics, tracing coverage, health probes
- Phase 3 (P2): Sub-issues 6, 7, 9    — Alerts, dashboard, docs

## Implementation Notes

- New metrics must follow the existing `kvcache_<subsystem>_<name>` naming convention (subsystems: `index`, `tokenization`) and register through `pkg/kvcache/metrics/` via `metrics.Register()` / `sync.Once` using the controller-runtime `metrics.Registry`
- When embedded in EPP the same Registry is used; metric names are unchanged and no second registration is needed
- Histogram buckets for sub-millisecond index ops: `prometheus.DefBuckets`; for ZMQ message processing latency: `{0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0}`
- Tracing instrumentation must follow the `tracedIndex` wrapper pattern; instrument at the interface boundary, do not inline spans into business logic
- Each sub-issue should be a separate PR for reviewability
- Sub-issues within the same phase can be developed in parallel


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve llm-d-kv-cache Observability for Metrics, Tracing Coverage, Alerting, and Dashboards #617

Summary

Current State

Components with No Observability Today

Deployment Boundary

Sub-Issues

Implementation Order

Implementation Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Area	Status	Notes
Prometheus metrics	9 collectors	Index: `admissions_total`, `evictions_total`, `lookup_*`; tokenization: 3 metrics (when pool enabled). Gated by `enableMetrics`.
Error metrics	None	Lookup / tokenization / backend errors not counted
kvevents layer	None	ZMQ subscriber and pool operate silently; no metrics, no traces
Backend metrics	None	Redis / Valkey connectivity unmonitored
OpenTelemetry tracing	Partial	Covers `Lookup`, `ScoreTokens`, `Scorer.Score`; `Add`, `Evict`, ZMQ path not traced
HTTP endpoints	None in library	Host application exposes `/metrics`; core library has no default HTTP listener, no `/healthz` or `/readyz`
Alerting rules	None	No PrometheusRule CRDs
Dashboards	`llmd_fs_backend` connector only	No operational dashboard for the core index library

Scenario	Who owns it
Embedded in EPP (`precise-prefix-cache-scorer`, `enableMetrics: true`)	`kvcache_*` metrics appear on EPP's `/metrics`; no duplicate work needed here
Standalone service	This epic: library-level metrics, health probes, scrape docs
EPP-side scrape configuration	Stays in the llm-d-router observability issue

Improve llm-d-kv-cache Observability for Metrics, Tracing Coverage, Alerting, and Dashboards #617

Description

Summary

Current State

Components with No Observability Today

Deployment Boundary

Sub-Issues

Implementation Order

Implementation Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions