docs: DYN-1967 update metrics docs after kvstats removal (#5704)

keivenchang · web-flow · commit 8e72fb69cf05 · 2026-01-27T17:24:12.000-08:00
Signed-off-by: Keiven Chang &lt;keivenchang@users.noreply.github.com&gt;
Co-authored-by: Keiven Chang &lt;keivenchang@users.noreply.github.com&gt;
diff --git a/docs/kubernetes/autoscaling.md b/docs/kubernetes/autoscaling.md
@@ -227,7 +227,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th
 | `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
 | `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
 | `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
-| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
 
 #### Metric Labels
 
@@ -641,7 +640,7 @@ Avoid configuring multiple autoscalers for the same service:
 |--------------|---------------------|---------------|
 | Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
 | Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
-| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` |
+| Decode | ITL | `dynamo_frontend_inter_token_latency_seconds` |
 
 ### 3. Configure Stabilization Windows
 
diff --git a/docs/observability/metrics.md b/docs/observability/metrics.md
@@ -123,19 +123,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
 curl http://localhost:8081/metrics
 ```
 
-### KV Router Statistics (kvstats)
-
-KV router statistics are automatically exposed by LLM workers and KV router components on the backend system status port (port 8081) with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
-
-- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
-- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
-- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
-- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
-
-These metrics are published by:
-- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
-- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
-
 ### Specialized Component Metrics
 
 Some components expose additional metrics specific to their functionality:
diff --git a/fern/pages/kubernetes/autoscaling.md b/fern/pages/kubernetes/autoscaling.md
@@ -233,7 +233,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th
 | `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
 | `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
 | `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
-| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
 
 #### Metric Labels
 
@@ -647,7 +646,7 @@ Avoid configuring multiple autoscalers for the same service:
 |--------------|---------------------|---------------|
 | Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
 | Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
-| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` |
+| Decode | ITL | `dynamo_frontend_inter_token_latency_seconds` |
 
 ### 3. Configure Stabilization Windows
 
diff --git a/fern/pages/observability/metrics.md b/fern/pages/observability/metrics.md
@@ -122,19 +122,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
 curl http://localhost:8081/metrics
 ```
 
-### KV Router Statistics (kvstats)
-
-KV router statistics are automatically exposed by LLM workers and KV router components on the backend system status port (port 8081) with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
-
-- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
-- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
-- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
-- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
-
-These metrics are published by:
-- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
-- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
-
 ### Specialized Component Metrics
 
 Some components expose additional metrics specific to their functionality:
diff --git a/lib/bindings/python/codegen/README.md b/lib/bindings/python/codegen/README.md
@@ -16,21 +16,20 @@ cargo run -p dynamo-codegen --bin gen-python-prometheus-names
 
 - Parses Rust AST from `lib/runtime/src/metrics/prometheus_names.rs`
 - Generates Python classes with constants at `lib/bindings/python/src/dynamo/prometheus_names.py`
-- Handles macro-generated constants (e.g., `kvstats_name!("active_blocks")` → `"kvstats_active_blocks"`)
 
 ### Example
 
 **Rust input:**
 ```rust
-pub mod kvstats {
-    pub const ACTIVE_BLOCKS: &str = kvstats_name!("active_blocks");
+pub mod kvrouter {
+    pub const KV_CACHE_EVENTS_APPLIED: &str = "kv_cache_events_applied";
 }
 ```
 
 **Python output:**
 ```python
-class kvstats:
-    ACTIVE_BLOCKS = "kvstats_active_blocks"
+class kvrouter:
+    KV_CACHE_EVENTS_APPLIED = "kv_cache_events_applied"
 ```
 
 ### When to run
diff --git a/lib/bindings/python/codegen/src/gen_python_prometheus_names.rs b/lib/bindings/python/codegen/src/gen_python_prometheus_names.rs
@@ -196,7 +196,7 @@ Parses lib/runtime/src/metrics/prometheus_names.rs and generates a pure Python
 module with 1:1 constant mappings at lib/bindings/python/src/dynamo/prometheus_names.py
 
 This allows Python code to import Prometheus metric constants without Rust bindings:
-    from dynamo.prometheus_names import frontend_service, kvstats
+    from dynamo.prometheus_names import frontend_service
 
 OPTIONS:
     --source PATH    Path to Rust source file