Skip to content

Commit 8e72fb6

Browse files
authored
docs: DYN-1967 update metrics docs after kvstats removal (#5704)
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
1 parent e555780 commit 8e72fb6

File tree

6 files changed

+7
-36
lines changed

6 files changed

+7
-36
lines changed

docs/kubernetes/autoscaling.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th
227227
| `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
228228
| `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
229229
| `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
230-
| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
231230

232231
#### Metric Labels
233232

@@ -641,7 +640,7 @@ Avoid configuring multiple autoscalers for the same service:
641640
|--------------|---------------------|---------------|
642641
| Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
643642
| Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
644-
| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` |
643+
| Decode | ITL | `dynamo_frontend_inter_token_latency_seconds` |
645644

646645
### 3. Configure Stabilization Windows
647646

docs/observability/metrics.md

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -123,19 +123,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
123123
curl http://localhost:8081/metrics
124124
```
125125

126-
### KV Router Statistics (kvstats)
127-
128-
KV router statistics are automatically exposed by LLM workers and KV router components on the backend system status port (port 8081) with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
129-
130-
- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
131-
- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
132-
- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
133-
- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
134-
135-
These metrics are published by:
136-
- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
137-
- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
138-
139126
### Specialized Component Metrics
140127

141128
Some components expose additional metrics specific to their functionality:

fern/pages/kubernetes/autoscaling.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th
233233
| `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
234234
| `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
235235
| `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
236-
| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
237236

238237
#### Metric Labels
239238

@@ -647,7 +646,7 @@ Avoid configuring multiple autoscalers for the same service:
647646
|--------------|---------------------|---------------|
648647
| Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
649648
| Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
650-
| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` |
649+
| Decode | ITL | `dynamo_frontend_inter_token_latency_seconds` |
651650

652651
### 3. Configure Stabilization Windows
653652

fern/pages/observability/metrics.md

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -122,19 +122,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
122122
curl http://localhost:8081/metrics
123123
```
124124

125-
### KV Router Statistics (kvstats)
126-
127-
KV router statistics are automatically exposed by LLM workers and KV router components on the backend system status port (port 8081) with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
128-
129-
- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
130-
- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
131-
- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
132-
- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
133-
134-
These metrics are published by:
135-
- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
136-
- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
137-
138125
### Specialized Component Metrics
139126

140127
Some components expose additional metrics specific to their functionality:

lib/bindings/python/codegen/README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,21 +16,20 @@ cargo run -p dynamo-codegen --bin gen-python-prometheus-names
1616

1717
- Parses Rust AST from `lib/runtime/src/metrics/prometheus_names.rs`
1818
- Generates Python classes with constants at `lib/bindings/python/src/dynamo/prometheus_names.py`
19-
- Handles macro-generated constants (e.g., `kvstats_name!("active_blocks")``"kvstats_active_blocks"`)
2019

2120
### Example
2221

2322
**Rust input:**
2423
```rust
25-
pub mod kvstats {
26-
pub const ACTIVE_BLOCKS: &str = kvstats_name!("active_blocks");
24+
pub mod kvrouter {
25+
pub const KV_CACHE_EVENTS_APPLIED: &str = "kv_cache_events_applied";
2726
}
2827
```
2928

3029
**Python output:**
3130
```python
32-
class kvstats:
33-
ACTIVE_BLOCKS = "kvstats_active_blocks"
31+
class kvrouter:
32+
KV_CACHE_EVENTS_APPLIED = "kv_cache_events_applied"
3433
```
3534

3635
### When to run

lib/bindings/python/codegen/src/gen_python_prometheus_names.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ Parses lib/runtime/src/metrics/prometheus_names.rs and generates a pure Python
196196
module with 1:1 constant mappings at lib/bindings/python/src/dynamo/prometheus_names.py
197197
198198
This allows Python code to import Prometheus metric constants without Rust bindings:
199-
from dynamo.prometheus_names import frontend_service, kvstats
199+
from dynamo.prometheus_names import frontend_service
200200
201201
OPTIONS:
202202
--source PATH Path to Rust source file

0 commit comments

Comments
 (0)