Skip to content

Commit a264d06

Browse files
docs: enhance MetricsAvailable condition documentation
- Add MetricsAvailable condition fix to CHANGELOG v0.5.0 - Enhance CRD reference with detailed condition documentation - Add Operations & Monitoring section to main README - Include examples of condition usage and kubectl commands - Link to comprehensive metrics health monitoring guide This update ensures the MetricsAvailable condition feature (PR #567) is properly documented across all relevant guides.
1 parent 2963cc7 commit a264d06

3 files changed

Lines changed: 107 additions & 1 deletion

File tree

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,10 @@ See the [Installation Guide](docs/user-guide/installation.md) for detailed instr
6464
- [CRD Reference](docs/user-guide/crd-reference.md)
6565
- [Multi-Controller Isolation](docs/user-guide/multi-controller-isolation.md)
6666

67+
### Operations & Monitoring
68+
- [Metrics Health Monitoring](docs/metrics-health-monitoring.md) - Status conditions and troubleshooting
69+
- [Prometheus Metrics](docs/integrations/prometheus.md)
70+
6771
<!--
6872
6973
### Tutorials
@@ -74,7 +78,6 @@ See the [Installation Guide](docs/user-guide/installation.md) for detailed instr
7478
### Integrations
7579
- [HPA Integration](docs/integrations/hpa-integration.md)
7680
- [KEDA Integration](docs/integrations/keda-integration.md)
77-
- [Prometheus Metrics](docs/integrations/prometheus.md)
7881

7982
<!--
8083

docs/CHANGELOG-v0.5.0.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,40 @@ query := fmt.Sprintf(`vllm_kv_cache_usage{namespace="%s"}`, escapedNamespace)
135135
**Documentation:**
136136
- [Prometheus Integration - PromQL Injection Prevention](integrations/prometheus.md#promql-injection-prevention)
137137

138+
## Bug Fixes
139+
140+
### MetricsAvailable Condition Always Set in Status (PR #567)
141+
142+
**Problem:**
143+
The `MetricsAvailable` condition was not consistently appearing in VariantAutoscaling status, making it difficult for operators to diagnose metrics collection issues.
144+
145+
**Root Cause:**
146+
The condition was being set on a local VA object that was never persisted to the API server. The condition needed to flow through the DecisionCache to reach the controller.
147+
148+
**Solution:**
149+
- Added `MetricsAvailable`, `MetricsReason`, and `MetricsMessage` fields to `VariantDecision` struct
150+
- Engine populates these fields in the decision cache based on whether metrics data is available
151+
- Controller reads from cache and sets the condition on VA status
152+
- Condition is now set even when pods aren't ready yet (MetricsAvailable=False with helpful message)
153+
154+
**Behavior:**
155+
- **MetricsAvailable=True**: Metrics data is available (allocation from metrics collection OR decision from saturation analysis)
156+
- **MetricsAvailable=False**: No metrics available - pods may not be ready or metrics not yet scraped
157+
158+
**Constants Extracted:**
159+
- `MetricsReasonAvailable` / `MetricsReasonUnavailable`
160+
- `MetricsMessageAvailable` / `MetricsMessageUnavailable`
161+
162+
**Benefits:**
163+
- ✅ Operators can always see metrics availability status
164+
- ✅ Clear diagnostic messages for troubleshooting
165+
- ✅ Consistent condition reporting across all scenarios
166+
- ✅ Better visibility into controller state
167+
168+
**Documentation:**
169+
- [Metrics Health Monitoring](metrics-health-monitoring.md)
170+
- [CRD Reference - Conditions](user-guide/crd-reference.md)
171+
138172
## Minor Improvements
139173

140174
### Helper Functions
@@ -205,7 +239,9 @@ E2E tests updated:
205239
## References
206240

207241
- PR #549: https://github.com/llm-d-incubation/workload-variant-autoscaler/pull/549
242+
- PR #567: https://github.com/llm-d-incubation/workload-variant-autoscaler/pull/567
208243
- Commit: 14e2bd88 - fix: pending-aware scaling and E2E test stability improvements
244+
- Commit: 2963cc73 - fix: always set MetricsAvailable condition in VA status
209245

210246
---
211247

docs/user-guide/crd-reference.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,3 +128,70 @@ _Appears in:_
128128
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#condition-v1-meta) array_ | Conditions represent the latest available observations of the VariantAutoscaling's state | | Optional: \{\} <br /> |
129129

130130

131+
## Status Conditions
132+
133+
WVA uses standard Kubernetes conditions to report the health and state of each VariantAutoscaling resource.
134+
135+
### Condition Types
136+
137+
#### MetricsAvailable
138+
139+
Indicates whether vLLM metrics are available from Prometheus for this variant.
140+
141+
**Status Values:**
142+
- `True`: Metrics data is available
143+
- `False`: No metrics available (pods not ready or metrics not scraped)
144+
145+
**Common Reasons:**
146+
- `MetricsAvailable`: Saturation metrics data is available for scaling decisions
147+
- `MetricsUnavailable`: No saturation metrics available
148+
149+
**Example:**
150+
```yaml
151+
conditions:
152+
- type: MetricsAvailable
153+
status: "True"
154+
reason: MetricsAvailable
155+
message: "Saturation metrics data is available for scaling decisions"
156+
lastTransitionTime: "2026-01-09T22:00:00Z"
157+
```
158+
159+
#### OptimizationReady
160+
161+
Indicates whether the optimization engine ran successfully.
162+
163+
**Status Values:**
164+
- `True`: Optimization completed
165+
- `False`: Optimization failed or cannot run
166+
167+
**Common Reasons:**
168+
- `OptimizationSucceeded`: Optimization completed successfully
169+
- `SaturationOnlyMode`: Operating in saturation-only mode
170+
- `SaturationSafetyOverride`: Saturation safety override applied
171+
172+
**Example:**
173+
```yaml
174+
conditions:
175+
- type: OptimizationReady
176+
status: "True"
177+
reason: OptimizationSucceeded
178+
message: "Hybrid mode: scale-up decision (target: 3 replicas)"
179+
lastTransitionTime: "2026-01-09T22:00:00Z"
180+
```
181+
182+
### Viewing Conditions
183+
184+
```bash
185+
# Quick view with MetricsReady column
186+
kubectl get variantautoscaling -A
187+
188+
# Detailed condition information
189+
kubectl describe variantautoscaling <name> -n <namespace>
190+
191+
# Extract conditions as JSON
192+
kubectl get variantautoscaling <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq
193+
```
194+
195+
For comprehensive information on metrics health monitoring and troubleshooting, see [Metrics Health Monitoring](../metrics-health-monitoring.md).
196+
197+

0 commit comments

Comments
 (0)