Add optimization loop performance metrics#981
Conversation
Add two Prometheus metrics to track optimization loop performance: - wva_optimization_duration_seconds: histogram tracking duration of each optimization cycle, labeled by status (success/error) - wva_models_processed_total: counter tracking total models processed Implementation: - internal/constants/metrics.go: add metric name and label constants - internal/metrics/metrics.go: register histogram and counter, add ObserveOptimizationDuration and IncrModelsProcessed helper methods - internal/engines/saturation/engine.go: instrument optimize() with deferred timing observation and model count tracking - internal/metrics/optimization_metrics_test.go: unit tests verifying histogram observation, counter increment, and nil safety Closes llm-d#914
|
Note on the If a |
|
/ok-to-test |
|
🚀 Kind E2E (full) triggered by |
|
🚀 OpenShift E2E — approve and run ( |
ev-shindin
left a comment
There was a problem hiding this comment.
Please address review comments
| // optimize performs the optimization logic. | ||
| func (e *Engine) optimize(ctx context.Context) error { | ||
| start := time.Now() | ||
| var optimizeErr error |
There was a problem hiding this comment.
The manual optimizeErr = err pattern before each return err (lines 226, 241, 329) is fragile — any new return path added later will silently miss the metric. The idiomatic Go approach uses named return values which automatically captures all paths:
func (e *Engine) optimize(ctx context.Context) (retErr error) {
start := time.Now()
var modelsProcessed int
defer func() {
status := "success"
if retErr != nil {
status = "error"
}
e.metricsEmitter.ObserveOptimizationDuration(time.Since(start).Seconds(), status)
if modelsProcessed > 0 {
e.metricsEmitter.IncrModelsProcessed(modelsProcessed)
}
}()
// ... existing code unchanged, every "return err" is captured via retErr
}This eliminates all three optimizeErr = err assignments
There was a problem hiding this comment.
Good call — switched to named return (retErr error) and removed all the manual optimizeErr = err assignments
internal/metrics/metrics.go
Outdated
| Help: "Duration of optimization loop cycles in seconds", | ||
| Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}, | ||
| }, | ||
| []string{constants.LabelStatus}, |
There was a problem hiding this comment.
All existing WVA metrics conditionally add a controller_instance label when the CONTROLLER_INSTANCE env var is set (see InitMetrics lines 45-48 where baseLabels and scalingLabels get the extra label appended). The new optimizationDuration histogram only has LabelStatus and modelsProcessedTotal has no labels at all. In HA deployments with multiple controller instances, these metrics will collide.
There was a problem hiding this comment.
Good catch, missed this pattern. Added controller_instance conditionally to the histogram labels, same as the existing metrics. The gauge (modelsProcessedGauge) is a singleton (no labels) so it won't collide in HA — it
reflects the local controller's last cycle.
| optimizer pipeline.ScalingOptimizer | ||
|
|
||
| // metricsEmitter emits optimization loop performance metrics (duration, models processed). | ||
| metricsEmitter *metrics.MetricsEmitter |
There was a problem hiding this comment.
The Engine struct didn't previously have a metricsEmitter — the existing MetricsEmitter is used downstream in applySaturationDecisions. Adding a second instance creates two separate entry points for metric emission. Since MetricsEmitter is a stateless empty struct backed by package-level vars, this works but is a bit confusing. Consider either:
- Package-level helper functions (no emitter instance needed), or
- Receiving the emitter as a
NewEngineparameter so there's a single instance
There was a problem hiding this comment.
switched to package-level functions metrics.ObserveOptimizationDuration() and metrics.SetModelsProcessed().
| WVAOptimizationDurationSeconds = "wva_optimization_duration_seconds" | ||
|
|
||
| // WVAModelsProcessedTotal is a counter that tracks the total number of models processed across optimization cycles. | ||
| WVAModelsProcessedTotal = "wva_models_processed_total" |
There was a problem hiding this comment.
A monotonic counter of "total models processed" requires rate() to be useful, which gives models/sec — not a very meaningful signal for this use case. A gauge ("models in last cycle") would be more directly dashboardable and useful for alerting (e.g., "model count dropped to 0"). Worth considering whether this counter will actually drive alerts or dashboards as-is.
There was a problem hiding this comment.
Makes sense — changed to a gauge. SetModelsProcessed(n) now reflects the last cycle directly, no rate() needed.
| return err | ||
| } | ||
|
|
||
| modelsProcessed = len(modelGroups) |
There was a problem hiding this comment.
modelsProcessed is set after applySaturationDecisions succeeds, but modelGroups is computed much earlier (line 228). This means a failure in applySaturationDecisions reports modelsProcessed=0 even though models were analyzed — only the apply step failed. Consider moving this assignment to right after modelGroups is computed, so the metric accurately reflects how many models entered the optimization pipeline regardless of apply outcome.
There was a problem hiding this comment.
moved the assignment to right after modelGroups is computed
internal/metrics/metrics.go
Outdated
| } | ||
|
|
||
| // ObserveOptimizationDuration records the duration of an optimization cycle with the given status. | ||
| // Status should be one of: "success", "error", "partial". |
There was a problem hiding this comment.
The comment documents partial as a valid status, but the engine code only ever emits success or error. Either remove partial from the comment until it's implemented, or add a note that it's reserved for future use — otherwise someone might grep for "partial" and be confused when it's never emitted.
There was a problem hiding this comment.
Removed partial from the comment. Can be added later when a partial completion path actually exists
Changes based on review by @ev-shindin: 1. Use named return value (retErr) instead of manual optimizeErr assignments — idiomatic Go, captures all return paths automatically 2. Add controller_instance label to optimization duration histogram for HA deployment compatibility 3. Replace MetricsEmitter methods with package-level functions (ObserveOptimizationDuration, SetModelsProcessed) — avoids adding a second emitter instance on the Engine struct 4. Change models-processed from counter to gauge — directly shows models in last cycle, more useful for dashboards and alerting 5. Move modelsProcessed assignment to right after modelGroups is computed — accurately reflects models entering the pipeline regardless of apply outcome 6. Remove "partial" from status label documentation — not currently emitted, can be added when a partial completion path exists
Summary
Add timing and throughput metrics for the optimization loop. Closes #914.
New Metrics
wva_optimization_duration_secondsstatuswva_models_processed_totalHistogram buckets:
{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}secondsStatus labels:
success,errorImplementation
internal/constants/metrics.go— New metric name and label constantsinternal/metrics/metrics.go— Register histogram + counter inInitMetrics(), addObserveOptimizationDuration()andIncrModelsProcessed()methods onMetricsEmitterinternal/engines/saturation/engine.go— Instrumentoptimize()with a deferred timing observation that captures duration and status (success/error), and tracks models processed per cycleinternal/metrics/optimization_metrics_test.go— Unit testsTests
Acceptance Criteria (from issue)