Add optimization loop performance metrics by jia-gao · Pull Request #981 · llm-d/llm-d-workload-variant-autoscaler

jia-gao · 2026-04-05T17:12:30Z

Summary

Add timing and throughput metrics for the optimization loop. Closes #914.

New Metrics

Metric	Type	Labels	Description
`wva_optimization_duration_seconds`	Histogram	`status`	Duration of each optimization cycle
`wva_models_processed_total`	Counter	—	Total models processed across cycles

Histogram buckets: {0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} seconds

Status labels: success, error

Implementation

internal/constants/metrics.go — New metric name and label constants
internal/metrics/metrics.go — Register histogram + counter in InitMetrics(), add ObserveOptimizationDuration() and IncrModelsProcessed() methods on MetricsEmitter
internal/engines/saturation/engine.go — Instrument optimize() with a deferred timing observation that captures duration and status (success/error), and tracks models processed per cycle
internal/metrics/optimization_metrics_test.go — Unit tests

Tests

=== RUN   TestObserveOptimizationDuration
--- PASS: TestObserveOptimizationDuration (0.00s)
=== RUN   TestIncrModelsProcessed
--- PASS: TestIncrModelsProcessed (0.00s)
=== RUN   TestObserveOptimizationDuration_NilSafety
--- PASS: TestObserveOptimizationDuration_NilSafety (0.00s)
PASS

Acceptance Criteria (from issue)

Duration histogram emitted per optimization cycle
Models-processed counter incremented correctly
Unit tests verify histogram observation and counter increment

Add two Prometheus metrics to track optimization loop performance: - wva_optimization_duration_seconds: histogram tracking duration of each optimization cycle, labeled by status (success/error) - wva_models_processed_total: counter tracking total models processed Implementation: - internal/constants/metrics.go: add metric name and label constants - internal/metrics/metrics.go: register histogram and counter, add ObserveOptimizationDuration and IncrModelsProcessed helper methods - internal/engines/saturation/engine.go: instrument optimize() with deferred timing observation and model count tracking - internal/metrics/optimization_metrics_test.go: unit tests verifying histogram observation, counter increment, and nil safety Closes llm-d#914

jia-gao · 2026-04-05T17:22:40Z

Note on the status label: The issue spec mentions partial as a possible status value. The current implementation only emits success or error because optimize() either completes fully or returns an error — there's no partial completion path today.

If a partial status is needed in the future (e.g., when some model groups succeed but others fail within a single cycle), it can be added without breaking the histogram schema since it's just a new label value.

ev-shindin · 2026-04-05T20:15:54Z

/ok-to-test

github-actions · 2026-04-05T20:16:04Z

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

github-actions · 2026-04-05T20:16:07Z

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

ev-shindin

Please address review comments

ev-shindin · 2026-04-05T20:39:19Z

internal/engines/saturation/engine.go

 // optimize performs the optimization logic.
 func (e *Engine) optimize(ctx context.Context) error {
+	start := time.Now()
+	var optimizeErr error


The manual optimizeErr = err pattern before each return err (lines 226, 241, 329) is fragile — any new return path added later will silently miss the metric. The idiomatic Go approach uses named return values which automatically captures all paths:

func (e *Engine) optimize(ctx context.Context) (retErr error) { start := time.Now() var modelsProcessed int defer func() { status := "success" if retErr != nil { status = "error" } e.metricsEmitter.ObserveOptimizationDuration(time.Since(start).Seconds(), status) if modelsProcessed > 0 { e.metricsEmitter.IncrModelsProcessed(modelsProcessed) } }() // ... existing code unchanged, every "return err" is captured via retErr }

This eliminates all three optimizeErr = err assignments

Good call — switched to named return (retErr error) and removed all the manual optimizeErr = err assignments

ev-shindin · 2026-04-05T20:41:54Z

internal/metrics/metrics.go

+			Help:    "Duration of optimization loop cycles in seconds",
+			Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
+		},
+		[]string{constants.LabelStatus},


All existing WVA metrics conditionally add a controller_instance label when the CONTROLLER_INSTANCE env var is set (see InitMetrics lines 45-48 where baseLabels and scalingLabels get the extra label appended). The new optimizationDuration histogram only has LabelStatus and modelsProcessedTotal has no labels at all. In HA deployments with multiple controller instances, these metrics will collide.

Good catch, missed this pattern. Added controller_instance conditionally to the histogram labels, same as the existing metrics. The gauge (modelsProcessedGauge) is a singleton (no labels) so it won't collide in HA — it
reflects the local controller's last cycle.

ev-shindin · 2026-04-05T20:42:49Z

internal/engines/saturation/engine.go

 	optimizer pipeline.ScalingOptimizer
+
+	// metricsEmitter emits optimization loop performance metrics (duration, models processed).
+	metricsEmitter *metrics.MetricsEmitter


The Engine struct didn't previously have a metricsEmitter — the existing MetricsEmitter is used downstream in applySaturationDecisions. Adding a second instance creates two separate entry points for metric emission. Since MetricsEmitter is a stateless empty struct backed by package-level vars, this works but is a bit confusing. Consider either:

Package-level helper functions (no emitter instance needed), or

Receiving the emitter as a NewEngine parameter so there's a single instance

switched to package-level functions metrics.ObserveOptimizationDuration() and metrics.SetModelsProcessed().

ev-shindin · 2026-04-05T20:44:11Z

internal/constants/metrics.go

+	WVAOptimizationDurationSeconds = "wva_optimization_duration_seconds"
+
+	// WVAModelsProcessedTotal is a counter that tracks the total number of models processed across optimization cycles.
+	WVAModelsProcessedTotal = "wva_models_processed_total"


A monotonic counter of "total models processed" requires rate() to be useful, which gives models/sec — not a very meaningful signal for this use case. A gauge ("models in last cycle") would be more directly dashboardable and useful for alerting (e.g., "model count dropped to 0"). Worth considering whether this counter will actually drive alerts or dashboards as-is.

Makes sense — changed to a gauge. SetModelsProcessed(n) now reflects the last cycle directly, no rate() needed.

ev-shindin · 2026-04-05T20:45:10Z

internal/engines/saturation/engine.go

 		return err
 	}

+	modelsProcessed = len(modelGroups)


modelsProcessed is set after applySaturationDecisions succeeds, but modelGroups is computed much earlier (line 228). This means a failure in applySaturationDecisions reports modelsProcessed=0 even though models were analyzed — only the apply step failed. Consider moving this assignment to right after modelGroups is computed, so the metric accurately reflects how many models entered the optimization pipeline regardless of apply outcome.

moved the assignment to right after modelGroups is computed

ev-shindin · 2026-04-05T20:46:18Z

internal/metrics/metrics.go

 }

+// ObserveOptimizationDuration records the duration of an optimization cycle with the given status.
+// Status should be one of: "success", "error", "partial".


The comment documents partial as a valid status, but the engine code only ever emits success or error. Either remove partial from the comment until it's implemented, or add a note that it's reserved for future use — otherwise someone might grep for "partial" and be confused when it's never emitted.

Removed partial from the comment. Can be added later when a partial completion path actually exists

@ev-shindin

Changes based on review by @ev-shindin: 1. Use named return value (retErr) instead of manual optimizeErr assignments — idiomatic Go, captures all return paths automatically 2. Add controller_instance label to optimization duration histogram for HA deployment compatibility 3. Replace MetricsEmitter methods with package-level functions (ObserveOptimizationDuration, SetModelsProcessed) — avoids adding a second emitter instance on the Engine struct 4. Change models-processed from counter to gauge — directly shows models in last cycle, more useful for dashboards and alerting 5. Move modelsProcessed assignment to right after modelGroups is computed — accurately reflects models entering the pipeline regardless of apply outcome 6. Remove "partial" from status label documentation — not currently emitted, can be added when a partial completion path exists

ev-shindin requested changes Apr 5, 2026

View reviewed changes

Conversation

jia-gao commented Apr 5, 2026

Summary

New Metrics

Implementation

Tests

Acceptance Criteria (from issue)

Uh oh!

jia-gao commented Apr 5, 2026

Uh oh!

ev-shindin commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

ev-shindin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants