Skip to content

Add optimization loop performance metrics#981

Open
jia-gao wants to merge 2 commits intollm-d:mainfrom
jia-gao:feat/optimization-loop-metrics-914
Open

Add optimization loop performance metrics#981
jia-gao wants to merge 2 commits intollm-d:mainfrom
jia-gao:feat/optimization-loop-metrics-914

Conversation

@jia-gao
Copy link
Copy Markdown

@jia-gao jia-gao commented Apr 5, 2026

Summary

Add timing and throughput metrics for the optimization loop. Closes #914.

New Metrics

Metric Type Labels Description
wva_optimization_duration_seconds Histogram status Duration of each optimization cycle
wva_models_processed_total Counter Total models processed across cycles

Histogram buckets: {0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} seconds

Status labels: success, error

Implementation

  • internal/constants/metrics.go — New metric name and label constants
  • internal/metrics/metrics.go — Register histogram + counter in InitMetrics(), add ObserveOptimizationDuration() and IncrModelsProcessed() methods on MetricsEmitter
  • internal/engines/saturation/engine.go — Instrument optimize() with a deferred timing observation that captures duration and status (success/error), and tracks models processed per cycle
  • internal/metrics/optimization_metrics_test.go — Unit tests

Tests

=== RUN   TestObserveOptimizationDuration
--- PASS: TestObserveOptimizationDuration (0.00s)
=== RUN   TestIncrModelsProcessed
--- PASS: TestIncrModelsProcessed (0.00s)
=== RUN   TestObserveOptimizationDuration_NilSafety
--- PASS: TestObserveOptimizationDuration_NilSafety (0.00s)
PASS

Acceptance Criteria (from issue)

  • Duration histogram emitted per optimization cycle
  • Models-processed counter incremented correctly
  • Unit tests verify histogram observation and counter increment

Add two Prometheus metrics to track optimization loop performance:
- wva_optimization_duration_seconds: histogram tracking duration of each
  optimization cycle, labeled by status (success/error)
- wva_models_processed_total: counter tracking total models processed

Implementation:
- internal/constants/metrics.go: add metric name and label constants
- internal/metrics/metrics.go: register histogram and counter, add
  ObserveOptimizationDuration and IncrModelsProcessed helper methods
- internal/engines/saturation/engine.go: instrument optimize() with
  deferred timing observation and model count tracking
- internal/metrics/optimization_metrics_test.go: unit tests verifying
  histogram observation, counter increment, and nil safety

Closes llm-d#914
@jia-gao
Copy link
Copy Markdown
Author

jia-gao commented Apr 5, 2026

Note on the status label: The issue spec mentions partial as a possible status value. The current implementation only emits success or error because optimize() either completes fully or returns an error — there's no partial completion path today.

If a partial status is needed in the future (e.g., when some model groups succeed but others fail within a single cycle), it can be added without breaking the histogram schema since it's just a new label value.

@ev-shindin
Copy link
Copy Markdown
Collaborator

/ok-to-test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

Copy link
Copy Markdown
Collaborator

@ev-shindin ev-shindin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address review comments

// optimize performs the optimization logic.
func (e *Engine) optimize(ctx context.Context) error {
start := time.Now()
var optimizeErr error
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual optimizeErr = err pattern before each return err (lines 226, 241, 329) is fragile — any new return path added later will silently miss the metric. The idiomatic Go approach uses named return values which automatically captures all paths:

func (e *Engine) optimize(ctx context.Context) (retErr error) {
    start := time.Now()
    var modelsProcessed int
    defer func() {
        status := "success"
        if retErr != nil {
            status = "error"
        }
        e.metricsEmitter.ObserveOptimizationDuration(time.Since(start).Seconds(), status)
        if modelsProcessed > 0 {
            e.metricsEmitter.IncrModelsProcessed(modelsProcessed)
        }
    }()
    // ... existing code unchanged, every "return err" is captured via retErr
}

This eliminates all three optimizeErr = err assignments

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — switched to named return (retErr error) and removed all the manual optimizeErr = err assignments

Help: "Duration of optimization loop cycles in seconds",
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
},
[]string{constants.LabelStatus},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All existing WVA metrics conditionally add a controller_instance label when the CONTROLLER_INSTANCE env var is set (see InitMetrics lines 45-48 where baseLabels and scalingLabels get the extra label appended). The new optimizationDuration histogram only has LabelStatus and modelsProcessedTotal has no labels at all. In HA deployments with multiple controller instances, these metrics will collide.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, missed this pattern. Added controller_instance conditionally to the histogram labels, same as the existing metrics. The gauge (modelsProcessedGauge) is a singleton (no labels) so it won't collide in HA — it
reflects the local controller's last cycle.

optimizer pipeline.ScalingOptimizer

// metricsEmitter emits optimization loop performance metrics (duration, models processed).
metricsEmitter *metrics.MetricsEmitter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Engine struct didn't previously have a metricsEmitter — the existing MetricsEmitter is used downstream in applySaturationDecisions. Adding a second instance creates two separate entry points for metric emission. Since MetricsEmitter is a stateless empty struct backed by package-level vars, this works but is a bit confusing. Consider either:

  • Package-level helper functions (no emitter instance needed), or
  • Receiving the emitter as a NewEngine parameter so there's a single instance

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to package-level functions metrics.ObserveOptimizationDuration() and metrics.SetModelsProcessed().

WVAOptimizationDurationSeconds = "wva_optimization_duration_seconds"

// WVAModelsProcessedTotal is a counter that tracks the total number of models processed across optimization cycles.
WVAModelsProcessedTotal = "wva_models_processed_total"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A monotonic counter of "total models processed" requires rate() to be useful, which gives models/sec — not a very meaningful signal for this use case. A gauge ("models in last cycle") would be more directly dashboardable and useful for alerting (e.g., "model count dropped to 0"). Worth considering whether this counter will actually drive alerts or dashboards as-is.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — changed to a gauge. SetModelsProcessed(n) now reflects the last cycle directly, no rate() needed.

return err
}

modelsProcessed = len(modelGroups)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modelsProcessed is set after applySaturationDecisions succeeds, but modelGroups is computed much earlier (line 228). This means a failure in applySaturationDecisions reports modelsProcessed=0 even though models were analyzed — only the apply step failed. Consider moving this assignment to right after modelGroups is computed, so the metric accurately reflects how many models entered the optimization pipeline regardless of apply outcome.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the assignment to right after modelGroups is computed

}

// ObserveOptimizationDuration records the duration of an optimization cycle with the given status.
// Status should be one of: "success", "error", "partial".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment documents partial as a valid status, but the engine code only ever emits success or error. Either remove partial from the comment until it's implemented, or add a note that it's reserved for future use — otherwise someone might grep for "partial" and be confused when it's never emitted.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed partial from the comment. Can be added later when a partial completion path actually exists

Changes based on review by @ev-shindin:

1. Use named return value (retErr) instead of manual optimizeErr
   assignments — idiomatic Go, captures all return paths automatically
2. Add controller_instance label to optimization duration histogram
   for HA deployment compatibility
3. Replace MetricsEmitter methods with package-level functions
   (ObserveOptimizationDuration, SetModelsProcessed) — avoids adding
   a second emitter instance on the Engine struct
4. Change models-processed from counter to gauge — directly shows
   models in last cycle, more useful for dashboards and alerting
5. Move modelsProcessed assignment to right after modelGroups is
   computed — accurately reflects models entering the pipeline
   regardless of apply outcome
6. Remove "partial" from status label documentation — not currently
   emitted, can be added when a partial completion path exists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization Loop Performance Metrics

2 participants