feat(operator): add controller-level Prometheus metrics and ServiceMonitor#3433
feat(operator): add controller-level Prometheus metrics and ServiceMonitor#34331Ayush-Petwal wants to merge 3 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Adds first-class, controller-level Prometheus metrics for Kubeflow Trainer and ships ServiceMonitor resources to enable Prometheus Operator scraping of the controller-manager’s TLS-secured /metrics endpoint.
Changes:
- Introduces
pkg/metricswithkubeflow_trainer_*counters/gauges/histograms plus unit tests. - Instruments the TrainJob controller, runtime framework plugins, and TrainJob validating webhook to emit metrics.
- Adds ServiceMonitor support via Helm (optional) and kustomize (static manifest).
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/webhooks/trainjob_webhook.go | Increments webhook validation counters for create/update outcomes. |
| pkg/version/version.go | Adds build-time version variables used for build_info labels. |
| pkg/runtime/framework/core/framework.go | Observes plugin execution duration and error counters across phases. |
| pkg/runtime/core/core.go | Records “runtimes registered” gauge values during runtime initialization. |
| pkg/metrics/metrics.go | Defines and registers all Trainer Prometheus metrics plus helper recording funcs. |
| pkg/metrics/metrics_test.go | Adds unit tests for metric helpers. |
| pkg/controller/trainjob_controller.go | Instruments reconcile duration and TrainJob lifecycle transition metrics. |
| cmd/trainer-controller-manager/main.go | Registers metrics and publishes a build_info series at startup. |
| manifests/base/manager/service_monitor.yaml | Adds a kustomize ServiceMonitor manifest for scraping /metrics. |
| manifests/base/manager/kustomization.yaml | Includes the new ServiceMonitor in the base manager kustomization. |
| charts/kubeflow-trainer/values.yaml | Adds Helm values for optional ServiceMonitor configuration. |
| charts/kubeflow-trainer/templates/manager/service-monitor.yaml | Adds Helm template for a ServiceMonitor gated by values/CRD presence. |
| go.mod | Adds prometheus/client_golang as a direct dependency. |
Comments suppressed due to low confidence (1)
pkg/controller/trainjob_controller.go:109
- The reconcile "result" label can be incorrect because
reconcileResultis only updated from the localerrvariable and misses early returns (e.g.,Geterrors) anddeadlineErr; consider using named return values and setting the label in the deferred func based on the final returned error.
func (r *TrainJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
start := time.Now()
reconcileResult := "success"
defer func() { metrics.ObserveReconcile("trainjob_controller", reconcileResult, time.Since(start)) }()
var trainJob trainer.TrainJob
if err := r.client.Get(ctx, req.NamespacedName, &trainJob); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
…nitor Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
49f7b30 to
9295812
Compare
What this PR does / why we need it:
kubeflow_trainer_*Prometheus metrics (counters, gauges, histograms) to the controller-manager:build_infogauge.ServiceMonitor(disabled by default) and a static Kustomize ServiceMonitor manifest for thekubeflownamespace.Which issue(s) this PR fixes :
Fixes #3429
Checklist: