Skip to content

feat(operator): add controller-level Prometheus metrics and ServiceMonitor#3433

Open
1Ayush-Petwal wants to merge 3 commits intokubeflow:masterfrom
1Ayush-Petwal:feat/controller-metrics
Open

feat(operator): add controller-level Prometheus metrics and ServiceMonitor#3433
1Ayush-Petwal wants to merge 3 commits intokubeflow:masterfrom
1Ayush-Petwal:feat/controller-metrics

Conversation

@1Ayush-Petwal
Copy link
Copy Markdown

What this PR does / why we need it:

  • Adds 14 kubeflow_trainer_* Prometheus metrics (counters, gauges, histograms) to the controller-manager:
  • TrainJob lifecycle events (created/completed/failed/suspended/deleted), reconcile loop duration,
  • plugin execution timing and errors, webhook validation outcomes, active job gauge, and a build_info gauge.
  • Adds a Helm-managed ServiceMonitor (disabled by default) and a static Kustomize ServiceMonitor manifest for the kubeflow namespace.

Which issue(s) this PR fixes :
Fixes #3429

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings April 16, 2026 14:38
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@1Ayush-Petwal 1Ayush-Petwal changed the title feat(metrics): add controller-level Prometheus metrics and ServiceMonitor feat(operator): add controller-level Prometheus metrics and ServiceMonitor Apr 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class, controller-level Prometheus metrics for Kubeflow Trainer and ships ServiceMonitor resources to enable Prometheus Operator scraping of the controller-manager’s TLS-secured /metrics endpoint.

Changes:

  • Introduces pkg/metrics with kubeflow_trainer_* counters/gauges/histograms plus unit tests.
  • Instruments the TrainJob controller, runtime framework plugins, and TrainJob validating webhook to emit metrics.
  • Adds ServiceMonitor support via Helm (optional) and kustomize (static manifest).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/webhooks/trainjob_webhook.go Increments webhook validation counters for create/update outcomes.
pkg/version/version.go Adds build-time version variables used for build_info labels.
pkg/runtime/framework/core/framework.go Observes plugin execution duration and error counters across phases.
pkg/runtime/core/core.go Records “runtimes registered” gauge values during runtime initialization.
pkg/metrics/metrics.go Defines and registers all Trainer Prometheus metrics plus helper recording funcs.
pkg/metrics/metrics_test.go Adds unit tests for metric helpers.
pkg/controller/trainjob_controller.go Instruments reconcile duration and TrainJob lifecycle transition metrics.
cmd/trainer-controller-manager/main.go Registers metrics and publishes a build_info series at startup.
manifests/base/manager/service_monitor.yaml Adds a kustomize ServiceMonitor manifest for scraping /metrics.
manifests/base/manager/kustomization.yaml Includes the new ServiceMonitor in the base manager kustomization.
charts/kubeflow-trainer/values.yaml Adds Helm values for optional ServiceMonitor configuration.
charts/kubeflow-trainer/templates/manager/service-monitor.yaml Adds Helm template for a ServiceMonitor gated by values/CRD presence.
go.mod Adds prometheus/client_golang as a direct dependency.
Comments suppressed due to low confidence (1)

pkg/controller/trainjob_controller.go:109

  • The reconcile "result" label can be incorrect because reconcileResult is only updated from the local err variable and misses early returns (e.g., Get errors) and deadlineErr; consider using named return values and setting the label in the deferred func based on the final returned error.
func (r *TrainJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	start := time.Now()
	reconcileResult := "success"
	defer func() { metrics.ObserveReconcile("trainjob_controller", reconcileResult, time.Since(start)) }()

	var trainJob trainer.TrainJob
	if err := r.client.Get(ctx, req.NamespacedName, &trainJob); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

Comment thread pkg/controller/trainjob_controller.go Outdated
Comment thread pkg/runtime/core/core.go Outdated
Comment thread manifests/overlays/monitoring/service_monitor.yaml
Comment thread manifests/base/manager/kustomization.yaml Outdated
Comment thread pkg/metrics/metrics_test.go Outdated
Comment thread manifests/overlays/monitoring/service_monitor.yaml Outdated
…nitor

Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
@1Ayush-Petwal 1Ayush-Petwal force-pushed the feat/controller-metrics branch from 49f7b30 to 9295812 Compare April 17, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add controller-level Prometheus metrics and ServiceMonitor

2 participants