Skip to content

feat: Ship optional default Grafana dashboard via Helm#3445

Open
sameerdattav wants to merge 1 commit intokubeflow:masterfrom
sameerdattav:grafana-dashboard
Open

feat: Ship optional default Grafana dashboard via Helm#3445
sameerdattav wants to merge 1 commit intokubeflow:masterfrom
sameerdattav:grafana-dashboard

Conversation

@sameerdattav
Copy link
Copy Markdown
Contributor

@sameerdattav sameerdattav commented Apr 21, 2026

Summary

Adds an optional Grafana dashboard for Kubeflow Trainer, delivered via Helm as a ConfigMap. Disabled by default, so clusters without Grafana remain unaffected.

Motivation

Provide quick, standardized visibility into controller health and TrainJob activity without requiring custom dashboards.

Changes

  • Added dashboard JSON (kubeflow-trainer-dashboard.json)

  • Added gated ConfigMap template using .Files.Get

  • Introduced Helm values:

    • grafanaDashboard.enabled (default: false)
    • grafanaDashboard.labels
    • grafanaDashboard.annotations
  • Updated chart documentation

Enable

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
  --version 2.1.0 \
  --set grafanaDashboard.enabled=true

Coverage

  • Controller health: scrape status, goroutines, memory
  • Reconcile metrics: rate, errors, latency (p95)
  • Workqueue: depth, retries
  • TrainJob activity (via controller metrics proxy)

Limitations

Uses only existing metrics. Native TrainJob lifecycle metrics can be added in a follow-up.

Testing

Gated via Helm values. Rendering expected to be validated in CI.

Fixes: #3430

cc: @andreyvelich , @abhijeet-dhumal

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Copilot AI review requested due to automatic review settings April 21, 2026 19:00
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional, Helm-gated Grafana dashboard for Kubeflow Trainer by packaging a dashboard JSON into a ConfigMap (disabled by default).

Changes:

  • Introduces grafanaDashboard.* Helm values to gate/label/annotate a dashboard ConfigMap.
  • Adds a grafana-dashboard-configmap.yaml template that embeds a dashboard JSON via .Files.Get.
  • Updates chart docs to describe enabling the dashboard.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
charts/kubeflow-trainer/values.yaml Adds grafanaDashboard values to control/label/annotate optional dashboard installation.
charts/kubeflow-trainer/templates/grafana-dashboard-configmap.yaml New conditional ConfigMap template that loads the dashboard JSON from the chart files.
charts/kubeflow-trainer/dashboards/kubeflow-trainer-dashboard.json Adds the default Grafana dashboard definition (PromQL panels + variables).
charts/kubeflow-trainer/README.md.gotmpl Documents the optional dashboard and how to enable it.
charts/kubeflow-trainer/README.md Generated README update reflecting the new values and enablement docs.

Comment on lines +118 to +121
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
--version 2.1.0 \
--set grafanaDashboard.enabled=true
```
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Helm install snippet pins --version 2.1.0, but the chart version in this repo is 2.2.0; this will mislead users and should either omit --version or be regenerated from the templated README source.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
{{- if .Values.grafanaDashboard.enabled }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ include "trainer.fullname" . }}-grafana-dashboard
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add helm-unittest coverage for this new conditional template (at least: not rendered when grafanaDashboard.enabled=false, rendered when true, and that custom labels/annotations are applied) since the chart already uses helm-unittest tests for other templates.

Copilot generated this review using guidance from repository custom instructions.

```bash
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
--version 2.1.0 \
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Helm install snippet pins --version 2.1.0, but the chart version in this repo is 2.2.0; this will mislead users and should either omit --version or template it from the chart version and regenerate README.md.

Suggested change
--version 2.1.0 \
--version {{ .Version }} \

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ship a default Grafana dashboard for Kubeflow Trainer

2 participants