Skip to content

Ship a default Grafana dashboard for Kubeflow Trainer #3430

@abhijeet-dhumal

Description

@abhijeet-dhumal

What you would like to be added?

Ship a default Grafana dashboard as part of the Helm chart that provides out-of-the-box visibility
into controller health and TrainJob lifecycle.

This revives #1376, filed in 2021 for V1 but
closed as stale.

Challenges

  • Dashboard depends on which metrics are available (controller-runtime defaults today, custom
    metrics once the companion metrics issue lands)
  • Needs to cover two personas: platform operators (controller health, reconcile backlogs)
    and ML engineers (TrainJob lifecycle, time-to-running, failure rates)
  • Should be gated in Helm values to avoid installing for clusters without Grafana

Why is this needed?

Current State

  • No dashboard shipped with Trainer
  • Operators must manually discover available metrics, write PromQL, and build panels
  • Peer projects (Kueue, Argo Workflows) bundle dashboards - Trainer is a gap

Related Issues

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions