What you would like to be added?
Ship a default Grafana dashboard as part of the Helm chart that provides out-of-the-box visibility
into controller health and TrainJob lifecycle.
This revives #1376, filed in 2021 for V1 but
closed as stale.
Challenges
- Dashboard depends on which metrics are available (controller-runtime defaults today, custom
metrics once the companion metrics issue lands)
- Needs to cover two personas: platform operators (controller health, reconcile backlogs)
and ML engineers (TrainJob lifecycle, time-to-running, failure rates)
- Should be gated in Helm values to avoid installing for clusters without Grafana
Why is this needed?
Current State
- No dashboard shipped with Trainer
- Operators must manually discover available metrics, write PromQL, and build panels
- Peer projects (Kueue, Argo Workflows) bundle dashboards - Trainer is a gap
Related Issues
Love this feature?
Give it a 👍 We prioritize the features with most 👍
What you would like to be added?
Ship a default Grafana dashboard as part of the Helm chart that provides out-of-the-box visibility
into controller health and TrainJob lifecycle.
This revives #1376, filed in 2021 for V1 but
closed as stale.
Challenges
metrics once the companion metrics issue lands)
and ML engineers (TrainJob lifecycle, time-to-running, failure rates)
Why is this needed?
Current State
Related Issues
Love this feature?
Give it a 👍 We prioritize the features with most 👍