feat: Add TrainerStatus Collector to surface TrainJob progress and persist convergence history by sameerdattav · Pull Request #2643 · kubeflow/katib

sameerdattav · 2026-03-22T17:17:21Z

Overview

This PR integrates Katib with Kubeflow Trainer’s progress tracking (TrainJob.status.trainerStatus)(kubeflow/trainer#3227) to enable real-time visibility into trial progress and convergence.

Trainer acts as the publisher of live training progress
Katib becomes a consumer, surfacing progress and persisting metric history

This removes the need for log scraping / sidecars and aligns with a push-based metrics model for HPO workflows.

What’s included

Added a new metrics collector: TrainerStatus
During Trial reconciliation:
- Read TrainJob.status.trainerStatus
- Mirror latest snapshot into Trial.status.trainerStatus
- Persist metrics as time-series (MetricLog) in Katib DB-manager
Added .status.trialsProgress in Experiment for lightweight per-trial progress
Disabled sidecar injection for this collector
Validation restricts usage to Trainer TrainJob templates

Metrics flow (single source of truth)

TrainJob.status.trainerStatus is treated as the real-time source of truth.

Katib:

Reads latest snapshot from TrainJob status
Persists history to DB-manager for:
- convergence visualization
- future scheduler / early stopping logic

Design clarification (metrics pipeline)

Current design discussions mix two push pipelines:

Training → Katib DB-manager (SDK/gRPC)
Training → Trainer → TrainJob.status.trainerStatus

This creates two independent metric streams, leading to:

duplicate / inconsistent data
unclear source of truth
added complexity in merging/deduplication

This PR adopts a clearer model:

TrainerStatus = source of truth (real-time)
Katib DB = archival layer (history)

This aligns better with the proposed OptimizationJob CRD, where:

Trainer owns execution + progress
Katib focuses on orchestration, scheduling, and history

Scope

This PR delivers a minimal vertical slice:

Read TrainerStatus
Surface real-time progress
Persist metric history

Follow-ups will include:

Expanded testing (envtest, edge cases)
Convergence-aware schedulers (median, Hyperband, etc.)
UI support for convergence graphs

Notes

Requires enabling the Trainer TrainJobStatus feature gate
If disabled, no progress will be emitted or collected

@andreyvelich , @abhijeet-dhumal , @akshaychitneni
Addresses the issue : #2637

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

github-actions · 2026-03-22T17:17:30Z

🎉 Welcome to the Kubeflow Katib repo! 🎉

Thanks for opening your first PR! We're excited to have you onboard 🚀

Next steps:

Our team will review your PR soon! cc @kubeflow/wg-automl-leads
Check out the Contributing Guide and the Kubeflow Contributor Guide
Join the Kubeflow Slack channels: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels
Join the AutoML & Training WG meetings: https://bit.ly/2PWVCkV

Feel free to ask questions in the comments. Thanks again for contributing! 🙏

google-oss-prow · 2026-03-22T17:17:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Init for metrics pusher

f904da9

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

google-oss-prow Bot added the size/L label Mar 22, 2026

google-oss-prow Bot requested review from andreyvelich, anencore94 and johnugeorge March 22, 2026 17:17

sameerdattav mentioned this pull request Mar 22, 2026

TrainerStatus Collector: Real-time HPO trial convergence tracking via TrainJob trainerStatus #2637

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add TrainerStatus Collector to surface TrainJob progress and persist convergence history#2643

feat: Add TrainerStatus Collector to surface TrainJob progress and persist convergence history#2643
sameerdattav wants to merge 1 commit intokubeflow:masterfrom
sameerdattav:trainerstatus-collector

sameerdattav commented Mar 22, 2026

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

google-oss-prow Bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sameerdattav commented Mar 22, 2026

Overview

What’s included

Metrics flow (single source of truth)

Design clarification (metrics pipeline)

Scope

Notes

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

google-oss-prow Bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant