Skip to content

feat: Add TrainerStatus Collector to surface TrainJob progress and persist convergence history#2643

Open
sameerdattav wants to merge 1 commit intokubeflow:masterfrom
sameerdattav:trainerstatus-collector
Open

feat: Add TrainerStatus Collector to surface TrainJob progress and persist convergence history#2643
sameerdattav wants to merge 1 commit intokubeflow:masterfrom
sameerdattav:trainerstatus-collector

Conversation

@sameerdattav
Copy link
Copy Markdown

Overview

This PR integrates Katib with Kubeflow Trainer’s progress tracking (TrainJob.status.trainerStatus)(kubeflow/trainer#3227) to enable real-time visibility into trial progress and convergence.

  • Trainer acts as the publisher of live training progress
  • Katib becomes a consumer, surfacing progress and persisting metric history

This removes the need for log scraping / sidecars and aligns with a push-based metrics model for HPO workflows.


What’s included

  • Added a new metrics collector: TrainerStatus
  • During Trial reconciliation:
    • Read TrainJob.status.trainerStatus
    • Mirror latest snapshot into Trial.status.trainerStatus
    • Persist metrics as time-series (MetricLog) in Katib DB-manager
  • Added .status.trialsProgress in Experiment for lightweight per-trial progress
  • Disabled sidecar injection for this collector
  • Validation restricts usage to Trainer TrainJob templates

Metrics flow (single source of truth)

TrainJob.status.trainerStatus is treated as the real-time source of truth.

Katib:

  • Reads latest snapshot from TrainJob status
  • Persists history to DB-manager for:
    • convergence visualization
    • future scheduler / early stopping logic

Design clarification (metrics pipeline)

Current design discussions mix two push pipelines:

  1. Training → Katib DB-manager (SDK/gRPC)
  2. Training → Trainer → TrainJob.status.trainerStatus

This creates two independent metric streams, leading to:

  • duplicate / inconsistent data
  • unclear source of truth
  • added complexity in merging/deduplication

This PR adopts a clearer model:

TrainerStatus = source of truth (real-time)
Katib DB = archival layer (history)

This aligns better with the proposed OptimizationJob CRD, where:

  • Trainer owns execution + progress
  • Katib focuses on orchestration, scheduling, and history

Scope

This PR delivers a minimal vertical slice:

  • Read TrainerStatus
  • Surface real-time progress
  • Persist metric history

Follow-ups will include:

  • Expanded testing (envtest, edge cases)
  • Convergence-aware schedulers (median, Hyperband, etc.)
  • UI support for convergence graphs

Notes

  • Requires enabling the Trainer TrainJobStatus feature gate
  • If disabled, no progress will be emitted or collected

@andreyvelich , @abhijeet-dhumal , @akshaychitneni
Addresses the issue : #2637

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Katib repo! 🎉

Thanks for opening your first PR! We're excited to have you onboard 🚀

Next steps:

Feel free to ask questions in the comments. Thanks again for contributing! 🙏

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant