Skip to content

KEP-3310: Shared initializer support for train jobs#3311

Open
akshaychitneni wants to merge 1 commit intokubeflow:masterfrom
akshaychitneni:sharedinit
Open

KEP-3310: Shared initializer support for train jobs#3311
akshaychitneni wants to merge 1 commit intokubeflow:masterfrom
akshaychitneni:sharedinit

Conversation

@akshaychitneni
Copy link
Contributor

What this PR does / why we need it:
This issue proposes a SharedInitializer plugin that lets multiple TrainJobs share a single initialized data source. The first TrainJob to run initializes the data; subsequent TrainJobs automatically detect and reuse it, skipping initialization entirely. Both disk-based (PVC) and cache-based (LeaderWorkerSet) sharing are supported.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #3310

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings March 11, 2026 18:42
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new KEP describing a proposed SharedInitializer plugin for the Kubeflow Trainer framework, enabling multiple TrainJobs to reuse shared, pre-initialized dataset/model artifacts (PVC-based or cache-service-based) instead of re-initializing per job.

Changes:

  • Introduces a label-based design for discovering and reusing shared PVCs and cache Services across TrainJobs.
  • Specifies durability/readiness tracking via an initialization-status PVC label (pending/complete/failed) and requeue behavior.
  • Proposes new API fields (initializer.sharedSource) and outlines required RBAC and related JobSet considerations.

Comment on lines +658 to +659
and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for
upstream ownerReference propagation that would achieve the same GC benefit.
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Alternatives section says the label-based approach “only needs list” RBAC, but the RBAC section (and the proposed design) requires updating PVC labels to stamp initialization-status; please reconcile this to avoid under-scoping required permissions.

Suggested change
and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for
upstream ownerReference propagation that would achieve the same GC benefit.
and instead requires `list` and `update` RBAC on PVCs for label-based initialization tracking. See
[Proposed JobSet Enhancements](#proposed-jobset-enhancements) for upstream ownerReference propagation
that would achieve the same GC benefit without direct PVC creation.

Copilot uses AI. Check for mistakes.
- name: TRAINJOB_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['trainer.kubeflow.org/trainjob-name']
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example derives TRAINJOB_NAME from metadata.labels['trainer.kubeflow.org/trainjob-name'], but that label doesn’t appear to be defined/added anywhere in the current codebase; consider using an existing Downward API field (e.g., metadata.name) or explicitly documenting/adding the label injection so the example works as written.

Suggested change
fieldPath: metadata.labels['trainer.kubeflow.org/trainjob-name']
fieldPath: metadata.name

Copilot uses AI. Check for mistakes.
Comment on lines +406 to +412
// name is the shared-source label value used to discover PVCs.
// Added as label trainer.kubeflow.org/shared-source-name on the PVC.
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
// +required
Name string `json:"name"`
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed validation pattern for SharedVolumeRef.Name/SharedDataCacheRef.Name enforces DNS-label characters but not the 63-char Kubernetes label-value length limit; since these names are used as label values, add a max-length validation (and mirror it on both fields) to prevent creating invalid resources at runtime.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants