KEP-3310: Shared initializer support for train jobs by akshaychitneni · Pull Request #3311 · kubeflow/trainer

akshaychitneni · 2026-03-11T18:42:49Z

What this PR does / why we need it:
This issue proposes a SharedInitializer plugin that lets multiple TrainJobs share a single initialized data source. The first TrainJob to run initializes the data; subsequent TrainJobs automatically detect and reuse it, skipping initialization entirely. Both disk-based (PVC) and cache-based (LeaderWorkerSet) sharing are supported.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #3310

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2026-03-11T18:42:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Akshay Chitneni <achitneni@apple.com>

Copilot

Pull request overview

This PR adds a new KEP describing a proposed SharedInitializer plugin for the Kubeflow Trainer framework, enabling multiple TrainJobs to reuse shared, pre-initialized dataset/model artifacts (PVC-based or cache-service-based) instead of re-initializing per job.

Changes:

Introduces a label-based design for discovering and reusing shared PVCs and cache Services across TrainJobs.
Specifies durability/readiness tracking via an initialization-status PVC label (pending/complete/failed) and requeue behavior.
Proposes new API fields (initializer.sharedSource) and outlines required RBAC and related JobSet considerations.

Copilot · 2026-03-11T18:46:40Z

docs/proposals/3310-shared-initializer/README.md

+and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for
+upstream ownerReference propagation that would achieve the same GC benefit.


The Alternatives section says the label-based approach “only needs list” RBAC, but the RBAC section (and the proposed design) requires updating PVC labels to stamp initialization-status; please reconcile this to avoid under-scoping required permissions.

Suggested change

and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for

upstream ownerReference propagation that would achieve the same GC benefit.

and instead requires `list` and `update` RBAC on PVCs for label-based initialization tracking. See

[Proposed JobSet Enhancements](#proposed-jobset-enhancements) for upstream ownerReference propagation

that would achieve the same GC benefit without direct PVC creation.

Copilot · 2026-03-11T18:46:41Z

docs/proposals/3310-shared-initializer/README.md

+                        - name: TRAINJOB_NAME
+                          valueFrom:
+                            fieldRef:
+                              fieldPath: metadata.labels['trainer.kubeflow.org/trainjob-name']


The example derives TRAINJOB_NAME from metadata.labels['trainer.kubeflow.org/trainjob-name'], but that label doesn’t appear to be defined/added anywhere in the current codebase; consider using an existing Downward API field (e.g., metadata.name) or explicitly documenting/adding the label injection so the example works as written.

Suggested change

fieldPath: metadata.labels['trainer.kubeflow.org/trainjob-name']

fieldPath: metadata.name

Copilot · 2026-03-11T18:46:41Z

docs/proposals/3310-shared-initializer/README.md

+	// name is the shared-source label value used to discover PVCs.
+	// Added as label trainer.kubeflow.org/shared-source-name on the PVC.
+	// +kubebuilder:validation:MinLength=1
+	// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
+	// +required
+	Name string `json:"name"`
+}


The proposed validation pattern for SharedVolumeRef.Name/SharedDataCacheRef.Name enforces DNS-label characters but not the 63-char Kubernetes label-value length limit; since these names are used as label values, add a max-length validation (and mirror it on both fields) to prevent creating invalid resources at runtime.

Copilot AI review requested due to automatic review settings March 11, 2026 18:42

google-oss-prow bot requested review from jinchihe and kuizhiqing March 11, 2026 18:42

google-oss-prow bot added the size/XL label Mar 11, 2026

adding kep for shared initializer support

4c74df4

Signed-off-by: Akshay Chitneni <achitneni@apple.com>

akshaychitneni force-pushed the sharedinit branch from 5f7e2a2 to 4c74df4 Compare March 11, 2026 18:43

Copilot started reviewing on behalf of akshaychitneni March 11, 2026 18:43 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-3310: Shared initializer support for train jobs#3311

KEP-3310: Shared initializer support for train jobs#3311
akshaychitneni wants to merge 1 commit intokubeflow:masterfrom
akshaychitneni:sharedinit

akshaychitneni commented Mar 11, 2026

Uh oh!

google-oss-prow bot commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for
		upstream ownerReference propagation that would achieve the same GC benefit.

-and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for
-upstream ownerReference propagation that would achieve the same GC benefit.
+and instead requires `list` and `update` RBAC on PVCs for label-based initialization tracking. See
+[Proposed JobSet Enhancements](#proposed-jobset-enhancements) for upstream ownerReference propagation
+that would achieve the same GC benefit without direct PVC creation.

	fieldPath: metadata.labels['trainer.kubeflow.org/trainjob-name']
	fieldPath: metadata.name

Conversation

akshaychitneni commented Mar 11, 2026

Uh oh!

google-oss-prow bot commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants