KEP-3310: Shared initializer support for train jobs#3311
KEP-3310: Shared initializer support for train jobs#3311akshaychitneni wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Akshay Chitneni <achitneni@apple.com>
5f7e2a2 to
4c74df4
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a new KEP describing a proposed SharedInitializer plugin for the Kubeflow Trainer framework, enabling multiple TrainJobs to reuse shared, pre-initialized dataset/model artifacts (PVC-based or cache-service-based) instead of re-initializing per job.
Changes:
- Introduces a label-based design for discovering and reusing shared PVCs and cache Services across TrainJobs.
- Specifies durability/readiness tracking via an
initialization-statusPVC label (pending/complete/failed) and requeue behavior. - Proposes new API fields (
initializer.sharedSource) and outlines required RBAC and related JobSet considerations.
| and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for | ||
| upstream ownerReference propagation that would achieve the same GC benefit. |
There was a problem hiding this comment.
The Alternatives section says the label-based approach “only needs list” RBAC, but the RBAC section (and the proposed design) requires updating PVC labels to stamp initialization-status; please reconcile this to avoid under-scoping required permissions.
| and only needs `list`. See [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for | |
| upstream ownerReference propagation that would achieve the same GC benefit. | |
| and instead requires `list` and `update` RBAC on PVCs for label-based initialization tracking. See | |
| [Proposed JobSet Enhancements](#proposed-jobset-enhancements) for upstream ownerReference propagation | |
| that would achieve the same GC benefit without direct PVC creation. |
| - name: TRAINJOB_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: metadata.labels['trainer.kubeflow.org/trainjob-name'] |
There was a problem hiding this comment.
The example derives TRAINJOB_NAME from metadata.labels['trainer.kubeflow.org/trainjob-name'], but that label doesn’t appear to be defined/added anywhere in the current codebase; consider using an existing Downward API field (e.g., metadata.name) or explicitly documenting/adding the label injection so the example works as written.
| fieldPath: metadata.labels['trainer.kubeflow.org/trainjob-name'] | |
| fieldPath: metadata.name |
| // name is the shared-source label value used to discover PVCs. | ||
| // Added as label trainer.kubeflow.org/shared-source-name on the PVC. | ||
| // +kubebuilder:validation:MinLength=1 | ||
| // +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$` | ||
| // +required | ||
| Name string `json:"name"` | ||
| } |
There was a problem hiding this comment.
The proposed validation pattern for SharedVolumeRef.Name/SharedDataCacheRef.Name enforces DNS-label characters but not the 63-char Kubernetes label-value length limit; since these names are used as label values, add a max-length validation (and mirror it on both fields) to prevent creating invalid resources at runtime.
What this PR does / why we need it:
This issue proposes a SharedInitializer plugin that lets multiple TrainJobs share a single initialized data source. The first TrainJob to run initializes the data; subsequent TrainJobs automatically detect and reuse it, skipping initialization entirely. Both disk-based (PVC) and cache-based (LeaderWorkerSet) sharing are supported.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes # #3310
Checklist: