-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Problem
The overlay tree creates parallel branches for training vs inference, duplicating shared config at every level:
eks-training eks-inference
├─ h100-eks-training ├─ h100-eks-inference ← ~80% identical
│ └─ h100-eks-ubuntu-training │ └─ h100-eks-ubuntu-inference ← ~90% identical
├─ gb200-eks-training ├─ gb200-eks-inference ← ~80% identical
│ └─ gb200-eks-ubuntu-training │ └─ gb200-eks-ubuntu-inference ← ~90% identical
Key duplications (~150-200 lines):
- K8s version constraint (
>= 1.32.4) repeated in 16 files - GPU operator overrides (cdi/gdrcopy) identical across 4 files
- Ubuntu OS constraints repeated in 6 files
- Dynamo components identical between H100 and GB200
What Actually Differs Between Training & Inference
| Field | Training | Inference |
|---|---|---|
criteria.intent |
training | inference |
| Skyhook intent | multiNodeTraining | inference |
| Extra components | — | kgateway-crds, kgateway |
| Validation checks | gang-scheduling, cluster-autoscaling | secure-accelerator-access, inference-gateway |
Everything else is shared.
Proposed Refactoring
Phase 1: Extract accelerator-agnostic base (quick win)
Introduce h100-eks.yaml and gb200-eks.yaml that hold shared GPU operator config and K8s constraints. Training/inference variants inherit from these instead of duplicating:
Current: Proposed:
eks-training eks
├─ h100-eks-training ├─ h100-eks (NEW: gpu-operator, constraints)
eks-inference │ ├─ h100-eks-training (intent + skyhook intent only)
├─ h100-eks-inference │ └─ h100-eks-inference (intent + kgateway + skyhook intent)
├─ gb200-eks (NEW)
│ ├─ gb200-eks-training
│ └─ gb200-eks-inference
Phase 2: Extract OS overlay
Create eks-ubuntu.yaml with Ubuntu constraints. Ubuntu-specific overlays inherit from it, eliminating 6× duplication of OS constraints.
Phase 3: Orthogonal composition (longer term)
Instead of deep single-parent chains, support composing independent overlays:
accelerator-h100.yaml— GPU configintent-training.yaml— training validation checksos-ubuntu.yaml— OS constraintsplatform-kubeflow.yaml— Kubeflow components
A recipe matching {h100, training, ubuntu, kubeflow} merges all four. This would reduce 23 files to ~10.
Feasibility
The existing mergeOverlayChains() in pkg/recipe/metadata_store.go already supports chain resolution — Phase 1-2 need no code changes, just overlay restructuring. Phase 3 would need the builder to support multi-base or criteria-driven composition.
Files Affected by Redundancy
| Category | Count | Redundant Fields |
|---|---|---|
| K8s version constraints | 16 | >= 1.32.4 repeated identically |
| GPU operator overrides | 4 | cdi/gdrcopy enabled |
| Skyhook dependencies | 4 | Same cert-manager, kube-prometheus-stack, skyhook-operator |
| OS constraints | 6 | Ubuntu 24.04, kernel >= 6.8 |
| Dynamo components | 4 | Identical source/version/values |
| Validation checks | 6 | gang-scheduling, pod-autoscaling repeated |