Skip to content

refactor(overlay): system to reduce training/inference redundancy #305

@yuanchen8911

Description

@yuanchen8911

Problem

The overlay tree creates parallel branches for training vs inference, duplicating shared config at every level:

eks-training                    eks-inference
├─ h100-eks-training            ├─ h100-eks-inference        ← ~80% identical
│  └─ h100-eks-ubuntu-training  │  └─ h100-eks-ubuntu-inference  ← ~90% identical
├─ gb200-eks-training           ├─ gb200-eks-inference       ← ~80% identical
│  └─ gb200-eks-ubuntu-training │  └─ gb200-eks-ubuntu-inference ← ~90% identical

Key duplications (~150-200 lines):

  • K8s version constraint (>= 1.32.4) repeated in 16 files
  • GPU operator overrides (cdi/gdrcopy) identical across 4 files
  • Ubuntu OS constraints repeated in 6 files
  • Dynamo components identical between H100 and GB200

What Actually Differs Between Training & Inference

Field Training Inference
criteria.intent training inference
Skyhook intent multiNodeTraining inference
Extra components kgateway-crds, kgateway
Validation checks gang-scheduling, cluster-autoscaling secure-accelerator-access, inference-gateway

Everything else is shared.

Proposed Refactoring

Phase 1: Extract accelerator-agnostic base (quick win)

Introduce h100-eks.yaml and gb200-eks.yaml that hold shared GPU operator config and K8s constraints. Training/inference variants inherit from these instead of duplicating:

Current:                          Proposed:
eks-training                      eks
├─ h100-eks-training              ├─ h100-eks  (NEW: gpu-operator, constraints)
eks-inference                     │  ├─ h100-eks-training  (intent + skyhook intent only)
├─ h100-eks-inference             │  └─ h100-eks-inference (intent + kgateway + skyhook intent)
                                  ├─ gb200-eks (NEW)
                                  │  ├─ gb200-eks-training
                                  │  └─ gb200-eks-inference

Phase 2: Extract OS overlay

Create eks-ubuntu.yaml with Ubuntu constraints. Ubuntu-specific overlays inherit from it, eliminating 6× duplication of OS constraints.

Phase 3: Orthogonal composition (longer term)

Instead of deep single-parent chains, support composing independent overlays:

  • accelerator-h100.yaml — GPU config
  • intent-training.yaml — training validation checks
  • os-ubuntu.yaml — OS constraints
  • platform-kubeflow.yaml — Kubeflow components

A recipe matching {h100, training, ubuntu, kubeflow} merges all four. This would reduce 23 files to ~10.

Feasibility

The existing mergeOverlayChains() in pkg/recipe/metadata_store.go already supports chain resolution — Phase 1-2 need no code changes, just overlay restructuring. Phase 3 would need the builder to support multi-base or criteria-driven composition.

Files Affected by Redundancy

Category Count Redundant Fields
K8s version constraints 16 >= 1.32.4 repeated identically
GPU operator overrides 4 cdi/gdrcopy enabled
Skyhook dependencies 4 Same cert-manager, kube-prometheus-stack, skyhook-operator
OS constraints 6 Ubuntu 24.04, kernel >= 6.8
Dynamo components 4 Identical source/version/values
Validation checks 6 gang-scheduling, pod-autoscaling repeated

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions