refactor(overlay): system to reduce training/inference redundancy

## Problem

The overlay tree creates **parallel branches** for training vs inference, duplicating shared config at every level:

```
eks-training                    eks-inference
├─ h100-eks-training            ├─ h100-eks-inference        ← ~80% identical
│  └─ h100-eks-ubuntu-training  │  └─ h100-eks-ubuntu-inference  ← ~90% identical
├─ gb200-eks-training           ├─ gb200-eks-inference       ← ~80% identical
│  └─ gb200-eks-ubuntu-training │  └─ gb200-eks-ubuntu-inference ← ~90% identical
```

**Key duplications (~150-200 lines):**
- K8s version constraint (`>= 1.32.4`) repeated in **16 files**
- GPU operator overrides (cdi/gdrcopy) identical across 4 files
- Ubuntu OS constraints repeated in 6 files
- Dynamo components identical between H100 and GB200

## What Actually Differs Between Training & Inference

| Field | Training | Inference |
|-------|----------|-----------|
| `criteria.intent` | training | inference |
| Skyhook intent | multiNodeTraining | inference |
| Extra components | — | kgateway-crds, kgateway |
| Validation checks | gang-scheduling, cluster-autoscaling | secure-accelerator-access, inference-gateway |

Everything else is shared.

## Proposed Refactoring

### Phase 1: Extract accelerator-agnostic base (quick win)

Introduce `h100-eks.yaml` and `gb200-eks.yaml` that hold shared GPU operator config and K8s constraints. Training/inference variants inherit from these instead of duplicating:

```
Current:                          Proposed:
eks-training                      eks
├─ h100-eks-training              ├─ h100-eks  (NEW: gpu-operator, constraints)
eks-inference                     │  ├─ h100-eks-training  (intent + skyhook intent only)
├─ h100-eks-inference             │  └─ h100-eks-inference (intent + kgateway + skyhook intent)
                                  ├─ gb200-eks (NEW)
                                  │  ├─ gb200-eks-training
                                  │  └─ gb200-eks-inference
```

### Phase 2: Extract OS overlay

Create `eks-ubuntu.yaml` with Ubuntu constraints. Ubuntu-specific overlays inherit from it, eliminating 6× duplication of OS constraints.

### Phase 3: Orthogonal composition (longer term)

Instead of deep single-parent chains, support composing independent overlays:
- `accelerator-h100.yaml` — GPU config
- `intent-training.yaml` — training validation checks
- `os-ubuntu.yaml` — OS constraints
- `platform-kubeflow.yaml` — Kubeflow components

A recipe matching `{h100, training, ubuntu, kubeflow}` merges all four. This would reduce 23 files to ~10.

## Feasibility

The existing `mergeOverlayChains()` in `pkg/recipe/metadata_store.go` already supports chain resolution — Phase 1-2 need no code changes, just overlay restructuring. Phase 3 would need the builder to support multi-base or criteria-driven composition.

## Files Affected by Redundancy

| Category | Count | Redundant Fields |
|----------|-------|------------------|
| K8s version constraints | 16 | `>= 1.32.4` repeated identically |
| GPU operator overrides | 4 | cdi/gdrcopy enabled |
| Skyhook dependencies | 4 | Same cert-manager, kube-prometheus-stack, skyhook-operator |
| OS constraints | 6 | Ubuntu 24.04, kernel >= 6.8 |
| Dynamo components | 4 | Identical source/version/values |
| Validation checks | 6 | gang-scheduling, pod-autoscaling repeated |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(overlay): system to reduce training/inference redundancy #305

Problem

What Actually Differs Between Training & Inference

Proposed Refactoring

Phase 1: Extract accelerator-agnostic base (quick win)

Phase 2: Extract OS overlay

Phase 3: Orthogonal composition (longer term)

Feasibility

Files Affected by Redundancy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Training	Inference
`criteria.intent`	training	inference
Skyhook intent	multiNodeTraining	inference
Extra components	—	kgateway-crds, kgateway
Validation checks	gang-scheduling, cluster-autoscaling	secure-accelerator-access, inference-gateway

Category	Count	Redundant Fields
K8s version constraints	16	`>= 1.32.4` repeated identically
GPU operator overrides	4	cdi/gdrcopy enabled
Skyhook dependencies	4	Same cert-manager, kube-prometheus-stack, skyhook-operator
OS constraints	6	Ubuntu 24.04, kernel >= 6.8
Dynamo components	4	Identical source/version/values
Validation checks	6	gang-scheduling, pod-autoscaling repeated

refactor(overlay): system to reduce training/inference redundancy #305

Description

Problem

What Actually Differs Between Training & Inference

Proposed Refactoring

Phase 1: Extract accelerator-agnostic base (quick win)

Phase 2: Extract OS overlay

Phase 3: Orthogonal composition (longer term)

Feasibility

Files Affected by Redundancy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions