KWOK (Kubernetes WithOut Kubelet) tests AICR bundles against simulated GPU clusters without real hardware.
make build # Build aicr binary
make kwok-test-all # Test all recipes (serial)
make kwok-test-all-parallel # Test all recipes (parallel, faster)
make kwok-e2e RECIPE=h100-eks-ubuntu-training-kubeflow # Test single recipeCluster configuration is inferred from recipe overlays - no separate config files needed.
flowchart LR
A[Recipe Overlay] --> B[Node Profile]
B --> C[KWOK Nodes]
A --> D[Bundle Generation]
C --> E[Schedule Test]
D --> E
Components:
| Component | Location | Purpose |
|---|---|---|
| Recipe Overlays | recipes/overlays/*.yaml |
Define cluster criteria (service, accelerator) |
| Node Profiles | kwok/profiles/{provider}/*.yaml |
Define hardware specs per instance type |
| Scripts | kwok/scripts/ |
Create nodes, validate scheduling |
| CI Workflow | .github/workflows/kwok-recipes.yaml |
Auto-discover and test recipes |
The script reads recipe criteria and selects matching profiles:
| Service | Accelerator | GPU Profile |
|---|---|---|
| eks | h100 (default) | eks/p5-h100.yaml |
| eks | gb200 | eks/p6-gb200.yaml |
| gke | any | eks/p5-h100.yaml (fallback) |
- System nodes: 2
- GPU nodes: 4 (32 GPUs total)
- Kubernetes: v1.33.5
- Region: us-east-1
| Target | Description |
|---|---|
make kwok-test-all |
Test all recipes in shared cluster (serial) |
make kwok-test-all-parallel |
Test all recipes in parallel clusters (faster) |
make kwok-e2e RECIPE=<name> |
Full e2e: cluster, nodes, validate |
make kwok-cluster |
Create Kind cluster with KWOK |
make kwok-nodes RECIPE=<name> |
Create simulated nodes |
make kwok-test RECIPE=<name> |
Validate scheduling only |
make kwok-status |
Show cluster and node status |
make kwok-cluster-delete |
Delete cluster |
The make kwok-test-all-parallel target runs tests in parallel across multiple Kind clusters:
# Auto-detect parallelism (CPUs / 2, min 2, max 8)
make kwok-test-all-parallel
# Specify number of parallel clusters
PARALLEL=4 make kwok-test-all-parallel
# Keep clusters after tests for inspection
KEEP_CLUSTERS=true make kwok-test-all-parallel
# Reduce parallelism if cluster creation fails
PARALLEL=2 make kwok-test-all-parallelHow it works:
- Processes recipes in batches (batch size = PARALLEL value)
- For each batch:
- Creates dedicated Kind clusters (one per recipe)
- Installs KWOK controller in each cluster
- Creates recipe-specific KWOK nodes in each cluster
- Runs tests in parallel (each recipe on its dedicated cluster)
- Collects results
- Deletes batch clusters before starting next batch
- Reports final pass/fail summary
Benefits:
- Faster: ~3-5x faster than serial testing depending on CPU cores
- Isolated: Each recipe runs in its own dedicated cluster with matching hardware
- Resource-efficient: Processes in batches to avoid overwhelming the system
- Correct: Hardware configuration matches recipe requirements exactly
Troubleshooting Parallel Tests:
- If cluster creation fails, logs are preserved in
/tmp/tmp.XXXXXX/ - Try reducing parallelism:
PARALLEL=2 make kwok-test-all-parallel - Clusters are staggered by 2s to reduce resource contention
- KWOK controller timeout is 5 minutes (increased for parallel creation)
A recipe is auto-discovered for KWOK testing if it has spec.criteria.service defined.
Create recipes/overlays/your-recipe.yaml:
kind: recipeMetadata
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
name: your-recipe-name
spec:
base: eks-training # Optional: inherit from existing
criteria:
service: eks # Required for KWOK testing
accelerator: h100 # Optional: h100, gb200
intent: training # Optional
componentRefs:
- name: gpu-operator
type: Helm
valuesFile: components/gpu-operator/values-eks-training.yamlTest it:
make build
make kwok-e2e RECIPE=your-recipe-nameTo add a new GPU type, copy an existing profile and modify:
cp kwok/profiles/eks/p5-h100.yaml kwok/profiles/eks/p5-a100.yaml
# Edit the new file with A100 specsThen update kwok/scripts/apply-nodes.sh to map your accelerator to the new profile (see the get_profiles() function around line 64).
Copy the kwok/profiles/eks/ directory structure for your provider and update the mapping in apply-nodes.sh. See existing profiles for the expected format.
The workflow .github/workflows/kwok-recipes.yaml calls run-all-recipes.sh — the same script used by make kwok-test-all. CI uses a single shared Kind cluster and tests recipes sequentially with cleanup between each, ensuring local and CI code paths are identical.
Manual trigger:
gh workflow run kwok-recipes.yaml -f recipe=your-recipe-nameCheck tolerations match KWOK nodes:
kubectl describe pod <pod-name> -n aicr-kwok-testGPU pods need tolerations for kwok.x-k8s.io/node=fake:NoSchedule and node selector nvidia.com/gpu.present: "true".
kubectl logs -n kube-system deployment/kwok-controllerVerify it has spec.criteria.service defined (not any or null):
yq eval '.spec.criteria.service' recipes/overlays/your-recipe.yamlKWOK validates scheduling, not runtime:
- ✅ Node selectors, tolerations, resource requests
- ✅ Pod scheduling decisions
- ✅ Helm chart generation
- ❌ Container execution
- ❌ GPU functionality
- ❌ Network connectivity
For runtime testing, use Tilt (make dev-env) or a real cluster.