feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime #2925
+1,180
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds comprehensive kubectl-friendly YAML examples for Kubeflow Trainer, addressing the need for simple, out-of-the-box examples that can be applied directly with
kubectl.Fixes #2770
Motivation
As noted in #2770, while Jupyter notebook examples are great for AI practitioners, platform administrators and users who prefer working with kubectl need ready-to-use YAML examples. The existing operator guides have embedded YAML snippets, but users were looking for standalone examples in the repository that:
kubectl apply -fThis is especially important for "GPU-poor" users who want to explore and learn Kubeflow Trainer.
Changes
Added a new
examples/yaml/directory with 8 ready-to-use examples organized by complexity:Basic Examples (
examples/yaml/basic/)01-hello-world.yaml- Simplest TrainJob with echo command (no GPU)02-multi-node.yaml- Multi-node distributed training simulation (no GPU)03-with-runtime.yaml- Custom namespace-scoped TrainingRuntime creation and usage (no GPU)04-pytorch-simple.yaml- Real PyTorch MNIST training (GPU optional, uses public image)Advanced Examples (
examples/yaml/advanced/)01-podspec-overrides.yaml- Comprehensive pod customization withpodTemplateOverrides02-kueue-integration.yaml- Queue-based job scheduling with Kueue03-volcano-integration.yaml- Gang scheduling for distributed training with Volcano04-multi-step.yaml- Multi-step pipeline with dataset initializationDocumentation
examples/yaml/README.md- Main guide with quick start, common commands, troubleshootingexamples/yaml/basic/README.md- Detailed guide for basic examplesexamples/yaml/advanced/README.md- Production best practices and advanced patternsexamples/README.md- Added clear separation between YAML (Platform Admins) and SDK (AI Practitioners) examplesKey Features
✅ All examples verified against actual CRD schemas (not just documentation)
podTemplateOverridesAPI structure withtargetJobsarrayinitializer.datasetAPI withstorageUrifield✅ GPU-poor friendly
✅ Production-ready
docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727)✅ Complete kubectl workflows
Testing
All YAML files have been validated against:
trainer.kubeflow.org/v1alpha1)Images verified as publicly available:
docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727✅python:3.11-slim✅Notes
Related Issues
cc @kannon92 @andreyvelich @kramaranya @astefanutti @xigang