feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime #2925

NarayanaSabari · 2025-11-06T06:30:25Z

Description

This PR adds comprehensive kubectl-friendly YAML examples for Kubeflow Trainer, addressing the need for simple, out-of-the-box examples that can be applied directly with kubectl.

Fixes #2770

Motivation

As noted in #2770, while Jupyter notebook examples are great for AI practitioners, platform administrators and users who prefer working with kubectl need ready-to-use YAML examples. The existing operator guides have embedded YAML snippets, but users were looking for standalone examples in the repository that:

Work out of the box
Don't require GPU resources
Show both basic and advanced configurations
Can be applied directly with kubectl apply -f

This is especially important for "GPU-poor" users who want to explore and learn Kubeflow Trainer.

Changes

Added a new examples/yaml/ directory with 8 ready-to-use examples organized by complexity:

Basic Examples (`examples/yaml/basic/`)

01-hello-world.yaml - Simplest TrainJob with echo command (no GPU)
02-multi-node.yaml - Multi-node distributed training simulation (no GPU)
03-with-runtime.yaml - Custom namespace-scoped TrainingRuntime creation and usage (no GPU)
04-pytorch-simple.yaml - Real PyTorch MNIST training (GPU optional, uses public image)

Advanced Examples (`examples/yaml/advanced/`)

01-podspec-overrides.yaml - Comprehensive pod customization with podTemplateOverrides
02-kueue-integration.yaml - Queue-based job scheduling with Kueue
03-volcano-integration.yaml - Gang scheduling for distributed training with Volcano
04-multi-step.yaml - Multi-step pipeline with dataset initialization

Documentation

examples/yaml/README.md - Main guide with quick start, common commands, troubleshooting
examples/yaml/basic/README.md - Detailed guide for basic examples
examples/yaml/advanced/README.md - Production best practices and advanced patterns
Updated examples/README.md - Added clear separation between YAML (Platform Admins) and SDK (AI Practitioners) examples

Key Features

✅ All examples verified against actual CRD schemas (not just documentation)

Uses correct podTemplateOverrides API structure with targetJobs array
Uses correct initializer.dataset API with storageUri field
No deprecated or non-existent fields

✅ GPU-poor friendly

3 out of 4 basic examples require NO GPU
PyTorch example works with or without GPU (auto-detects)

✅ Production-ready

Uses publicly available images (e.g., docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727)
Includes best practices for resource limits, security, scheduling
Shows integration with Kueue and Volcano

✅ Complete kubectl workflows

Every example includes apply/check/logs/delete commands
Comprehensive troubleshooting section
Common debugging patterns

Testing

All YAML files have been validated against:

TrainJob CRD schema (trainer.kubeflow.org/v1alpha1)
TrainingRuntime CRD schema (namespace-scoped)
ClusterTrainingRuntime CRD schema (cluster-scoped)

Images verified as publicly available:

docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727 ✅
python:3.11-slim ✅

Notes

These examples complement (not replace) the Python SDK examples
They're specifically designed for platform administrators and kubectl users
All examples can be applied immediately without building custom images
Documentation clearly separates use cases: YAML for Platform Admins, SDK for AI Practitioners

Related Issues

Addresses feedback from @kannon92 about wanting simple hello-world examples without requiring GPU
Implements suggestions from @andreyvelich about showing PodSpecOverrides usage
Follows @kramaranya's idea of complete kubectl workflows for Platform Admins

cc @kannon92 @andreyvelich @kramaranya @astefanutti @xigang

google-oss-prow · 2025-11-06T06:30:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: narayanasabari <[email protected]>

kannon92 · 2025-11-07T22:07:25Z

This is exactly what I wanted!

I haven't verified if this actually work though. It would be worth doing that just to make sure.

examples/yaml/advanced/02-kueue-integration.yaml

Signed-off-by: narayanasabari <[email protected]>

NarayanaSabari · 2025-11-12T05:28:49Z

@kannon92 You're absolutely right! Thank you for catching that.

I've updated the Kueue example to use the correct approach:

Moved the kueue.x-k8s.io/queue-name label to the TrainJob metadata level
Removed the unnecessary podTemplateOverrides for basic Kueue integration

I also took the opportunity to:

Fix the Volcano example to use the official podGroupPolicy API
Fix the multi-step example where initializer was incorrectly nested under trainer

All examples now match the actual CRD API definitions. Thanks for the detailed feedback!

google-oss-prow bot requested review from jinchihe and kuizhiqing November 6, 2025 06:30

google-oss-prow bot added the size/XXL label Nov 6, 2025

NarayanaSabari changed the title ~~Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime~~ feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime Nov 6, 2025

added yaml examples for trainer

5d92bf6

Signed-off-by: narayanasabari <[email protected]>

NarayanaSabari force-pushed the yaml-examples branch from 63adef0 to 5d92bf6 Compare November 6, 2025 06:33

kannon92 reviewed Nov 7, 2025

View reviewed changes

examples/yaml/advanced/02-kueue-integration.yaml Show resolved Hide resolved

Examples fixed

3e4f3cc

Signed-off-by: narayanasabari <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime #2925

feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime #2925

NarayanaSabari commented Nov 6, 2025

Uh oh!

google-oss-prow bot commented Nov 6, 2025

Uh oh!

kannon92 commented Nov 7, 2025

Uh oh!

Uh oh!

NarayanaSabari commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime #2925

Are you sure you want to change the base?

feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime #2925

Conversation

NarayanaSabari commented Nov 6, 2025

Description

Motivation

Changes

Basic Examples (examples/yaml/basic/)

Advanced Examples (examples/yaml/advanced/)

Documentation

Key Features

Testing

Notes

Related Issues

Uh oh!

google-oss-prow bot commented Nov 6, 2025

Uh oh!

kannon92 commented Nov 7, 2025

Uh oh!

Uh oh!

NarayanaSabari commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Basic Examples (`examples/yaml/basic/`)

Advanced Examples (`examples/yaml/advanced/`)