Skip to content

Conversation

@NarayanaSabari
Copy link

Description

This PR adds comprehensive kubectl-friendly YAML examples for Kubeflow Trainer, addressing the need for simple, out-of-the-box examples that can be applied directly with kubectl.

Fixes #2770

Motivation

As noted in #2770, while Jupyter notebook examples are great for AI practitioners, platform administrators and users who prefer working with kubectl need ready-to-use YAML examples. The existing operator guides have embedded YAML snippets, but users were looking for standalone examples in the repository that:

  • Work out of the box
  • Don't require GPU resources
  • Show both basic and advanced configurations
  • Can be applied directly with kubectl apply -f

This is especially important for "GPU-poor" users who want to explore and learn Kubeflow Trainer.

Changes

Added a new examples/yaml/ directory with 8 ready-to-use examples organized by complexity:

Basic Examples (examples/yaml/basic/)

  1. 01-hello-world.yaml - Simplest TrainJob with echo command (no GPU)
  2. 02-multi-node.yaml - Multi-node distributed training simulation (no GPU)
  3. 03-with-runtime.yaml - Custom namespace-scoped TrainingRuntime creation and usage (no GPU)
  4. 04-pytorch-simple.yaml - Real PyTorch MNIST training (GPU optional, uses public image)

Advanced Examples (examples/yaml/advanced/)

  1. 01-podspec-overrides.yaml - Comprehensive pod customization with podTemplateOverrides
  2. 02-kueue-integration.yaml - Queue-based job scheduling with Kueue
  3. 03-volcano-integration.yaml - Gang scheduling for distributed training with Volcano
  4. 04-multi-step.yaml - Multi-step pipeline with dataset initialization

Documentation

  • examples/yaml/README.md - Main guide with quick start, common commands, troubleshooting
  • examples/yaml/basic/README.md - Detailed guide for basic examples
  • examples/yaml/advanced/README.md - Production best practices and advanced patterns
  • Updated examples/README.md - Added clear separation between YAML (Platform Admins) and SDK (AI Practitioners) examples

Key Features

All examples verified against actual CRD schemas (not just documentation)

  • Uses correct podTemplateOverrides API structure with targetJobs array
  • Uses correct initializer.dataset API with storageUri field
  • No deprecated or non-existent fields

GPU-poor friendly

  • 3 out of 4 basic examples require NO GPU
  • PyTorch example works with or without GPU (auto-detects)

Production-ready

  • Uses publicly available images (e.g., docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727)
  • Includes best practices for resource limits, security, scheduling
  • Shows integration with Kueue and Volcano

Complete kubectl workflows

  • Every example includes apply/check/logs/delete commands
  • Comprehensive troubleshooting section
  • Common debugging patterns

Testing

All YAML files have been validated against:

  • TrainJob CRD schema (trainer.kubeflow.org/v1alpha1)
  • TrainingRuntime CRD schema (namespace-scoped)
  • ClusterTrainingRuntime CRD schema (cluster-scoped)

Images verified as publicly available:

  • docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
  • python:3.11-slim

Notes

  • These examples complement (not replace) the Python SDK examples
  • They're specifically designed for platform administrators and kubectl users
  • All examples can be applied immediately without building custom images
  • Documentation clearly separates use cases: YAML for Platform Admins, SDK for AI Practitioners

Related Issues

  • Addresses feedback from @kannon92 about wanting simple hello-world examples without requiring GPU
  • Implements suggestions from @andreyvelich about showing PodSpecOverrides usage
  • Follows @kramaranya's idea of complete kubectl workflows for Platform Admins

cc @kannon92 @andreyvelich @kramaranya @astefanutti @xigang

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@NarayanaSabari NarayanaSabari changed the title Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime feat(examples): Add kubectl-friendly YAML examples for TrainJob and TrainingRuntime Nov 6, 2025
Signed-off-by: narayanasabari <[email protected]>
@kannon92
Copy link
Contributor

kannon92 commented Nov 7, 2025

This is exactly what I wanted!

I haven't verified if this actually work though. It would be worth doing that just to make sure.

Signed-off-by: narayanasabari <[email protected]>
@NarayanaSabari
Copy link
Author

@kannon92 You're absolutely right! Thank you for catching that.

I've updated the Kueue example to use the correct approach:

  • Moved the kueue.x-k8s.io/queue-name label to the TrainJob metadata level
  • Removed the unnecessary podTemplateOverrides for basic Kueue integration

I also took the opportunity to:

  • Fix the Volcano example to use the official podGroupPolicy API
  • Fix the multi-step example where initializer was incorrectly nested under trainer

All examples now match the actual CRD API definitions. Thanks for the detailed feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add yaml examples for trainer

2 participants