Skip to content

Conversation

@wcarrollrai
Copy link

Description

This PR provides comprehensive documentation for batch and gang scheduling in KAI-Scheduler, expanding the minimal existing documentation (24 lines) to cover all 9 supported workload types with practical, production-ready examples (640 lines).

What Changed

Documentation Enhancements:

  • Expanded docs/batch/README.md with clear definitions of batch vs gang scheduling
  • Added 9 workload type examples in new docs/batch/examples/ directory
  • Included external requirements with operator install commands
  • Added cross-links to topology-aware scheduling and pod-grouper documentation

New Example Files (9 workload types):

  1. job.yaml - Standard Kubernetes Job (batch scheduling)
  2. pytorchjob.yaml - PyTorch distributed training with MNIST
  3. mpijob.yaml - MPI distributed computing (pi calculation)
  4. tfjob.yaml - TensorFlow distributed training with MNIST
  5. xgboostjob.yaml - XGBoost distributed training on Iris dataset
  6. jaxjob.yaml - JAX distributed training on MNIST
  7. rayjob.yaml - Ray distributed computing cluster
  8. jobset.yaml - Multi-job workflow coordination
  9. sparkapplication.yaml - Apache Spark data processing

Key Features:

  • All examples use schedulerName: kai-scheduler with proper queue labels
  • Only non-gcr.io container images (all validated and pullable)
  • Practical, runnable examples (not placeholder sleep commands)
  • Copyright headers on all files
  • kubectl validation performed on all manifests

Cleanup:

  • Removed legacy batch-job.yaml (replaced by examples/job.yaml)
  • Moved and enhanced pytorch-job.yaml to examples/pytorchjob.yaml

Why This Change

The existing batch documentation was minimal and only covered 2 workload types with placeholder examples. This made it difficult for users to:

  • Understand the difference between batch and gang scheduling
  • Know which workload types are supported
  • Get started quickly with practical examples
  • Understand external requirements for third-party CRDs

This PR addresses these gaps by providing comprehensive, copy/paste-ready documentation.

Related Issues

N/A - Documentation enhancement

Checklist

  • Self-reviewed
  • Added/updated tests (if needed) - N/A for documentation
  • Updated documentation (if needed) - This PR IS documentation

Breaking Changes

None. This is a documentation-only change that enhances existing content without modifying any code or APIs.

Additional Notes

Validation Performed

kubectl dry-run validation:

  • ✅ Native K8s resources (Job) pass validation
  • ✅ Third-party CRDs show expected "CRD not installed" errors (validates YAML syntax)

Image validation:
All container images verified as non-gcr.io, publicly accessible, and manifest-inspectable:

  • ubuntu:latest (Docker Hub)
  • ghcr.io/kubeflow/training-v1/pytorch-dist-mnist:latest (GitHub Container Registry)
  • mpioperator/mpi-pi:openmpi (Docker Hub)
  • kubeflow/tf-mnist-with-summaries:latest (Docker Hub)
  • docker.io/kubeflow/xgboost-dist-iris:latest (Docker Hub)
  • docker.io/kubeflow/jax-mnist:latest (Docker Hub)
  • rayproject/ray:2.46.0 (Docker Hub)
  • busybox:1.36 (Docker Hub)
  • apache/spark:3.5.0 (Docker Hub)

Reviewer Guidance

What to Review:

  • README structure and clarity
  • Example completeness (all 9 supported workload types covered)
  • Accuracy of operator install commands
  • Appropriateness of selected container images

Related Documentation:

Before & After

Before: 24 lines, 2 workload types, sleep-based placeholders
After: 640 lines, 9 workload types, practical runnable examples

The new documentation provides:

  • Clear batch vs gang scheduling definitions
  • Comprehensive coverage with inline YAML + raw files
  • External requirements with install commands
  • Cross-links to related documentation

- Expand README with clear definitions of batch vs gang scheduling
- Add 9 workload type examples with inline YAML and raw files
  - Job, PyTorchJob, MPIJob, TFJob, XGBoostJob, JAXJob
  - RayJob, JobSet, SparkApplication
- Include external requirements with operator install commands
- Add practical, runnable examples using non-gcr.io images
- Link to topology-aware scheduling for distributed workloads
- Reorganize examples into docs/batch/examples/ directory
- Remove legacy sleep-based examples

All examples validated with kubectl dry-run and use kai-scheduler
with proper queue labels. Examples include copyright headers and
practical workloads (e.g., PyTorch MNIST, MPI pi calculation).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@enoodle
Copy link
Collaborator

enoodle commented Feb 10, 2026

@wcarrollrai Thanks,
Maybe it will be more easy to read through if you link in the README to the example files instead of having the whole example yamls there?

- Remove all inline YAML examples from README (640 lines → 198 lines)
- Add quick reference table with links to all 9 workload types
- Link to example files in docs/batch/examples/ instead of inline YAML
- Replace hardcoded installation commands with links to official docs
- Preserve all descriptions, scheduling behavior, and external requirements
- Keep topology annotation snippet (not a workload example)
- Update PR description to address reviewer feedback

This change improves readability and reduces maintenance burden by:
1. Keeping the README concise and scannable
2. Maintaining all example files in docs/batch/examples/ for reference
3. Linking to upstream installation docs instead of hardcoding commands

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@wcarrollrai
Copy link
Author

Updates Based on Reviewer Feedback

Thanks for the feedback! I've refactored the README to improve readability and reduce maintenance burden.

Changes Made

Improved Readability:

  • Reduced README from 640 lines to 198 lines (69% reduction)
  • Removed all inline YAML examples in favor of links to example files
  • Added a quick reference table at the top for easy navigation
  • Reformatted each workload section with bullet lists for better scannability

Reduced Maintenance:

  • Replaced hardcoded installation commands with links to official documentation
  • This ensures users always get the latest installation instructions from upstream
  • No need to update this repo when operators change their installation procedures

What Stayed the Same

  • All 9 example YAML files remain unchanged in docs/batch/examples/
  • All descriptions, scheduling behaviors, and external requirements preserved
  • Topology-aware scheduling section maintained

Before & After

Before:

  • 640-line README with 9 inline YAML blocks
  • Hardcoded kubectl apply and helm install commands

After:

  • 198-line README with links to examples
  • Quick reference table
  • Bullet list format for each workload type
  • Links to official installation guides

The documentation is now more maintainable and easier to scan while preserving all essential information.

@enoodle
Copy link
Collaborator

enoodle commented Feb 11, 2026

@wcarrollrai Can this also somehow absorb this example with its extra documentation: https://github.com/NVIDIA/KAI-Scheduler/tree/main/examples/ray

Copy link
Collaborator

@gshaibi gshaibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wcarrollrai thanks for contributing!!!


### Prerequisites
This requires the [kubeflow-training-operator-v1](https://www.kubeflow.org/docs/components/trainer/legacy-v1/) to be installed in the cluster.
### Batch Scheduling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this term in the context of independent scheduling.
What do you think about renaming it to "independent scheduling"/"pod-by-pod scheduling" or something alike?

@wcarrollrai
Copy link
Author

@enoodle - Sure. Let me see what we can do to best handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants