-
Notifications
You must be signed in to change notification settings - Fork 152
docs: comprehensive batch and gang scheduling documentation #986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: comprehensive batch and gang scheduling documentation #986
Conversation
- Expand README with clear definitions of batch vs gang scheduling - Add 9 workload type examples with inline YAML and raw files - Job, PyTorchJob, MPIJob, TFJob, XGBoostJob, JAXJob - RayJob, JobSet, SparkApplication - Include external requirements with operator install commands - Add practical, runnable examples using non-gcr.io images - Link to topology-aware scheduling for distributed workloads - Reorganize examples into docs/batch/examples/ directory - Remove legacy sleep-based examples All examples validated with kubectl dry-run and use kai-scheduler with proper queue labels. Examples include copyright headers and practical workloads (e.g., PyTorch MNIST, MPI pi calculation). Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@wcarrollrai Thanks, |
- Remove all inline YAML examples from README (640 lines → 198 lines) - Add quick reference table with links to all 9 workload types - Link to example files in docs/batch/examples/ instead of inline YAML - Replace hardcoded installation commands with links to official docs - Preserve all descriptions, scheduling behavior, and external requirements - Keep topology annotation snippet (not a workload example) - Update PR description to address reviewer feedback This change improves readability and reduces maintenance burden by: 1. Keeping the README concise and scannable 2. Maintaining all example files in docs/batch/examples/ for reference 3. Linking to upstream installation docs instead of hardcoding commands Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Updates Based on Reviewer FeedbackThanks for the feedback! I've refactored the README to improve readability and reduce maintenance burden. Changes MadeImproved Readability:
Reduced Maintenance:
What Stayed the Same
Before & AfterBefore:
After:
The documentation is now more maintainable and easier to scan while preserving all essential information. |
|
@wcarrollrai Can this also somehow absorb this example with its extra documentation: https://github.com/NVIDIA/KAI-Scheduler/tree/main/examples/ray |
gshaibi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wcarrollrai thanks for contributing!!!
|
|
||
| ### Prerequisites | ||
| This requires the [kubeflow-training-operator-v1](https://www.kubeflow.org/docs/components/trainer/legacy-v1/) to be installed in the cluster. | ||
| ### Batch Scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this term in the context of independent scheduling.
What do you think about renaming it to "independent scheduling"/"pod-by-pod scheduling" or something alike?
|
@enoodle - Sure. Let me see what we can do to best handle this. |
Description
This PR provides comprehensive documentation for batch and gang scheduling in KAI-Scheduler, expanding the minimal existing documentation (24 lines) to cover all 9 supported workload types with practical, production-ready examples (640 lines).
What Changed
Documentation Enhancements:
docs/batch/README.mdwith clear definitions of batch vs gang schedulingdocs/batch/examples/directoryNew Example Files (9 workload types):
job.yaml- Standard Kubernetes Job (batch scheduling)pytorchjob.yaml- PyTorch distributed training with MNISTmpijob.yaml- MPI distributed computing (pi calculation)tfjob.yaml- TensorFlow distributed training with MNISTxgboostjob.yaml- XGBoost distributed training on Iris datasetjaxjob.yaml- JAX distributed training on MNISTrayjob.yaml- Ray distributed computing clusterjobset.yaml- Multi-job workflow coordinationsparkapplication.yaml- Apache Spark data processingKey Features:
schedulerName: kai-schedulerwith proper queue labelsCleanup:
batch-job.yaml(replaced by examples/job.yaml)pytorch-job.yamlto examples/pytorchjob.yamlWhy This Change
The existing batch documentation was minimal and only covered 2 workload types with placeholder examples. This made it difficult for users to:
This PR addresses these gaps by providing comprehensive, copy/paste-ready documentation.
Related Issues
N/A - Documentation enhancement
Checklist
Breaking Changes
None. This is a documentation-only change that enhances existing content without modifying any code or APIs.
Additional Notes
Validation Performed
kubectl dry-run validation:
Image validation:
All container images verified as non-gcr.io, publicly accessible, and manifest-inspectable:
ubuntu:latest(Docker Hub)ghcr.io/kubeflow/training-v1/pytorch-dist-mnist:latest(GitHub Container Registry)mpioperator/mpi-pi:openmpi(Docker Hub)kubeflow/tf-mnist-with-summaries:latest(Docker Hub)docker.io/kubeflow/xgboost-dist-iris:latest(Docker Hub)docker.io/kubeflow/jax-mnist:latest(Docker Hub)rayproject/ray:2.46.0(Docker Hub)busybox:1.36(Docker Hub)apache/spark:3.5.0(Docker Hub)Reviewer Guidance
What to Review:
Related Documentation:
Before & After
Before: 24 lines, 2 workload types, sleep-based placeholders
After: 640 lines, 9 workload types, practical runnable examples
The new documentation provides: