- What work did the WG do this year that should be highlighted?
See 2024 Highlights.
- Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
Yes, JobSet has 1 active owner at the moment.
We will breakdown our highlights into Sub Projects, KEPs, talks, community adoption.
Kueue has had 5 minor releases in 2024.
In 2024, the kueue community would like to highlight Topology aware scheduling, MultiKueue, Kueue Dashboard, KueueCtrl, Deployment/Statefulset integration for serving and Fair sharing.
Topology aware scheduling facilitates scheduling of workloads that take into account data center topology. Workloads benefit from using interconnects that are physically close together.
MultiKueue provides a way of dispatching batch workloads to worker clusters. Kueue provides multicluster dispatching for popular batch workloads such as Ray, Job, Kubeflow and JobSet. This feature went beta in 0.9.
Kueue Dashboards has been a popular ask for Kueue. Users would like to have a visualization representation of queueing and we are happy to announce that a dashboard has been created for Kueue. This went into kueue in late 2024 and a big focus of 2025 will be to harden this for production.
KueueCtl provides a cli for creating kueue objects. The plugin is hosted in krew and is easily installed as a kueue plugin.
Deployment and StatefulSet integration provides an avenue for the usage of Kueue for serving workloads. Serving leads to a need for sharing/preemption of model servers that may leverage accelerators. Kueue provides an integration with popular methods of deploying services (Deployment/StatefulSet).
Jobset has had 4 minor releases in 2024.
A major achievement of JobSet has been the adoption of JobSet as a component for Kubeflow Trainer V2, the next generation of the Kubeflow Training Operator project.
Metaflow has adopted the use of JobSet for distributed ML training.
KJob has been started to provide a CLI friendly way for users to submit batch jobs. The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs. Another focus of this project is to provide a smooth transition for Slurm users.
WG-Batch provided a series of kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2024, this group proposed/implemented the following KEPs.
-
- Promoted to beta.
-
- Promoted to beta.
-
- Promoted to stable.
-
- Promoted to stable.
-
- Promoted to stable.
-
Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet
- Speakers: Andrey Velichkevich and Yuki Iwai
- Kubecon NA, Salt Lake City
- Recording
-
WG-Batch Update at Kubecon
- Speakers: Kevin Hannon and Marcin Wielgus
- Kubecon NA, Salt Lake City
- Recording
-
Keynote: MultiCluster Batch Jobs Dispatching with Kueue at CERN
- Speakers: Ricardo Rocha and Marcin Wielgus
- Kubecon NA, Salt Lake City
- Recording
-
Multitenancy and Fairness at Scale with Kueue: A Case Study
- Speakers: Aldo Culquicondor and Rajat Phull
- Kubecon NA, Salt Lake City
- Recording
-
Advanced Resource Management for Running AI/ML Workloads with Kueue
- Speakers: Michał Woźniak and Yuki Iwai
- Kubecon EU, Paris
- Recording
-
Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler
- Speaker: Antonin Stefanutti and Anish Asthana
- KubeCon EU, Paris
- Recording
-
WG-Batch Update
- Speaker: Michał Woźniak and Yuki Iwai
- KubeCon EU, Paris
- Recording
-
How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads
- Author: Kevin Hannon
- FOSDEM 2024
- Recording
-
Kubeflow Trainer v2 will be using JobSet as a critical component for distributed training and LLMs fine-tuning.
-
Metaflow supports JobSet for distributed training.
-
Airflow has built an integration with Kueue.
Operational tasks in [wg-governance.md]:
- [README.md] reviewed for accuracy and updated if needed
- WG leaders in [sigs.yaml] are accurate and active, and updated if needed
- Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed
- Updates provided to sponsoring SIGs in 2024 - WG-Batch Updates at Kubecon EU 2024 - WG-Batch Updates at Kubecon NA 2024 [wg-governance.md]: https://git.k8s.io/community/committee-steering/governance/wg-governance.md [README.md]: https://git.k8s.io/community/wg-batch/README.md [sigs.yaml]: https://git.k8s.io/community/sigs.yaml