Skip to content

Latest commit

 

History

History
154 lines (101 loc) · 7.43 KB

annual-report-2024.md

File metadata and controls

154 lines (101 loc) · 7.43 KB

2024 Annual Report: WG Batch

Current initiatives and Project Health

  1. What work did the WG do this year that should be highlighted?

See 2024 Highlights.

  1. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?

Yes, JobSet has 1 active owner at the moment.

2024 Highlights

We will breakdown our highlights into Sub Projects, KEPs, talks, community adoption.

Sub Projects

Kueue

Kueue has had 5 minor releases in 2024.

In 2024, the kueue community would like to highlight Topology aware scheduling, MultiKueue, Kueue Dashboard, KueueCtrl, Deployment/Statefulset integration for serving and Fair sharing.

Topology aware scheduling facilitates scheduling of workloads that take into account data center topology. Workloads benefit from using interconnects that are physically close together.

MultiKueue provides a way of dispatching batch workloads to worker clusters. Kueue provides multicluster dispatching for popular batch workloads such as Ray, Job, Kubeflow and JobSet. This feature went beta in 0.9.

Kueue Dashboards has been a popular ask for Kueue. Users would like to have a visualization representation of queueing and we are happy to announce that a dashboard has been created for Kueue. This went into kueue in late 2024 and a big focus of 2025 will be to harden this for production.

KueueCtl provides a cli for creating kueue objects. The plugin is hosted in krew and is easily installed as a kueue plugin.

Deployment and StatefulSet integration provides an avenue for the usage of Kueue for serving workloads. Serving leads to a need for sharing/preemption of model servers that may leverage accelerators. Kueue provides an integration with popular methods of deploying services (Deployment/StatefulSet).

JobSet

Jobset has had 4 minor releases in 2024.

A major achievement of JobSet has been the adoption of JobSet as a component for Kubeflow Trainer V2, the next generation of the Kubeflow Training Operator project.

Metaflow has adopted the use of JobSet for distributed ML training.

KJob

KJob has been started to provide a CLI friendly way for users to submit batch jobs. The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs. Another focus of this project is to provide a smooth transition for Slurm users.

KEPs

WG-Batch provided a series of kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2024, this group proposed/implemented the following KEPs.

Talks

  • Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet

    • Speakers: Andrey Velichkevich and Yuki Iwai
    • Kubecon NA, Salt Lake City
    • Recording
  • WG-Batch Update at Kubecon

    • Speakers: Kevin Hannon and Marcin Wielgus
    • Kubecon NA, Salt Lake City
    • Recording
  • Keynote: MultiCluster Batch Jobs Dispatching with Kueue at CERN

    • Speakers: Ricardo Rocha and Marcin Wielgus
    • Kubecon NA, Salt Lake City
    • Recording
  • Multitenancy and Fairness at Scale with Kueue: A Case Study

    • Speakers: Aldo Culquicondor and Rajat Phull
    • Kubecon NA, Salt Lake City
    • Recording
  • Advanced Resource Management for Running AI/ML Workloads with Kueue

    • Speakers: Michał Woźniak and Yuki Iwai
    • Kubecon EU, Paris
    • Recording
  • Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler

    • Speaker: Antonin Stefanutti and Anish Asthana
    • KubeCon EU, Paris
    • Recording
  • WG-Batch Update

    • Speaker: Michał Woźniak and Yuki Iwai
    • KubeCon EU, Paris
    • Recording
  • How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads

Community adoption

Operational

Operational tasks in [wg-governance.md]: