Skip to content

Commit 6a52b2a

Browse files
committed
WG Batch: add 2025 annual report
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
1 parent 08815d0 commit 6a52b2a

File tree

1 file changed

+128
-17
lines changed

1 file changed

+128
-17
lines changed

wg-batch/annual-report-2025.md

Lines changed: 128 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,143 @@
22

33
## Current initiatives and Project Health
44

5-
65
1. What work did the WG do this year that should be highlighted?
76

8-
<!--
9-
Some example items that might be worth highlighting:
10-
- artifacts
11-
- reports
12-
- white papers
13-
- work not tracked in KEPs
14-
-->
7+
See [2025 Highlights](#2025-highlights).
158

169
2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
1710

11+
No, all subprojects have sufficient active owners.
12+
13+
### 2025 Highlights
14+
15+
We will break down our highlights into Sub Projects, KEPs, talks, community adoption.
16+
17+
#### Sub Projects
18+
19+
##### Kueue
20+
21+
Kueue has had 5 minor releases in 2025.
22+
23+
- [Release 0.11](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.11.0)
24+
25+
- [Release 0.12](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.0)
26+
27+
- [Release 0.13](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.13.0)
28+
29+
- [Release 0.14](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.14.0)
30+
31+
- [Release 0.15](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.15.0)
32+
33+
In 2025, the Kueue community would like to highlight Topology Aware Scheduling, MultiKueue, Admission Fair Sharing, Elastic Jobs, DRA Integration, v1beta2 API and KueueViz Dashboard.
34+
35+
[Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/) matured from alpha to beta (enabled by default in 0.14), facilitating scheduling of workloads that take into account data center topology.
36+
Workloads benefit from using interconnects that are physically close together, which is critical for AI/ML training workloads.
37+
38+
[MultiKueue](https://kueue.sigs.k8s.io/docs/concepts/multikueue/) expanded to support RayCluster, RayJob, and Pods.
39+
An external dispatcher API was introduced in 0.13 for nominating worker clusters.
40+
Security hardening with kubeconfig validation was added in 0.15.
41+
42+
[Admission Fair Sharing](https://kueue.sigs.k8s.io/docs/concepts/admission_fair_sharing/) progressed from alpha (0.12) to beta (0.15).
43+
This feature orders workloads based on recent LocalQueue usage rather than just priority, preventing queue manipulation and ensuring fair resource distribution.
44+
45+
[Elastic Jobs](https://kueue.sigs.k8s.io/docs/concepts/workload/#elastic-jobs) via WorkloadSlices was introduced in 0.13 as an alpha feature.
46+
This enables dynamic job resizing without suspension or requeueing.
47+
48+
[DRA Integration](https://kueue.sigs.k8s.io/docs/concepts/dynamic_resource_allocation/) was introduced in 0.14 as an alpha feature, providing Dynamic Resource Allocation support for specialized hardware.
49+
50+
[v1beta2 API](https://kueue.sigs.k8s.io/docs/reference/kueue-v1beta2/) was introduced in 0.15, representing API maturation toward stability.
51+
52+
[KueueViz Dashboard](https://kueue.sigs.k8s.io/docs/reference/kueue-viz/) was hardened for production with Helm charts for installation and rebranded with the CNCF logo.
53+
54+
##### JobSet
55+
56+
JobSet has had 3 minor releases in 2025.
57+
58+
- [Release 0.8](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.8.0)
59+
60+
- [Release 0.9](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.9.0)
61+
62+
- [Release 0.10](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.10.0)
63+
64+
A major achievement of JobSet has been the official Kubernetes blog post [Introducing JobSet](https://kubernetes.io/blog/2025/03/23/introducing-jobset/) published in March 2025, which significantly increased the project's visibility.
65+
66+
Key features introduced include VolumeClaimPolicies for managing persistent volume claims, InPlaceRestart for faster failure recovery, DependsOn API for defining execution dependencies between replicated jobs, Failure Policy for configuring different behavior for different types of errors, and the Coordinator field for defining a global coordinator pod for distributed ML/HPC workloads.
67+
68+
##### KJob
69+
70+
[KJob](https://github.com/kubernetes-sigs/kjob) had its first release (v0.1.0) in 2025, providing the base functionality for CLI-friendly batch job submission.
71+
KJob provides a template-based job execution with built-in SLURM support and kubectl plugin integration.
72+
The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs and a smooth transition for Slurm users.
73+
74+
#### KEPs
75+
76+
WG-Batch provided a series of Kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2025, this group proposed/implemented the following KEPs.
77+
78+
- [Job Success Policy](https://github.com/kubernetes/enhancements/issues/3998)
79+
- Promoted to stable.
80+
81+
- [Backoff Limit Per Index](https://github.com/kubernetes/enhancements/issues/3850)
82+
- Promoted to stable.
83+
84+
- [Pod Replacement Policy](https://github.com/kubernetes/enhancements/issues/3939)
85+
- Promoted to stable.
86+
87+
- [Job Managed By](https://github.com/kubernetes/enhancements/issues/4368)
88+
- Promoted to stable.
89+
90+
- [Gang Scheduling / Workload API](https://github.com/kubernetes/enhancements/issues/4671)
91+
- Introduced as alpha.
92+
93+
- [Mutable Container Resources for Suspended Jobs](https://github.com/kubernetes/enhancements/issues/5440)
94+
- Introduced as alpha.
95+
96+
### Talks
97+
98+
- Accelerate Your AI/ML Workloads With Topology-Aware Scheduling in Kueue
99+
- Speakers: Michal Wozniak and Yuki Iwai
100+
- KubeCon EU, London
101+
- [Recording](https://www.youtube.com/watch?v=F55pFM1M1bU)
102+
103+
- Tutorial: Build, Operate, and Use a Multi-Tenant AI Cluster Based Entirely on Open Source
104+
- Speakers: Claudia Misale, Olivier Tardieu, and David Grove
105+
- KubeCon EU, London
106+
- [Recording](https://www.youtube.com/watch?v=Ab7mRoJYsMo)
107+
108+
- Kueue: Save Some QPS for the Rest of Us! How To Manage 100k Updates Per Second
109+
- Speaker: Patryk Bundyra
110+
- KubeCon EU, London
111+
- [Recording](https://www.youtube.com/watch?v=njNXlZNT3dw)
112+
113+
- WG-Batch Updates: What's New and What Is Next?
114+
- Speaker: Marcin Wielgus
115+
- KubeCon EU, London
116+
- [Recording](https://www.youtube.com/watch?v=aWxuaEFSarU)
117+
118+
- Resource Fairness and Utilization for Heterogeneous Batch/ML Platforms With Kueue
119+
- Speakers: Yuki Iwai and Gabe Saba
120+
- KubeCon NA, Atlanta
121+
- [Recording](https://www.youtube.com/watch?v=dKhF-hZi7CI)
122+
123+
- WG-Batch Updates: What's New and What Is Next?
124+
- Speakers: Michal Wozniak and Yuki Iwai
125+
- KubeCon Japan, Tokyo
126+
- [Recording](https://www.youtube.com/watch?v=jeRhDmp_i2M)
127+
128+
### Community adoption
129+
130+
- [CNCF Kubernetes AI Conformance Program](https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/) was launched in November 2025 to standardize AI workloads on Kubernetes, with Kueue as a key component in the ecosystem.
131+
18132
## Operational
19133

20134
Operational tasks in [wg-governance.md]:
21135

22-
- [ ] [README.md] reviewed for accuracy and updated if needed
23-
- [ ] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
24-
- [ ] Meeting notes and recordings for 2025 are linked from [README.md] and updated/uploaded if needed
25-
- [ ] Updates provided to sponsoring SIGs in 2025
26-
- [$sig-name](https://git.k8s.io/community/$sig-id/)
27-
- links to email, meeting notes, slides, or recordings, etc
28-
- [$sig-name](https://git.k8s.io/community/$sig-id/)
29-
- links to email, meeting notes, slides, or recordings, etc
30-
-
136+
- [x] [README.md] reviewed for accuracy and updated if needed
137+
- [x] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
138+
- [x] Meeting notes and recordings for 2025 are linked from [README.md] and updated/uploaded if needed
139+
- [x] Updates provided to sponsoring SIGs in 2025
140+
- [WG-Batch Updates at KubeCon EU 2025]()
141+
- [WG-Batch Updates at KubeCon Japan 2025]()
31142

32143
[wg-governance.md]: https://git.k8s.io/community/committee-steering/governance/wg-governance.md
33144
[README.md]: https://git.k8s.io/community/wg-batch/README.md

0 commit comments

Comments
 (0)