2024 Annual Report: WG Serving

Current Initiatives and Project Health

1. What work did the WG do this year that should be highlighted?

Public talks

Talks that focus on WG Serving

KubeCon NA 2024 talks:

WG Serving: Accelerating AI/ML Inference Workloads on Kubernetes

Kubernetes Podcast:

Kubernetes Working Group Serving

Other talks that mention WG Serving or initiatives from the WG:

Navigating Failures in Pods with Devices: Challenges and Solutions
Optimizing LLM Performance in Kubernetes with OpenTelemetry
Solving the Kubernetes Networking API Rubik's Cube
Distributed Multi-Node Model Inference Using the LeaderWorkerSet API
Engaging the KServe Community, the Impact of Integrating a Solutions with Standardized CNCF Projects
Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes
Unlocking Potential of Large Models in Production
Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines
Advancing Cloud Native AI Innovation: Through Open Collaboration
Kubernetes WG Device Management - Advancing K8s Support for GPUs
Better Together! GPU, TPU and NIC Topological Alignment with DRA
Incremental GPU Slicing in Action
A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA
Which GPU Sharing Strategy Is Right for You? A Comprehensive Benchmark Study Using DRA

New subprojects and initiatives

Inference Perf

We created a new GenAI benchmarking tool called inference-perf to consolidate and standardize on a benchmarking-as-code tool which can be used to benchmark different Gen AI workloads running on Kubernetes or elsewhere in a model server and infrastructure agnostic way.

The approved subproject proposal can be found here.
General design for the tool can be found here.
We have had contributions from different companies so far like Google, Red Hat, IBM and Capital One. NVIDIA and others looking to contribute as well.
We have dependency management set up for the tool and it can be shipped as a python package - PR #13.
Github actions are set up to run formatting, linting and other checks on PR submissions.
Support for a constant time load generator and load generator to send requests using a Poisson distribution in progress - PR #9.
Support for model server clients (PR #5) and metrics collection (PR #7) are in progress as well.

Gateway API Inference Extension

The Gateway Inference Extension(GIE) has been developing rapidly, with our first release (v0.1.0) recently published. We have already seen[15%-60%] reduction in output token latency when kv cache is close to saturation in this first release.

Development will continue rapidly, with a focus on:

Productionalization
Adoption of the latest developments (prefix caching as an example)
And driving improvement/development in other areas of inference routing (multi-LoRA, disaggregated serving pools, etc)

GIE has established adoption patterns for both the Model Server, and the Gateway interfaces. With regards to model servers, GIE already integrates with vLLM well, and will soon support JetStream as well, other model servers need only implement the protocol and will also cleanly integrate into GIE. As a part of sig-net, GIE seeks to build a strong partnership with the gateways in the space. Integration efforts from these orgs are already underway:

GKE
KGateway
Istio

GIE was developed on top of ext-proc and Envoy Proxy, and so any Proxy that can support ext-proc, can support the GIE protocol.

The GIE team is looking forward to this upcoming year! With many integrations upcoming and working with new partners (such as KServe) we look forward to where we are headed in 2025.

Serving Catalog

Serving Catalog creates a repository of example K8s templates to deploy popular inference workloads. Kustomized Blueprints - Serving Catalog described an approach for LLM Serving Catalog using Kustomize overlays and components to provide a framework for extensible templates.

The current support matrix is available here, including support for:

Single-host Inference using Deployments for vLLM and JetStream
Multi-host Inference using LeaderWorkerSet for vLLM
Components for HPA stubs for token-latency

Workstream Updates

Orchestration: Progress on various initiatives and relevant projects such as the GIE project, Serving Catalog, and KServe. Please refer to the previous section for more details.

Autoscaling: Ongoing efforts to integrate custom metrics for autoscaling AI workloads.

One of the directions for autoscaling is a unification of model weights distributions formats. There is not a single distribution mechanism and WG Serving believes that the container images are the best distribution format. WG Serving identified some problems with the OCI large images support and sponsored the Kubernetes Image VolumeSource KEP work.

Multi-Host Serving: Improvements in distributed inference across nodes in vLLM, LeaderWorkerSet, and KServe.

LeaderWorkerSet (LWS) continues to evolve as a key component for multi-host inference, addressing the challenges of deploying large-scale AI/ML models across multiple nodes. The v0.3.0 release introduced subgroup support for disaggregated serving, a new start policy API, and improved inter-container communication through leader address injection. It also added a multi-node serving example for LLaMA 70B on GPUs using vLLM. Building on these capabilities, v0.4.0 & v0.5.0 introduced network configuration support, group size as an environment variable, and expanded multi-host inference examples, including llama.cpp for distributed inference and an updated vLLM example for Llama 3.1-405B. These enhancements reinforce LWS’s flexibility in orchestrating increasingly larger models on Kubernetes.

At the same time, WG-Serving is working closely with vLLM developers on the latest P&D disaggregation feature progress, actively testing the upstream 1P1D functionality to better understand evolving orchestration requirements. This collaboration aims to drive improvements in xPyD capabilities, further unlocking disaggregated serving on Kubernetes by optimizing workload placement and execution strategies. By refining these mechanisms, we aim to enhance inference performance, ensuring more efficient resource utilization and scalability for large-scale AI workloads.

With these iterative improvements, LWS and vLLM continue to refine multi-host inference, making large-scale distributed model deployments on Kubernetes more reliable, efficient, and adaptable.

In addition, KServe also added multi-host serving capability via vLLM serving runtime.

DRA (Dynamic Resource Allocation): Enhancing GPU/accelerator allocation, structured parameters, and resource claim standardization.

The DRA long term vision will enable many serving-related scenarios in future. In 2024, most of the effort was spent on adjusting plans and design to ensure timely GA of the feature and smooth migration from the device plugin architecture. We are working on prioritizing features in DRA needed for serving workloads, however the major push in the first half of 2025 will still be GA-related activities. WG Serving prepared a document listing scenarios and requirements for the DRA with the hope to start working on some of them in 2025.

Another topic that is actively being discussed is device failure handling and managing workload affected by those failures.

2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?

Not at this point.

Operational

Operational tasks in wg-governance.md:

README.md reviewed for accuracy and updated if needed
WG leaders in sigs.yaml are accurate and active, and updated if needed
Meeting notes and recordings for 2024 are linked from README.md and updated/uploaded if needed
Updates provided to sponsoring SIGs in 2024 - $sig-name - links to email, meeting notes, slides, or recordings, etc - $sig-name - links to email, meeting notes, slides, or recordings, etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annual-report-2024.md

annual-report-2024.md

2024 Annual Report: WG Serving

Current Initiatives and Project Health

1. What work did the WG do this year that should be highlighted?

Public talks

Talks that focus on WG Serving

Other talks that mention WG Serving or initiatives from the WG:

New subprojects and initiatives

Inference Perf

Gateway API Inference Extension

Serving Catalog

Workstream Updates

2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?

Operational

Files

annual-report-2024.md

Latest commit

History

annual-report-2024.md

File metadata and controls

2024 Annual Report: WG Serving

Current Initiatives and Project Health

1. What work did the WG do this year that should be highlighted?

Public talks

Talks that focus on WG Serving

Other talks that mention WG Serving or initiatives from the WG:

New subprojects and initiatives

Inference Perf

Gateway API Inference Extension

Serving Catalog

Workstream Updates

2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?

Operational