KubeCon NA 2024 talks:
Kubernetes Podcast:
- Navigating Failures in Pods with Devices: Challenges and Solutions
- Optimizing LLM Performance in Kubernetes with OpenTelemetry
- Solving the Kubernetes Networking API Rubik's Cube
- Distributed Multi-Node Model Inference Using the LeaderWorkerSet API
- Engaging the KServe Community, the Impact of Integrating a Solutions with Standardized CNCF Projects
- Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes
- Unlocking Potential of Large Models in Production
- Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines
- Advancing Cloud Native AI Innovation: Through Open Collaboration
- Kubernetes WG Device Management - Advancing K8s Support for GPUs
- Better Together! GPU, TPU and NIC Topological Alignment with DRA
- Incremental GPU Slicing in Action
- A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA
- Which GPU Sharing Strategy Is Right for You? A Comprehensive Benchmark Study Using DRA
We created a new GenAI benchmarking tool called inference-perf to consolidate and standardize on a benchmarking-as-code tool which can be used to benchmark different Gen AI workloads running on Kubernetes or elsewhere in a model server and infrastructure agnostic way.
- The approved subproject proposal can be found here.
- General design for the tool can be found here.
- We have had contributions from different companies so far like Google, Red Hat, IBM and Capital One. NVIDIA and others looking to contribute as well.
- We have dependency management set up for the tool and it can be shipped as a python package - PR #13.
- Github actions are set up to run formatting, linting and other checks on PR submissions.
- Support for a constant time load generator and load generator to send requests using a Poisson distribution in progress - PR #9.
- Support for model server clients (PR #5) and metrics collection (PR #7) are in progress as well.
The Gateway Inference Extension(GIE) has been developing rapidly, with our first release (v0.1.0) recently published. We have already seen[15%-60%] reduction in output token latency when kv cache is close to saturation in this first release.
Development will continue rapidly, with a focus on:
- Productionalization
- Adoption of the latest developments (prefix caching as an example)
- And driving improvement/development in other areas of inference routing (multi-LoRA, disaggregated serving pools, etc)
GIE has established adoption patterns for both the Model Server, and the Gateway interfaces. With regards to model servers, GIE already integrates with vLLM well, and will soon support JetStream as well, other model servers need only implement the protocol and will also cleanly integrate into GIE. As a part of sig-net, GIE seeks to build a strong partnership with the gateways in the space. Integration efforts from these orgs are already underway:
GIE was developed on top of ext-proc and Envoy Proxy, and so any Proxy that can support ext-proc, can support the GIE protocol.
The GIE team is looking forward to this upcoming year! With many integrations upcoming and working with new partners (such as KServe) we look forward to where we are headed in 2025.
Serving Catalog creates a repository of example K8s templates to deploy popular inference workloads. Kustomized Blueprints - Serving Catalog described an approach for LLM Serving Catalog using Kustomize overlays and components to provide a framework for extensible templates.
The current support matrix is available here, including support for:
- Single-host Inference using Deployments for vLLM and JetStream
- Multi-host Inference using LeaderWorkerSet for vLLM
- Components for HPA stubs for token-latency
Orchestration: Progress on various initiatives and relevant projects such as the GIE project, Serving Catalog, and KServe. Please refer to the previous section for more details.
Autoscaling: Ongoing efforts to integrate custom metrics for autoscaling AI workloads.
One of the directions for autoscaling is a unification of model weights distributions formats. There is not a single distribution mechanism and WG Serving believes that the container images are the best distribution format. WG Serving identified some problems with the OCI large images support and sponsored the Kubernetes Image VolumeSource KEP work.
Multi-Host Serving: Improvements in distributed inference across nodes in vLLM, LeaderWorkerSet, and KServe.
LeaderWorkerSet (LWS) continues to evolve as a key component for multi-host inference, addressing the challenges of deploying large-scale AI/ML models across multiple nodes. The v0.3.0 release introduced subgroup support for disaggregated serving, a new start policy API, and improved inter-container communication through leader address injection. It also added a multi-node serving example for LLaMA 70B on GPUs using vLLM. Building on these capabilities, v0.4.0 & v0.5.0 introduced network configuration support, group size as an environment variable, and expanded multi-host inference examples, including llama.cpp for distributed inference and an updated vLLM example for Llama 3.1-405B. These enhancements reinforce LWS’s flexibility in orchestrating increasingly larger models on Kubernetes.
At the same time, WG-Serving is working closely with vLLM developers on the latest P&D disaggregation feature progress, actively testing the upstream 1P1D functionality to better understand evolving orchestration requirements. This collaboration aims to drive improvements in xPyD capabilities, further unlocking disaggregated serving on Kubernetes by optimizing workload placement and execution strategies. By refining these mechanisms, we aim to enhance inference performance, ensuring more efficient resource utilization and scalability for large-scale AI workloads.
With these iterative improvements, LWS and vLLM continue to refine multi-host inference, making large-scale distributed model deployments on Kubernetes more reliable, efficient, and adaptable.
In addition, KServe also added multi-host serving capability via vLLM serving runtime.
DRA (Dynamic Resource Allocation): Enhancing GPU/accelerator allocation, structured parameters, and resource claim standardization.
The DRA long term vision will enable many serving-related scenarios in future. In 2024, most of the effort was spent on adjusting plans and design to ensure timely GA of the feature and smooth migration from the device plugin architecture. We are working on prioritizing features in DRA needed for serving workloads, however the major push in the first half of 2025 will still be GA-related activities. WG Serving prepared a document listing scenarios and requirements for the DRA with the hope to start working on some of them in 2025.
Another topic that is actively being discussed is device failure handling and managing workload affected by those failures.
2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
Not at this point.
Operational tasks in wg-governance.md:
- README.md reviewed for accuracy and updated if needed
- WG leaders in sigs.yaml are accurate and active, and updated if needed
- Meeting notes and recordings for 2024 are linked from README.md and updated/uploaded if needed
- Updates provided to sponsoring SIGs in 2024 - $sig-name - links to email, meeting notes, slides, or recordings, etc - $sig-name - links to email, meeting notes, slides, or recordings, etc