Skip to content

Commit 0e247cd

Browse files
authored
Add autoscale-inference-workloads-with-kaito blog (#5507)
1 parent 962d036 commit 0e247cd

3 files changed

Lines changed: 283 additions & 0 deletions

File tree

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
---
2+
title: "Autoscale KAITO inference workloads on AKS using KEDA"
3+
date: "2026-02-03"
4+
description: "Learn how to autoscale KAITO inference workloads on AKS with KEDA to handle varying requests and optimize GPU utilization for AI models at scale."
5+
authors: ["andy-zhang", "sachi-desai"]
6+
tags: ["ai", "kaito"]
7+
---
8+
9+
[Kubernetes AI Toolchain Operator](https://github.com/Azure/kaito) (KAITO) is an operator that simplifies and automates AI/ML model inference, tuning, and RAG in a Kubernetes cluster. With the recent [v0.8.0 release](https://github.com/Azure/kaito/releases/tag/v0.8.0), KAITO has introduced intelligent autoscaling for inference workloads as an alpha feature! In this blog, we'll guide you through setting up event-driven autoscaling for vLLM inference workloads.
10+
11+
<!-- truncate -->
12+
13+
## Introduction
14+
15+
LLM inference service is a basic and widely used feature in KAITO. As the number of waiting inference requests increases, scale more inference instances to prevent blocking. Conversely, reduce inference instances when requests decline to improve GPU resource utilization. Kubernetes Event Driven Autoscaling (KEDA) is well-suited for inference pod autoscaling. It enables event-driven, fine-grained scaling based on external metrics and triggers. KEDA supports a wide range of event sources (like custom metrics), allowing pods to scale precisely in response to workload demand. This flexibility and extensibility make KEDA ideal for dynamic, cloud-native applications that require responsive and efficient autoscaling.
16+
17+
To enable intelligent autoscaling for KAITO inference workloads using service monitoring metrics, utilize the following components and features:
18+
19+
- [Kubernetes Event Driven Autoscaling (KEDA)](https://github.com/kedacore/keda)
20+
21+
- **[KEDA KAITO Scaler](https://github.com/kaito-project/keda-kaito-scaler)**: A dedicated KEDA external scaler, eliminating the need for external dependencies such as Prometheus.
22+
23+
- **KAITO `InferenceSet` CustomResourceDefinition (CRD) and controller**: A new CRD and controller were built on top of the KAITO workspace for intelligent autoscaling, introduced as an alpha feature in KAITO version `v0.8.0`.
24+
25+
### Architecture
26+
27+
The following diagram shows how KEDA KAITO Scaler integrates KAITO InferenceSet with KEDA to autoscale inference workloads on AKS:
28+
29+
![Architecture diagram showing KEDA KAITO Scaler integrating KAITO InferenceSet with KEDA to autoscale inference workloads on AKS](keda-kaito-scaler-arch.png)
30+
31+
## Getting started
32+
33+
### Create an AKS cluster with GPU auto-provisioning capabilities for KAITO
34+
35+
Refer to the instructions on [how to create an AKS cluster with GPU auto-provisioning capabilities for KAITO](https://kaito-project.github.io/kaito/docs/azure).
36+
37+
### Enable InferenceSet controller in KAITO
38+
39+
The InferenceSet CRD and controller were introduced as an **alpha** feature in KAITO version `v0.8.0`. Built on top of the KAITO workspace, InferenceSet supports the scale subresource API for intelligent autoscaling. To use InferenceSet, the InferenceSet controller must be enabled during the KAITO installation.
40+
41+
```bash
42+
export CLUSTER_NAME=kaito
43+
44+
helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito
45+
helm repo update
46+
helm upgrade --install kaito-workspace kaito/workspace \
47+
--namespace kaito-workspace \
48+
--create-namespace \
49+
--set clusterName="$CLUSTER_NAME" \
50+
--set featureGates.enableInferenceSetController=true \
51+
--wait
52+
```
53+
54+
### Install KEDA
55+
56+
- **Option 1**: Enable managed KEDA add-on
57+
For instructions, refer to [Install KEDA add-on on AKS](https://learn.microsoft.com/azure/aks/keda-deploy-add-on-cli)
58+
59+
- **Option 2**: Install KEDA using Helm chart
60+
61+
> The following example demonstrates how to install KEDA 2.x using Helm chart. For instructions on installing KEDA through other methods, refer to the [KEDA deployment documentation](https://github.com/kedacore/keda#deploying-keda).
62+
63+
```bash
64+
helm repo add kedacore https://kedacore.github.io/charts
65+
helm install keda kedacore/keda --namespace kube-system
66+
```
67+
68+
## Example Scenarios
69+
70+
### Time-Based KEDA Scaler
71+
72+
The KEDA cron scaler enables scaling of workloads according to time-based schedules, making it especially beneficial for workloads with predictable traffic patterns. It is perfect for situations where peak hours are known ahead of time, allowing you to proactively adjust resources before demand rises. For more details about time-based scalers, refer to [Scale applications based on a cron schedule](https://keda.sh/docs/2.18/scalers/cron/).
73+
74+
#### Example: Business Hours Scaling
75+
76+
- Create a KAITO InferenceSet for running inference workloads
77+
78+
The following example creates an InferenceSet for the phi-4-mini model:
79+
80+
```bash
81+
cat <<EOF | kubectl apply -f -
82+
apiVersion: kaito.sh/v1alpha1
83+
kind: InferenceSet
84+
metadata:
85+
name: phi-4-mini
86+
namespace: default
87+
spec:
88+
labelSelector:
89+
matchLabels:
90+
apps: phi-4-mini
91+
replicas: 1
92+
template:
93+
inference:
94+
preset:
95+
accessMode: public
96+
name: phi-4-mini-instruct
97+
resource:
98+
instanceType: Standard_NC24ads_A100_v4
99+
EOF
100+
```
101+
102+
- Create a KEDA ScaledObject
103+
104+
Below is an example of creating a `ScaledObject` that scales a KAITO InferenceSet based on business hours:
105+
106+
- **Scale up to 5 replicas** from 6:00 AM to 8:00 PM (peak hours)
107+
108+
- **Scale down to 1 replica** otherwise (off-peak hours)
109+
110+
```bash
111+
cat <<EOF | kubectl apply -f -
112+
apiVersion: keda.sh/v1alpha1
113+
kind: ScaledObject
114+
metadata:
115+
name: kaito-business-hours-scaler
116+
namespace: default
117+
spec:
118+
# Target KAITO InferenceSet to scale
119+
scaleTargetRef:
120+
apiVersion: kaito.sh/v1alpha1
121+
kind: InferenceSet
122+
name: phi-4-mini
123+
# Scaling boundaries
124+
minReplicaCount: 1
125+
maxReplicaCount: 5
126+
# Cron-based triggers for time-based scaling
127+
triggers:
128+
# Scale up to 5 replicas at 6:00 AM (start of business hours)
129+
- type: cron
130+
metadata:
131+
timezone: "America/New_York" # Adjust timezone as needed
132+
start: "0 6 * * 1-5" # 6:00 AM Monday to Friday
133+
end: "0 20 * * 1-5" # 8:00 PM Monday to Friday
134+
desiredReplicas: "5" # Scale to 5 replicas during business hours
135+
# Scale down to 1 replica at 8:00 PM (end of business hours)
136+
- type: cron
137+
metadata:
138+
timezone: "America/New_York" # Adjust timezone as needed
139+
start: "0 20 * * 1-5" # 8:00 PM Monday to Friday
140+
end: "0 6 * * 1-5" # 6:00 AM Monday to Friday (next day)
141+
desiredReplicas: "1" # Scale to 1 replica during off-hours
142+
EOF
143+
```
144+
145+
### Metric-Based KEDA Scaler
146+
147+
#### Install KEDA KAITO Scaler
148+
149+
> This component is required only when using metric-based KEDA scaler, ensure that KEDA KAITO Scaler is installed within the same namespace as KEDA.
150+
151+
```bash
152+
helm repo add keda-kaito-scaler https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project
153+
helm upgrade --install keda-kaito-scaler -n kube-system keda-kaito-scaler/keda-kaito-scaler
154+
```
155+
156+
After a few seconds, the `keda-kaito-scaler` deployment starts.
157+
158+
```bash
159+
# kubectl get deployment keda-kaito-scaler -n kube-system
160+
NAME READY UP-TO-DATE AVAILABLE AGE
161+
keda-kaito-scaler 1/1 1 1 28h
162+
```
163+
164+
The `keda-kaito-scaler` provides a simplified configuration interface for scaling vLLM inference workloads, it directly scrapes metrics from inference pods, eliminating the need for a separate monitoring stack.
165+
166+
#### Example: Create a KAITO InferenceSet with annotations for running inference workloads
167+
168+
- The following example creates an InferenceSet for the phi-4-mini model, using annotations with the prefix `scaledobject.kaito.sh/` to supply parameter inputs for the KEDA KAITO scaler.
169+
170+
- `scaledobject.kaito.sh/auto-provision`
171+
- required, when set to `true`, the KEDA KAITO scaler automatically provisions a ScaledObject based on the `InferenceSet` object
172+
- `scaledobject.kaito.sh/max-replicas`
173+
- required, maximum number of replicas for the target InferenceSet
174+
- `scaledobject.kaito.sh/metricName`
175+
- optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is `vllm:num_requests_waiting`, find all vllm metrics in [vLLM Production Metrics](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics)
176+
- `scaledobject.kaito.sh/threshold`
177+
- required, specifies the threshold for the monitored metric that triggers the scaling operation
178+
179+
```bash
180+
cat <<EOF | kubectl apply -f -
181+
apiVersion: kaito.sh/v1alpha1
182+
kind: InferenceSet
183+
metadata:
184+
annotations:
185+
scaledobject.kaito.sh/auto-provision: "true"
186+
scaledobject.kaito.sh/max-replicas: "5"
187+
scaledobject.kaito.sh/metricName: "vllm:num_requests_waiting"
188+
scaledobject.kaito.sh/threshold: "10"
189+
name: phi-4-mini
190+
namespace: default
191+
spec:
192+
labelSelector:
193+
matchLabels:
194+
apps: phi-4-mini
195+
replicas: 1
196+
template:
197+
inference:
198+
preset:
199+
accessMode: public
200+
name: phi-4-mini-instruct
201+
resource:
202+
instanceType: Standard_NC24ads_A100_v4
203+
EOF
204+
```
205+
206+
In just a few seconds, the KEDA KAITO scaler automatically creates the `scaledobject` and `hpa` objects. After a few minutes, once the inference pod runs, the KEDA KAITO scaler begins scraping [metric values](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics) from the inference pod. The system then marks the status of the `scaledobject` and `hpa` objects as ready.
207+
208+
```bash
209+
# kubectl get scaledobject
210+
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX READY ACTIVE FALLBACK PAUSED TRIGGERS AUTHENTICATIONS AGE
211+
phi-4-mini kaito.sh/v1alpha1.InferenceSet phi-4-mini 1 5 True True False False external keda-kaito-scaler-creds 10m
212+
213+
# kubectl get hpa
214+
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
215+
keda-hpa-phi-4-mini InferenceSet/phi-4-mini 0/10 (avg) 1 5 1 11m
216+
```
217+
218+
That's it! Your KAITO workloads will now automatically scale based on the average number of waiting inference requests(`vllm:num_requests_waiting`) across all workloads associated with `InferenceSet/phi-4-mini` in the cluster.
219+
220+
In the example below, if `vllm:num_requests_waiting` exceeds the threshold (10) for over 60 seconds, KEDA scales up by adding a new replica to `InferenceSet/phi-4-mini`. Conversely, if `vllm:num_requests_waiting` remains below the threshold (10) for more than 300 seconds, KEDA scales down the number of replicas.
221+
222+
```yaml
223+
Every 2.0s: kubectl describe hpa
224+
Name: keda-hpa-phi-4-mini
225+
Namespace: default
226+
Labels: app.kubernetes.io/managed-by=keda-operator
227+
app.kubernetes.io/name=keda-hpa-phi-4-mini
228+
app.kubernetes.io/part-of=phi-4-mini
229+
app.kubernetes.io/version=2.18.1
230+
scaledobject.keda.sh/name=phi-4-mini
231+
Annotations: scaledobject.kaito.sh/managed-by: keda-kaito-scaler
232+
CreationTimestamp: Tue, 09 Dec 2025 03:35:09 +0000
233+
Reference: InferenceSet/phi-4-mini
234+
Metrics: ( current / target )
235+
"s0-vllm:num_requests_waiting" (target average value): 58 / 10
236+
Min replicas: 1
237+
Max replicas: 5
238+
Behavior:
239+
Scale Up:
240+
Stabilization Window: 60 seconds
241+
Select Policy: Max
242+
Policies:
243+
- Type: Pods Value: 1 Period: 300 seconds
244+
Scale Down:
245+
Stabilization Window: 300 seconds
246+
Select Policy: Max
247+
Policies:
248+
- Type: Pods Value: 1 Period: 600 seconds
249+
InferenceSet pods: 2 current / 2 desired
250+
Conditions:
251+
Type Status Reason Message
252+
---- ------ ------ -------
253+
AbleToScale True ReadyForNewScale recommended size matches current size
254+
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-vllm:num_requests_waiting(&Lab
255+
elSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: phi-4-mini,},MatchExpressions:[]LabelSelectorRequirement{},})
256+
ScalingLimited True ScaleUpLimit the desired replica count is increasing faster than the maximum scale rate
257+
Events:
258+
Type Reason Age From Message
259+
---- ------ ---- ---- -------
260+
Normal SuccessfulRescale 33s horizontal-pod-autoscaler New size: 2; reason: external metric s0-vllm:num_requests_waiting(&LabelSelector{MatchLabels:ma
261+
p[string]string{scaledobject.keda.sh/name: phi-4-mini,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
262+
```
263+
264+
## Summary
265+
266+
KAITO's LLM inference service must scale inference instances dynamically to handle varying numbers of waiting requests: scaling up to prevent blocking when requests increase, and scaling down to optimize GPU usage when requests decrease. With the newly introduced InferenceSet CRD and KEDA KAITO scaler, configuring this setting in KAITO has become much simpler.
267+
268+
We're just getting started and would love your feedback. To learn more about KAITO inference workloads autoscaling and AI model deployment on AKS, check out the following links:
269+
270+
## Resources
271+
272+
- [KEDA Auto-Scaler for inference workloads](https://kaito-project.github.io/kaito/docs/keda-autoscaler-inference)
273+
- [KAITO InferenceSet](https://github.com/kaito-project/kaito/blob/main/docs/proposals/20250918-introduce_inferenceset_autoscaling.md)
274+
- [vLLM Production Metrics](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics)
178 KB
Loading

website/blog/authors.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
andy-zhang:
2+
name: Andy Zhang
3+
title: Principal Software Engineer for the Azure Kubernetes Service
4+
url: https://www.linkedin.com/in/andy-zhang-a7bb9676/
5+
image_url: https://avatars.githubusercontent.com/andyzhangx
6+
socials:
7+
linkedin: andy-zhang-a7bb9676
8+
github: andyzhangx
9+
110
ahmed-sabbour:
211
name: Ahmed Sabbour
312
title: Principal PM Lead for the Azure Kubernetes Service

0 commit comments

Comments
 (0)