-
Notifications
You must be signed in to change notification settings - Fork 370
docs: add autoscale-inference-workloads-with-kaito blog #5507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
303ea6e
docs: add autoscale-inference-workloads-with-kaito blog
andyzhangx cb0e2d2
fix indent
andyzhangx aabee61
fix
andyzhangx 36ab2c0
fix
andyzhangx aba1d15
fix
andyzhangx 7ba066d
fix
andyzhangx cfb74bc
fix comments
andyzhangx d60ad9b
fix Markdown
andyzhangx 3b77708
fix
andyzhangx c44f47c
add more examples
andyzhangx c7bbd18
fix Markdown
andyzhangx 35e5b7b
fix copilot comments
andyzhangx 2467e42
fix date
andyzhangx b792548
fix copilot comments #2
andyzhangx 95facc7
fix copilot comments
andyzhangx 4e4ccad
fix
andyzhangx c4a17da
fix
andyzhangx 3c83e18
fix comments
andyzhangx e02273e
move folder name
andyzhangx 8d0a408
move to 2026-02-03
andyzhangx d26fe65
fix comments
andyzhangx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
149 changes: 149 additions & 0 deletions
149
website/blog/2025-12-11-autoscale-inference-workloads-with-kaito/index.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| --- | ||
| title: "Autoscale KAITO inference workloads on AKS using KEDA" | ||
| date: "2025-12-11" | ||
| description: "Autoscale your KAITO inference workloads using KEDA" | ||
| authors: ["andy-zhang"] | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| tags: ["ai", "inference", "keda", "kaito"] | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| --- | ||
|
|
||
| [Kubernetes AI Toolchain Operator](https://github.com/Azure/kaito/tree/main) (KAITO) is an operator that automates the AI/ML model inference or tuning workload in a Kubernetes cluster. With the [v0.8.0 release](https://github.com/Azure/kaito/releases/tag/v0.8.0), KAITO has introduced intelligent autoscaling for inference workloads as an alpha feature. | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
|
|
||
|
andyzhangx marked this conversation as resolved.
Outdated
andyzhangx marked this conversation as resolved.
|
||
| ## Overview | ||
|
|
||
| This blog outlines the steps to enable intelligent autoscaling based on the service monitoring metrics for KAITO inference workloads by utilizing the following components and features: | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| - [KEDA](https://github.com/kedacore/keda) | ||
| - Kubernetes-based Event Driven Autoscaling component | ||
|
sdesai345 marked this conversation as resolved.
Outdated
|
||
| - [keda-kaito-scaler](https://github.com/kaito-project/keda-kaito-scaler) | ||
| - A dedicated KEDA external scaler, eliminating the need for external dependencies such as Prometheus. | ||
| - KAITO `InferenceSet` CRD and Controller | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| - This new CRD and Controller were built on top of the KAITO workspace for intelligent autoscaling, introduced as an alpha feature in KAITO version `v0.8.0` | ||
|
andyzhangx marked this conversation as resolved.
Outdated
sdesai345 marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Architecture | ||
|
|
||
|  | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Prerequisites | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| - install KEDA | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| > The following example demonstrates how to install KEDA using Helm chart. For instructions on installing KEDA through other methods, please refer to the guide [here](https://github.com/kedacore/keda#deploying-keda). | ||
| ```bash | ||
| helm repo add kedacore https://kedacore.github.io/charts | ||
| helm install keda kedacore/keda --namespace keda --create-namespace | ||
| ``` | ||
|
|
||
| - install keda-kaito-scaler | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| ```bash | ||
| helm repo add keda-kaito-scaler https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project | ||
| helm upgrade --install keda-kaito-scaler -n kaito-workspace keda-kaito-scaler/keda-kaito-scaler --create-namespace | ||
| ``` | ||
|
|
||
| ## Enable this feature on KAITO | ||
|
|
||
| This feature is available starting from KAITO `v0.8.0`, and the InferenceSet Controller must be enabled during the KAITO installation. | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```bash | ||
| export CLUSTER_NAME=kaito | ||
|
|
||
| helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito | ||
| helm repo update | ||
| helm upgrade --install kaito-workspace kaito/workspace \ | ||
| --namespace kaito-workspace \ | ||
| --create-namespace \ | ||
| --set clusterName="$CLUSTER_NAME" \ | ||
| --set featureGates.enableInferenceSetController=true \ | ||
| --wait | ||
| ``` | ||
|
|
||
| ## Quickstart | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Create a Kaito InferenceSet for running inference workloads | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| - The following example creates an InferenceSet for the phi-4-mini model, using annotations with the prefix `scaledobject.kaito.sh/` to supply parameter inputs for the KEDA Kaito Scaler: | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| - `scaledobject.kaito.sh/auto-provision` | ||
| - required, specifies whether KEDA Kaito Scaler will automatically provision a ScaledObject based on the `InferenceSet` object | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
| - `scaledobject.kaito.sh/metricName` | ||
| - optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is `vllm:num_requests_waiting` | ||
| - `scaledobject.kaito.sh/threshold` | ||
| - required, specifies the threshold for the monitored metric that triggers the scaling operation | ||
|
sdesai345 marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```bash | ||
| cat <<EOF | kubectl apply -f - | ||
| apiVersion: kaito.sh/v1alpha1 | ||
| kind: InferenceSet | ||
| metadata: | ||
| annotations: | ||
| scaledobject.kaito.sh/auto-provision: "true" | ||
| scaledobject.kaito.sh/metricName: "vllm:num_requests_waiting" | ||
| scaledobject.kaito.sh/threshold: "10" | ||
| name: phi-4-mini | ||
| namespace: default | ||
| spec: | ||
| labelSelector: | ||
| matchLabels: | ||
| apps: phi-4-mini | ||
| replicas: 1 | ||
| nodeCountLimit: 5 | ||
| template: | ||
| inference: | ||
| preset: | ||
| accessMode: public | ||
| name: phi-4-mini-instruct | ||
| resource: | ||
| instanceType: Standard_NC24ads_A100_v4 | ||
| EOF | ||
| ``` | ||
| - In just a few seconds, the KEDA Kaito Scaler will automatically create the `scaledobject` and `hpa` objects. After a few minutes, once the inference pod is running, the KEDA Kaito Scaler will begin scraping metric values from the inference pod, and the status of the `scaledobject` and `hpa` objects will be marked as ready. | ||
|
andyzhangx marked this conversation as resolved.
Outdated
andyzhangx marked this conversation as resolved.
Outdated
andyzhangx marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```bash | ||
| # kubectl get scaledobject | ||
| NAME SCALETARGETKIND SCALETARGETNAME MIN MAX READY ACTIVE FALLBACK PAUSED TRIGGERS AUTHENTICATIONS AGE | ||
| phi-4-mini kaito.sh/v1alpha1.InferenceSet phi-4-mini 1 5 True True False False external keda-kaito-scaler-creds 10m | ||
|
|
||
| # kubectl get hpa | ||
| NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE | ||
| keda-hpa-phi-4-mini InferenceSet/phi-4-mini 0/10 (avg) 1 5 1 11m | ||
| ``` | ||
|
|
||
| That's it! Your KAITO workloads will now automatically scale based on the number of waiting inference requests(`vllm:num_requests_waiting`). | ||
|
andyzhangx marked this conversation as resolved.
Outdated
|
||
|
|
||
| - in below example, when `vllm:num_requests_waiting` exceeds the threshold (10s) for more than 60 seconds, KEDA will scale up a new `InferenceSet/phi-4-mini` replica. | ||
|
andyzhangx marked this conversation as resolved.
Outdated
sdesai345 marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```yaml | ||
| Every 2.0s: kubectl describe hpa | ||
| Name: keda-hpa-phi-4-mini | ||
| Namespace: default | ||
| Labels: app.kubernetes.io/managed-by=keda-operator | ||
| app.kubernetes.io/name=keda-hpa-phi-4-mini | ||
| app.kubernetes.io/part-of=phi-4-mini | ||
| app.kubernetes.io/version=2.18.1 | ||
| scaledobject.keda.sh/name=phi-4-mini | ||
| Annotations: scaledobject.kaito.sh/managed-by: keda-kaito-scaler | ||
| CreationTimestamp: Tue, 09 Dec 2025 03:35:09 +0000 | ||
| Reference: InferenceSet/phi-4-mini | ||
| Metrics: ( current / target ) | ||
| "s0-vllm:num_requests_waiting" (target average value): 58 / 10 | ||
| Min replicas: 1 | ||
| Max replicas: 5 | ||
| Behavior: | ||
| Scale Up: | ||
| Stabilization Window: 60 seconds | ||
| Select Policy: Max | ||
| Policies: | ||
| - Type: Pods Value: 1 Period: 300 seconds | ||
| Scale Down: | ||
| Stabilization Window: 300 seconds | ||
| Select Policy: Max | ||
| Policies: | ||
| - Type: Pods Value: 1 Period: 600 seconds | ||
| InferenceSet pods: 2 current / 2 desired | ||
| Conditions: | ||
| Type Status Reason Message | ||
| ---- ------ ------ ------- | ||
| AbleToScale True ReadyForNewScale recommended size matches current size | ||
| ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-vllm:num_requests_waiting(&Lab | ||
| elSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: phi-4-mini,},MatchExpressions:[]LabelSelectorRequirement{},}) | ||
| ScalingLimited True ScaleUpLimit the desired replica count is increasing faster than the maximum scale rate | ||
| Events: | ||
| Type Reason Age From Message | ||
| ---- ------ ---- ---- ------- | ||
| Normal SuccessfulRescale 33s horizontal-pod-autoscaler New size: 2; reason: external metric s0-vllm:num_requests_waiting(&LabelSelector{MatchLabels:ma | ||
| p[string]string{scaledobject.keda.sh/name: phi-4-mini,},MatchExpressions:[]LabelSelectorRequirement{},}) above target | ||
| ``` | ||
|
pauldotyu marked this conversation as resolved.
|
||
Binary file added
BIN
+178 KB
.../2025-12-11-autoscale-inference-workloads-with-kaito/keda-kaito-scaler-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.