You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/blog/2025-12-11-autoscale-inference-workloads-with-kaito/index.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ tags: ["ai", "kaito"]
12
12
13
13
## Introduction
14
14
15
-
LLM inference service is a basic and widely-used feature in KAITO, as the number of waiting inference requests increases, it is necessary to scale more inference instances in order to prevent blocking inference requests. On the other hand, if the number of waiting inference requests declines, we should consider reducing inference instances to improve GPU resource utilization. Kubernetes Event-driven Autoscaling (KEDA) is a good fit for inference pod autoscaling since it enables event-driven, fine-grained scaling based on external metrics and triggers, it supports a wide range of event sources (like custom metrics), allowing pods to scale precisely in response to workload demand. This flexibility and extensibility make KEDA ideal for dynamic, cloud-native applications that require responsive and efficient autoscaling.
15
+
LLM inference service is a basic and widelyused feature in KAITO. As the number of waiting inference requests increases, it's necessary to scale more inference instances to prevent blocking inference requests. Conversely, if the number of waiting inference requests declines, consider reducing inference instances to improve GPU resource utilization. Kubernetes Event-driven Autoscaling (KEDA) is well-suited for inference pod autoscaling. It enables event-driven, fine-grained scaling based on external metrics and triggers. KEDA supports a wide range of event sources (like custom metrics), allowing pods to scale precisely in response to workload demand. This flexibility and extensibility make KEDA ideal for dynamic, cloud-native applications that require responsive and efficient autoscaling.
16
16
17
17
To enable intelligent autoscaling for KAITO inference workloads using service.monitoring metrics, use the following components and features:
18
18
@@ -24,7 +24,7 @@ To enable intelligent autoscaling for KAITO inference workloads using service.mo
24
24
25
25
### Architecture
26
26
27
-
Following diagram shows how keda-kaito-scaler integrates KAITO InferenceSet with KEDA to autoscale inference workloads on AKS:
27
+
The following diagram shows how keda-kaito-scaler integrates KAITO InferenceSet with KEDA to autoscale inference workloads on AKS:
28
28
29
29

30
30
@@ -101,7 +101,7 @@ EOF
101
101
102
102
- Create a KEDA ScaledObject
103
103
104
-
Below is an example of creating a `ScaledObject` that scales a Kaito InferenceSet based on business hours:
104
+
Below is an example of creating a `ScaledObject` that scales a KAITO InferenceSet based on business hours:
105
105
106
106
-**Scale up to 5 replicas** from 6:00 AM to 8:00 PM (peak hours)
107
107
@@ -115,7 +115,7 @@ metadata:
115
115
name: kaito-business-hours-scaler
116
116
namespace: default
117
117
spec:
118
-
# Target Kaito InferenceSet to scale
118
+
# Target KAITO InferenceSet to scale
119
119
scaleTargetRef:
120
120
apiVersion: kaito.sh/v1alpha1
121
121
kind: InferenceSet
@@ -170,7 +170,7 @@ The `keda-kaito-scaler` provides a simplified configuration interface for scalin
170
170
-`scaledobject.kaito.sh/auto-provision`
171
171
- required, if it's `true`, KEDA KAITO scaler will automatically provision a ScaledObject based on the `InferenceSet` object
172
172
-`scaledobject.kaito.sh/max-replicas`
173
-
- required, maximum replica number of target InferenceSet
173
+
- required, maximum number of replicas for the target InferenceSet
174
174
-`scaledobject.kaito.sh/metricName`
175
175
- optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is `vllm:num_requests_waiting`, find all vllm metrics in [vLLM Production Metrics](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics)
176
176
-`scaledobject.kaito.sh/threshold`
@@ -203,7 +203,7 @@ spec:
203
203
EOF
204
204
```
205
205
206
-
In just a few seconds, the KEDA KAITO scaler automatically creates the `scaledobject` and `hpa` objects. After a few minutes, once the inference pod runs, the KEDA KAITO scaler begins scraping [metric values](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics) from the inference pod, and the system marks the status of the `scaledobject` and `hpa` objects as ready.
206
+
In just a few seconds, the KEDA KAITO scaler automatically creates the `scaledobject` and `hpa` objects. After a few minutes, once the inference pod runs, the KEDA KAITO scaler begins scraping [metric values](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics) from the inference pod. The system then marks the status of the `scaledobject` and `hpa` objects as ready.
0 commit comments