Skip to content

Commit c4a17da

Browse files
committed
fix
1 parent 4e4ccad commit c4a17da

1 file changed

Lines changed: 6 additions & 6 deletions

File tree

  • website/blog/2025-12-11-autoscale-inference-workloads-with-kaito

website/blog/2025-12-11-autoscale-inference-workloads-with-kaito/index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ tags: ["ai", "kaito"]
1212

1313
## Introduction
1414

15-
LLM inference service is a basic and widely-used feature in KAITO, as the number of waiting inference requests increases, it is necessary to scale more inference instances in order to prevent blocking inference requests. On the other hand, if the number of waiting inference requests declines, we should consider reducing inference instances to improve GPU resource utilization. Kubernetes Event-driven Autoscaling (KEDA) is a good fit for inference pod autoscaling since it enables event-driven, fine-grained scaling based on external metrics and triggers, it supports a wide range of event sources (like custom metrics), allowing pods to scale precisely in response to workload demand. This flexibility and extensibility make KEDA ideal for dynamic, cloud-native applications that require responsive and efficient autoscaling.
15+
LLM inference service is a basic and widely used feature in KAITO. As the number of waiting inference requests increases, it's necessary to scale more inference instances to prevent blocking inference requests. Conversely, if the number of waiting inference requests declines, consider reducing inference instances to improve GPU resource utilization. Kubernetes Event-driven Autoscaling (KEDA) is well-suited for inference pod autoscaling. It enables event-driven, fine-grained scaling based on external metrics and triggers. KEDA supports a wide range of event sources (like custom metrics), allowing pods to scale precisely in response to workload demand. This flexibility and extensibility make KEDA ideal for dynamic, cloud-native applications that require responsive and efficient autoscaling.
1616

1717
To enable intelligent autoscaling for KAITO inference workloads using service.monitoring metrics, use the following components and features:
1818

@@ -24,7 +24,7 @@ To enable intelligent autoscaling for KAITO inference workloads using service.mo
2424

2525
### Architecture
2626

27-
Following diagram shows how keda-kaito-scaler integrates KAITO InferenceSet with KEDA to autoscale inference workloads on AKS:
27+
The following diagram shows how keda-kaito-scaler integrates KAITO InferenceSet with KEDA to autoscale inference workloads on AKS:
2828

2929
![Architecture diagram showing keda-kaito-scaler integrating KAITO InferenceSet with KEDA to autoscale inference workloads on AKS](keda-kaito-scaler-arch.png)
3030

@@ -101,7 +101,7 @@ EOF
101101

102102
- Create a KEDA ScaledObject
103103

104-
Below is an example of creating a `ScaledObject` that scales a Kaito InferenceSet based on business hours:
104+
Below is an example of creating a `ScaledObject` that scales a KAITO InferenceSet based on business hours:
105105

106106
- **Scale up to 5 replicas** from 6:00 AM to 8:00 PM (peak hours)
107107

@@ -115,7 +115,7 @@ metadata:
115115
name: kaito-business-hours-scaler
116116
namespace: default
117117
spec:
118-
# Target Kaito InferenceSet to scale
118+
# Target KAITO InferenceSet to scale
119119
scaleTargetRef:
120120
apiVersion: kaito.sh/v1alpha1
121121
kind: InferenceSet
@@ -170,7 +170,7 @@ The `keda-kaito-scaler` provides a simplified configuration interface for scalin
170170
- `scaledobject.kaito.sh/auto-provision`
171171
- required, if it's `true`, KEDA KAITO scaler will automatically provision a ScaledObject based on the `InferenceSet` object
172172
- `scaledobject.kaito.sh/max-replicas`
173-
- required, maximum replica number of target InferenceSet
173+
- required, maximum number of replicas for the target InferenceSet
174174
- `scaledobject.kaito.sh/metricName`
175175
- optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is `vllm:num_requests_waiting`, find all vllm metrics in [vLLM Production Metrics](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics)
176176
- `scaledobject.kaito.sh/threshold`
@@ -203,7 +203,7 @@ spec:
203203
EOF
204204
```
205205

206-
In just a few seconds, the KEDA KAITO scaler automatically creates the `scaledobject` and `hpa` objects. After a few minutes, once the inference pod runs, the KEDA KAITO scaler begins scraping [metric values](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics) from the inference pod, and the system marks the status of the `scaledobject` and `hpa` objects as ready.
206+
In just a few seconds, the KEDA KAITO scaler automatically creates the `scaledobject` and `hpa` objects. After a few minutes, once the inference pod runs, the KEDA KAITO scaler begins scraping [metric values](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics) from the inference pod. The system then marks the status of the `scaledobject` and `hpa` objects as ready.
207207

208208
```bash
209209
# kubectl get scaledobject

0 commit comments

Comments
 (0)