You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* add GIE queuing for scale from zero e2es
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
* makes sure leftover scale-from-zero resources are cleaned up
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
* autodetect inference_pool_api_group
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
* doc nits and clarify
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
---------
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Copy file name to clipboardExpand all lines: docs/developer-guide/testing.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -159,6 +159,8 @@ This deploys:
159
159
- Prometheus stack and Prometheus Adapter (or KEDA when `SCALER_BACKEND=keda`)
160
160
-**No** VariantAutoscaling, HPA, or model services (tests create these)
161
161
162
+
When `E2E_TESTS_ENABLED=true` (or `ENABLE_SCALE_TO_ZERO=true`), the deploy script also enables **GIE queuing** so scale-from-zero tests can run: it patches the EPP with `ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER=true` and applies an **InferenceObjective** (`e2e-default`) that references the default InferencePool. This ensures the metric `inference_extension_flow_control_queue_size` is populated when requests hit the gateway.
163
+
162
164
Alternatively, use the Makefile to deploy infra and run tests in one go:
Copy file name to clipboardExpand all lines: docs/developer-guide/troubleshooting.md
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,9 @@
13
13
kubectl get inferencepool
14
14
```
15
15
16
-
**Solution**: Ensure InferencePool is created and reconciled before creating VariantAutoscaling.
16
+
WVA watches a single InferencePool API group (`inference.networking.k8s.io` or `inference.networking.x-k8s.io`). If the cluster's pools use the other group, the datastore stays empty and scale-from-zero never gets a recommendation.
17
+
18
+
**Solution**: Ensure InferencePool is created and reconciled before creating VariantAutoscaling. When using `deploy/install.sh` with llm-d (e.g. kind-emulator or CI), the script auto-detects the pool API group after llm-d deploy and upgrades WVA with the correct `wva.poolGroup` so both local and CI work regardless of llm-d version.
17
19
18
20
2.**Labels mismatch**:
19
21
```bash
@@ -54,6 +56,10 @@
54
56
55
57
**Solution**: Verify requests are being sent to the correct model endpoint.
56
58
59
+
### E2E and infra-only deploys
60
+
61
+
For e2e and infra-only deploys, the install script enables EPP flow control and optionally applies an InferenceObjective when `E2E_TESTS_ENABLED=true` or `ENABLE_SCALE_TO_ZERO=true`. See [deploy/install.sh](https://github.com/llm-d/llm-d-workload-variant-autoscaler/blob/main/deploy/install.sh) and [deploy/inference-objective-e2e.yaml](https://github.com/llm-d/llm-d-workload-variant-autoscaler/blob/main/deploy/inference-objective-e2e.yaml).
62
+
57
63
## Slow Scale-Up Response
58
64
59
65
**Symptom**: Deployment takes too long to scale up from zero.
- WVA and llm-d installed and running - deployment options available for [kind](https://github.com/llm-d/llm-d-workload-variant-autoscaler/blob/main/deploy/kind-emulator/README.md), [OpenShift](https://github.com/llm-d/llm-d-workload-variant-autoscaler/blob/main/deploy/openshift/README.md) and [Kubernetes](https://github.com/llm-d/llm-d-workload-variant-autoscaler/blob/main/deploy/kubernetes/README.md)
45
-
- EndpointPicker (EPP) configured with flowcontrol enabled - required for queue metrics collection (set EPP env variable `ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER`)
45
+
-**EPP flow control**: EndpointPicker (EPP) with flow control enabled (set EPP env `ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER=true`) so the queue metric `inference_extension_flow_control_queue_size` is collected. InferenceObjective is not required to enable this metric; it is a QoS policy for priority-based scheduling and optional for scale-from-zero.
- Requires EPP flow control enabled so the metric `inference_extension_flow_control_queue_size` is populated (InferenceObjective is not required for this metric). When deploying infra with `E2E_TESTS_ENABLED=true` (or `ENABLE_SCALE_TO_ZERO=true`), the install script enables flow control on the EPP and optionally applies an InferenceObjective for e2e.
220
+
- Create HPA (or KEDA ScaledObject) with minReplicas=0
220
221
- Verify deployment scales to 0 when idle
221
222
- Generate first request, verify scale-up from 0 → 1
0 commit comments