|
| 1 | +# Workaround: Azure Load Balancer Health Probe for Istio Gateway |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +When deploying KServe with Istio Gateway API on AKS, external traffic to the inference gateway on port 80 times out, even though the gateway pod is running and works fine from inside the cluster. |
| 6 | + |
| 7 | +Port 15021 (Istio health port) works externally, but port 80 does not. |
| 8 | + |
| 9 | +## Root Cause |
| 10 | + |
| 11 | +AKS automatically creates an **HTTP health probe** for LoadBalancer service ports that have `appProtocol: http` set. The Istio Gateway controller sets `appProtocol: http` on port 80 by default. |
| 12 | + |
| 13 | +The HTTP health probe sends `GET /` to the nodePort backing port 80. Since no HTTPRoute matches `/`, Istio returns **404**. The Azure Load Balancer treats this as unhealthy and **stops forwarding all traffic** to port 80. |
| 14 | + |
| 15 | +Port 15021 works because its health probe uses **TCP** (just checks if the port is open). |
| 16 | + |
| 17 | +```text |
| 18 | +Azure LB health probe → HTTP GET / → nodePort → Istio port 80 → 404 → backend marked unhealthy → all traffic dropped |
| 19 | +``` |
| 20 | + |
| 21 | +## Why deploying an HTTPRoute doesn't fix it |
| 22 | + |
| 23 | +Deploying an HTTPRoute for your model (e.g., `/llm-inference/qwen2-7b-instruct/...`) does not fix this because the health probe hits `/`, not your model path. Unless you have a route that explicitly matches `/` and returns 200, the probe will continue to fail. |
| 24 | + |
| 25 | +## Fix |
| 26 | + |
| 27 | +Annotate the `inference-gateway-istio` service to use a **TCP health probe** for the affected port instead of HTTP. |
| 28 | + |
| 29 | +> **Note:** The port number in the annotation (`port_80`) must match the Gateway listener port. Port 80 is used here because that is what `setup-gateway.sh` configures in the Gateway's `listeners` spec. If your Gateway uses a different port, update the annotation key accordingly (e.g., `port_8080_health-probe_protocol`). |
| 30 | +
|
| 31 | +```bash |
| 32 | +kubectl annotate svc inference-gateway-istio -n opendatahub \ |
| 33 | + "service.beta.kubernetes.io/port_80_health-probe_protocol=tcp" \ |
| 34 | + --overwrite |
| 35 | +``` |
| 36 | + |
| 37 | +This annotation is applied automatically on AKS when using `setup-gateway.sh`. The manual command above is only needed if you recreate the Gateway without re-running the setup script. |
| 38 | + |
| 39 | +### Verify the probe changed |
| 40 | + |
| 41 | +```bash |
| 42 | +# Find the MC resource group |
| 43 | +NODE_RG=$(az aks show --resource-group <rg> --name <cluster> --query nodeResourceGroup -o tsv) |
| 44 | + |
| 45 | +# Check probes |
| 46 | +az network lb probe list --resource-group "$NODE_RG" --lb-name kubernetes -o table |
| 47 | +``` |
| 48 | + |
| 49 | +The port 80 probe should now show `Protocol: Tcp` instead of `Http`. |
| 50 | + |
| 51 | +## How to diagnose this issue |
| 52 | + |
| 53 | +1. Verify the gateway works from inside the cluster (bypasses Azure LB): |
| 54 | + ```bash |
| 55 | + kubectl run curl-test --rm -i --restart=Never --image=curlimages/curl \ |
| 56 | + -- curl -s -o /dev/null -w "HTTP %{http_code}" \ |
| 57 | + http://inference-gateway-istio.opendatahub.svc.cluster.local:80/ |
| 58 | + ``` |
| 59 | + If this returns 404 but external access times out, the LB health probe is the issue. |
| 60 | + |
| 61 | +2. Check the Azure LB health probe configuration: |
| 62 | + ```bash |
| 63 | + NODE_RG=$(az aks show --resource-group <rg> --name <cluster> --query nodeResourceGroup -o tsv) |
| 64 | + az network lb probe list --resource-group "$NODE_RG" --lb-name kubernetes -o table |
| 65 | + ``` |
| 66 | + If the port 80 probe shows `Protocol: Http` and `RequestPath: /`, that confirms the problem. |
| 67 | + |
| 68 | + > **Note:** If the `inference-gateway-istio` service is annotated with `service.beta.kubernetes.io/azure-load-balancer-internal: "true"`, use `--lb-name kubernetes-internal` instead. |
| 69 | +
|
| 70 | +## Notes |
| 71 | + |
| 72 | +- This is an AKS-specific issue. AWS and GCP load balancers default to TCP health checks. |
| 73 | +- On AKS clusters v1.24+, `spec.ports.appProtocol` is used as the health probe protocol with `/` as the default request path. Since the Istio Gateway controller sets `appProtocol: http` on port 80, AKS creates an HTTP probe by default. |
| 74 | +- The annotation `service.beta.kubernetes.io/port_80_health-probe_protocol` is a per-port override. The generic `service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol` applies to all ports but may not take effect if the gateway controller reconciles the service. |
| 75 | +- The Istio gateway service is managed by the Gateway controller and has no annotations by default. |
| 76 | + |
| 77 | +## References |
| 78 | + |
| 79 | +- [Configure a Public Standard Load Balancer in AKS](https://learn.microsoft.com/en-us/azure/aks/configure-load-balancer-standard) — Microsoft documentation on per-port health probe annotation overrides and default probe behavior. |
| 80 | +- [Troubleshoot AKS Health Probe Mode](https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/availability-performance/cluster-service-health-probe-mode-issues) — Troubleshooting guide for health probe issues. |
0 commit comments