-
Notifications
You must be signed in to change notification settings - Fork 5
Reduce API errors during rolling deployments #559
Description
Problem
When the Milo API server is updated, clients occasionally receive brief 502 errors. This happens because the gateway continues sending requests to pods that are shutting down before it realizes they're no longer available.
The issue resolves itself within minutes, but it creates noisy alerts and a degraded experience for anyone making API calls during that window.
Expected Behavior
Deployments should be seamless — clients should never see errors when the API server is being updated.
Proposed Solution
Add an Envoy Gateway BackendTrafficPolicy to the existing gateway-api kustomize component alongside the HTTPRoute. This gives operators a safe operational baseline out of the box without needing to configure it themselves.
The policy should include:
- Retries on gateway errors so failed requests are transparently re-sent to a healthy pod
- Passive health checking to quickly eject unhealthy pods from the load balancer pool
- Active health checking so the gateway detects pod readiness faster than the default Kubernetes endpoint propagation
Additionally, add a PodDisruptionBudget to ensure a minimum number of API server pods remain available during voluntary disruptions (node drains, cluster upgrades).
Context
This is a recurring issue observed in staging. Most recently on 2026-04-02 where a single 502 was returned to a client calling the OpenAPI discovery endpoint during a routine image update rollout.