Skip to content

Reduce API errors during rolling deployments #559

@scotwells

Description

@scotwells

Problem

When the Milo API server is updated, clients occasionally receive brief 502 errors. This happens because the gateway continues sending requests to pods that are shutting down before it realizes they're no longer available.

The issue resolves itself within minutes, but it creates noisy alerts and a degraded experience for anyone making API calls during that window.

Expected Behavior

Deployments should be seamless — clients should never see errors when the API server is being updated.

Proposed Solution

Add an Envoy Gateway BackendTrafficPolicy to the existing gateway-api kustomize component alongside the HTTPRoute. This gives operators a safe operational baseline out of the box without needing to configure it themselves.

The policy should include:

  • Retries on gateway errors so failed requests are transparently re-sent to a healthy pod
  • Passive health checking to quickly eject unhealthy pods from the load balancer pool
  • Active health checking so the gateway detects pod readiness faster than the default Kubernetes endpoint propagation

Additionally, add a PodDisruptionBudget to ensure a minimum number of API server pods remain available during voluntary disruptions (node drains, cluster upgrades).

Context

This is a recurring issue observed in staging. Most recently on 2026-04-02 where a single 502 was returned to a client calling the OpenAPI discovery endpoint during a routine image update rollout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions