Reduce API errors during rolling deployments

## Problem

When the Milo API server is updated, clients occasionally receive brief 502 errors. This happens because the gateway continues sending requests to pods that are shutting down before it realizes they're no longer available.

The issue resolves itself within minutes, but it creates noisy alerts and a degraded experience for anyone making API calls during that window.

## Expected Behavior

Deployments should be seamless — clients should never see errors when the API server is being updated.

## Proposed Solution

Add an Envoy Gateway `BackendTrafficPolicy` to the existing `gateway-api` kustomize component alongside the HTTPRoute. This gives operators a safe operational baseline out of the box without needing to configure it themselves.

The policy should include:

- **Retries** on gateway errors so failed requests are transparently re-sent to a healthy pod
- **Passive health checking** to quickly eject unhealthy pods from the load balancer pool
- **Active health checking** so the gateway detects pod readiness faster than the default Kubernetes endpoint propagation

Additionally, add a `PodDisruptionBudget` to ensure a minimum number of API server pods remain available during voluntary disruptions (node drains, cluster upgrades).

## Context

This is a recurring issue observed in staging. Most recently on 2026-04-02 where a single 502 was returned to a client calling the OpenAPI discovery endpoint during a routine image update rollout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce API errors during rolling deployments #559

Problem

Expected Behavior

Proposed Solution

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce API errors during rolling deployments #559

Description

Problem

Expected Behavior

Proposed Solution

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions