Skip to content

fix(controller): use noCacheReader in webhook to avoid extProc injection cache race#1981

Open
KaveeshKhattar wants to merge 2 commits intoenvoyproxy:mainfrom
KaveeshKhattar:fix/admission-cache-race-extproc-injection
Open

fix(controller): use noCacheReader in webhook to avoid extProc injection cache race#1981
KaveeshKhattar wants to merge 2 commits intoenvoyproxy:mainfrom
KaveeshKhattar:fix/admission-cache-race-extproc-injection

Conversation

@KaveeshKhattar
Copy link
Copy Markdown
Contributor

Description

Adds a noCacheReader client.Reader field to gatewayMutator wired from mgr.GetAPIReader() at construction time.

Route lookups are extracted into two helper methods - listAIGatewayRoutesForGateway and listMCPRoutesForGateway,
Extract route lookups into two helper methods:

  • listAIGatewayRoutesForGateway
  • listMCPRoutesForGateway

Each following a cache-first, fallback-second pattern, trying the cached client with MatchingFields index lookup first; if the cache returns empty, fall back to noCacheReader.

The no-cache path cannot use MatchingFields since it has no access to in-memory indexes these exist only in the controller's cache, not on the API server.

Manual filtering via parentRefsMatchGateway replicates the same namespace resolution logic as the index functions. A comment marks this coupling explicitly so future changes to index logic are not missed.

Related Issues/PRs (if applicable)

Fixes #1495
Related PR: #1789

Validation

3 tests added to gateway_mutator_test.go, each using two separate fake client instances.
One empty (simulating stale cache) and one populated (simulating direct API server) to exercise the fallback path:

  • TestGatewayMutator_mutatePod_UsesNoCacheReader: route exists only in noCacheReader, verifies sidecar is still injected
  • TestGatewayMutator_listAIGatewayRoutesForGateway_NoCacheReaderFallback: verifies fallback filtering returns only matching routes
  • TestGatewayMutator_listMCPRoutesForGateway_NoCacheReaderFallback: same for MCP routes

make precommit test passed locally.

Additional Context

The Gateway controller has a corrective rollout mechanism where if it detects pods without the sidecar while effective routes exist, it triggers a rolling update to re-invoke the webhook.
However, this mechanism also uses the cached client for route lookups and is subject to the same cache race. This fix addresses the root cause directly at the webhook level, making the corrective rollout unnecessary for this scenario.

…ion cache race

Signed-off-by: Kaveesh Khattar <kaveeshkhattar@gmail.com>
@KaveeshKhattar KaveeshKhattar requested a review from a team as a code owner March 23, 2026 07:24
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 23, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 81.13208% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.32%. Comparing base (dfe41c3) to head (8d70ed3).

Files with missing lines Patch % Lines
internal/controller/gateway_mutator.go 81.13% 5 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1981      +/-   ##
==========================================
- Coverage   84.33%   84.32%   -0.01%     
==========================================
  Files         130      130              
  Lines       18022    18067      +45     
==========================================
+ Hits        15198    15235      +37     
- Misses       1879     1883       +4     
- Partials      945      949       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@johnugeorge
Copy link
Copy Markdown
Contributor

Were you able to reproduce this bug? Can you provide the steps?

@KaveeshKhattar
Copy link
Copy Markdown
Contributor Author

KaveeshKhattar commented Mar 25, 2026

Were you able to reproduce this bug? Can you provide the steps?

Yes, was able to reproduce the bug in a local cluster.

The race happens when a pod is admitted during the window between a route being written to the API server and the informer cache reflecting it. The webhook sees an empty cache, finds no routes, and skips extProc injection silently.

To trigger it, I ran a loop that applies a route in the background and creates a pod immediately, forcing the webhook to fire while the cache is still stale. On the unfixed controller the race appeared within 5 iterations.

The controller logs confirmed it: controller.gateway-mutator: no AIGatewayRoutes or MCPRoutes found for gateway logged at the exact timestamp the route was simultaneously being reconciled, proving it existed on the API server but was invisible to the cache at webhook time.

After applying this fix, the same loop ran cleanly. Every webhook call logged found routes for gateway and every pod had the extProc sidecar in its initContainers.

The three unit tests in gateway_mutator_test.go cover this deterministically, each uses two separate fake clients, one empty (stale cache) and one populated (API server), to exercise the fallback path without relying on timing.

Steps

  • In a local kind cluster, install Envoy Gateway, then AI Gateway from main.

Bug reproduction (main branch)

  • Create a Gateway, pod template, and the filter config secret.
  • Run a loop: apply a route in the background, create a pod immediately.
  • The webhook fired before the cache synced the route and no AIGatewayRoutes or MCPRoutes found for gateway was logged at admission time.
  • The pod started without the extProc sidecar.

2026-03-25T09:09:49Z INFO controller.gateway-mutator mutating gateway pod {"gateway_name": "test-gateway", ...} 2026-03-25T09:09:49Z INFO controller.gateway-mutator no AIGatewayRoutes or MCPRoutes found for gateway {"name": "test-gateway", "namespace": "envoy-ai-gateway-system"}

Fix verification (this branch)

  • Apply the fix (build and load the controller image from the PR branch, upgrade the Helm release).
  • Run the same loop again to verify.
  • Same script ran for 50+ iterations with zero failures. Every pod had ai-gateway-extproc in initContainers.
  • Controller logs consistently showed found routes for gateway at webhook time.

2026-03-25T10:01:17Z INFO controller.gateway-mutator mutating gateway pod {"gateway_name": "test-gateway", ...} 2026-03-25T10:01:17Z INFO controller.gateway-mutator found routes for gateway {"aigatewayroute_count": 1, "mcpgatewayroute_count": 0}

Setup: kind cluster, Envoy Gateway v1.5.4, AI Gateway built from each branch with make docker-build, loaded into kind with kind load docker-image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race Condition During Deployment: extProc sidecar not injected

3 participants