fix(controller): use noCacheReader in webhook to avoid extProc injection cache race#1981
Conversation
…ion cache race Signed-off-by: Kaveesh Khattar <kaveeshkhattar@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1981 +/- ##
==========================================
- Coverage 84.33% 84.32% -0.01%
==========================================
Files 130 130
Lines 18022 18067 +45
==========================================
+ Hits 15198 15235 +37
- Misses 1879 1883 +4
- Partials 945 949 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Were you able to reproduce this bug? Can you provide the steps? |
Yes, was able to reproduce the bug in a local cluster. The race happens when a pod is admitted during the window between a route being written to the API server and the informer cache reflecting it. The webhook sees an empty cache, finds no routes, and skips extProc injection silently. To trigger it, I ran a loop that applies a route in the background and creates a pod immediately, forcing the webhook to fire while the cache is still stale. On the unfixed controller the race appeared within 5 iterations. The controller logs confirmed it: After applying this fix, the same loop ran cleanly. Every webhook call logged The three unit tests in Steps
Bug reproduction (main branch)
Fix verification (this branch)
Setup: kind cluster, Envoy Gateway v1.5.4, AI Gateway built from each branch with |
…-race-extproc-injection
Description
Adds a
noCacheReader client.Readerfield togatewayMutatorwired frommgr.GetAPIReader()at construction time.Route lookups are extracted into two helper methods -
listAIGatewayRoutesForGatewayandlistMCPRoutesForGateway,Extract route lookups into two helper methods:
listAIGatewayRoutesForGatewaylistMCPRoutesForGatewayEach following a cache-first, fallback-second pattern, trying the cached client with MatchingFields index lookup first; if the cache returns empty, fall back to noCacheReader.
The no-cache path cannot use
MatchingFieldssince it has no access to in-memory indexes these exist only in the controller's cache, not on the API server.Manual filtering via
parentRefsMatchGatewayreplicates the same namespace resolution logic as the index functions. A comment marks this coupling explicitly so future changes to index logic are not missed.Related Issues/PRs (if applicable)
Fixes #1495
Related PR: #1789
Validation
3 tests added to
gateway_mutator_test.go, each using two separate fake client instances.One empty (simulating stale cache) and one populated (simulating direct API server) to exercise the fallback path:
TestGatewayMutator_mutatePod_UsesNoCacheReader: route exists only in noCacheReader, verifies sidecar is still injectedTestGatewayMutator_listAIGatewayRoutesForGateway_NoCacheReaderFallback: verifies fallback filtering returns only matching routesTestGatewayMutator_listMCPRoutesForGateway_NoCacheReaderFallback: same for MCP routesmake precommit testpassed locally.Additional Context
The Gateway controller has a corrective rollout mechanism where if it detects pods without the sidecar while effective routes exist, it triggers a rolling update to re-invoke the webhook.
However, this mechanism also uses the cached client for route lookups and is subject to the same cache race. This fix addresses the root cause directly at the webhook level, making the corrective rollout unnecessary for this scenario.