fix(router): clean pattern_router state on upsert/delete#29601
Conversation
PatternMatchRouter.add_pattern was append-only, and neither Router.upsert_deployment nor Router.delete_deployment removed the existing entry. Rotated-out api_keys stayed in the routing rotation for wildcard deployments (model_name with `*`) until proxy restart, silently defeating key rotation as an admin operation. The same leak applied to provider_default_deployment_ids and per-team pattern routers, and the patterns list grew unboundedly on every edit
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR fixes a stale-state bug in wildcard routing where
Confidence Score: 4/5Safe to merge; the fix is narrow, well-tested, and targets a clear state-management gap with no risk of breaking existing non-wildcard routing paths. The core change is correct and the test suite is thorough. Two small gaps exist: the The
|
| Filename | Overview |
|---|---|
| litellm/router_utils/pattern_match_deployments.py | Adds remove_deployment(model_id) to PatternMatchRouter; logic is correct but the type annotation accepts str while the implementation (and test) also handles None |
| litellm/router.py | Adds _remove_deployment_from_wildcard_state and wires it correctly into both upsert_deployment and delete_deployment; cleanup of team_pattern_routers and provider_default_deployment_ids is complete |
| tests/local_testing/test_router_pattern_matching.py | Six new unit tests covering remove_deployment (single id, empty regex cleanup, multi-regex span, noop for unknown, falsy guard, missing model_info tolerance); all in-memory, no network calls |
| tests/test_litellm/test_router.py | Six new integration tests for upsert/delete on wildcard deployments including api_key rotation regression, idempotency, team-scoped wildcards, and empty team-router cleanup; all use internal state checks with no real API calls |
Comments Outside Diff (1)
-
litellm/router.py, line 8679-8691 (link)Stale wildcard state when
deployment_idis not in the fast-mapping index_remove_deployment_from_wildcard_stateis only called insideif removal_idx is not None:, which is itself nested insideif deployment_id in deployment_fast_mapping:. If a wildcard deployment exists on the router (_deployment_on_router is not None) but is somehow absent frommodel_id_to_deployment_index_map(e.g. after index corruption or a partially-failed prior upsert), the oldpattern_routerentry is never cleaned up beforeadd_deploymentappends the new one — reproducing the exact stale-credential accumulation this PR aims to fix. Moving_remove_deployment_from_wildcard_stateone level up (alongside the outer_deployment_on_router is not Nonecheck) would close this gap.
Reviews (1): Last reviewed commit: "fix(router): clean pattern_router state ..." | Re-trigger Greptile
| self.patterns[regex] = [] | ||
| self.patterns[regex].append(llm_deployment) | ||
|
|
||
| def remove_deployment(self, model_id: str) -> int: |
There was a problem hiding this comment.
The type annotation says
model_id: str but the method also handles None (the falsy guard, and the test exercises it directly with None). This will cause a mypy error at the call site in test_remove_deployment_with_falsy_id_is_noop_even_when_entries_have_no_id. Widening to Optional[str] matches the actual contract.
| def remove_deployment(self, model_id: str) -> int: | |
| def remove_deployment(self, model_id: Optional[str]) -> int: |
…state router_code_coverage.py greps test files for AST Call nodes and flagged the helper as untested because the existing coverage only exercised it transitively through upsert/delete. Adds two direct tests that pin the helper's contract (cleans across global pattern router, per-team routers with empty-router pop, and provider_default_deployment_ids; noop on falsy model_id)
Widen PatternMatchRouter.remove_deployment annotation to Optional[str]; the implementation already handles None via the falsy guard and the unit test exercises it directly. Move _remove_deployment_from_wildcard_state up one level in upsert_deployment so it runs whenever the prior deployment is on the router, not only when the model_id is present in the fast-mapping index. The scenario is currently unreachable (get_deployment shares the same index), but the cleanup is idempotent so this is defensive against any future divergence between those code paths.
Relevant issues
Linear ticket
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
make test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewDelays in PR merge?
If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).
CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
The bug lives in internal router state, so the user-visible symptom is "the api_key I just rotated still works for my wildcard models." The runbook below reproduces it on
litellm_internal_stagingand shows it gone on this branch. It needs two real OpenAI keys, because the only honest way to prove the old key is out of rotation is to revoke it upstream and watch for 401s.Save two keys locally; OLD_KEY is the one you'll revoke later
export OLD_KEY=sk-...
export NEW_KEY=sk-...
Drop a wildcard model into
litellm/proxy/dev_config.yamlpointing at OLD_KEYpython litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log
curl -sS -X POST http://localhost:4000/v1/chat/completions
-H "Authorization: Bearer sk-1234"
-H "Content-Type: application/json"
-d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}'
curl -sS http://localhost:4000/model/info -H "Authorization: Bearer sk-1234"
| jq -r '.data[] | select(.model_name == "openai/*") | .model_info.id'
curl -sS -X POST http://localhost:4000/model/update
-H "Authorization: Bearer sk-1234"
-H "Content-Type: application/json"
-d "{"model_id":"<id from step 5>","litellm_params":{"model":"openai/*","api_key":"$NEW_KEY"}}"
Revoke OLD_KEY on the OpenAI dashboard so it can no longer authenticate
Fire 20 requests through the wildcard and tally the status codes
for i in $(seq 1 20); do
curl -sS -o /dev/null -w "%{http_code}\n" -X POST http://localhost:4000/v1/chat/completions
-H "Authorization: Bearer sk-1234"
-H "Content-Type: application/json"
-d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}'
done | sort | uniq -c
On litellm_internal_staging, roughly half of those 20 responses come back as 401 because the rotated-out OLD_KEY is still living inside pattern_router.patterns and the load balancer keeps round-robining onto it. On this branch every response should be a 200, and grepping litellm.log for OLD_KEY after step 6 should turn up nothing
Type
🐛 Bug Fix
Changes
PatternMatchRouter.add_patternwas append-only, and neitherRouter.upsert_deploymentnordelete_deploymentever removed the existing entry. So every time an admin edited or deleted a wildcard model, the old deployment dict (including its oldapi_key) just sat there inpattern_router.patterns, and the load balancer kept round-robining onto it until proxy restart. The same leak hitprovider_default_deployment_idsand the per-teamteam_pattern_routersAdded
PatternMatchRouter.remove_deployment(model_id)and a privateRouter._remove_deployment_from_wildcard_state(model_id)that cleans up across all three. Wired intoupsert_deploymentanddelete_deploymentright alongside the existing index-map cleanup so the change stays narrowSix unit tests in
tests/local_testing/test_router_pattern_matching.pypin the new method's contract, and six integration tests intests/test_litellm/test_router.pycover the actual upsert/delete paths, including team-scoped wildcards and api_key rotation as the regression test