Skip to content

fix(router): clean pattern_router state on upsert/delete#29601

Open
Aarkin7 wants to merge 3 commits into
BerriAI:litellm_internal_stagingfrom
Aarkin7:litellm_fix_pattern_router_leak
Open

fix(router): clean pattern_router state on upsert/delete#29601
Aarkin7 wants to merge 3 commits into
BerriAI:litellm_internal_stagingfrom
Aarkin7:litellm_fix_pattern_router_leak

Conversation

@Aarkin7
Copy link
Copy Markdown

@Aarkin7 Aarkin7 commented Jun 3, 2026

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have added meaningful tests
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible; it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

The bug lives in internal router state, so the user-visible symptom is "the api_key I just rotated still works for my wildcard models." The runbook below reproduces it on litellm_internal_staging and shows it gone on this branch. It needs two real OpenAI keys, because the only honest way to prove the old key is out of rotation is to revoke it upstream and watch for 401s.

  1. Save two keys locally; OLD_KEY is the one you'll revoke later

    export OLD_KEY=sk-...
    export NEW_KEY=sk-...

  2. Drop a wildcard model into litellm/proxy/dev_config.yaml pointing at OLD_KEY

model_list:
  - model_name: openai/*
    litellm_params:
      model: openai/*
      api_key: os.environ/OLD_KEY
  1. Start the proxy

python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log

  1. Sanity check that OLD_KEY works through the wildcard route

curl -sS -X POST http://localhost:4000/v1/chat/completions
-H "Authorization: Bearer sk-1234"
-H "Content-Type: application/json"
-d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}'

  1. Grab the deployment's model_id so you can target it for the rotation

curl -sS http://localhost:4000/model/info -H "Authorization: Bearer sk-1234"
| jq -r '.data[] | select(.model_name == "openai/*") | .model_info.id'

  1. Rotate the key via the admin endpoint, pasting the id from step 5

curl -sS -X POST http://localhost:4000/model/update
-H "Authorization: Bearer sk-1234"
-H "Content-Type: application/json"
-d "{"model_id":"<id from step 5>","litellm_params":{"model":"openai/*","api_key":"$NEW_KEY"}}"

  1. Revoke OLD_KEY on the OpenAI dashboard so it can no longer authenticate

  2. Fire 20 requests through the wildcard and tally the status codes

for i in $(seq 1 20); do
curl -sS -o /dev/null -w "%{http_code}\n" -X POST http://localhost:4000/v1/chat/completions
-H "Authorization: Bearer sk-1234"
-H "Content-Type: application/json"
-d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}'
done | sort | uniq -c

On litellm_internal_staging, roughly half of those 20 responses come back as 401 because the rotated-out OLD_KEY is still living inside pattern_router.patterns and the load balancer keeps round-robining onto it. On this branch every response should be a 200, and grepping litellm.log for OLD_KEY after step 6 should turn up nothing

Type

🐛 Bug Fix

Changes

PatternMatchRouter.add_pattern was append-only, and neither Router.upsert_deployment nor delete_deployment ever removed the existing entry. So every time an admin edited or deleted a wildcard model, the old deployment dict (including its old api_key) just sat there in pattern_router.patterns, and the load balancer kept round-robining onto it until proxy restart. The same leak hit provider_default_deployment_ids and the per-team team_pattern_routers

Added PatternMatchRouter.remove_deployment(model_id) and a private Router._remove_deployment_from_wildcard_state(model_id) that cleans up across all three. Wired into upsert_deployment and delete_deployment right alongside the existing index-map cleanup so the change stays narrow

Six unit tests in tests/local_testing/test_router_pattern_matching.py pin the new method's contract, and six integration tests in tests/test_litellm/test_router.py cover the actual upsert/delete paths, including team-scoped wildcards and api_key rotation as the regression test

PatternMatchRouter.add_pattern was append-only, and neither Router.upsert_deployment nor Router.delete_deployment removed the existing entry. Rotated-out api_keys stayed in the routing rotation for wildcard deployments (model_name with `*`) until proxy restart, silently defeating key rotation as an admin operation. The same leak applied to provider_default_deployment_ids and per-team pattern routers, and the patterns list grew unboundedly on every edit
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/router_utils/pattern_match_deployments.py 80.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR fixes a stale-state bug in wildcard routing where PatternMatchRouter was append-only: editing or deleting a wildcard deployment via upsert_deployment / delete_deployment left the old entry (and its api_key) sitting in pattern_router.patterns, causing the load balancer to keep round-robining onto revoked credentials until proxy restart.

  • Adds PatternMatchRouter.remove_deployment(model_id), which purges all pattern entries matching the given id and drops now-empty regex keys; wires it into Router._remove_deployment_from_wildcard_state, which also cleans team_pattern_routers (removing empty team routers entirely) and provider_default_deployment_ids.
  • Hooks _remove_deployment_from_wildcard_state into both upsert_deployment and delete_deployment, exactly alongside the existing index-map cleanup.
  • Adds 12 tests (6 unit, 6 integration) covering key rotation, idempotent upserts, multi-regex span, team-scoped wildcards, and empty-router cleanup, all in-memory with no network calls.

Confidence Score: 4/5

Safe to merge; the fix is narrow, well-tested, and targets a clear state-management gap with no risk of breaking existing non-wildcard routing paths.

The core change is correct and the test suite is thorough. Two small gaps exist: the remove_deployment type annotation accepts None at runtime but declares str, which will fail static type checking; and the wildcard cleanup is skipped when deployment_id is present on the router but absent from the fast-mapping index, which could reproduce the stale-credential accumulation in an inconsistent-state scenario. Neither affects the happy path.

The upsert_deployment block in litellm/router.py (lines 8679-8691) is worth a second look for the nested-guard edge case. The type annotation in litellm/router_utils/pattern_match_deployments.py line 78 is a straightforward one-word fix.

Important Files Changed

Filename Overview
litellm/router_utils/pattern_match_deployments.py Adds remove_deployment(model_id) to PatternMatchRouter; logic is correct but the type annotation accepts str while the implementation (and test) also handles None
litellm/router.py Adds _remove_deployment_from_wildcard_state and wires it correctly into both upsert_deployment and delete_deployment; cleanup of team_pattern_routers and provider_default_deployment_ids is complete
tests/local_testing/test_router_pattern_matching.py Six new unit tests covering remove_deployment (single id, empty regex cleanup, multi-regex span, noop for unknown, falsy guard, missing model_info tolerance); all in-memory, no network calls
tests/test_litellm/test_router.py Six new integration tests for upsert/delete on wildcard deployments including api_key rotation regression, idempotency, team-scoped wildcards, and empty team-router cleanup; all use internal state checks with no real API calls

Comments Outside Diff (1)

  1. litellm/router.py, line 8679-8691 (link)

    P2 Stale wildcard state when deployment_id is not in the fast-mapping index

    _remove_deployment_from_wildcard_state is only called inside if removal_idx is not None:, which is itself nested inside if deployment_id in deployment_fast_mapping:. If a wildcard deployment exists on the router (_deployment_on_router is not None) but is somehow absent from model_id_to_deployment_index_map (e.g. after index corruption or a partially-failed prior upsert), the old pattern_router entry is never cleaned up before add_deployment appends the new one — reproducing the exact stale-credential accumulation this PR aims to fix. Moving _remove_deployment_from_wildcard_state one level up (alongside the outer _deployment_on_router is not None check) would close this gap.

Reviews (1): Last reviewed commit: "fix(router): clean pattern_router state ..." | Re-trigger Greptile

self.patterns[regex] = []
self.patterns[regex].append(llm_deployment)

def remove_deployment(self, model_id: str) -> int:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The type annotation says model_id: str but the method also handles None (the falsy guard, and the test exercises it directly with None). This will cause a mypy error at the call site in test_remove_deployment_with_falsy_id_is_noop_even_when_entries_have_no_id. Widening to Optional[str] matches the actual contract.

Suggested change
def remove_deployment(self, model_id: str) -> int:
def remove_deployment(self, model_id: Optional[str]) -> int:

Aarkin7 added 2 commits June 3, 2026 23:03
…state

router_code_coverage.py greps test files for AST Call nodes and flagged
the helper as untested because the existing coverage only exercised it
transitively through upsert/delete. Adds two direct tests that pin the
helper's contract (cleans across global pattern router, per-team
routers with empty-router pop, and provider_default_deployment_ids;
noop on falsy model_id)
Widen PatternMatchRouter.remove_deployment annotation to Optional[str];
the implementation already handles None via the falsy guard and the
unit test exercises it directly.

Move _remove_deployment_from_wildcard_state up one level in
upsert_deployment so it runs whenever the prior deployment is on the
router, not only when the model_id is present in the fast-mapping
index. The scenario is currently unreachable (get_deployment shares
the same index), but the cleanup is idempotent so this is defensive
against any future divergence between those code paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant