-
Notifications
You must be signed in to change notification settings - Fork 171
Health checker recreates deleted etcd keys due to race condition with circuit deletion in Traefik mode #10656
Description
Problem
Health checker periodically overwrites etcd service keys using a stale in-memory circuit.route_info snapshot. When routes change between the DB read and etcd write (session termination, scale-in, endpoint deletion), stale backend URLs persist in Traefik's weighted service pool.
*Reported symptom:* Request to Model A's endpoint returns Model B's response. "Sync Route Info" button fixes it temporarily.
*Mechanism:* Model A's weighted service (bai_service_{circuit_id}) contains a stale backend URL from a removed session. After the container at that address is freed and reassigned to Model B, Traefik load-balances some of Model A's traffic to Model B's container.
Traefik mode only (enable_traefik = true). Python frontends unaffected.
Root Cause
propagate_route_updates_to_workers() writes etcd service keys from an in-memory circuit object without:
- Verifying circuit still exists in DB
- Re-reading fresh route_info before writing
- Any concurrency control (no Lock on CircuitManager)
Trigger Scenarios
- Scale-in: removed replica's service key recreated by health checker
- Session crash/kill: dead session's backend URL persists in weighted service
- Endpoint deletion: circuit's service keys recreated after unload
- Health state transition during concurrent route update: stale route set overwrites correct one
Not limited to scale-in/out — any route_info change during a health check cycle.
Affected Files
appproxy/coordinator/health_checker.py:434-507—propagate_route_updates_to_workers()appproxy/coordinator/types.py:80-85—CircuitManager(no Lock)appproxy/coordinator/types.py:163-206—update_traefik_circuit_routes()appproxy/coordinator/server.py:495-578—on_route_update_event()
Success Criteria
-
propagate_route_updates_to_workers()re-reads circuit from DB before etcd write; skips if deleted -
CircuitManageruses per-circuitasyncio.Lockto serializeupdate_circuit_routes()andunload_circuits() - Scale-in does not leave stale session service keys in etcd
- Session crash does not leave stale backend URLs in weighted service pool
- pants test passes for affected packages
JIRA Issue: BA-5499