Skip to content

Health checker recreates deleted etcd keys due to race condition with circuit deletion in Traefik mode #10656

@seedspirit

Description

@seedspirit

Problem

Health checker periodically overwrites etcd service keys using a stale in-memory circuit.route_info snapshot. When routes change between the DB read and etcd write (session termination, scale-in, endpoint deletion), stale backend URLs persist in Traefik's weighted service pool.

*Reported symptom:* Request to Model A's endpoint returns Model B's response. "Sync Route Info" button fixes it temporarily.

*Mechanism:* Model A's weighted service (bai_service_{circuit_id}) contains a stale backend URL from a removed session. After the container at that address is freed and reassigned to Model B, Traefik load-balances some of Model A's traffic to Model B's container.

Traefik mode only (enable_traefik = true). Python frontends unaffected.

Root Cause

propagate_route_updates_to_workers() writes etcd service keys from an in-memory circuit object without:

  1. Verifying circuit still exists in DB
  2. Re-reading fresh route_info before writing
  3. Any concurrency control (no Lock on CircuitManager)

Trigger Scenarios

  • Scale-in: removed replica's service key recreated by health checker
  • Session crash/kill: dead session's backend URL persists in weighted service
  • Endpoint deletion: circuit's service keys recreated after unload
  • Health state transition during concurrent route update: stale route set overwrites correct one

Not limited to scale-in/out — any route_info change during a health check cycle.

Affected Files

  • appproxy/coordinator/health_checker.py:434-507propagate_route_updates_to_workers()
  • appproxy/coordinator/types.py:80-85CircuitManager (no Lock)
  • appproxy/coordinator/types.py:163-206update_traefik_circuit_routes()
  • appproxy/coordinator/server.py:495-578on_route_update_event()

Success Criteria

  • propagate_route_updates_to_workers() re-reads circuit from DB before etcd write; skips if deleted
  • CircuitManager uses per-circuit asyncio.Lock to serialize update_circuit_routes() and unload_circuits()
  • Scale-in does not leave stale session service keys in etcd
  • Session crash does not leave stale backend URLs in weighted service pool
  • pants test passes for affected packages

JIRA Issue: BA-5499

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions