Skip to content

Fix: detect prolonged broker unavailability and crash for restart#1082

Merged
peter-quix merged 6 commits intomainfrom
fix/broker-unavailable-zombie-state
Feb 13, 2026
Merged

Fix: detect prolonged broker unavailability and crash for restart#1082
peter-quix merged 6 commits intomainfrom
fix/broker-unavailable-zombie-state

Conversation

@peter-quix
Copy link
Collaborator

@peter-quix peter-quix commented Feb 12, 2026

Summary

Closes #1081

When all Kafka brokers become unavailable (e.g., during a cluster restart), the SDK could enter a zombie state — logging _ALL_BROKERS_DOWN indefinitely but never reconnecting or raising an error. In one observed case, the application was stuck for 20 days.

This PR adds a configurable broker_availability_timeout (default 120s) that detects prolonged broker unavailability and raises KafkaBrokerUnavailableError, allowing the container orchestrator to restart the application with fresh connections.

Key changes

  • Track _ALL_BROKERS_DOWN errors on both Producer and BaseConsumer via _broker_unavailable_since timestamp
  • Active metadata probe (list_topics(timeout=5.0)) before raising to avoid false positives from Azure's connections.max.idle.ms=180000 idle timeout
  • Check in all app phases: main processing loop, sources loop, and state recovery loop
  • Reset timer on successful message delivery, message processing, and changelog consumption during recovery
  • Custom error callbacks are composed (wrapped), so broker tracking always runs regardless of user-provided callbacks
  • New parameter: Application(broker_availability_timeout=120.0) — set to 0 to disable

Breaking change

Default 120s timeout is a behavior change on upgrade. Existing applications that survived >2-minute broker outages by waiting indefinitely will now crash with KafkaBrokerUnavailableError. Users can set broker_availability_timeout=0 to restore the previous behavior.

When all Kafka brokers are unreachable for longer than
broker_availability_timeout (default 300s), the Application raises
KafkaBrokerUnavailableError so the orchestrator can restart it with
fresh connections. This prevents the zombie state where an app silently
logs "all brokers down" indefinitely without recovering.

- Track _ALL_BROKERS_DOWN errors on Producer via _error_cb
- Reset timer on successful message consumption and delivery callbacks
- Active metadata probe before raising to avoid false positives on idle apps
- Compose tracking with custom error callbacks (never silently disabled)
- Add check to both _run_dataframe and _run_sources loops
- Validate broker_availability_timeout is non-negative
- Actionable error message with parameter name and disable instructions
- Add broker unavailability detection to BaseConsumer (mirrors Producer)
- Wire consumer checks into both _run_dataframe and _run_sources loops
- Reset consumer timer on successful message consumption
- Remove topic arg from list_topics() probe to avoid ACL/auto-creation issues
- Make _broker_available() private on Producer, InternalProducer, and Consumer
The recovery loop in RecoveryManager._recovery_loop() could block
indefinitely if brokers went down during changelog replay. Now checks
consumer broker availability each iteration when enabled.

- Add broker_availability_timeout param to RecoveryManager
- Check consumer.raise_if_broker_unavailable() in _recovery_loop()
- Wire timeout through Application → RecoveryManager
- Wire timeout through Application → SourceManager → SourceProcess
During recovery, successfully consuming changelog messages now resets
the broker unavailability timer, preventing false positives when
_ALL_BROKERS_DOWN fired before recovery started but brokers have since
recovered. Without this, the active metadata probe would fire every
iteration adding unnecessary latency.
…sion to 3.23.3

120s (2 minutes) is sufficient given the active metadata probe prevents
false positives. 300s was overly conservative for most deployments.
@peter-quix peter-quix merged commit b785f42 into main Feb 13, 2026
4 checks passed
@peter-quix peter-quix deleted the fix/broker-unavailable-zombie-state branch February 13, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Application silently enters zombie state when all brokers go down — never reconnects or raises

2 participants