Fix: detect prolonged broker unavailability and crash for restart by peter-quix · Pull Request #1082 · quixio/quix-streams

peter-quix · 2026-02-12T17:09:45Z

Summary

When all Kafka brokers become unavailable (e.g., during a cluster restart), the SDK could enter a zombie state — logging _ALL_BROKERS_DOWN indefinitely but never reconnecting or raising an error. In one observed case, the application was stuck for 20 days.

This PR adds a configurable broker_availability_timeout (default 120s) that detects prolonged broker unavailability and raises KafkaBrokerUnavailableError, allowing the container orchestrator to restart the application with fresh connections.

Key changes

Track _ALL_BROKERS_DOWN errors on both Producer and BaseConsumer via _broker_unavailable_since timestamp
Active metadata probe (list_topics(timeout=5.0)) before raising to avoid false positives from Azure's connections.max.idle.ms=180000 idle timeout
Check in all app phases: main processing loop, sources loop, and state recovery loop
Reset timer on successful message delivery, message processing, and changelog consumption during recovery
Custom error callbacks are composed (wrapped), so broker tracking always runs regardless of user-provided callbacks
New parameter: Application(broker_availability_timeout=120.0) — set to 0 to disable

Breaking change

Default 120s timeout is a behavior change on upgrade. Existing applications that survived >2-minute broker outages by waiting indefinitely will now crash with KafkaBrokerUnavailableError. Users can set broker_availability_timeout=0 to restore the previous behavior.

When all Kafka brokers are unreachable for longer than broker_availability_timeout (default 300s), the Application raises KafkaBrokerUnavailableError so the orchestrator can restart it with fresh connections. This prevents the zombie state where an app silently logs "all brokers down" indefinitely without recovering. - Track _ALL_BROKERS_DOWN errors on Producer via _error_cb - Reset timer on successful message consumption and delivery callbacks - Active metadata probe before raising to avoid false positives on idle apps - Compose tracking with custom error callbacks (never silently disabled) - Add check to both _run_dataframe and _run_sources loops - Validate broker_availability_timeout is non-negative - Actionable error message with parameter name and disable instructions

- Add broker unavailability detection to BaseConsumer (mirrors Producer) - Wire consumer checks into both _run_dataframe and _run_sources loops - Reset consumer timer on successful message consumption - Remove topic arg from list_topics() probe to avoid ACL/auto-creation issues - Make _broker_available() private on Producer, InternalProducer, and Consumer

The recovery loop in RecoveryManager._recovery_loop() could block indefinitely if brokers went down during changelog replay. Now checks consumer broker availability each iteration when enabled. - Add broker_availability_timeout param to RecoveryManager - Check consumer.raise_if_broker_unavailable() in _recovery_loop() - Wire timeout through Application → RecoveryManager - Wire timeout through Application → SourceManager → SourceProcess

During recovery, successfully consuming changelog messages now resets the broker unavailability timer, preventing false positives when _ALL_BROKERS_DOWN fired before recovery started but brokers have since recovered. Without this, the active metadata probe would fire every iteration adding unnecessary latency.

…sion to 3.23.3 120s (2 minutes) is sufficient given the active metadata probe prevents false positives. 300s was overly conservative for most deployments.

peter-quix added 6 commits February 12, 2026 16:32

chore: bump version to 3.23.3a1 for test PyPI release

c02719f

chore: reduce default broker_availability_timeout to 120s and set ver…

48ba4af

…sion to 3.23.3 120s (2 minutes) is sufficient given the active metadata probe prevents false positives. 300s was overly conservative for most deployments.

SteveRosam approved these changes Feb 13, 2026

View reviewed changes

peter-quix merged commit b785f42 into main Feb 13, 2026
4 checks passed

peter-quix deleted the fix/broker-unavailable-zombie-state branch February 13, 2026 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: detect prolonged broker unavailability and crash for restart#1082

Fix: detect prolonged broker unavailability and crash for restart#1082
peter-quix merged 6 commits intomainfrom
fix/broker-unavailable-zombie-state

peter-quix commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peter-quix commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Breaking change

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peter-quix commented Feb 12, 2026 •

edited

Loading