Skip to content

fix(cluster): Cluster reconnect sharded subscribers v4 (#2060)#2065

Open
PavelPashov wants to merge 1 commit intoredis:v4from
PavelPashov:fix/cluster-reconnect-on-refresh-failure-v4
Open

fix(cluster): Cluster reconnect sharded subscribers v4 (#2060)#2065
PavelPashov wants to merge 1 commit intoredis:v4from
PavelPashov:fix/cluster-reconnect-on-refresh-failure-v4

Conversation

@PavelPashov
Copy link
Contributor

@PavelPashov PavelPashov commented Jan 15, 2026

When all sharded subscriber connections fail and the subsequent slots cache refresh returns ClusterAllFailedError, the cluster now properly enters reconnecting state instead of becoming zombied. This occurs when the cluster topology changes and all nodes are replaced with new IPs - the subscriber connections fail, triggering a slots refresh via the "-node" event handler. If this refresh fails (e.g., the duplicated connection for CLUSTER SLOTS times out or closes), the cluster becomes stuck in "ready" state with no working connections because normal pool connections use lazyConnect: true and never emit "end" events to trigger the drain->close->reconnect cycle. Now subscriber-triggered refreshSlotsCache() calls use a dedicated callback that detects ClusterAllFailedError and calls disconnect(true) to force reconnection, preventing the zombie state.

* fix: trigger reconnect when sharded subscriber slots refresh fails

When all sharded subscriber connections fail and the subsequent slots cache
refresh returns ClusterAllFailedError, the cluster now properly enters
reconnecting state instead of becoming zombied. This occurs when the cluster
topology changes and all nodes are replaced with new IPs - the subscriber
connections fail, triggering a slots refresh via the "-node" event handler.
If this refresh fails (e.g., the duplicated connection for CLUSTER SLOTS
times out or closes), the cluster becomes stuck in "ready" state with no
working connections because normal pool connections use lazyConnect: true
and never emit "end" events to trigger the drain->close->reconnect cycle.
Now subscriber-triggered refreshSlotsCache() calls use a dedicated callback
that detects ClusterAllFailedError and calls disconnect(true) to force
reconnection, preventing the zombie state.

* test: ensure reconnect after sharded subscriber failure
@jit-ci
Copy link

jit-ci bot commented Jan 15, 2026

❌ Security scan failed

Security scan failed: Branch fix/cluster-reconnect-on-refresh-failure-v4 does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant