Skip to content

fix: validate connection health and handle missing slots in cluster refresh#944

Open
billshen99 wants to merge 1 commit intoredis:mainfrom
billshen99:conn_reuse_and_slot_missing_fix_clean
Open

fix: validate connection health and handle missing slots in cluster refresh#944
billshen99 wants to merge 1 commit intoredis:mainfrom
billshen99:conn_reuse_and_slot_missing_fix_clean

Conversation

@billshen99
Copy link

Summary

Fixes critical bug where broken connections were blindly reused during cluster
topology refresh, causing persistent "wrong shard/slot" errors requiring pod restarts.

Root Cause

During _refresh(), connections were reused without health validation. Network
issues could break connections, which would persist in error state indefinitely
across refresh cycles.

Changes

  1. Critical, Connection health validation and reuse:
    Validate connection health with Error() before reusing
  • The cluster client now checks the health of existing connections before reusing them during a topology refresh. Connections reporting errors are replaced with new ones, preventing broken connections from being reused.
  1. Slot mapping resilience:
  • When the cluster topology information is incomplete (e.g., missing slots), the client preserves old slot-to-connection mappings for slots not covered by the new topology, ensuring no slot is left without a connection.
  1. Replica slot fallback improvements:
    Add replica-to-master fallback for missing replica slots
  • In scenarios where replica slot information is missing, the client now falls back to using the master connection for those slots, ensuring continued availability. [1] [2] [3]

Impact

  • Broken connections now auto-detected and replaced within seconds
  • No more manual pod restarts needed for this issue
  • Graceful degradation during incomplete topology updates
    This pull request improves the reliability of the Redis cluster client by ensuring that unhealthy (broken) connections are not reused after a topology refresh, and by making slot mapping more robust when the cluster topology is incomplete. It also adds comprehensive tests to validate these behaviors.

Testing enhancements:

  • Added tests to verify that broken connections are not reused after a refresh, and that all slots are properly updated to use healthy connections, ensuring robust failover and recovery.

@jit-ci
Copy link

jit-ci bot commented Jan 29, 2026

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

conns[addr] = fresh
// Validate connection health before reusing
// If connection has error, it will be replaced with a new one
if cc.conn.Error() == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mux handles reconnections internally, so no Error() check at the cluster client is okay.

Comment on lines +332 to +334
// Preserve old slot mappings for any slots not covered by the new topology
// This prevents nil connections when CLUSTER SLOTS returns incomplete data
for i := 0; i < 16384; i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not do this. CLUSTER SLOTS should be the only truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants