Skip to content

fix(connection): retry entire connect+handshake to fix PubSub reconnect#304

Open
makubo-aws wants to merge 1 commit intovalkey-io:mainfrom
makubo-aws:fix/pubsub-retry-connect-handshake
Open

fix(connection): retry entire connect+handshake to fix PubSub reconnect#304
makubo-aws wants to merge 1 commit intovalkey-io:mainfrom
makubo-aws:fix/pubsub-retry-connect-handshake

Conversation

@makubo-aws
Copy link
Copy Markdown

Summary

Fixes #169
Related: #225

Problem

The retry logic in AbstractConnection.connect() only wrapped the socket-level _connect() call. The on_connect() handshake (authentication, protocol negotiation, CLIENT SETNAME, etc.) ran outside the retry scope.

When a PubSub connection drops and the client tries to reconnect, on_connect() can raise a ConnectionError (e.g. Connection reset by peer — errno 104) if the server isn't fully ready yet. Because this error escapes the retry block, the configured retry policy is never applied and the PubSub thread crashes instead of recovering.

This is also the root cause of the idle-to-burst ConnectionError bursts seen on ElastiCache Serverless (issue #225).

Fix

Extract a _connect_with_handshake() helper that combines the socket connect and on_connect() handshake into a single retryable unit. The retry block in connect() now covers the full connection establishment flow. On failure, disconnect() is called as the error handler to ensure a clean state before the next attempt.

The same fix is applied to both:

  • valkey/connection.py (sync)
  • valkey/asyncio/connection.py (async)

Prior art

This mirrors the fix applied to redis-py in PR #3863, which resolved the equivalent issues in that library.

Changes

  • valkey/connection.py: connect() now calls _connect_with_handshake() inside the retry block; new _connect_with_handshake() method combines socket connect + handshake
  • valkey/asyncio/connection.py: same changes for the async path

Previously, only the socket-level _connect() was wrapped in the retry
block. The on_connect() handshake (auth, protocol negotiation, etc.)
ran outside the retry scope, so a ConnectionError raised during
reconnection (e.g. 'Connection reset by peer' in PubSub) would
propagate directly to the caller instead of being retried.

This is the root cause of the PubSub retry bug reported in issue valkey-io#169
and the idle-to-burst ConnectionError bursts on ElastiCache Serverless
reported in issue valkey-io#225.

Fix: extract _connect_with_handshake() that combines the socket connect
and on_connect() into a single retryable unit. The retry block now
covers the full connection establishment flow. On failure, disconnect()
is called as the error handler to ensure a clean state before the next
attempt.

The same fix is applied to both the sync (valkey/connection.py) and
async (valkey/asyncio/connection.py) AbstractConnection classes.

This mirrors the fix applied to redis-py in PR #3863.

Fixes valkey-io#169
Related: valkey-io#225
@makubo-aws
Copy link
Copy Markdown
Author

vibe coded based on existing fix for this in redis-py

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.23%. Comparing base (29039f6) to head (ce63ba9).

Files with missing lines Patch % Lines
valkey/asyncio/connection.py 50.00% 4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (29039f6) and HEAD (ce63ba9). Click for more details.

HEAD has 512 uploads less than BASE
Flag BASE (29039f6) HEAD (ce63ba9)
567 55
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #304       +/-   ##
===========================================
- Coverage   76.72%   62.23%   -14.50%     
===========================================
  Files         129      129               
  Lines       34146    34150        +4     
===========================================
- Hits        26199    21253     -4946     
- Misses       7947    12897     +4950     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands connection retry coverage to include the full connect + handshake sequence (auth/protocol negotiation/client setup), addressing PubSub reconnect failures where handshake-time ConnectionErrors previously escaped the retry policy.

Changes:

  • Wrap socket connect and on_connect() handshake into a single retryable unit via a new _connect_with_handshake() helper (sync + async).
  • Update connect() to retry _connect_with_handshake() and run connect callbacks after a successful connection.
  • Ensure cleanup between attempts by calling disconnect() on failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
valkey/connection.py Wrap sync connect + handshake inside retry; add _connect_with_handshake() and move callbacks post-connect.
valkey/asyncio/connection.py Mirror the same retry + handshake restructuring for asyncio connections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 314 to 317
await self.retry.call_with_retry(
lambda: self._connect(), lambda error: self.disconnect()
lambda: self._connect_with_handshake(),
lambda error: self.disconnect(),
)
Comment thread valkey/connection.py
Comment on lines +321 to 324
self.retry.call_with_retry(
lambda: self._connect_with_handshake(),
lambda error: self.disconnect(error),
)
Comment thread valkey/connection.py
Comment on lines +360 to 362
# clean up after any error in on_connect so that the next
# retry attempt starts from a clean state
self.disconnect()
Comment thread valkey/connection.py
Comment on lines +321 to 324
self.retry.call_with_retry(
lambda: self._connect_with_handshake(),
lambda error: self.disconnect(error),
)
Comment on lines 314 to 317
await self.retry.call_with_retry(
lambda: self._connect(), lambda error: self.disconnect()
lambda: self._connect_with_handshake(),
lambda error: self.disconnect(),
)
Comment on lines 360 to 364
except ValkeyError:
# clean up after any error in on_connect
# clean up after any error in on_connect so that the next
# retry attempt starts from a clean state
await self.disconnect()
raise
@mkmkme
Copy link
Copy Markdown
Collaborator

mkmkme commented May 1, 2026

Could you have a look at:

  1. DCO failure (we use commit --signoff in the repo)
  2. Copilot comments (take them with a grain of salt though, copilot is dumb)
  3. Tests failing

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in retry-handling in the PubSub

4 participants