fix(connection): retry entire connect+handshake to fix PubSub reconnect by makubo-aws · Pull Request #304 · valkey-io/valkey-py

makubo-aws · 2026-04-29T22:38:17Z

Summary

Fixes #169
Related: #225

Problem

The retry logic in AbstractConnection.connect() only wrapped the socket-level _connect() call. The on_connect() handshake (authentication, protocol negotiation, CLIENT SETNAME, etc.) ran outside the retry scope.

When a PubSub connection drops and the client tries to reconnect, on_connect() can raise a ConnectionError (e.g. Connection reset by peer — errno 104) if the server isn't fully ready yet. Because this error escapes the retry block, the configured retry policy is never applied and the PubSub thread crashes instead of recovering.

This is also the root cause of the idle-to-burst ConnectionError bursts seen on ElastiCache Serverless (issue #225).

Fix

Extract a _connect_with_handshake() helper that combines the socket connect and on_connect() handshake into a single retryable unit. The retry block in connect() now covers the full connection establishment flow. On failure, disconnect() is called as the error handler to ensure a clean state before the next attempt.

The same fix is applied to both:

valkey/connection.py (sync)
valkey/asyncio/connection.py (async)

Prior art

This mirrors the fix applied to redis-py in PR #3863, which resolved the equivalent issues in that library.

Changes

valkey/connection.py: connect() now calls _connect_with_handshake() inside the retry block; new _connect_with_handshake() method combines socket connect + handshake
valkey/asyncio/connection.py: same changes for the async path

Previously, only the socket-level _connect() was wrapped in the retry block. The on_connect() handshake (auth, protocol negotiation, etc.) ran outside the retry scope, so a ConnectionError raised during reconnection (e.g. 'Connection reset by peer' in PubSub) would propagate directly to the caller instead of being retried. This is the root cause of the PubSub retry bug reported in issue valkey-io#169 and the idle-to-burst ConnectionError bursts on ElastiCache Serverless reported in issue valkey-io#225. Fix: extract _connect_with_handshake() that combines the socket connect and on_connect() into a single retryable unit. The retry block now covers the full connection establishment flow. On failure, disconnect() is called as the error handler to ensure a clean state before the next attempt. The same fix is applied to both the sync (valkey/connection.py) and async (valkey/asyncio/connection.py) AbstractConnection classes. This mirrors the fix applied to redis-py in PR #3863. Fixes valkey-io#169 Related: valkey-io#225

makubo-aws · 2026-04-29T22:39:14Z

vibe coded based on existing fix for this in redis-py

codecov-commenter · 2026-04-29T22:42:59Z

Codecov Report

❌ Patch coverage is 75.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.23%. Comparing base (29039f6) to head (ce63ba9).

Files with missing lines	Patch %	Lines
valkey/asyncio/connection.py	50.00%	4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (29039f6) and HEAD (ce63ba9). Click for more details.

HEAD has 512 uploads less than BASE

Flag BASE (29039f6) HEAD (ce63ba9)

567 55

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #304       +/-   ##
===========================================
- Coverage   76.72%   62.23%   -14.50%     
===========================================
  Files         129      129               
  Lines       34146    34150        +4     
===========================================
- Hits        26199    21253     -4946     
- Misses       7947    12897     +4950

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR expands connection retry coverage to include the full connect + handshake sequence (auth/protocol negotiation/client setup), addressing PubSub reconnect failures where handshake-time ConnectionErrors previously escaped the retry policy.

Changes:

Wrap socket connect and on_connect() handshake into a single retryable unit via a new _connect_with_handshake() helper (sync + async).
Update connect() to retry _connect_with_handshake() and run connect callbacks after a successful connection.
Ensure cleanup between attempts by calling disconnect() on failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`valkey/connection.py`	Wrap sync connect + handshake inside retry; add `_connect_with_handshake()` and move callbacks post-connect.
`valkey/asyncio/connection.py`	Mirror the same retry + handshake restructuring for asyncio connections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

            await self.retry.call_with_retry(
-                lambda: self._connect(), lambda error: self.disconnect()
+                lambda: self._connect_with_handshake(),
+                lambda error: self.disconnect(),
            )


+            self.retry.call_with_retry(
+                lambda: self._connect_with_handshake(),
+                lambda error: self.disconnect(error),
            )


+            # clean up after any error in on_connect so that the next
+            # retry attempt starts from a clean state
            self.disconnect()


+            self.retry.call_with_retry(
+                lambda: self._connect_with_handshake(),
+                lambda error: self.disconnect(error),
            )


            await self.retry.call_with_retry(
-                lambda: self._connect(), lambda error: self.disconnect()
+                lambda: self._connect_with_handshake(),
+                lambda error: self.disconnect(),
            )


        except ValkeyError:
-            # clean up after any error in on_connect
+            # clean up after any error in on_connect so that the next
+            # retry attempt starts from a clean state
            await self.disconnect()
            raise


mkmkme · 2026-05-01T17:50:35Z

Could you have a look at:

DCO failure (we use commit --signoff in the repo)
Copilot comments (take them with a grain of salt though, copilot is dumb)
Tests failing

Thanks!

makubo-aws requested review from ahmedsobeh, bogdanp05, mkmkme and smeso as code owners April 29, 2026 22:38

mkmkme requested a review from Copilot May 1, 2026 17:41

Copilot started reviewing on behalf of mkmkme May 1, 2026 17:42 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(connection): retry entire connect+handshake to fix PubSub reconnect#304

fix(connection): retry entire connect+handshake to fix PubSub reconnect#304
makubo-aws wants to merge 1 commit intovalkey-io:mainfrom
makubo-aws:fix/pubsub-retry-connect-handshake

makubo-aws commented Apr 29, 2026

Uh oh!

makubo-aws commented Apr 29, 2026

Uh oh!

codecov-commenter commented Apr 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mkmkme commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

makubo-aws commented Apr 29, 2026

Summary

Problem

Fix

Prior art

Changes

Uh oh!

makubo-aws commented Apr 29, 2026

Uh oh!

codecov-commenter commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mkmkme commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 29, 2026 •

edited

Loading

mkmkme commented May 1, 2026 •

edited

Loading