Skip to content

fix: prevent "Cannot invoke RPC on closed channel" after connection recovery#3326

Open
XuanYang-cn wants to merge 3 commits intomilvus-io:masterfrom
XuanYang-cn:fix-recover-closed-channel
Open

fix: prevent "Cannot invoke RPC on closed channel" after connection recovery#3326
XuanYang-cn wants to merge 3 commits intomilvus-io:masterfrom
XuanYang-cn:fix-recover-closed-channel

Conversation

@XuanYang-cn
Copy link
Contributor

Summary

  • Fix "Cannot invoke RPC on closed channel" error caused by stale handler references after ConnectionManager._recover() — a regression from the ConnectionManager introduced in 2.6.10 (feat: add ConnectionManager to replace connections singleton for MilvusClient #3283)
  • _recover() previously created a new handler object, but MilvusClient._handler and in-flight retry loops still referenced the old (closed) handler
  • Fix: add reconnect() to GrpcHandler/AsyncGrpcHandler that resets the gRPC channel in-place (same object, new channel), so all references remain valid
  • Add recovery_gen counter to ManagedConnection for concurrent recovery guard
  • Add get_recovery_address() to strategies so GlobalStrategy passes new primary endpoint
  • Add warning log when global topology is unavailable during recovery
  • Add concurrency unit tests proving lock + gen guard prevent double reconnect

🤖 Generated with Claude Code

…ecovery

ConnectionManager._recover() previously created a new handler object and
stored it in managed.handler, but MilvusClient._handler and in-flight
retry loops still referenced the old (closed) handler. Any subsequent
RPC on the stale reference raised "Cannot invoke RPC on closed channel".

Fix: add reconnect() to GrpcHandler/AsyncGrpcHandler that resets the
gRPC channel in-place (close + re-setup on the same object). _recover()
now calls handler.reconnect() instead of creating a new handler, so all
existing references remain valid.

Additional changes:
- Add recovery_gen counter to ManagedConnection for concurrent recovery
  guard (replaces handler identity check that no longer applies)
- Add get_recovery_address() to strategies so GlobalStrategy can pass
  the new primary endpoint to reconnect()
- Add concurrency unit tests proving the lock + gen guard prevent
  double reconnect under thread contention

Signed-off-by: xuan.yang <xuan.yang@zilliz.com>
- Fix stale docstring in AsyncConnectionManager._register_error_callback
  that described old handler-replacement mechanism; updated to describe
  in-place reconnect and recovery_gen guard
- Remove no-op recovery_gen guard from async handle_error (asyncio lock
  is held throughout, so gen can never change between read and check)
- Add warning log in GlobalStrategy.get_recovery_address when topology
  is unavailable, so silent fallback to current address is observable
- Add test for the topology-unavailable warning scenario

Signed-off-by: xuan.yang <xuan.yang@zilliz.com>
@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.06%. Comparing base (9c29ad1) to head (0c1a07c).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3326      +/-   ##
==========================================
+ Coverage   90.04%   90.06%   +0.01%     
==========================================
  Files          64       64              
  Lines       13638    13656      +18     
==========================================
+ Hits        12281    12299      +18     
  Misses       1357     1357              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…rage

Add TestHandlerReconnect class exercising the real GrpcHandler.reconnect()
and AsyncGrpcHandler.reconnect() implementations with parametrized tests
covering address handling, close-failure recovery, and timeout forwarding.

Signed-off-by: xuan.yang <xuan.yang@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants