[#31420] tserver: Fix kNormal pool self-deadlock in TriggerRelcacheInitConnection#31421
[#31420] tserver: Fix kNormal pool self-deadlock in TriggerRelcacheInitConnection#31421ellabaron-code wants to merge 2 commits into
Conversation
Summary: This fix is for PgSingleTServerTest.ManyYsqlConnections which fails on both Linux and macOS. PgClientService::TriggerRelcacheInitConnection was a synchronous handler. It scheduled MakeRelcacheInitConnection on the messenger's scheduler, then called future::wait_for on the in-flight promise, parking the kNormal worker. MakeRelcacheInitConnection itself was dispatched onto kNormal, and in turn opened a libpq connection to the local PG listener, whose backend issues PgClient RPCs that also land on kNormal. Under enough concurrent relcache-init callers (e.g. PgSingleTServerTest.ManyYsqlConnections with 64 backends and a 32-thread kNormal pool), every kNormal worker would park in wait_for and the only thing that could call promise.set_value would never get a worker. Classic self-deadlock. This change makes the handler async: - Move TriggerRelcacheInitConnection from YB_PG_CLIENT_METHODS to YB_PG_CLIENT_ASYNC_METHODS in pg_client_service.h. The handler now takes RpcContext by value and responds from a callback. - Change TabletServerIf::TriggerRelcacheInitConnection from Status-returning to void-returning with an StdStatusCallback. Concurrent callers for the same database register a callback under lock_ and return; once MakeRelcacheInitConnection finishes, RelcacheInitConnectionDone fans out to all registered callbacks. The worker is freed before any wait, so the pool is never parked on its own future. - Update Master / MasterTabletServer / TabletServiceImpl signatures to match. The kNormal worker now returns to the pool immediately after registering its callback, so MakeRelcacheInitConnection (and any inner PgClient RPCs spawned by the resulting PG backend) can always run. Test Plan: yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test \ --gtest_filter PgSingleTServerTest.ManyYsqlConnections For details about the fix refer to this document: https://docs.google.com/document/d/1wh_jsWeoOx0Jr8N6ld4j8wG1FCuCwnjjI1lVMnCi1F0/edit?tab=t.0 The document includes deadlock walkthrough and stack traces from the live hang. This fix removes the failure on Linux, but on macOS it uncovered another problem that will be fixed in another PR.
✅ Deploy Preview for infallible-bardeen-164bc9 ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Code Review
This pull request refactors the TriggerRelcacheInitConnection method across the master and tablet server components to be asynchronous. The implementation replaces blocking logic and the use of std::shared_future with a callback-based mechanism (StdStatusCallback). In the TabletServer, in-flight requests for the same database are now managed via a map of callback vectors, allowing multiple concurrent callers to be notified upon the completion of a single underlying operation without blocking their respective threads. I have no feedback to provide as the review comments were either validating the implementation or discussing hypothetical scenarios without providing actionable improvements.
|
Possibly a duplicate: #27639 |
|
trigger jenkins |
|
Jenkins build has been triggered. Results will be posted once it completes. CSI JenkinsBot |
|
trigger jenkins |
|
Jenkins already in progress for JenkinsBot |
|
❌ Jenkins build for commit Exceptions:
Checking test failure count per build versus limit of 20 (0 on mac).
Checking for number of tests planned versus executed.
🔨 DB Build/Test Job Summary
JenkinsBot |
|
trigger jenkins |
|
❌ Jenkins build for commit Exceptions:
Checking test failure count per build versus limit of 20 (0 on mac).
🔨 DB Build/Test Job Summary
JenkinsBot |
|
trigger jenkins |
|
❌ Jenkins build for commit Exceptions:
Checking test failure count per build versus limit of 20 (0 on mac).
Checking for number of tests planned versus executed.
🔨 DB Build/Test Job Summary
JenkinsBot |
|
trigger jenkins |
|
Jenkins build has been triggered. Results will be posted once it completes. CSI JenkinsBot |
|
❌ Jenkins build for commit Exceptions:
Checking test failure count per build versus limit of 20 (0 on mac).
Checking for number of tests planned versus executed.
🔨 DB Build/Test Job Summary
JenkinsBot |
Summary
Fixes a self-deadlock in the
kNormalPgClient RPC pool that hangsPgSingleTServerTest.ManyYsqlConnectionson both Linux and macOS.PgClientService::TriggerRelcacheInitConnectionwas a synchronous handler. It scheduledMakeRelcacheInitConnectionon the messenger's scheduler and then calledfuture::wait_foron thein-flight promise, parking the
kNormalworker.MakeRelcacheInitConnectionitself ran onkNormal, and opened a libpq connection to the local PG listener whose backend issues PgClientRPCs that also land on
kNormal. Under enough concurrent relcache-init callers (e.g.ManyYsqlConnectionswith 64 backends and a 32-threadkNormalpool), everykNormalworkerwould park in
wait_forand the only thing that could callpromise.set_valuewould never get aworker.
This change makes the handler async:
TriggerRelcacheInitConnectionfromYB_PG_CLIENT_METHODStoYB_PG_CLIENT_ASYNC_METHODSin
pg_client_service.h. The handler now takesRpcContextby value and responds from acallback.
TabletServerIf::TriggerRelcacheInitConnectionfromStatus-returning tovoid-returningwith an
StdStatusCallback. Concurrent callers for the same database register a callback underlock_and return; onceMakeRelcacheInitConnectionfinishes,RelcacheInitConnectionDonefansout to all registered callbacks. The worker is freed before any wait, so the pool is never parked
on its own future.
Master/MasterTabletServer/TabletServiceImplsignatures to match.The
kNormalworker now returns to the pool immediately after registering its callback, soMakeRelcacheInitConnection(and any inner PgClient RPCs spawned by the resulting PG backend) canalways run.
This fix removes the failure on Linux. On macOS it uncovered a separate problem that will be fixed
in a follow-up PR.
Resolves #31420.
For details about the fix refer to this document:
document
CSI