Skip to content

[#31420] tserver: Fix kNormal pool self-deadlock in TriggerRelcacheInitConnection#31421

Open
ellabaron-code wants to merge 2 commits into
yugabyte:masterfrom
Shopify:fix-trigger-relcache-init-deadlock
Open

[#31420] tserver: Fix kNormal pool self-deadlock in TriggerRelcacheInitConnection#31421
ellabaron-code wants to merge 2 commits into
yugabyte:masterfrom
Shopify:fix-trigger-relcache-init-deadlock

Conversation

@ellabaron-code
Copy link
Copy Markdown
Collaborator

@ellabaron-code ellabaron-code commented May 4, 2026

Summary

Fixes a self-deadlock in the kNormal PgClient RPC pool that hangs
PgSingleTServerTest.ManyYsqlConnections on both Linux and macOS.

PgClientService::TriggerRelcacheInitConnection was a synchronous handler. It scheduled
MakeRelcacheInitConnection on the messenger's scheduler and then called future::wait_for on the
in-flight promise, parking the kNormal worker. MakeRelcacheInitConnection itself ran on
kNormal, and opened a libpq connection to the local PG listener whose backend issues PgClient
RPCs that also land on kNormal. Under enough concurrent relcache-init callers (e.g.
ManyYsqlConnections with 64 backends and a 32-thread kNormal pool), every kNormal worker
would park in wait_for and the only thing that could call promise.set_value would never get a
worker.

This change makes the handler async:

  • Move TriggerRelcacheInitConnection from YB_PG_CLIENT_METHODS to YB_PG_CLIENT_ASYNC_METHODS
    in pg_client_service.h. The handler now takes RpcContext by value and responds from a
    callback.
  • Change TabletServerIf::TriggerRelcacheInitConnection from Status-returning to void-returning
    with an StdStatusCallback. Concurrent callers for the same database register a callback under
    lock_ and return; once MakeRelcacheInitConnection finishes, RelcacheInitConnectionDone fans
    out to all registered callbacks. The worker is freed before any wait, so the pool is never parked
    on its own future.
  • Update Master / MasterTabletServer / TabletServiceImpl signatures to match.

The kNormal worker now returns to the pool immediately after registering its callback, so
MakeRelcacheInitConnection (and any inner PgClient RPCs spawned by the resulting PG backend) can
always run.

This fix removes the failure on Linux. On macOS it uncovered a separate problem that will be fixed
in a follow-up PR.

Resolves #31420.

For details about the fix refer to this document:
document


CSI

Summary:

This fix is for PgSingleTServerTest.ManyYsqlConnections which fails on
both Linux and macOS.

PgClientService::TriggerRelcacheInitConnection was a synchronous handler.
It scheduled MakeRelcacheInitConnection on the messenger's scheduler, then
called future::wait_for on the in-flight promise, parking the kNormal
worker. MakeRelcacheInitConnection itself was dispatched onto kNormal, and
in turn opened a libpq connection to the local PG listener, whose backend
issues PgClient RPCs that also land on kNormal. Under enough concurrent
relcache-init callers (e.g. PgSingleTServerTest.ManyYsqlConnections with
64 backends and a 32-thread kNormal pool), every kNormal worker would park
in wait_for and the only thing that could call promise.set_value would
never get a worker. Classic self-deadlock.

This change makes the handler async:

- Move TriggerRelcacheInitConnection from YB_PG_CLIENT_METHODS to
  YB_PG_CLIENT_ASYNC_METHODS in pg_client_service.h. The handler now takes
  RpcContext by value and responds from a callback.
- Change TabletServerIf::TriggerRelcacheInitConnection from
  Status-returning to void-returning with an StdStatusCallback. Concurrent
  callers for the same database register a callback under lock_ and
  return; once MakeRelcacheInitConnection finishes,
  RelcacheInitConnectionDone fans out to all registered callbacks. The
  worker is freed before any wait, so the pool is never parked on its own
  future.
- Update Master / MasterTabletServer / TabletServiceImpl signatures to
  match.

The kNormal worker now returns to the pool immediately after registering
its callback, so MakeRelcacheInitConnection (and any inner PgClient RPCs
spawned by the resulting PG backend) can always run.

Test Plan:
yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test \
  --gtest_filter PgSingleTServerTest.ManyYsqlConnections

For details about the fix refer to this document:
https://docs.google.com/document/d/1wh_jsWeoOx0Jr8N6ld4j8wG1FCuCwnjjI1lVMnCi1F0/edit?tab=t.0

The document includes deadlock walkthrough and stack traces from the
live hang.

This fix removes the failure on Linux, but on macOS it uncovered another
problem that will be fixed in another PR.
@netlify
Copy link
Copy Markdown

netlify Bot commented May 4, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit b0ccf9c
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/69f9f1e19c73f1000853e5af
😎 Deploy Preview https://deploy-preview-31421--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the TriggerRelcacheInitConnection method across the master and tablet server components to be asynchronous. The implementation replaces blocking logic and the use of std::shared_future with a callback-based mechanism (StdStatusCallback). In the TabletServer, in-flight requests for the same database are now managed via a map of callback vectors, allowing multiple concurrent callers to be notified upon the completion of a single underlying operation without blocking their respective threads. I have no feedback to provide as the review comments were either validating the implementation or discussing hypothetical scenarios without providing actionable improvements.

Copy link
Copy Markdown
Contributor

@myang2021 myang2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Comment thread src/yb/tserver/tablet_server.cc
@OlegLoginov
Copy link
Copy Markdown
Contributor

Possibly a duplicate: #27639

@ellabaron-code ellabaron-code requested a review from myang2021 May 5, 2026 18:53
@rahulb-yb
Copy link
Copy Markdown
Contributor

trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 5, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

@myang2021
Copy link
Copy Markdown
Contributor

trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 5, 2026

Jenkins already in progress for b0ccf9c6.


JenkinsBot

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 6, 2026

Jenkins build for commit b0ccf9c6: Fail
CSI
Reason: CSI status: WARNING

Exceptions:

Checking test failure count per build versus limit of 20 (0 on mac).

Build Failures Status
PR31421-arm-alma8-clang21-release #1 3 Okay
PR31421-alma8-clang21-release #1 2 Okay
PR31421-arm-mac14-clang21-release #1 0 Okay
PR31421-ubuntu22.04-clang21-debug #1 0 Okay
PR31421-alma8-clang21-tsan #1 2 Okay
PR31421-alma8-gcc12-fastdebug #1 22 FAILURE
PR31421-mac14-clang21-release #1 0 Okay
PR31421-alma9-clang21-asan #1 8 Okay

Checking for number of tests planned versus executed.

Type C++ Plan Java Plan Planned Executed Status
PR31421-arm-alma8-clang21-release #1 6140 3259 9399 9399 Okay
PR31421-alma8-clang21-release #1 6140 3262 9402 67 FAILURE
PR31421-arm-mac14-clang21-release #1 11 4 15 15 Okay
PR31421-ubuntu22.04-clang21-debug #1 0 0 0 0 Okay
PR31421-alma8-clang21-tsan #1 6154 3061 9215 47 FAILURE
PR31421-alma8-gcc12-fastdebug #1 6155 3262 9417 133 FAILURE
PR31421-mac14-clang21-release #1 0 0 0 0 Okay
PR31421-alma9-clang21-asan #1 6154 3155 9309 9309 Okay

🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31421-arm-alma8-clang21-release 9401 9027 8 3
PR31421-alma8-clang21-release 79 26 2 2
PR31421-arm-mac14-clang21-release 17 17 0 0
PR31421-ubuntu22.04-clang21-debug 2 2 0 0
PR31421-alma8-clang21-tsan 59 5 2 2
PR31421-alma8-gcc12-fastdebug 144 65 22 22
PR31421-mac14-clang21-release 2 2 0 0
PR31421-alma9-clang21-asan 9311 8607 15 8

JenkinsBot

@myang2021
Copy link
Copy Markdown
Contributor

trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 6, 2026

Jenkins build for commit b0ccf9c6: Fail
CSI
Reason: CSI status: WARNING

Exceptions:

Checking test failure count per build versus limit of 20 (0 on mac).

Build Failures Status
PR31421-arm-alma8-clang21-release #1 3 Okay
PR31421-alma8-clang21-release #1 2 Okay
PR31421-arm-mac14-clang21-release #1 0 Okay
PR31421-ubuntu22.04-clang21-debug #1 0 Okay
PR31421-alma8-clang21-tsan #1 2 Okay
PR31421-alma8-gcc12-fastdebug #1 22 FAILURE
PR31421-mac14-clang21-release #1 0 Okay
PR31421-alma9-clang21-asan #1 8 Okay

🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31421-arm-alma8-clang21-release (In Progress) 9401 9027 0 0
PR31421-alma8-clang21-release (In Progress) 79 26 0 0
PR31421-arm-mac14-clang21-release (In Progress) 17 17 0 0
PR31421-ubuntu22.04-clang21-debug (In Progress) 1 1 0 0
PR31421-alma8-clang21-tsan (In Progress) 59 5 0 0
PR31421-alma8-gcc12-fastdebug (In Progress) 144 65 0 0
PR31421-mac14-clang21-release (In Progress) 1 1 0 0
PR31421-alma9-clang21-asan (In Progress) 9311 8607 0 0

JenkinsBot

@rahulb-yb
Copy link
Copy Markdown
Contributor

trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 6, 2026

Jenkins build for commit b0ccf9c6: Fail
CSI
Reason: CSI status: WARNING

Exceptions:

Checking test failure count per build versus limit of 20 (0 on mac).

Build Failures Status
PR31421-arm-alma8-clang21-release #1 3 Okay
PR31421-alma8-clang21-release #1 2 Okay
PR31421-arm-mac14-clang21-release #1 0 Okay
PR31421-ubuntu22.04-clang21-debug #1 0 Okay
PR31421-alma8-clang21-tsan #1 2 Okay
PR31421-alma8-gcc12-fastdebug #1 22 FAILURE
PR31421-mac14-clang21-release #1 0 Okay
PR31421-alma9-clang21-asan #1 7 Okay

Checking for number of tests planned versus executed.

Type C++ Plan Java Plan Planned Executed Status
PR31421-arm-alma8-clang21-release #1 6140 3259 9399 9399 Okay
PR31421-alma8-clang21-release #1 6140 3262 9402 67 FAILURE
PR31421-arm-mac14-clang21-release #1 11 4 15 15 Okay
PR31421-ubuntu22.04-clang21-debug #1 0 0 0 -1 FAILURE
PR31421-alma8-clang21-tsan #1 6154 3061 9215 47 FAILURE
PR31421-alma8-gcc12-fastdebug #1 6155 3262 9417 133 FAILURE
PR31421-mac14-clang21-release #1 0 0 0 -1 FAILURE
PR31421-alma9-clang21-asan #1 6154 3155 9309 9309 Okay

🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31421-arm-alma8-clang21-release 9401 9027 8 3
PR31421-alma8-clang21-release 79 26 2 2
PR31421-arm-mac14-clang21-release 17 17 0 0
PR31421-ubuntu22.04-clang21-debug 2 1 0 0
PR31421-alma8-clang21-tsan 59 5 2 2
PR31421-alma8-gcc12-fastdebug 144 65 22 22
PR31421-mac14-clang21-release 2 1 0 0
PR31421-alma9-clang21-asan 9311 8607 15 7

JenkinsBot

@myang2021
Copy link
Copy Markdown
Contributor

trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 8, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 9, 2026

Jenkins build for commit b0ccf9c6: Fail
CSI
Reason: CSI status: WARNING

Exceptions:

Checking test failure count per build versus limit of 20 (0 on mac).

Build Failures Status
PR31421-arm-alma8-clang21-release #1 9 Okay
PR31421-alma8-clang21-release #1 2 Okay
PR31421-arm-mac14-clang21-release #1 0 Okay
PR31421-ubuntu22.04-clang21-debug #1 0 Okay
PR31421-alma8-clang21-tsan #1 20 Okay
PR31421-alma8-gcc12-fastdebug #1 53 FAILURE
PR31421-mac14-clang21-release #1 0 Okay
PR31421-alma9-clang21-asan #1 7 Okay

Checking for number of tests planned versus executed.

Type C++ Plan Java Plan Planned Executed Status
PR31421-arm-alma8-clang21-release #1 6140 3259 9399 9399 Okay
PR31421-alma8-clang21-release #1 6140 3262 9402 67 FAILURE
PR31421-arm-mac14-clang21-release #1 11 4 15 15 Okay
PR31421-ubuntu22.04-clang21-debug #1 0 0 0 0 Okay
PR31421-alma8-clang21-tsan #1 6154 3061 9215 9214 FAILURE
PR31421-alma8-gcc12-fastdebug #1 6155 3262 9417 9417 Okay
PR31421-mac14-clang21-release #1 0 0 0 0 Okay
PR31421-alma9-clang21-asan #1 6154 3155 9309 9309 Okay

🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31421-arm-alma8-clang21-release 9401 9027 15 9
PR31421-alma8-clang21-release 81 26 2 2
PR31421-arm-mac14-clang21-release 17 17 0 0
PR31421-ubuntu22.04-clang21-debug 2 2 0 0
PR31421-alma8-clang21-tsan 9217 7569 21 20
PR31421-alma8-gcc12-fastdebug 9419 8935 75 53
PR31421-mac14-clang21-release 2 2 0 0
PR31421-alma9-clang21-asan 9311 8607 15 7

JenkinsBot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DocDB] PgSingleTServerTest.ManyYsqlConnections deadlocks via TriggerRelcacheInitConnection on kNormal pool

6 participants