Skip to content

[#31522] tserver: Fix VerifyPgClientServiceCleanupQueue on macOS#31523

Open
ellabaron-code wants to merge 1 commit into
yugabyte:masterfrom
Shopify:fix-pg-client-cleanup-queue-mac
Open

[#31522] tserver: Fix VerifyPgClientServiceCleanupQueue on macOS#31523
ellabaron-code wants to merge 1 commit into
yugabyte:masterfrom
Shopify:fix-pg-client-cleanup-queue-mac

Conversation

@ellabaron-code
Copy link
Copy Markdown
Collaborator

@ellabaron-code ellabaron-code commented May 8, 2026

Summary

Removes the macOS-specific 250ms branch in PgClientService's session
cleanup delay so all platforms use 50ms. The macOS delay caused
VerifyPgClientServiceCleanupQueue to fail because the internal libpq
backend spawned by TriggerRelcacheInitConnection on the first
Connect() was still in sessions_ when the assertion ran (~230ms
after libpq close).

The 250ms macOS branch was added in a6c5ea1 to fix
PgSharedMemTest.ConnectionShutdown, which still passes with the
unified 50ms delay.

Fixes #31522.

Test plan

  • On macOS arm64: ./yb_build.sh release --cxx-test pg_client-test --gtest_filter PgClientTest.VerifyPgClientServiceCleanupQueue passes.
  • On macOS arm64: PgSharedMemTest.ConnectionShutdown still passes with the unified 50ms delay (regression check for the original a6c5ea1 motivation).
  • On Linux: both tests continue to pass.

CSI

The test opens 30 user connections and asserts that PgClientService has
exactly 31 sessions (30 + 1 yb_ash bgworker). On macOS the count was 32.

The extra session is the internal libpq backend spawned by
TriggerRelcacheInitConnection. The first user Connect() triggers it
because no per-DB relcache init file exists on a fresh cluster
(relcache.c:6742). The internal backend exits promptly, but
PgClientService waits a hardcoded delay after the backend's RPC channel
shuts down before calling getsid(pid) to confirm the process is gone
and removing the session. That delay was 250ms on macOS vs. 50ms on
Linux release — and the test asserts ~230ms after the libpq close, so
on macOS the session is still in sessions_ when the assertion runs.

The 250ms macOS branch was added in commit a6c5ea1 to fix
PgSharedMemTest.ConnectionShutdown. That test still passes with the
unified 50ms delay, so remove the macOS-specific branch.
@netlify
Copy link
Copy Markdown

netlify Bot commented May 8, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit 68042c5
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/69fe14c1b01d5500094d0554
😎 Deploy Preview https://deploy-preview-31523--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the platform-specific delay for macOS in the ListenConnectionShutdown callback within PgClientServiceImpl, standardizing the delay using RegularBuildVsSanitizers(50ms, 1000ms) across all platforms. I have no feedback to provide.

@rahulb-yb
Copy link
Copy Markdown
Contributor

trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 8, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 9, 2026

Jenkins build for commit 68042c50: Pass
CSI

Passed: 12


🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31523-arm-mac14-clang21-release 17 17 0 0
PR31523-arm-alma8-clang21-release 9394 9021 6 3
PR31523-alma9-clang21-asan 9304 8599 12 7
PR31523-ubuntu22.04-clang21-debug 2 2 0 0
PR31523-alma8-clang21-tsan 9210 7583 14 7
PR31523-alma8-gcc12-fastdebug 9412 8978 10 5
PR31523-alma8-clang21-release 9397 9027 5 1
PR31523-mac14-clang21-release 2 2 0 0

JenkinsBot

@spolitov
Copy link
Copy Markdown
Contributor

The PR summary does not explain why it fixes this test.
VerifyPgClientServiceCleanupQueue waits 4 minutes until number of sessions becomes equals to expected one.
Session timeout is 1 minute. So it gives 3 minutes to every closed session to cleanup.

the internal libpq backend spawned by TriggerRelcacheInitConnection on the first Connect() was still in sessions_ when the assertion ran (~230ms after libpq close).
We wait 3 minutes after libpq close, why does 230ms is important here?

Also please note that a6c5ea1 was fixing the test in Jenkins where tests were we used overloaded machines to run tests. As result we use only minimal subset of tests on Mac in Jenkins now.

@ellabaron-code
Copy link
Copy Markdown
Collaborator Author

The WaitFor at line 261 isn't where this fails — the failure is the immediate ASSERT_EQ at line 258 right after the connection loop.

Timeline from the test log:

08:41:20.845679 — relcache init connection request to "yugabyte" starts (fires once on first Connect to a fresh DB)
08:41:21.025385 — relcache init succeeds; the libpq client closes
08:41:21.246174 — last user connection acquired; ASSERT_EQ runs right after
So the assertion runs ~221 ms after the relcache-init backend closed. With the macOS 250 ms ListenConnectionShutdown delay, the cleanup timer hasn't fired yet, and the test sees the lingering session (32 instead of the expected 31). With the Linux 50 ms delay it fires at ~08:41:21.075, well before the assertion — which is why the test has always passed on Linux.

I'd prefer to keep the fix as-is. Reliable macOS test runs matter to us because we do most of our development on macOS, and aligning the delay with what every other platform already uses removes a long-standing source of flake without affecting anything else. Happy to add a code comment explaining the timing if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DocDB] VerifyPgClientServiceCleanupQueue fails on macOS

5 participants