Skip to content

[BACKPORT 2025.1][#31870] DocDB: PgResponseCache entries become invalid after the originating RPC's deadline#32070

Open
kai-franz wants to merge 1 commit into
yugabyte:2025.1from
kai-franz:backport-36b344fe3-2025.1
Open

[BACKPORT 2025.1][#31870] DocDB: PgResponseCache entries become invalid after the originating RPC's deadline#32070
kai-franz wants to merge 1 commit into
yugabyte:2025.1from
kai-franz:backport-36b344fe3-2025.1

Conversation

@kai-franz

@kai-franz kai-franz commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

DO NOT MERGE!! CSI

Summary:
PgResponseCache::Data::IsValid unconditionally checked
now < readiness_deadline_, where readiness_deadline_ is set once at
entry creation from the originating RPC's wall-clock deadline (the YSQL
client RW timeout, ~10 min by default). This caused every cache entry --
even one with a successful, populated, version-current response --
to become "invalid" once wall-clock crossed that deadline, forcing a
full rebuild every few minutes. The rebuilt RPC's response carries a fresh
catalog_read_time that propagates into subsequent paged RPCs' cache
keys, so one expiry cascades into ~5 misses per preload sequence on
every new connection's prefetch.

This was a regression introduced by #27412 (commit f960a4a, "Async
wait on pg response cache"). Pre-refactor, the deadline was only
consulted while the response was still in-flight:

return version == version_ &&
(IsReady(future_) ? IsOk(future_.get()) : now < readiness_deadline_);

Restore the pre-refactor semantics: the deadline remains a watchdog on
a stalled fetch, but a populated successful response stays valid until
catalog-version invalidation or LRU eviction supersedes it. The same
change was reverted on the 2024.2.4 backport branch (commit
413142a); this fix restores correct behavior on master.

Original commit: 36b344f / D53645

Test Plan:
New test:
./yb_build.sh release --cxx-test pg_catalog_perf-test
--gtest_filter PgCatalogPerfTest.ResponseCacheValidPastReadinessDeadline

Regression check (all response-cache tests):
./yb_build.sh release --cxx-test pg_catalog_perf-test
--gtest_filter 'ResponseCache'

Verified locally that:

  • The new test fails before the IsValid fix (cache.hits = 0 instead of 5)
    when the warm-up entry's readiness_deadline has elapsed.
  • The new test passes with the fix.
  • All 10 existing ResponseCache tests still pass.

Reviewers: dmitry

Reviewed By: dmitry

Subscribers: svc_phabricator, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D53645

…me invalid after the originating RPC's deadline

Summary:
`PgResponseCache::Data::IsValid` unconditionally checked
`now < readiness_deadline_`, where `readiness_deadline_` is set once at
entry creation from the originating RPC's wall-clock deadline (the YSQL
client RW timeout, ~10 min by default). This caused every cache entry --
even one with a successful, populated, version-current response --
to become "invalid" once wall-clock crossed that deadline, forcing a
full rebuild every few minutes. The rebuilt RPC's response carries a fresh
`catalog_read_time` that propagates into subsequent paged RPCs' cache
keys, so one expiry cascades into ~5 misses per preload sequence on
every new connection's prefetch.

This was a regression introduced by yugabyte#27412 (commit f960a4a, "Async
wait on pg response cache"). Pre-refactor, the deadline was only
consulted while the response was still in-flight:

  return version == version_ &&
         (IsReady(future_) ? IsOk(future_.get()) : now < readiness_deadline_);

Restore the pre-refactor semantics: the deadline remains a watchdog on
a stalled fetch, but a populated successful response stays valid until
catalog-version invalidation or LRU eviction supersedes it. The same
change was reverted on the 2024.2.4 backport branch (commit
413142a); this fix restores correct behavior on master.

Original commit: 36b344f / D53645

Test Plan:
New test:
  ./yb_build.sh release --cxx-test pg_catalog_perf-test \
    --gtest_filter PgCatalogPerfTest.ResponseCacheValidPastReadinessDeadline

Regression check (all response-cache tests):
  ./yb_build.sh release --cxx-test pg_catalog_perf-test \
    --gtest_filter '*ResponseCache*'

Verified locally that:
  - The new test fails before the IsValid fix (cache.hits = 0 instead of 5)
    when the warm-up entry's readiness_deadline has elapsed.
  - The new test passes with the fix.
  - All 10 existing *ResponseCache* tests still pass.

Reviewers: dmitry

Reviewed By: dmitry

Subscribers: svc_phabricator, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D53645

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the response cache in pg_response_cache.cc by replacing the failed_ boolean flag with a DataState enum (kNotReady, kReadyOK, kReadyFailure). This allows the cache to remain valid and usable even after the readiness deadline has passed, provided the data has been successfully loaded (kReadyOK). A new integration test ResponseCacheValidPastReadinessDeadline has been added to verify this behavior under short RPC deadlines. I have no feedback to provide as the changes look correct and are well-tested.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@kai-franz

Copy link
Copy Markdown
Contributor Author

Trigger Jenkins


automated · Claude Code (Fable 5)

@yb-agentk-dev

yb-agentk-dev Bot commented Jun 12, 2026

Copy link
Copy Markdown

Jenkins build has been triggered. Results will be available in the CI checks. CSI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant