[BACKPORT 2025.1][#31870] DocDB: PgResponseCache entries become invalid after the originating RPC's deadline#32070
Conversation
…me invalid after the originating RPC's deadline Summary: `PgResponseCache::Data::IsValid` unconditionally checked `now < readiness_deadline_`, where `readiness_deadline_` is set once at entry creation from the originating RPC's wall-clock deadline (the YSQL client RW timeout, ~10 min by default). This caused every cache entry -- even one with a successful, populated, version-current response -- to become "invalid" once wall-clock crossed that deadline, forcing a full rebuild every few minutes. The rebuilt RPC's response carries a fresh `catalog_read_time` that propagates into subsequent paged RPCs' cache keys, so one expiry cascades into ~5 misses per preload sequence on every new connection's prefetch. This was a regression introduced by yugabyte#27412 (commit f960a4a, "Async wait on pg response cache"). Pre-refactor, the deadline was only consulted while the response was still in-flight: return version == version_ && (IsReady(future_) ? IsOk(future_.get()) : now < readiness_deadline_); Restore the pre-refactor semantics: the deadline remains a watchdog on a stalled fetch, but a populated successful response stays valid until catalog-version invalidation or LRU eviction supersedes it. The same change was reverted on the 2024.2.4 backport branch (commit 413142a); this fix restores correct behavior on master. Original commit: 36b344f / D53645 Test Plan: New test: ./yb_build.sh release --cxx-test pg_catalog_perf-test \ --gtest_filter PgCatalogPerfTest.ResponseCacheValidPastReadinessDeadline Regression check (all response-cache tests): ./yb_build.sh release --cxx-test pg_catalog_perf-test \ --gtest_filter '*ResponseCache*' Verified locally that: - The new test fails before the IsValid fix (cache.hits = 0 instead of 5) when the warm-up entry's readiness_deadline has elapsed. - The new test passes with the fix. - All 10 existing *ResponseCache* tests still pass. Reviewers: dmitry Reviewed By: dmitry Subscribers: svc_phabricator, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D53645
There was a problem hiding this comment.
Code Review
This pull request refactors the response cache in pg_response_cache.cc by replacing the failed_ boolean flag with a DataState enum (kNotReady, kReadyOK, kReadyFailure). This allows the cache to remain valid and usable even after the readiness deadline has passed, provided the data has been successfully loaded (kReadyOK). A new integration test ResponseCacheValidPastReadinessDeadline has been added to verify this behavior under short RPC deadlines. I have no feedback to provide as the changes look correct and are well-tested.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Trigger Jenkins automated · Claude Code (Fable 5) |
|
Jenkins build has been triggered. Results will be available in the CI checks. CSI |
DO NOT MERGE!! CSI ⏳
Summary:
PgResponseCache::Data::IsValidunconditionally checkednow < readiness_deadline_, wherereadiness_deadline_is set once atentry creation from the originating RPC's wall-clock deadline (the YSQL
client RW timeout, ~10 min by default). This caused every cache entry --
even one with a successful, populated, version-current response --
to become "invalid" once wall-clock crossed that deadline, forcing a
full rebuild every few minutes. The rebuilt RPC's response carries a fresh
catalog_read_timethat propagates into subsequent paged RPCs' cachekeys, so one expiry cascades into ~5 misses per preload sequence on
every new connection's prefetch.
This was a regression introduced by #27412 (commit f960a4a, "Async
wait on pg response cache"). Pre-refactor, the deadline was only
consulted while the response was still in-flight:
return version == version_ &&
(IsReady(future_) ? IsOk(future_.get()) : now < readiness_deadline_);
Restore the pre-refactor semantics: the deadline remains a watchdog on
a stalled fetch, but a populated successful response stays valid until
catalog-version invalidation or LRU eviction supersedes it. The same
change was reverted on the 2024.2.4 backport branch (commit
413142a); this fix restores correct behavior on master.
Original commit: 36b344f / D53645
Test Plan:
New test:
./yb_build.sh release --cxx-test pg_catalog_perf-test
--gtest_filter PgCatalogPerfTest.ResponseCacheValidPastReadinessDeadline
Regression check (all response-cache tests):
./yb_build.sh release --cxx-test pg_catalog_perf-test
--gtest_filter 'ResponseCache'
Verified locally that:
when the warm-up entry's readiness_deadline has elapsed.
Reviewers: dmitry
Reviewed By: dmitry
Subscribers: svc_phabricator, yql, ybase
Differential Revision: https://phorge.dev.yugabyte.com/D53645