Skip to content

Use double‑checked locking to eliminate redundant metadata lookups #5538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

DomGarguilo
Copy link
Member

Fixes #5535

Swap out the Timer‑based check for a double‑checked‑locking pattern on the lock, and only short‑circuit when the cached tablet actually has a valid location.

Also made the failing IT case more concurrently strenuous to better exercise this new logic.

@DomGarguilo DomGarguilo added this to the 4.0.0 milestone May 6, 2025
@DomGarguilo DomGarguilo self-assigned this May 6, 2025
@dlmarion dlmarion requested a review from keith-turner May 7, 2025 11:49

CachedTablet now = findTabletInCache(row);
if (now != null) {
if (now.getTserverLocation().isPresent() && lcSession.checkLock(now) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In support of ondemand tablets, tablets w/o a location can be cached in 4.0 (was not the case in 2.1). So these checks that look for the presence of a location do not seem correct.

try (var unused = lookupLocks.lock(ptl.getExtent())) {
// See if entry was added to cache by another thread while we were waiting on the lock
var cached = findTabletInCache(row);
if (cached != null && cached.getCreationTimer().startedAfter(timer)) {
Copy link
Contributor

@keith-turner keith-turner May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What made you suspect the timer here? I looked into the specifics of the timer and found how its implemented here https://github.com/openjdk/jdk/blob/b6b5ac1ef9042ed62a8358aa6943b8dc87dcf0ab/src/hotspot/os/posix/os_posix.cpp#L1474 . Looking at the docs for that system call it does mention that it will not go backwards but also there is no gaurantee it wil go forward between two calls. So its possible two threads got the exact same nanoTime causing the entry to look stale, that is probably ok. We could make the time comparison >= instead of > if that is the case.

Copy link
Contributor

@keith-turner keith-turner May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the > to a >= is iffy and probably not good so please ignore that suggestion. One way we could avoid comparing the time or anything else is to just look for change in the object reference. So maybe could do the following?

   CachedTablet before = findTabletInCache(row);
  try (var unused = lookupLocks.lock(ptl.getExtent())) {
       CachedTablet after = findTabletInCache(row);
       if(after != before && after != null){
           // another thread probably did the lookup while we were waiting on the lock, so let use that
           return;
       }
   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken or Flaky test: ClientTabletCacheImplTest.testMultithreadedLookups
2 participants