Attempt to fix flaky test by allenss-amazon · Pull Request #1004 · valkey-io/valkey-search

allenss-amazon · 2026-05-01T00:01:51Z

Leaks caused by race in cancellation logic. Background thread hadn't completed after timeout generated on foreground, leading to intermittent leaks.

Signed-off-by: Allen Samuels <allenss@amazon.com>

greptile-apps · 2026-05-11T18:31:04Z

Greptile Summary

Adds a post-pausepoint cleanup block to test_timeoutCMD: drops both HNSW and FLAT indexes, issues a ping(), and sleeps 500 ms to let the reader-pool worker's queued ResolveContent continuation drain from the main-thread event loop before test teardown, eliminating the intermittent LSan leak.
The fix is targeted and well-commented; the only concern is the hardcoded time.sleep(0.5) which could still race on a heavily loaded CI runner — a condition-based wait or a longer duration would be more robust.

Confidence Score: 4/5

Safe to merge; only a test file is modified and the race it targets is clearly explained.

Single test-file change with a clear, well-documented rationale. The only finding is P2: the hardcoded 0.5 s sleep could theoretically still be racy on slow CI, but it is far better than the status quo (no wait at all). No production code is touched.

integration/test_cancel.py — specifically the hardcoded sleep duration on line 287.

Important Files Changed

Filename	Overview
integration/test_cancel.py	Adds post-pausepoint cleanup (index drops + ping + 0.5 s sleep) at the end of `test_timeoutCMD` to drain the background worker's queued continuation before teardown, fixing an LSan leak race; the hardcoded sleep duration is the only concern.

Sequence Diagram

sequenceDiagram
    participant Test as Test Thread
    participant Main as Main Thread (event loop)
    participant Worker as Reader-Pool Worker

    Test->>Main: FT._DEBUG PAUSEPOINT SET Cancel
    Test->>Main: FT.SEARCH (returns timeout error)
    Main-->>Test: ResponseError: timeout
    Test->>Main: "wait_for_true(PAUSEPOINT TEST Cancel > 0)"
    Worker-->>Main: hits Cancel pausepoint
    Main-->>Test: pausepoint hit confirmed
    Test->>Main: FT._DEBUG PAUSEPOINT RESET Cancel
    Main-->>Worker: resume from pausepoint
    Worker->>Main: RunByMain(ResolveContent continuation)
    Note over Worker,Main: Worker dispatches AnyInvocable to event loop (SearchParameters / HNSW index still pinned)
    Test->>Main: hnsw_index.drop() + flat_index.drop()
    Main-->>Test: drops acknowledged
    Test->>Main: ping()
    Main->>Main: drains queued continuation
    Main-->>Test: PONG
    Test->>Test: time.sleep(0.5) [extra buffer]
    Note over Test: Test teardown - no leaked continuations

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into shutdown-timeou..." | Re-trigger Greptile}

greptile-apps · 2026-05-11T18:31:08Z

+        hnsw_index.drop(client)
+        flat_index.drop(client)
+        client.ping()
+        time.sleep(0.5)


Hardcoded sleep may still be fragile on slow CI

time.sleep(0.5) is a fixed-duration guard against the background worker racing ahead of test teardown. The comment explains that ping() gives the event loop one iteration, but the background thread could dispatch its ResolveContent continuation after the ping response is sent (e.g., if the thread is preempted between the pausepoint release and the RunByMain post). On a loaded CI runner, 500 ms may not be enough. A condition-based wait (e.g., polling an LSan-visible counter or using waiters.wait_for_true) would be more reliable, even if it requires a new debug counter. If no observable state is exposed, consider at least making the duration configurable via an env var or bumping it to 1–2 s.

+1 Should we swap the time.sleep with a wait for X metric? Or is it better to use time.sleep with a higher value?

allenss-amazon · 2026-05-12T22:19:03Z

Claude says that PR #1029 is a better fix for this problem.

Attempt to fix flaky test

fd8db3f

Signed-off-by: Allen Samuels <allenss@amazon.com>

KarthikSubbarao added the test label May 1, 2026

github-actions Bot assigned allenss-amazon May 1, 2026

allenss-amazon added 2 commits May 1, 2026 07:29

Merge branch 'main' into shutdown-timeout-leak

368c712

Merge branch 'main' into shutdown-timeout-leak

ac49c2c

allenss-amazon closed this May 2, 2026

allenss-amazon reopened this May 2, 2026

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

allenss-amazon closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to fix flaky test#1004

Attempt to fix flaky test#1004
allenss-amazon wants to merge 3 commits into
valkey-io:mainfrom
allenss-amazon:shutdown-timeout-leak

allenss-amazon commented May 1, 2026

Uh oh!

greptile-apps Bot commented May 11, 2026

Uh oh!

greptile-apps Bot May 11, 2026

Uh oh!

KarthikSubbarao May 12, 2026

Uh oh!

allenss-amazon commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

allenss-amazon commented May 1, 2026

Uh oh!

greptile-apps Bot commented May 11, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

KarthikSubbarao May 12, 2026

Choose a reason for hiding this comment

Uh oh!

allenss-amazon commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants