Skip to content

Attempt to fix flaky test#1004

Closed
allenss-amazon wants to merge 3 commits into
valkey-io:mainfrom
allenss-amazon:shutdown-timeout-leak
Closed

Attempt to fix flaky test#1004
allenss-amazon wants to merge 3 commits into
valkey-io:mainfrom
allenss-amazon:shutdown-timeout-leak

Conversation

@allenss-amazon

Copy link
Copy Markdown
Member

Leaks caused by race in cancellation logic. Background thread hadn't completed after timeout generated on foreground, leading to intermittent leaks.

Signed-off-by: Allen Samuels <allenss@amazon.com>
@greptile-apps

greptile-apps Bot commented May 11, 2026

Copy link
Copy Markdown

Greptile Summary

  • Adds a post-pausepoint cleanup block to test_timeoutCMD: drops both HNSW and FLAT indexes, issues a ping(), and sleeps 500 ms to let the reader-pool worker's queued ResolveContent continuation drain from the main-thread event loop before test teardown, eliminating the intermittent LSan leak.
  • The fix is targeted and well-commented; the only concern is the hardcoded time.sleep(0.5) which could still race on a heavily loaded CI runner — a condition-based wait or a longer duration would be more robust.

Confidence Score: 4/5

Safe to merge; only a test file is modified and the race it targets is clearly explained.

Single test-file change with a clear, well-documented rationale. The only finding is P2: the hardcoded 0.5 s sleep could theoretically still be racy on slow CI, but it is far better than the status quo (no wait at all). No production code is touched.

integration/test_cancel.py — specifically the hardcoded sleep duration on line 287.

Important Files Changed

Filename Overview
integration/test_cancel.py Adds post-pausepoint cleanup (index drops + ping + 0.5 s sleep) at the end of test_timeoutCMD to drain the background worker's queued continuation before teardown, fixing an LSan leak race; the hardcoded sleep duration is the only concern.

Sequence Diagram

sequenceDiagram
    participant Test as Test Thread
    participant Main as Main Thread (event loop)
    participant Worker as Reader-Pool Worker

    Test->>Main: FT._DEBUG PAUSEPOINT SET Cancel
    Test->>Main: FT.SEARCH (returns timeout error)
    Main-->>Test: ResponseError: timeout
    Test->>Main: "wait_for_true(PAUSEPOINT TEST Cancel > 0)"
    Worker-->>Main: hits Cancel pausepoint
    Main-->>Test: pausepoint hit confirmed
    Test->>Main: FT._DEBUG PAUSEPOINT RESET Cancel
    Main-->>Worker: resume from pausepoint
    Worker->>Main: RunByMain(ResolveContent continuation)
    Note over Worker,Main: Worker dispatches AnyInvocable to event loop (SearchParameters / HNSW index still pinned)
    Test->>Main: hnsw_index.drop() + flat_index.drop()
    Main-->>Test: drops acknowledged
    Test->>Main: ping()
    Main->>Main: drains queued continuation
    Main-->>Test: PONG
    Test->>Test: time.sleep(0.5) [extra buffer]
    Note over Test: Test teardown - no leaked continuations
Loading

Fix All in Claude Code

Reviews (1): Last reviewed commit: "Merge branch 'main' into shutdown-timeou..." | Re-trigger Greptile

hnsw_index.drop(client)
flat_index.drop(client)
client.ping()
time.sleep(0.5)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded sleep may still be fragile on slow CI

time.sleep(0.5) is a fixed-duration guard against the background worker racing ahead of test teardown. The comment explains that ping() gives the event loop one iteration, but the background thread could dispatch its ResolveContent continuation after the ping response is sent (e.g., if the thread is preempted between the pausepoint release and the RunByMain post). On a loaded CI runner, 500 ms may not be enough. A condition-based wait (e.g., polling an LSan-visible counter or using waiters.wait_for_true) would be more reliable, even if it requires a new debug counter. If no observable state is exposed, consider at least making the duration configurable via an env var or bumping it to 1–2 s.

Fix in Claude Code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Should we swap the time.sleep with a wait for X metric? Or is it better to use time.sleep with a higher value?

@allenss-amazon

Copy link
Copy Markdown
Member Author

Claude says that PR #1029 is a better fix for this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants