Attempt to fix flaky test#1004
Conversation
Signed-off-by: Allen Samuels <allenss@amazon.com>
Greptile Summary
Confidence Score: 4/5Safe to merge; only a test file is modified and the race it targets is clearly explained. Single test-file change with a clear, well-documented rationale. The only finding is P2: the hardcoded 0.5 s sleep could theoretically still be racy on slow CI, but it is far better than the status quo (no wait at all). No production code is touched. integration/test_cancel.py — specifically the hardcoded sleep duration on line 287. Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Test Thread
participant Main as Main Thread (event loop)
participant Worker as Reader-Pool Worker
Test->>Main: FT._DEBUG PAUSEPOINT SET Cancel
Test->>Main: FT.SEARCH (returns timeout error)
Main-->>Test: ResponseError: timeout
Test->>Main: "wait_for_true(PAUSEPOINT TEST Cancel > 0)"
Worker-->>Main: hits Cancel pausepoint
Main-->>Test: pausepoint hit confirmed
Test->>Main: FT._DEBUG PAUSEPOINT RESET Cancel
Main-->>Worker: resume from pausepoint
Worker->>Main: RunByMain(ResolveContent continuation)
Note over Worker,Main: Worker dispatches AnyInvocable to event loop (SearchParameters / HNSW index still pinned)
Test->>Main: hnsw_index.drop() + flat_index.drop()
Main-->>Test: drops acknowledged
Test->>Main: ping()
Main->>Main: drains queued continuation
Main-->>Test: PONG
Test->>Test: time.sleep(0.5) [extra buffer]
Note over Test: Test teardown - no leaked continuations
Reviews (1): Last reviewed commit: "Merge branch 'main' into shutdown-timeou..." | Re-trigger Greptile |
| hnsw_index.drop(client) | ||
| flat_index.drop(client) | ||
| client.ping() | ||
| time.sleep(0.5) |
There was a problem hiding this comment.
Hardcoded sleep may still be fragile on slow CI
time.sleep(0.5) is a fixed-duration guard against the background worker racing ahead of test teardown. The comment explains that ping() gives the event loop one iteration, but the background thread could dispatch its ResolveContent continuation after the ping response is sent (e.g., if the thread is preempted between the pausepoint release and the RunByMain post). On a loaded CI runner, 500 ms may not be enough. A condition-based wait (e.g., polling an LSan-visible counter or using waiters.wait_for_true) would be more reliable, even if it requires a new debug counter. If no observable state is exposed, consider at least making the duration configurable via an env var or bumping it to 1–2 s.
There was a problem hiding this comment.
+1 Should we swap the time.sleep with a wait for X metric? Or is it better to use time.sleep with a higher value?
|
Claude says that PR #1029 is a better fix for this problem. |
Leaks caused by race in cancellation logic. Background thread hadn't completed after timeout generated on foreground, leading to intermittent leaks.