Skip to content

fix(test): make test_schedstat_fd_closed_on_thread_exit not flaky#398

Merged
rcoh merged 2 commits into
mainfrom
fix-schedstat-fd-test-flake
May 13, 2026
Merged

fix(test): make test_schedstat_fd_closed_on_thread_exit not flaky#398
rcoh merged 2 commits into
mainfrom
fix-schedstat-fd-test-flake

Conversation

@rcoh
Copy link
Copy Markdown
Contributor

@rcoh rcoh commented May 13, 2026

Summary

test_schedstat_fd_closed_on_thread_exit (added in 5e905e8) was flaky under the parallel test suite — failing roughly 3/10 runs of cargo test -p dial9-tokio-telemetry --lib, but passing every time when run in isolation.

Root cause. The test spawned a thread, called SchedStat::read_current() (which opens /proc/self/task/<tid>/schedstat and stores the OwnedFd in a thread-local), captured the raw fd number, joined, and then asserted fcntl(fd, F_GETFD) != -1. The kernel readily recycles fd numbers, and other tests in the same binary open many files concurrently — including their own per-thread schedstat fds opened from tokio worker poll callbacks (runtime_context.rs). If a sibling test opened any file in the small window after our spawned thread's OwnedFd was dropped, the fd number got reused and fcntl(F_GETFD) returned success on a totally unrelated fd, falsely reporting a leak.

Fix. Inside the spawned thread, also record what the fd points at via readlink("/proc/self/fd/<fd>"). After join, check whether the fd is either closed (readlink fails) or now points at a different file (slot was reused for an unrelated open — which itself proves our OwnedFd was dropped). The captured path /proc/self/task/<spawned_tid>/schedstat is unique to the spawned thread's open: each thread opens schedstat for its own tid, so no concurrent test can possibly hold an open fd for that exact path. If the fd is still open AND still points at it, that's a real leak — the assertion still fires.

The schedstat fd-cleanup invariant is unchanged; only the way the test observes it is hardened.

Test plan

  • Reproduce the flake on main: 3/10 full-suite runs failed.
  • Apply the fix and run the full suite 13 times in a row — all 228 tests pass every time.
  • Verify the test still catches a real leak by temporarily replacing *cell.borrow_mut() = Some(owned) with std::mem::forget(owned) in the production path: test fails with schedstat fd 3 leaked after thread exit (still points at "/proc/.../task/.../schedstat"). Reverted before commit.
  • cargo clippy -p dial9-tokio-telemetry --lib --tests clean.

The previous assertion checked fcntl(fd, F_GETFD) != -1 after the
spawned thread exited. Linux readily recycles fd numbers, and other
tests in the same suite open many files concurrently (including their
own per-thread schedstat fds via tokio worker poll callbacks). When a
sibling test opened a file in the small window after our spawned thread
closed its schedstat fd, the fd number got reused and the check
incorrectly reported a leak.

Capture the path the fd points to via /proc/self/fd/<fd> inside the
spawned thread, and after join verify that the fd is either closed or
now points at a different file. Each thread opens schedstat for its own
tid, so /proc/self/task/<spawned_tid>/schedstat is unique to the
spawned thread's open and no concurrent test can collide on it.

Verified by running the full cargo test -p dial9-tokio-telemetry --lib
suite 13 times in a row with zero failures (previously failed roughly
3/10 runs), and confirmed the new test still catches a real leak by
temporarily mem::forget'ing the OwnedFd in the production path.
@rcoh rcoh force-pushed the fix-schedstat-fd-test-flake branch from 7c0b3e5 to c1de8ce Compare May 13, 2026 01:00
@rcoh rcoh added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit a62a184 May 13, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants