Skip to content

Commit 78b4077

Browse files
committed
docs: add root cause analysis for rerunfailures timeout
server.py sets socket.setdefaulttimeout(5), which propagates to accepted connections in pytest-rerunfailures' crash recovery server. The listening socket overrides this with setblocking(1), but accept() creates new sockets that inherit the 5s global default, causing TimeoutError on recv(1).
1 parent de5ae7d commit 78b4077

File tree

2 files changed

+12
-2
lines changed

2 files changed

+12
-2
lines changed

docs/notes.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,11 @@ Scattering imports inside every method of a class (like the `Browser` class in s
2727

2828
## pytest-rerunfailures crash recovery under xdist
2929

30-
The experiment branch was correct. The `run_connection` threads get `TimeoutError: timed out` on `conn.recv(1)` — the timeout is set on the accepted connection socket, not the listening socket (which is why reading the `__init__` code alone was misleading). During heavy xdist startup, workers take too long to send data, the connection threads die, and crash recovery breaks. Setting `HAS_PYTEST_HANDLECRASHITEM = False` disables crash recovery mode. Normal `--reruns` still works (each worker retries locally).
30+
**Root cause:** `sentry/conf/server.py:40` sets `socket.setdefaulttimeout(5)`. The `pytest-rerunfailures` `ServerStatusDB` creates a listening socket with `setblocking(1)` (no timeout), but `socket.accept()` returns a *new* socket whose timeout comes from `socket.getdefaulttimeout()` (5s), not from the listening socket. Each `run_connection` thread blocks on `conn.recv(1)` and dies after 5s of inactivity.
31+
32+
Verified empirically: `socket.setdefaulttimeout(5)` → listening socket `.gettimeout()` is `None` (overridden by `setblocking(1)`) → accepted socket `.gettimeout()` is `5.0` (from global default).
33+
34+
This is a subtle interaction: `pytest-rerunfailures` assumes sockets block indefinitely, but Sentry's global timeout silently changes that contract on accepted connections only.
3135

3236
## Why hash-based sharding beats algorithmic LPT
3337

docs/tiered-xdist-changes.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,13 @@ Region name RNG is seeded with `PYTEST_XDIST_TESTRUNUID` so all workers generate
7777

7878
**Modified:** `tests/conftest.py`
7979

80-
Sets `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False`. The socket-based crash recovery threads get `TimeoutError` on `recv(1)` during heavy xdist startup. Normal `--reruns` still works.
80+
Sets `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False`.
81+
82+
**Root cause:** `sentry/conf/server.py` line 40 sets `socket.setdefaulttimeout(5)`, a global default for all newly created sockets. When `pytest-rerunfailures` creates its crash recovery server, it calls `setblocking(1)` on the listening socket (overriding the default to no timeout). But `socket.accept()` creates a *new* socket for each accepted connection, and Python docs specify that new sockets inherit their timeout from `socket.getdefaulttimeout()`, not from the listening socket. So each accepted connection has a 5-second timeout.
83+
84+
Under xdist, the server spawns a `run_connection` thread per worker. Each thread blocks on `conn.recv(1)` waiting for the next message. If no message arrives within 5 seconds (e.g., during heavy Django/plugin initialization or between test batches), `recv` raises `TimeoutError`, the thread dies, and crash recovery for that worker is lost. With 3 workers, all 3 threads die during startup, producing the `Exception in thread Thread-N (run_connection)` errors.
85+
86+
Normal `--reruns` is unaffected — each worker retries failed tests locally via `StatusDB` (in-memory, no sockets). Only segfault crash recovery (reassigning a dead worker's test to another worker) is lost, which is a rare edge case.
8187

8288
### 2g. Snowflake test fix
8389

0 commit comments

Comments
 (0)