docs: add root cause analysis for rerunfailures timeout

mchen-sentry · mchen-sentry · commit 78b40772f296 · 2026-02-26T15:07:18.000-08:00
server.py sets socket.setdefaulttimeout(5), which propagates to accepted
connections in pytest-rerunfailures' crash recovery server. The listening
socket overrides this with setblocking(1), but accept() creates new sockets
that inherit the 5s global default, causing TimeoutError on recv(1).
diff --git a/docs/notes.md b/docs/notes.md
@@ -27,7 +27,11 @@ Scattering imports inside every method of a class (like the `Browser` class in s
 
 ## pytest-rerunfailures crash recovery under xdist
 
-The experiment branch was correct. The `run_connection` threads get `TimeoutError: timed out` on `conn.recv(1)` — the timeout is set on the accepted connection socket, not the listening socket (which is why reading the `__init__` code alone was misleading). During heavy xdist startup, workers take too long to send data, the connection threads die, and crash recovery breaks. Setting `HAS_PYTEST_HANDLECRASHITEM = False` disables crash recovery mode. Normal `--reruns` still works (each worker retries locally).
+**Root cause:** `sentry/conf/server.py:40` sets `socket.setdefaulttimeout(5)`. The `pytest-rerunfailures` `ServerStatusDB` creates a listening socket with `setblocking(1)` (no timeout), but `socket.accept()` returns a *new* socket whose timeout comes from `socket.getdefaulttimeout()` (5s), not from the listening socket. Each `run_connection` thread blocks on `conn.recv(1)` and dies after 5s of inactivity.
+
+Verified empirically: `socket.setdefaulttimeout(5)` → listening socket `.gettimeout()` is `None` (overridden by `setblocking(1)`) → accepted socket `.gettimeout()` is `5.0` (from global default).
+
+This is a subtle interaction: `pytest-rerunfailures` assumes sockets block indefinitely, but Sentry's global timeout silently changes that contract on accepted connections only.
 
 ## Why hash-based sharding beats algorithmic LPT
 
diff --git a/docs/tiered-xdist-changes.md b/docs/tiered-xdist-changes.md
@@ -77,7 +77,13 @@ Region name RNG is seeded with `PYTEST_XDIST_TESTRUNUID` so all workers generate
 
 **Modified:** `tests/conftest.py`
 
-Sets `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False`. The socket-based crash recovery threads get `TimeoutError` on `recv(1)` during heavy xdist startup. Normal `--reruns` still works.
+Sets `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False`.
+
+**Root cause:** `sentry/conf/server.py` line 40 sets `socket.setdefaulttimeout(5)`, a global default for all newly created sockets. When `pytest-rerunfailures` creates its crash recovery server, it calls `setblocking(1)` on the listening socket (overriding the default to no timeout). But `socket.accept()` creates a *new* socket for each accepted connection, and Python docs specify that new sockets inherit their timeout from `socket.getdefaulttimeout()`, not from the listening socket. So each accepted connection has a 5-second timeout.
+
+Under xdist, the server spawns a `run_connection` thread per worker. Each thread blocks on `conn.recv(1)` waiting for the next message. If no message arrives within 5 seconds (e.g., during heavy Django/plugin initialization or between test batches), `recv` raises `TimeoutError`, the thread dies, and crash recovery for that worker is lost. With 3 workers, all 3 threads die during startup, producing the `Exception in thread Thread-N (run_connection)` errors.
+
+Normal `--reruns` is unaffected — each worker retries failed tests locally via `StatusDB` (in-memory, no sockets). Only segfault crash recovery (reassigning a dead worker's test to another worker) is lost, which is a rare edge case.
 
 ### 2g. Snowflake test fix