Skip to content

feat: replace Werkzeug dev server with waitress for high-concurrency adapter#896

Merged
AdamRajfer merged 3 commits intomainfrom
agronskiy/chore/increase-adapter-concurrency
Apr 8, 2026
Merged

feat: replace Werkzeug dev server with waitress for high-concurrency adapter#896
AdamRajfer merged 3 commits intomainfrom
agronskiy/chore/increase-adapter-concurrency

Conversation

@agronskiy
Copy link
Copy Markdown
Collaborator

Summary

  • Replace Werkzeug's run_simple(threaded=True) with waitress, a production WSGI server with a fixed thread pool and proper connection queuing
  • Raise file descriptor limit (RLIMIT_NOFILE) to 65536 in the adapter subprocess
  • Add env vars ADAPTER_SERVER_THREADS (default 2048) and ADAPTER_SERVER_CONNECTION_LIMIT (default 16384) for tunability

Context

At 11k+ concurrent connections the Werkzeug dev server causes "connection reset by peer" errors due to:

  1. Socket backlog of 5 — connections beyond that are dropped
  2. Unbounded thread-per-request — tries to spawn 11k OS threads
  3. Default fd limit of 1024 — exhausted by open sockets

Waitress uses a fixed thread pool with channel-based I/O, so excess connections queue rather than getting reset. Single-process architecture is preserved, keeping all threading.Lock() in interceptors (caching, progress tracking, stats) valid.

Test plan

  • All 21 adapter server unit tests pass
  • Verify at scale (11k concurrency) on cluster

🤖 Generated with Claude Code

…adapter

Werkzeug's `run_simple(threaded=True)` creates an unbounded thread per
request with a socket backlog of 5, causing "connection reset by peer"
errors at scale (11k+ concurrent connections). Replace with waitress,
a production WSGI server that uses a fixed thread pool (default 2048)
with proper connection queuing (default 16k connection limit).

Also raises the file descriptor limit (RLIMIT_NOFILE) to 65536 in the
adapter subprocess to prevent fd exhaustion under high concurrency.

Configurable via env vars:
- ADAPTER_SERVER_THREADS (default: 2048)
- ADAPTER_SERVER_CONNECTION_LIMIT (default: 16384)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
@agronskiy agronskiy requested review from a team as code owners April 5, 2026 19:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@agronskiy agronskiy self-assigned this Apr 7, 2026
@AdamRajfer
Copy link
Copy Markdown
Contributor

/ok to test 9162ca1

@AdamRajfer
Copy link
Copy Markdown
Contributor

/ok to test 19bb754

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor

/ok to test c8ab793

@AdamRajfer AdamRajfer merged commit 3dc2bfc into main Apr 8, 2026
48 checks passed
@AdamRajfer AdamRajfer deleted the agronskiy/chore/increase-adapter-concurrency branch April 8, 2026 16:08
agronskiy added a commit that referenced this pull request Apr 13, 2026
…ainer errors

Fixes two issues introduced in #896:
- select.select() breaks with fds > 1023 after raising RLIMIT_NOFILE
- localhost may resolve to ::1 in IPv6-disabled containers (EADDRNOTAVAIL)

Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
agronskiy added a commit that referenced this pull request Apr 13, 2026
Fixes two issues from #896:

- **`filedescriptor out of range in select()`**: `select.select()` can't
handle fds > 1023. After raising RLIMIT_NOFILE to 65536, this is easily
hit. `asyncore_use_poll=True` switches to `select.poll()` which has no
fd limit.
- **`EADDRNOTAVAIL` in containers**: `localhost` resolves to `::1` in
IPv6-disabled environments. `ipv6=False` forces IPv4-only binding.

Reproduced the select() issue in a core-evals container
(`terminal-bench:2026-02-09T10-38-9216d4e3-amd64`): default soft=1024,
hard=524288; after raising and opening 1100 sockets, `select.select()`
fails exactly as reported.

## Test plan
- [x] 21/21 adapter server unit tests pass

Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants