overlapping_query_spike: clients refused with connect timeouts — YSQL Connection Manager listen() backlog (128) appears to overflow under burst load

Clients refused with connect timeouts — YSQL Conn Mgr listen() backlog (128) appears to overflow under burst load

During the overlapping_query_spike test(3 client systems × 3 workloads, each firing 10 burst-read threads at 5s intervals, each thread spawning 333–500 sub-threads running SELECT 1; pg_sleep(20); per connection.), clients intermittently fail with PSQLException: Connection to {ip}:5433 refused (caused by java.net.ConnectException: Connection timed out). Evidence points to TCP accept-queue overflow inside the YSQL Connection Manager: Odyssey calls listen(backlog=128) per ysql_conn_mgr.conf, and during the spike completed handshakes pile up in the accept queue faster than Odyssey drains them. The kernel then drops/RSTs new connections, which is what TcpExtListenOverflows / TcpExtListenDrops are counting on this host. net.core.somaxconn is 4096, so the kernel could allow more — the bottleneck is the backlog Odyssey passes to listen().

Evidence (captured a few minutes into the run)
ysql_conn_mgr.conf backlog | 128 
net.core.somaxconn | 4096 
ss Recv-Q / Send-Q on :5433 | 130 / 128 
TcpExtListenOverflows | 277,876 
TcpExtListenDrops | 277,876 
net.ipv4.tcp_syncookies | 1

The Recv-Q pinned at the configured backlog plus matching overflow/drop counts is the smoking gun.

Error / stack trace
```
ERROR AppBase - Going to retrieve connection to 172.151.29.106 again: Connection to 172.151.29.106:5433 refused. ...
com.yugabyte.util.PSQLException: Connection to 172.151.29.106:5433 refused.
    at com.yugabyte.core.v3.ConnectionFactoryImpl.openConnectionImpl(...)
    at com.yugabyte.sample.apps.AppBase.getPostgresConnection(...)
    at com.yugabyte.sample.apps.SqlConnectionsBurst.lambda$burstSelect$0(...)
    ...
Caused by: java.net.ConnectException: Connection timed out
    at sun.nio.ch.Net.pollConnect(Native Method)
    at sun.nio.ch.NioSocketImpl.timedFinishConnect(...)
```

**Suggested fixes (to evaluate)**
- Raise Odyssey's listen() backlog (e.g. 4096), and consider scaling with ysql_max_client_connections.
- Run a dedicated accept coroutine independent of the worker pool so clients can be queued internally even when workers are backed up.

Jira Link: [DB-21351](https://yugabyte.atlassian.net/browse/DB-21351)

[DB-21351]: https://yugabyte.atlassian.net/browse/DB-21351?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlapping_query_spike: clients refused with connect timeouts — YSQL Connection Manager listen() backlog (128) appears to overflow under burst load #31558

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

overlapping_query_spike: clients refused with connect timeouts — YSQL Connection Manager listen() backlog (128) appears to overflow under burst load #31558

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions