Clients refused with connect timeouts — YSQL Conn Mgr listen() backlog (128) appears to overflow under burst load
During the overlapping_query_spike test(3 client systems × 3 workloads, each firing 10 burst-read threads at 5s intervals, each thread spawning 333–500 sub-threads running SELECT 1; pg_sleep(20); per connection.), clients intermittently fail with PSQLException: Connection to {ip}:5433 refused (caused by java.net.ConnectException: Connection timed out). Evidence points to TCP accept-queue overflow inside the YSQL Connection Manager: Odyssey calls listen(backlog=128) per ysql_conn_mgr.conf, and during the spike completed handshakes pile up in the accept queue faster than Odyssey drains them. The kernel then drops/RSTs new connections, which is what TcpExtListenOverflows / TcpExtListenDrops are counting on this host. net.core.somaxconn is 4096, so the kernel could allow more — the bottleneck is the backlog Odyssey passes to listen().
Evidence (captured a few minutes into the run)
ysql_conn_mgr.conf backlog | 128
net.core.somaxconn | 4096
ss Recv-Q / Send-Q on :5433 | 130 / 128
TcpExtListenOverflows | 277,876
TcpExtListenDrops | 277,876
net.ipv4.tcp_syncookies | 1
The Recv-Q pinned at the configured backlog plus matching overflow/drop counts is the smoking gun.
Error / stack trace
ERROR AppBase - Going to retrieve connection to 172.151.29.106 again: Connection to 172.151.29.106:5433 refused. ...
com.yugabyte.util.PSQLException: Connection to 172.151.29.106:5433 refused.
at com.yugabyte.core.v3.ConnectionFactoryImpl.openConnectionImpl(...)
at com.yugabyte.sample.apps.AppBase.getPostgresConnection(...)
at com.yugabyte.sample.apps.SqlConnectionsBurst.lambda$burstSelect$0(...)
...
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.Net.pollConnect(Native Method)
at sun.nio.ch.NioSocketImpl.timedFinishConnect(...)
Suggested fixes (to evaluate)
- Raise Odyssey's listen() backlog (e.g. 4096), and consider scaling with ysql_max_client_connections.
- Run a dedicated accept coroutine independent of the worker pool so clients can be queued internally even when workers are backed up.
Jira Link: DB-21351
Clients refused with connect timeouts — YSQL Conn Mgr listen() backlog (128) appears to overflow under burst load
During the overlapping_query_spike test(3 client systems × 3 workloads, each firing 10 burst-read threads at 5s intervals, each thread spawning 333–500 sub-threads running SELECT 1; pg_sleep(20); per connection.), clients intermittently fail with PSQLException: Connection to {ip}:5433 refused (caused by java.net.ConnectException: Connection timed out). Evidence points to TCP accept-queue overflow inside the YSQL Connection Manager: Odyssey calls listen(backlog=128) per ysql_conn_mgr.conf, and during the spike completed handshakes pile up in the accept queue faster than Odyssey drains them. The kernel then drops/RSTs new connections, which is what TcpExtListenOverflows / TcpExtListenDrops are counting on this host. net.core.somaxconn is 4096, so the kernel could allow more — the bottleneck is the backlog Odyssey passes to listen().
Evidence (captured a few minutes into the run)
ysql_conn_mgr.conf backlog | 128
net.core.somaxconn | 4096
ss Recv-Q / Send-Q on :5433 | 130 / 128
TcpExtListenOverflows | 277,876
TcpExtListenDrops | 277,876
net.ipv4.tcp_syncookies | 1
The Recv-Q pinned at the configured backlog plus matching overflow/drop counts is the smoking gun.
Error / stack trace
Suggested fixes (to evaluate)
Jira Link: DB-21351