Skip to content

overlapping_query_spike: clients refused with connect timeouts — YSQL Connection Manager listen() backlog (128) appears to overflow under burst load #31558

@yugabyte-ci

Description

@yugabyte-ci

Clients refused with connect timeouts — YSQL Conn Mgr listen() backlog (128) appears to overflow under burst load

During the overlapping_query_spike test(3 client systems × 3 workloads, each firing 10 burst-read threads at 5s intervals, each thread spawning 333–500 sub-threads running SELECT 1; pg_sleep(20); per connection.), clients intermittently fail with PSQLException: Connection to {ip}:5433 refused (caused by java.net.ConnectException: Connection timed out). Evidence points to TCP accept-queue overflow inside the YSQL Connection Manager: Odyssey calls listen(backlog=128) per ysql_conn_mgr.conf, and during the spike completed handshakes pile up in the accept queue faster than Odyssey drains them. The kernel then drops/RSTs new connections, which is what TcpExtListenOverflows / TcpExtListenDrops are counting on this host. net.core.somaxconn is 4096, so the kernel could allow more — the bottleneck is the backlog Odyssey passes to listen().

Evidence (captured a few minutes into the run)
ysql_conn_mgr.conf backlog | 128
net.core.somaxconn | 4096
ss Recv-Q / Send-Q on :5433 | 130 / 128
TcpExtListenOverflows | 277,876
TcpExtListenDrops | 277,876
net.ipv4.tcp_syncookies | 1

The Recv-Q pinned at the configured backlog plus matching overflow/drop counts is the smoking gun.

Error / stack trace

ERROR AppBase - Going to retrieve connection to 172.151.29.106 again: Connection to 172.151.29.106:5433 refused. ...
com.yugabyte.util.PSQLException: Connection to 172.151.29.106:5433 refused.
    at com.yugabyte.core.v3.ConnectionFactoryImpl.openConnectionImpl(...)
    at com.yugabyte.sample.apps.AppBase.getPostgresConnection(...)
    at com.yugabyte.sample.apps.SqlConnectionsBurst.lambda$burstSelect$0(...)
    ...
Caused by: java.net.ConnectException: Connection timed out
    at sun.nio.ch.Net.pollConnect(Native Method)
    at sun.nio.ch.NioSocketImpl.timedFinishConnect(...)

Suggested fixes (to evaluate)

  • Raise Odyssey's listen() backlog (e.g. 4096), and consider scaling with ysql_max_client_connections.
  • Run a dedicated accept coroutine independent of the worker pool so clients can be queued internally even when workers are backed up.

Jira Link: DB-21351

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions