-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What problem are you trying to solve?
Linkerd proxy inbound listeners currently use a fixed TCP accept backlog (observed as 128 via ss -ltnp) that cannot be configured by operators.
In high-traffic environments, especially during Kubernetes rollouts where many sidecars simultaneously establish new outbound connections to newly-ready pods, this fixed backlog can become a limiting factor. When a connection burst exceeds the proxy’s accept queue capacity, incoming connections are temporarily dropped or delayed at the TCP level, leading to short-lived connection failures such as:
{"timestamp":"2025-12-12T19:55:11.333411Z","level":"WARN","fields":{"message":"Failed to connect","error":"connect timed out after 1s"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
Because the proxy backlog is not configurable or documented, operators have no direct way to tune Linkerd for services that experience high fan-in or connection storms (for example during rollouts, autoscaling events, or traffic rebalancing).
How should the problem be solved?
Linkerd should allow operators to configure the TCP listen backlog for the proxy’s inbound listeners by exposing a proxy configuration option.
Any alternatives you've considered?
- Introducing artificial traffic ramp-up or rollout delays at the application or deployment level, which adds operational complexity and slows rollouts.
- Increasing Linkerd client proxy outbound connection timeout, which masks symptoms but does not address the underlying accept queue saturation.
How would users interact with this feature?
This would give operators a standard, well-understood TCP tuning lever (similar to listen backlog in nginx, for example) and avoid the need for rollout workarounds or indirect traffic shaping.
Would you like to work on this feature?
maybe