[client] Bound read timeout on streaming API requests to detect dead connections#9990
[client] Bound read timeout on streaming API requests to detect dead connections#9990zpoint wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds default timeout handling to the request_without_retry function in sky/server/rest.py to prevent requests from blocking indefinitely. The feedback suggests refactoring the timeout logic to avoid duplicate magic numbers and wrapping the comment lines to comply with PEP 8 line length limits.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # `requests` has no default timeout, so if the connection stalls during the | ||
| # TLS handshake or the response is dropped mid-flight (e.g. a proxy/CDN | ||
| # silently drops a long-lived connection), the call blocks forever. Set a | ||
| # connect timeout always. Only bound the read for non-streaming requests -- | ||
| # streaming endpoints (e.g. log following) may legitimately stay idle for a | ||
| # long time, so they keep an unbounded read. | ||
| if 'timeout' not in kwargs: | ||
| kwargs['timeout'] = (10, None) if kwargs.get('stream') else (10, 60) |
There was a problem hiding this comment.
The current implementation uses hardcoded magic numbers (10, 60) and repeats the connection timeout value (10) twice. Additionally, several comment lines exceed the PEP 8 recommended limit of 79 characters (and 72 characters for comments/docstrings).\n\nWe can improve readability, maintainability, and PEP 8 compliance by wrapping the comments to shorter lines and refactoring the timeout logic to use descriptive local variables that avoid duplication.
| # `requests` has no default timeout, so if the connection stalls during the | |
| # TLS handshake or the response is dropped mid-flight (e.g. a proxy/CDN | |
| # silently drops a long-lived connection), the call blocks forever. Set a | |
| # connect timeout always. Only bound the read for non-streaming requests -- | |
| # streaming endpoints (e.g. log following) may legitimately stay idle for a | |
| # long time, so they keep an unbounded read. | |
| if 'timeout' not in kwargs: | |
| kwargs['timeout'] = (10, None) if kwargs.get('stream') else (10, 60) | |
| # `requests` has no default timeout, so if the connection stalls\n # during the TLS handshake or the response is dropped mid-flight\n # (e.g. a proxy/CDN silently drops a long-lived connection), the\n # call blocks forever. Set a connect timeout always. Only bound\n # the read for non-streaming requests -- streaming endpoints\n # (e.g. log following) may legitimately stay idle for a long\n # time, so they keep an unbounded read.\n if 'timeout' not in kwargs:\n connect_timeout = 10\n read_timeout = None if kwargs.get('stream') else 60\n kwargs['timeout'] = (connect_timeout, read_timeout) |
References
- PEP 8 recommends limiting all lines to a maximum of 79 characters, and limiting flowing long blocks of text (comments or docstrings) to 72 characters. (link)
f4280e6 to
30ac534
Compare
30ac534 to
4079a7b
Compare
|
/smoke-test --kubernetes --no-resource-heavy |
4079a7b to
c942a6b
Compare
…connections `sky status`/`sky logs`/`sky jobs logs` stream their results from the API server's `/api/stream` (and `/logs`, `/provision_logs`, `/jobs/logs`) endpoints. The client made these streaming reads with no read timeout (`timeout=(connect, None)`), so when a streaming connection silently dies -- a proxy/CDN drops the long-lived connection, or the peer goes away mid-stream -- the client blocks in recv() forever. We observed `sky status` wedged ~20 min in a single call (kernel stack stuck in `tcp_recvmsg`), while a fresh `sky status` returned instantly, confirming a dead connection rather than a slow server. The server already sends a heartbeat every 30s on these streams (`stream_utils.py:_HEARTBEAT_INTERVAL`, skypilot-org#5750) to keep them busy through idle-timeout proxies. This adds the missing client-side counterpart: a read timeout of 4x the heartbeat (120s). A healthy stream is reset by each 30s heartbeat so the timeout never fires; a dead stream raises `ReadTimeout`, which these readers already retry (`retry_transient_errors`) on a fresh connection. Scope: the retry-wrapped sync console-log streams -- `tail_logs`, `tail_provision_logs`, `stream_and_get` (sky/client/sdk.py) and `jobs.tail_logs` (sky/jobs/client/sdk.py). Left for follow-up (each needs its own retry wrapper first, else a timeout would surface instead of retry): `serve.tail_logs`, the async client streams, and `jobs.download_logs_streaming`. Heartbeat-less endpoints (`/api/get` long-poll, `format=plain` log streams) keep an unbounded read, since they may legitimately block with no intermediate output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c942a6b to
3399d8e
Compare
Problem
sky status,sky logs, andsky jobs logsstream their results from the APIserver's
/api/stream(and/logs,/provision_logs,/jobs/logs) endpoints.The client makes these streaming reads with no read timeout:
requests/aiohttpwith an unbounded read wait forever for the next byte.So if a streaming connection silently dies, the client blocks in
recv()indefinitely instead of erroring or retrying — the command hangs with no output
until Ctrl-C.
Why it hangs, and why load makes it worse
The stream stays healthy only while bytes keep arriving. The server sends a
heartbeat every 30s (
stream_utils.py:_HEARTBEAT_INTERVAL, added in #5750)specifically to keep bytes flowing, so that reverse proxies / load balancers /
CDNs — which drop idle connections (commonly after ~60–100s of silence) —
don't cut the connection. That heartbeat is the safety mechanism.
If the stream goes silent longer than the intermediary's idle limit, the
connection is dropped; and if that drop isn't cleanly signaled to the client
(a half-close, or silently dropped packets), the client's socket still looks
connected and it blocks in
recv()forever — because there is no read timeout.High load amplifies this two ways:
server's event loop; under heavy concurrency the loop is contended and the
heartbeat can be delayed past the proxy's idle limit, so the mechanism that
keeps the connection alive fails and the "idle" connection gets dropped.
the proxy/CDN, so statistically more get hit by a transient drop/reset.
Both are amplifiers of the same root cause: no client read timeout → any
silent drop = infinite hang.
Who hits it
common hosted/team setup):
sky logs,sky jobs logs,sky status,sky launchall stream — if the server is busy or the network idle-drops thestream, the CLI hangs until Ctrl-C.
sky logs --followon a quiet job — the stream depends on heartbeats; aslipped heartbeat under load → idle-drop → hang.
the TCP connection without a clean close → the client never notices → hang.
Local / direct API servers (no proxy, low latency) rarely hit it.
Evidence
sky statuswas caught wedged ~20 min in a single call; kernel stack(
/proc/<pid>/stack) stuck intcp_recvmsg→ blocked reading the responsesocket, with
ps etime~1229s.connection returned in <2s. So it was a single half-dead connection, not a slow
server or a real problem with the query.
Fix
The server already sends the 30s heartbeat; the missing piece is the
client-side counterpart — a read timeout above the heartbeat interval. This
sets the streaming read timeout to 4× the heartbeat (120s):
fires — no regression to live streaming.
RequestException(surfaces as
ConnectionError("Read timed out")during iteration), which thestreaming SDK functions already treat as transient (
retry_transient_errors)and retry on a fresh connection — which, as shown above, succeeds.
/api/get) are left unbounded, sincethey may legitimately block on a slow request with no intermediate output.
Reproduction (deterministic; included as a unit test)
tests/unit_tests/test_sky/server/test_rest_stream_timeout.pystarts a stubserver that sends
200+ one chunk then holds the socket open and silent(mimicking a dropped-but-not-closed stream), and drives sky's real request
session at it:
Plus an invariant test asserting the client read timeout stays above the
server heartbeat interval (so a healthy stream never trips it).
Why this layer (vs. alternatives)
long-poll
/api/getand any slow synchronous endpoint.SO_KEEPALIVE) — keepalive only detects a TCP-deadpeer, not a proxy that holds the socket open while forwarding no data; a read
timeout catches both, and it composes with the heartbeat the server already
sends. (HTTP keep-alive / connection reuse is already on and is unrelated — it
does not detect dead connections.)
Scope
Applied to the retry-wrapped sync console-log streams (a
ReadTimeoutthereis retried on a fresh connection):
tail_logs,tail_provision_logs,stream_and_get(sky/client/sdk.py) andjobs.tail_logs(
sky/jobs/client/sdk.py).Left for follow-up (same heartbeat streams, but not retry-wrapped today, so a
bare read timeout would surface instead of retrying):
serve.tail_logs, theasync client streams (
sky/client/sdk_async.py), andjobs.download_logs_streaming. Each needs its own retry/re-attach first.Untouched on purpose:
/api/getlong-poll andformat=plainlog streams emitno heartbeat, so a bounded read timeout there would false-fire.
Open questions for reviewers
→ false timeout + retry; too loose → slower detection)?
timeout is provably coupled to it?
serve.tail_logs+ the async client acceptable, or should thisPR add the retry wrappers they need?
Test plan
would block forever without it.
/smoke-test --kubernetes --no-resource-heavy) green,including
sky logs/sky status/sky jobs logsstreaming.sky logs --followstays connected while idle (heartbeatkeeps the 120s timer reset).