[client] Bound read timeout on streaming API requests to detect dead connections by zpoint · Pull Request #9990 · skypilot-org/skypilot

zpoint · 2026-06-30T11:09:00Z

Problem

sky status, sky logs, and sky jobs logs stream their results from the API
server's /api/stream (and /logs, /provision_logs, /jobs/logs) endpoints.
The client makes these streaming reads with no read timeout:

timeout=(client_common.API_SERVER_REQUEST_CONNECTION_TIMEOUT_SECONDS, None)
#                                                                      ^^^^ read = unbounded

requests/aiohttp with an unbounded read wait forever for the next byte.
So if a streaming connection silently dies, the client blocks in recv()
indefinitely instead of erroring or retrying — the command hangs with no output
until Ctrl-C.

Why it hangs, and why load makes it worse

The stream stays healthy only while bytes keep arriving. The server sends a
heartbeat every 30s (stream_utils.py:_HEARTBEAT_INTERVAL, added in #5750)
specifically to keep bytes flowing, so that reverse proxies / load balancers /
CDNs — which drop idle connections (commonly after ~60–100s of silence) —
don't cut the connection. That heartbeat is the safety mechanism.

If the stream goes silent longer than the intermediary's idle limit, the
connection is dropped; and if that drop isn't cleanly signaled to the client
(a half-close, or silently dropped packets), the client's socket still looks
connected and it blocks in recv() forever — because there is no read timeout.

High load amplifies this two ways:

Heartbeat slippage — the 30s heartbeat is an async coroutine on the
server's event loop; under heavy concurrency the loop is contended and the
heartbeat can be delayed past the proxy's idle limit, so the mechanism that
keeps the connection alive fails and the "idle" connection gets dropped.
More exposure — more and longer-lived concurrent streams are open through
the proxy/CDN, so statistically more get hit by a transient drop/reset.

Both are amplifiers of the same root cause: no client read timeout → any
silent drop = infinite hang.

Who hits it

Anyone on a remote API server behind a reverse proxy / LB / CDN (the
common hosted/team setup): sky logs, sky jobs logs, sky status,
sky launch all stream — if the server is busy or the network idle-drops the
stream, the CLI hangs until Ctrl-C.
sky logs --follow on a quiet job — the stream depends on heartbeats; a
slipped heartbeat under load → idle-drop → hang.
Flaky / changing networks — laptop sleep/wake, VPN or wifi handoff kills
the TCP connection without a clean close → the client never notices → hang.
Server rollout / restart while a stream is open.

Local / direct API servers (no proxy, low latency) rarely hit it.

Evidence

A real sky status was caught wedged ~20 min in a single call; kernel stack
(/proc/<pid>/stack) stuck in tcp_recvmsg → blocked reading the response
socket, with ps etime ~1229s.
The server/target were healthy — running the same command again on a fresh
connection returned in <2s. So it was a single half-dead connection, not a slow
server or a real problem with the query.
Deterministic repro (now a unit test — see below).

Fix

The server already sends the 30s heartbeat; the missing piece is the
client-side counterpart — a read timeout above the heartbeat interval. This
sets the streaming read timeout to 4× the heartbeat (120s):

A healthy stream is reset by each 30s heartbeat, so the 120s timeout never
fires — no regression to live streaming.
A dead/stalled stream (no heartbeat for 120s) raises a RequestException
(surfaces as ConnectionError("Read timed out") during iteration), which the
streaming SDK functions already treat as transient (retry_transient_errors)
and retry on a fresh connection — which, as shown above, succeeds.
Heartbeat-less long-poll endpoints (/api/get) are left unbounded, since
they may legitimately block on a slow request with no intermediate output.

Reproduction (deterministic; included as a unit test)

tests/unit_tests/test_sky/server/test_rest_stream_timeout.py starts a stub
server that sends 200 + one chunk then holds the socket open and silent
(mimicking a dropped-but-not-closed stream), and drives sky's real request
session at it:

read timeout = None  ->  blocks forever  (test asserts still blocked after 4s)  = the bug
read timeout = 2s    ->  raises "Read timed out" in ~2s (a retryable RequestException) = the fix

Plus an invariant test asserting the client read timeout stays above the
server heartbeat interval (so a healthy stream never trips it).

Why this layer (vs. alternatives)

Not a blanket default read timeout on all requests — it would break the
long-poll /api/get and any slow synchronous endpoint.
Not TCP keepalive (SO_KEEPALIVE) — keepalive only detects a TCP-dead
peer, not a proxy that holds the socket open while forwarding no data; a read
timeout catches both, and it composes with the heartbeat the server already
sends. (HTTP keep-alive / connection reuse is already on and is unrelated — it
does not detect dead connections.)

Scope

Applied to the retry-wrapped sync console-log streams (a ReadTimeout there
is retried on a fresh connection): tail_logs, tail_provision_logs,
stream_and_get (sky/client/sdk.py) and jobs.tail_logs
(sky/jobs/client/sdk.py).

Left for follow-up (same heartbeat streams, but not retry-wrapped today, so a
bare read timeout would surface instead of retrying): serve.tail_logs, the
async client streams (sky/client/sdk_async.py), and
jobs.download_logs_streaming. Each needs its own retry/re-attach first.

Untouched on purpose: /api/get long-poll and format=plain log streams emit
no heartbeat, so a bounded read timeout there would false-fire.

Open questions for reviewers

Is 4× the heartbeat (120s) the right margin under extreme load (too tight
→ false timeout + retry; too loose → slower detection)?
Should the heartbeat interval be a shared constant so the client read
timeout is provably coupled to it?
Is deferring serve.tail_logs + the async client acceptable, or should this
PR add the retry wrappers they need?

Test plan

Unit test: silent stream raises within the timeout instead of hanging; and
would block forever without it.
Full k8s smoke suite (/smoke-test --kubernetes --no-resource-heavy) green,
including sky logs/sky status/sky jobs logs streaming.
Manual: confirm sky logs --follow stays connected while idle (heartbeat
keeps the 120s timer reset).

gemini-code-assist

Code Review

This pull request adds default timeout handling to the request_without_retry function in sky/server/rest.py to prevent requests from blocking indefinitely. The feedback suggests refactoring the timeout logic to avoid duplicate magic numbers and wrapping the comment lines to comply with PEP 8 line length limits.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-30T11:09:59Z

+    # `requests` has no default timeout, so if the connection stalls during the
+    # TLS handshake or the response is dropped mid-flight (e.g. a proxy/CDN
+    # silently drops a long-lived connection), the call blocks forever. Set a
+    # connect timeout always. Only bound the read for non-streaming requests --
+    # streaming endpoints (e.g. log following) may legitimately stay idle for a
+    # long time, so they keep an unbounded read.
+    if 'timeout' not in kwargs:
+        kwargs['timeout'] = (10, None) if kwargs.get('stream') else (10, 60)


The current implementation uses hardcoded magic numbers (10, 60) and repeats the connection timeout value (10) twice. Additionally, several comment lines exceed the PEP 8 recommended limit of 79 characters (and 72 characters for comments/docstrings).\n\nWe can improve readability, maintainability, and PEP 8 compliance by wrapping the comments to shorter lines and refactoring the timeout logic to use descriptive local variables that avoid duplication.

Suggested change

# `requests` has no default timeout, so if the connection stalls during the

# TLS handshake or the response is dropped mid-flight (e.g. a proxy/CDN

# silently drops a long-lived connection), the call blocks forever. Set a

# connect timeout always. Only bound the read for non-streaming requests --

# streaming endpoints (e.g. log following) may legitimately stay idle for a

# long time, so they keep an unbounded read.

if 'timeout' not in kwargs:

kwargs['timeout'] = (10, None) if kwargs.get('stream') else (10, 60)

# `requests` has no default timeout, so if the connection stalls\n # during the TLS handshake or the response is dropped mid-flight\n # (e.g. a proxy/CDN silently drops a long-lived connection), the\n # call blocks forever. Set a connect timeout always. Only bound\n # the read for non-streaming requests -- streaming endpoints\n # (e.g. log following) may legitimately stay idle for a long\n # time, so they keep an unbounded read.\n if 'timeout' not in kwargs:\n connect_timeout = 10\n read_timeout = None if kwargs.get('stream') else 60\n kwargs['timeout'] = (connect_timeout, read_timeout)

References

PEP 8 recommends limiting all lines to a maximum of 79 characters, and limiting flowing long blocks of text (comments or docstrings) to 72 characters. ^(link)

zpoint · 2026-06-30T11:50:09Z

/smoke-test --kubernetes --no-resource-heavy

…connections `sky status`/`sky logs`/`sky jobs logs` stream their results from the API server's `/api/stream` (and `/logs`, `/provision_logs`, `/jobs/logs`) endpoints. The client made these streaming reads with no read timeout (`timeout=(connect, None)`), so when a streaming connection silently dies -- a proxy/CDN drops the long-lived connection, or the peer goes away mid-stream -- the client blocks in recv() forever. We observed `sky status` wedged ~20 min in a single call (kernel stack stuck in `tcp_recvmsg`), while a fresh `sky status` returned instantly, confirming a dead connection rather than a slow server. The server already sends a heartbeat every 30s on these streams (`stream_utils.py:_HEARTBEAT_INTERVAL`, skypilot-org#5750) to keep them busy through idle-timeout proxies. This adds the missing client-side counterpart: a read timeout of 4x the heartbeat (120s). A healthy stream is reset by each 30s heartbeat so the timeout never fires; a dead stream raises `ReadTimeout`, which these readers already retry (`retry_transient_errors`) on a fresh connection. Scope: the retry-wrapped sync console-log streams -- `tail_logs`, `tail_provision_logs`, `stream_and_get` (sky/client/sdk.py) and `jobs.tail_logs` (sky/jobs/client/sdk.py). Left for follow-up (each needs its own retry wrapper first, else a timeout would surface instead of retry): `serve.tail_logs`, the async client streams, and `jobs.download_logs_streaming`. Heartbeat-less endpoints (`/api/get` long-poll, `format=plain` log streams) keep an unbounded read, since they may legitimately block with no intermediate output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

zpoint force-pushed the fix/api-request-connection-timeout branch from f4280e6 to 30ac534 Compare June 30, 2026 11:30

zpoint changed the title ~~[client] Add connection timeout to API requests to avoid indefinite hangs~~ [client] Bound read timeout on streaming API requests to detect dead connections Jun 30, 2026

zpoint force-pushed the fix/api-request-connection-timeout branch from 30ac534 to 4079a7b Compare June 30, 2026 11:41

zpoint force-pushed the fix/api-request-connection-timeout branch from 4079a7b to c942a6b Compare July 1, 2026 02:27

zpoint force-pushed the fix/api-request-connection-timeout branch from c942a6b to 3399d8e Compare July 1, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[client] Bound read timeout on streaming API requests to detect dead connections#9990

[client] Bound read timeout on streaming API requests to detect dead connections#9990
zpoint wants to merge 1 commit into
skypilot-org:masterfrom
zpoint:fix/api-request-connection-timeout

zpoint commented Jun 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

zpoint commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zpoint commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Why it hangs, and why load makes it worse

Who hits it

Evidence

Fix

Reproduction (deterministic; included as a unit test)

Why this layer (vs. alternatives)

Scope

Open questions for reviewers

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

zpoint commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zpoint commented Jun 30, 2026 •

edited

Loading