Skip to content

[client] Bound read timeout on streaming API requests to detect dead connections#9990

Draft
zpoint wants to merge 1 commit into
skypilot-org:masterfrom
zpoint:fix/api-request-connection-timeout
Draft

[client] Bound read timeout on streaming API requests to detect dead connections#9990
zpoint wants to merge 1 commit into
skypilot-org:masterfrom
zpoint:fix/api-request-connection-timeout

Conversation

@zpoint

@zpoint zpoint commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Problem

sky status, sky logs, and sky jobs logs stream their results from the API
server's /api/stream (and /logs, /provision_logs, /jobs/logs) endpoints.
The client makes these streaming reads with no read timeout:

timeout=(client_common.API_SERVER_REQUEST_CONNECTION_TIMEOUT_SECONDS, None)
#                                                                      ^^^^ read = unbounded

requests/aiohttp with an unbounded read wait forever for the next byte.
So if a streaming connection silently dies, the client blocks in recv()
indefinitely instead of erroring or retrying — the command hangs with no output
until Ctrl-C.

Why it hangs, and why load makes it worse

The stream stays healthy only while bytes keep arriving. The server sends a
heartbeat every 30s (stream_utils.py:_HEARTBEAT_INTERVAL, added in #5750)
specifically to keep bytes flowing, so that reverse proxies / load balancers /
CDNs — which drop idle connections (commonly after ~60–100s of silence) —
don't cut the connection. That heartbeat is the safety mechanism.

If the stream goes silent longer than the intermediary's idle limit, the
connection is dropped; and if that drop isn't cleanly signaled to the client
(a half-close, or silently dropped packets), the client's socket still looks
connected and it blocks in recv() forever — because there is no read timeout.

High load amplifies this two ways:

  1. Heartbeat slippage — the 30s heartbeat is an async coroutine on the
    server's event loop; under heavy concurrency the loop is contended and the
    heartbeat can be delayed past the proxy's idle limit, so the mechanism that
    keeps the connection alive fails and the "idle" connection gets dropped.
  2. More exposure — more and longer-lived concurrent streams are open through
    the proxy/CDN, so statistically more get hit by a transient drop/reset.

Both are amplifiers of the same root cause: no client read timeout → any
silent drop = infinite hang.

Who hits it

  • Anyone on a remote API server behind a reverse proxy / LB / CDN (the
    common hosted/team setup): sky logs, sky jobs logs, sky status,
    sky launch all stream — if the server is busy or the network idle-drops the
    stream, the CLI hangs until Ctrl-C.
  • sky logs --follow on a quiet job — the stream depends on heartbeats; a
    slipped heartbeat under load → idle-drop → hang.
  • Flaky / changing networks — laptop sleep/wake, VPN or wifi handoff kills
    the TCP connection without a clean close → the client never notices → hang.
  • Server rollout / restart while a stream is open.

Local / direct API servers (no proxy, low latency) rarely hit it.

Evidence

  • A real sky status was caught wedged ~20 min in a single call; kernel stack
    (/proc/<pid>/stack) stuck in tcp_recvmsg → blocked reading the response
    socket, with ps etime ~1229s.
  • The server/target were healthy — running the same command again on a fresh
    connection returned in <2s. So it was a single half-dead connection, not a slow
    server or a real problem with the query.
  • Deterministic repro (now a unit test — see below).

Fix

The server already sends the 30s heartbeat; the missing piece is the
client-side counterpart — a read timeout above the heartbeat interval. This
sets the streaming read timeout to 4× the heartbeat (120s):

  • A healthy stream is reset by each 30s heartbeat, so the 120s timeout never
    fires
    — no regression to live streaming.
  • A dead/stalled stream (no heartbeat for 120s) raises a RequestException
    (surfaces as ConnectionError("Read timed out") during iteration), which the
    streaming SDK functions already treat as transient (retry_transient_errors)
    and retry on a fresh connection — which, as shown above, succeeds.
  • Heartbeat-less long-poll endpoints (/api/get) are left unbounded, since
    they may legitimately block on a slow request with no intermediate output.

Reproduction (deterministic; included as a unit test)

tests/unit_tests/test_sky/server/test_rest_stream_timeout.py starts a stub
server that sends 200 + one chunk then holds the socket open and silent
(mimicking a dropped-but-not-closed stream), and drives sky's real request
session at it:

read timeout = None  ->  blocks forever  (test asserts still blocked after 4s)  = the bug
read timeout = 2s    ->  raises "Read timed out" in ~2s (a retryable RequestException) = the fix

Plus an invariant test asserting the client read timeout stays above the
server heartbeat interval (so a healthy stream never trips it).

Why this layer (vs. alternatives)

  • Not a blanket default read timeout on all requests — it would break the
    long-poll /api/get and any slow synchronous endpoint.
  • Not TCP keepalive (SO_KEEPALIVE) — keepalive only detects a TCP-dead
    peer, not a proxy that holds the socket open while forwarding no data; a read
    timeout catches both, and it composes with the heartbeat the server already
    sends. (HTTP keep-alive / connection reuse is already on and is unrelated — it
    does not detect dead connections.)

Scope

Applied to the retry-wrapped sync console-log streams (a ReadTimeout there
is retried on a fresh connection): tail_logs, tail_provision_logs,
stream_and_get (sky/client/sdk.py) and jobs.tail_logs
(sky/jobs/client/sdk.py).

Left for follow-up (same heartbeat streams, but not retry-wrapped today, so a
bare read timeout would surface instead of retrying): serve.tail_logs, the
async client streams (sky/client/sdk_async.py), and
jobs.download_logs_streaming. Each needs its own retry/re-attach first.

Untouched on purpose: /api/get long-poll and format=plain log streams emit
no heartbeat, so a bounded read timeout there would false-fire.

Open questions for reviewers

  1. Is 4× the heartbeat (120s) the right margin under extreme load (too tight
    → false timeout + retry; too loose → slower detection)?
  2. Should the heartbeat interval be a shared constant so the client read
    timeout is provably coupled to it?
  3. Is deferring serve.tail_logs + the async client acceptable, or should this
    PR add the retry wrappers they need?

Test plan

  • Unit test: silent stream raises within the timeout instead of hanging; and
    would block forever without it.
  • Full k8s smoke suite (/smoke-test --kubernetes --no-resource-heavy) green,
    including sky logs/sky status/sky jobs logs streaming.
  • Manual: confirm sky logs --follow stays connected while idle (heartbeat
    keeps the 120s timer reset).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds default timeout handling to the request_without_retry function in sky/server/rest.py to prevent requests from blocking indefinitely. The feedback suggests refactoring the timeout logic to avoid duplicate magic numbers and wrapping the comment lines to comply with PEP 8 line length limits.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sky/server/rest.py Outdated
Comment on lines +372 to +379
# `requests` has no default timeout, so if the connection stalls during the
# TLS handshake or the response is dropped mid-flight (e.g. a proxy/CDN
# silently drops a long-lived connection), the call blocks forever. Set a
# connect timeout always. Only bound the read for non-streaming requests --
# streaming endpoints (e.g. log following) may legitimately stay idle for a
# long time, so they keep an unbounded read.
if 'timeout' not in kwargs:
kwargs['timeout'] = (10, None) if kwargs.get('stream') else (10, 60)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation uses hardcoded magic numbers (10, 60) and repeats the connection timeout value (10) twice. Additionally, several comment lines exceed the PEP 8 recommended limit of 79 characters (and 72 characters for comments/docstrings).\n\nWe can improve readability, maintainability, and PEP 8 compliance by wrapping the comments to shorter lines and refactoring the timeout logic to use descriptive local variables that avoid duplication.

Suggested change
# `requests` has no default timeout, so if the connection stalls during the
# TLS handshake or the response is dropped mid-flight (e.g. a proxy/CDN
# silently drops a long-lived connection), the call blocks forever. Set a
# connect timeout always. Only bound the read for non-streaming requests --
# streaming endpoints (e.g. log following) may legitimately stay idle for a
# long time, so they keep an unbounded read.
if 'timeout' not in kwargs:
kwargs['timeout'] = (10, None) if kwargs.get('stream') else (10, 60)
# `requests` has no default timeout, so if the connection stalls\n # during the TLS handshake or the response is dropped mid-flight\n # (e.g. a proxy/CDN silently drops a long-lived connection), the\n # call blocks forever. Set a connect timeout always. Only bound\n # the read for non-streaming requests -- streaming endpoints\n # (e.g. log following) may legitimately stay idle for a long\n # time, so they keep an unbounded read.\n if 'timeout' not in kwargs:\n connect_timeout = 10\n read_timeout = None if kwargs.get('stream') else 60\n kwargs['timeout'] = (connect_timeout, read_timeout)
References
  1. PEP 8 recommends limiting all lines to a maximum of 79 characters, and limiting flowing long blocks of text (comments or docstrings) to 72 characters. (link)

@zpoint zpoint force-pushed the fix/api-request-connection-timeout branch from f4280e6 to 30ac534 Compare June 30, 2026 11:30
@zpoint zpoint changed the title [client] Add connection timeout to API requests to avoid indefinite hangs [client] Bound read timeout on streaming API requests to detect dead connections Jun 30, 2026
@zpoint zpoint force-pushed the fix/api-request-connection-timeout branch from 30ac534 to 4079a7b Compare June 30, 2026 11:41
@zpoint

zpoint commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

/smoke-test --kubernetes --no-resource-heavy

@zpoint zpoint force-pushed the fix/api-request-connection-timeout branch from 4079a7b to c942a6b Compare July 1, 2026 02:27
…connections

`sky status`/`sky logs`/`sky jobs logs` stream their results from the API
server's `/api/stream` (and `/logs`, `/provision_logs`, `/jobs/logs`) endpoints.
The client made these streaming reads with no read timeout
(`timeout=(connect, None)`), so when a streaming connection silently dies -- a
proxy/CDN drops the long-lived connection, or the peer goes away mid-stream --
the client blocks in recv() forever. We observed `sky status` wedged ~20 min in
a single call (kernel stack stuck in `tcp_recvmsg`), while a fresh `sky status`
returned instantly, confirming a dead connection rather than a slow server.

The server already sends a heartbeat every 30s on these streams
(`stream_utils.py:_HEARTBEAT_INTERVAL`, skypilot-org#5750) to keep them busy through
idle-timeout proxies. This adds the missing client-side counterpart: a read
timeout of 4x the heartbeat (120s). A healthy stream is reset by each 30s
heartbeat so the timeout never fires; a dead stream raises `ReadTimeout`, which
these readers already retry (`retry_transient_errors`) on a fresh connection.

Scope: the retry-wrapped sync console-log streams -- `tail_logs`,
`tail_provision_logs`, `stream_and_get` (sky/client/sdk.py) and `jobs.tail_logs`
(sky/jobs/client/sdk.py). Left for follow-up (each needs its own retry wrapper
first, else a timeout would surface instead of retry): `serve.tail_logs`, the
async client streams, and `jobs.download_logs_streaming`. Heartbeat-less
endpoints (`/api/get` long-poll, `format=plain` log streams) keep an unbounded
read, since they may legitimately block with no intermediate output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zpoint zpoint force-pushed the fix/api-request-connection-timeout branch from c942a6b to 3399d8e Compare July 1, 2026 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant