Skip to content

Commit c942a6b

Browse files
zpointclaude
andcommitted
[client] Bound read timeout on streaming API requests to detect dead connections
`sky status`/`sky logs`/`sky jobs logs` stream their results from the API server's `/api/stream` (and `/logs`, `/provision_logs`, `/jobs/logs`) endpoints. The client made these streaming reads with no read timeout (`timeout=(connect, None)`), so when a streaming connection silently dies -- a proxy/CDN drops the long-lived connection, or the peer goes away mid-stream -- the client blocks in recv() forever. We observed `sky status` wedged ~20 min in a single call (kernel stack stuck in `tcp_recvmsg`), while a fresh `sky status` returned instantly, confirming a dead connection rather than a slow server. The server already sends a heartbeat every 30s on these streams (`stream_utils.py:_HEARTBEAT_INTERVAL`, #5750) to keep them busy through idle-timeout proxies. This adds the missing client-side counterpart: a read timeout of 4x the heartbeat (120s). A healthy stream is reset by each 30s heartbeat so the timeout never fires; a dead stream raises `ReadTimeout`, which these readers already retry (`retry_transient_errors`) on a fresh connection. Scope: the retry-wrapped sync console-log streams -- `tail_logs`, `tail_provision_logs`, `stream_and_get` (sky/client/sdk.py) and `jobs.tail_logs` (sky/jobs/client/sdk.py). Left for follow-up (each needs its own retry wrapper first, else a timeout would surface instead of retry): `serve.tail_logs`, the async client streams, and `jobs.download_logs_streaming`. Heartbeat-less endpoints (`/api/get` long-poll, `format=plain` log streams) keep an unbounded read, since they may legitimately block with no intermediate output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 7ee947c commit c942a6b

4 files changed

Lines changed: 155 additions & 642 deletions

File tree

sky/client/common.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,18 @@
6363
# Connection timeout when sending requests to the API server.
6464
API_SERVER_REQUEST_CONNECTION_TIMEOUT_SECONDS = 5
6565

66+
# Read timeout (seconds allowed between bytes from the server) for streaming
67+
# requests to the API server's heartbeat-emitting endpoints (e.g. /api/stream
68+
# log following). The server sends a heartbeat every 30s on these streams
69+
# (sky/server/stream_utils.py:_HEARTBEAT_INTERVAL, added for #5750) to keep the
70+
# connection busy through idle-timeout proxies/CDNs. This client-side read
71+
# timeout is the counterpart: set well above the 30s heartbeat so a healthy
72+
# stream never trips it, but a dead/stalled connection (no heartbeat received
73+
# for this long) raises ReadTimeout and is retried on a fresh connection instead
74+
# of blocking forever. NOT applied to heartbeat-less long-poll endpoints (e.g.
75+
# /api/get), which may legitimately block on a slow request with no output.
76+
API_SERVER_REQUEST_STREAM_READ_TIMEOUT_SECONDS = 120 # 4x the 30s heartbeat
77+
6678

6779
def download_logs_from_api_server(
6880
paths_on_api_server: Iterable[str],

sky/client/sdk.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1165,7 +1165,7 @@ def tail_logs(
11651165
json=json.loads(body.model_dump_json()),
11661166
stream=True,
11671167
timeout=(client_common.API_SERVER_REQUEST_CONNECTION_TIMEOUT_SECONDS,
1168-
None))
1168+
client_common.API_SERVER_REQUEST_STREAM_READ_TIMEOUT_SECONDS))
11691169
request_id: server_common.RequestId[int] = server_common.get_request_id(
11701170
response)
11711171
if preload_content:
@@ -1225,7 +1225,7 @@ def tail_provision_logs(cluster_name: str,
12251225
params=params,
12261226
stream=True,
12271227
timeout=(client_common.API_SERVER_REQUEST_CONNECTION_TIMEOUT_SECONDS,
1228-
None))
1228+
client_common.API_SERVER_REQUEST_STREAM_READ_TIMEOUT_SECONDS))
12291229
# Check for HTTP errors before streaming the response
12301230
if response.status_code != 200:
12311231
with ux_utils.print_exception_no_traceback():
@@ -2442,7 +2442,7 @@ def stream_and_get(
24422442
params=params,
24432443
retry=False,
24442444
timeout=(client_common.API_SERVER_REQUEST_CONNECTION_TIMEOUT_SECONDS,
2445-
None),
2445+
client_common.API_SERVER_REQUEST_STREAM_READ_TIMEOUT_SECONDS),
24462446
stream=True)
24472447
if response.status_code in [404, 400]:
24482448
detail = response.json().get('detail')

sky/jobs/client/sdk.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -543,7 +543,8 @@ def tail_logs(
543543
'/jobs/logs',
544544
json=json.loads(body.model_dump_json()),
545545
stream=True,
546-
timeout=(5, None))
546+
timeout=(client_common.API_SERVER_REQUEST_CONNECTION_TIMEOUT_SECONDS,
547+
client_common.API_SERVER_REQUEST_STREAM_READ_TIMEOUT_SECONDS))
547548
request_id: server_common.RequestId[int] = server_common.get_request_id(
548549
response)
549550
if preload_content:

0 commit comments

Comments
 (0)