Problem
nb execute in remote mode occasionally hangs indefinitely. NextGenKernelManager (from jupyter-server-documents) polls kernel_info_request every 100ms during its connect() startup handshake until the kernel reports idle. Each request produces a Status(Busy) → KernelInfoReply → Status(Idle) cycle on iopub, all delivered over the same multiplexed WebSocket as our execute messages. The kernel processes our execute_request on the shell channel (ExecuteReply arrives), but the corresponding iopub messages (ExecuteInput, Status(Idle) with our parent_header) are queued behind a backlog of kernel_info iopub messages. Reproduces when running several notebooks sequentially — the hang occurs on later notebooks when the kernel is fresh and the manager is polling during startup.
The execute loop's tokio::select! reads one kernel message per iteration, interleaved with ydoc updates, and has no timeout arm:
|
} else { |
|
tokio::select! { |
|
kernel_msg = ws.recv_message() => { |
|
if let Some(msg) = kernel_msg? { |
|
let is_ours = msg.parent_header.as_ref() |
|
.map(|h| h.msg_id == msg_id).unwrap_or(false); |
|
if is_ours { |
|
match &msg.content { |
|
JupyterMessageContent::ExecuteInput(input) => { |
|
expected_ec = Some(input.execution_count.0 as i64); |
|
} |
|
JupyterMessageContent::Status(status) => { |
|
if matches!(status.execution_state, |
|
jupyter_protocol::ExecutionState::Idle) { |
|
idle_received = true; |
|
} |
|
} |
|
_ => {} |
|
} |
|
} |
|
} |
|
} |
|
ydoc_result = ydoc.recv_update() => { |
|
ydoc_result.context("Y.js update error")?; |
|
} |
|
} |
The polling runs at 10 req/s and can overlap with our execute_request if it lands while the startup handshake is still running. Our Status(Idle) remains queued behind kernel_info messages, and without a timeout the loop blocks indefinitely.
Trace from a hung execution:
cell=0 WS msg: ExecuteReply(...) parent=a1045bb4 is_ours=true ← our execute completed
cell=0 WS msg: KernelInfoReply(...) parent=91dad377 is_ours=false ← NextGenKernelManager polling
cell=0 WS msg: KernelInfoReply(...) parent=91dad377 is_ours=false
cell=0 WS msg: KernelInfoReply(...) parent=91dad377 is_ours=false
... (repeats for 20 seconds — Status(Idle) for a1045bb4 never reached)
When the kernel WebSocket closes, recv_message() returns Ok(None) to signal end-of-stream. The current code only handles the Some(msg) case. None silently falls through and re-enters the loop. A closed WebSocket returns None immediately on every subsequent read, so the loop spins at full CPU until the process is killed:
|
if let Some(msg) = kernel_msg? { |
The fallback path after the loop breaks returns ExecutionResult::success() unconditionally, even when the collected outputs contain errors.
Proposed solution
- Add a
sleep_until(deadline) arm to the tokio::select! so the loop exits within the per-cell timeout regardless of iopub backlog depth
- Handle WebSocket close (
None) by breaking out of the loop
- On the fallback path, collect remaining outputs from the Y.js document and check for errors instead of returning unconditional success
Problem
nb executein remote mode occasionally hangs indefinitely.NextGenKernelManager(fromjupyter-server-documents) pollskernel_info_requestevery 100ms during itsconnect()startup handshake until the kernel reports idle. Each request produces aStatus(Busy)→KernelInfoReply→Status(Idle)cycle on iopub, all delivered over the same multiplexed WebSocket as our execute messages. The kernel processes ourexecute_requeston the shell channel (ExecuteReplyarrives), but the corresponding iopub messages (ExecuteInput,Status(Idle)with ourparent_header) are queued behind a backlog of kernel_info iopub messages. Reproduces when running several notebooks sequentially — the hang occurs on later notebooks when the kernel is fresh and the manager is polling during startup.The execute loop's
tokio::select!reads one kernel message per iteration, interleaved with ydoc updates, and has no timeout arm:nb-cli/src/execution/remote/mod.rs
Lines 259 to 284 in 58b615b
The polling runs at 10 req/s and can overlap with our
execute_requestif it lands while the startup handshake is still running. OurStatus(Idle)remains queued behind kernel_info messages, and without a timeout the loop blocks indefinitely.Trace from a hung execution:
When the kernel WebSocket closes,
recv_message()returnsOk(None)to signal end-of-stream. The current code only handles theSome(msg)case.Nonesilently falls through and re-enters the loop. A closed WebSocket returnsNoneimmediately on every subsequent read, so the loop spins at full CPU until the process is killed:nb-cli/src/execution/remote/mod.rs
Line 262 in 58b615b
The fallback path after the loop breaks returns
ExecutionResult::success()unconditionally, even when the collected outputs contain errors.Proposed solution
sleep_until(deadline)arm to thetokio::select!so the loop exits within the per-cell timeout regardless of iopub backlog depthNone) by breaking out of the loop