Skip to content

nb execute occasionally hangs in remote mode #87

@andrii-i

Description

@andrii-i

Problem

nb execute in remote mode occasionally hangs indefinitely. NextGenKernelManager (from jupyter-server-documents) polls kernel_info_request every 100ms during its connect() startup handshake until the kernel reports idle. Each request produces a Status(Busy)KernelInfoReplyStatus(Idle) cycle on iopub, all delivered over the same multiplexed WebSocket as our execute messages. The kernel processes our execute_request on the shell channel (ExecuteReply arrives), but the corresponding iopub messages (ExecuteInput, Status(Idle) with our parent_header) are queued behind a backlog of kernel_info iopub messages. Reproduces when running several notebooks sequentially — the hang occurs on later notebooks when the kernel is fresh and the manager is polling during startup.

The execute loop's tokio::select! reads one kernel message per iteration, interleaved with ydoc updates, and has no timeout arm:

} else {
tokio::select! {
kernel_msg = ws.recv_message() => {
if let Some(msg) = kernel_msg? {
let is_ours = msg.parent_header.as_ref()
.map(|h| h.msg_id == msg_id).unwrap_or(false);
if is_ours {
match &msg.content {
JupyterMessageContent::ExecuteInput(input) => {
expected_ec = Some(input.execution_count.0 as i64);
}
JupyterMessageContent::Status(status) => {
if matches!(status.execution_state,
jupyter_protocol::ExecutionState::Idle) {
idle_received = true;
}
}
_ => {}
}
}
}
}
ydoc_result = ydoc.recv_update() => {
ydoc_result.context("Y.js update error")?;
}
}

The polling runs at 10 req/s and can overlap with our execute_request if it lands while the startup handshake is still running. Our Status(Idle) remains queued behind kernel_info messages, and without a timeout the loop blocks indefinitely.

Trace from a hung execution:

    cell=0 WS msg: ExecuteReply(...)      parent=a1045bb4 is_ours=true    ← our execute completed
    cell=0 WS msg: KernelInfoReply(...)   parent=91dad377 is_ours=false   ← NextGenKernelManager polling
    cell=0 WS msg: KernelInfoReply(...)   parent=91dad377 is_ours=false
    cell=0 WS msg: KernelInfoReply(...)   parent=91dad377 is_ours=false
    ...  (repeats for 20 seconds — Status(Idle) for a1045bb4 never reached)

When the kernel WebSocket closes, recv_message() returns Ok(None) to signal end-of-stream. The current code only handles the Some(msg) case. None silently falls through and re-enters the loop. A closed WebSocket returns None immediately on every subsequent read, so the loop spins at full CPU until the process is killed:

if let Some(msg) = kernel_msg? {

The fallback path after the loop breaks returns ExecutionResult::success() unconditionally, even when the collected outputs contain errors.

Proposed solution

  • Add a sleep_until(deadline) arm to the tokio::select! so the loop exits within the per-cell timeout regardless of iopub backlog depth
  • Handle WebSocket close (None) by breaking out of the loop
  • On the fallback path, collect remaining outputs from the Y.js document and check for errors instead of returning unconditional success

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions