Skip to content

Fix execute hang in remote mode#89

Draft
andrii-i wants to merge 3 commits into
jupyter-ai-contrib:mainfrom
andrii-i:execute-stall
Draft

Fix execute hang in remote mode#89
andrii-i wants to merge 3 commits into
jupyter-ai-contrib:mainfrom
andrii-i:execute-stall

Conversation

@andrii-i

Copy link
Copy Markdown
Collaborator

Fixes #87

Problem

nb execute in remote mode can hang indefinitely. The tokio::select! in the kernel message loop waits for WebSocket messages and Y.js updates but has no timeout arm. The per-cell deadline is only applied after Status(Idle) arrives. If that message is missed (race condition observed after nb output clear + nb execute in quick succession), the loop blocks forever.

A secondary issue: when the kernel WebSocket closes, recv_message() returns Ok(None). The if let Some(msg) pattern silently skips None, re-enters the loop, and spins at 100% CPU since a closed socket returns None immediately on every read.

Changes

Added timeout arm to tokio::select!. sleep_until(deadline) now serves as a select branch so the loop always exits within the configured per-cell timeout, regardless of whether Status(Idle) arrives. Matches the same pattern the local executor already uses (timeout_at wrapping each recv).

Fixed WebSocket close handling. Changed if let Some(msg) to match with an explicit None => break arm. On WS close the loop now exits into the fallback path instead of spinning.

Improved fallback path to collect outputs. Previously the fallback returned a blind ExecutionResult::success. Now it reads any remaining outputs from the Y.js document and uses from_outputs to detect errors, so a timed-out execution that produced an error still reports as failed.

Extracted ExecutionResult::from_outputs helper. Moved the error-detection scan (previously inlined in the happy path) into a reusable method on ExecutionResult. Single-pass find_map over outputs replaces the prior any() + find_map() double scan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nb execute occasionally hangs in remote mode

1 participant