Queue sandbox jobs with client polling to avoid overload timeouts #939

i-vainn · 2025-10-13T17:24:49Z

This PR queues sandbox jobs for sequential execution and adds client-side polling to avoid HTTP timeouts, with session affinity tests updated to match the new server API.

Summary by CodeRabbit

New Features
- Executions may be queued: submit returns a job ID, queue position, and worker info; view job status, cancel jobs, and trigger a worker reset.
- New session management endpoints: list and delete active sessions; jobs tied to sessions for affinity and related-job cancellation.
Bug Fixes
- Improved session restoration, cleanup, timeout handling, and clearer truncated output indicators.
Tests
- Added tests for 202-accepted polling, timeouts, and session-affinity behavior.

Signed-off-by: i-vainn <[email protected]>

coderabbitai · 2025-10-13T17:25:12Z

Walkthrough

Adds asynchronous job orchestration to the local sandbox server: introduces a Job dataclass and a serialized single-worker JobManager, routes /execute through job submission returning job_id (HTTP 202), adds job retrieval, cancellation, session list/delete, admin reset endpoints, and client-side polling with session-affinity handling.

Changes

Cohort / File(s)	Summary of Changes
Server job queue & endpoints `nemo_skills/code_execution/local_sandbox/local_sandbox_server.py`	Added `Job` dataclass and `JobManager` (single-worker ThreadPoolExecutor); `/execute` now enqueues jobs and returns `job_id` + `queued_ahead` (202); added `GET /jobs/<job_id>`, `POST /jobs/<job_id>/cancel`, `POST /admin/reset_worker`, `GET /sessions`, and `DELETE /sessions/<session_id>`; expanded cancellation semantics (including session-scoped cancellation), job lifecycle tracking, stdout/stderr truncation, and global `job_manager`.
Client request & polling logic `nemo_skills/code_execution/sandbox.py`	Added centralized `async _request` supporting HTTP and SSH-tunneled calls; when `/execute` responds 202 with `job_id`, client polls `/jobs/{job_id}` until terminal status or timeout (deadline bounded); session-affinity (`X-Session-ID`) derived from provided `session_id` or affinity UUID; added abstract URL builders (`_get_jobs_url`, `_get_reset_worker_url`, `_get_cancel_job_url`) and `LocalSandbox` implementations; refined restoration, retry, and worker-reset-on-timeout behavior.
Tests: polling & affinity behavior `tests/test_session_affinity.py`	Updated tests to exercise 202 submission flow: require `job_id`, poll `/jobs/{job_id}` until completion or timeout, handle non-200 poll responses with retries, augment final result with `process_status`/timestamps/response_time, and validate affinity propagation and timeout/error handling.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Server as LocalSandboxServer
  participant JM as JobManager
  participant Worker as SingleWorker

  rect rgba(230,240,255,0.6)
    Note over Client,Server: Submit code (may include X-Session-ID)
    Client->>Server: POST /execute
    Server->>JM: submit(job)
    JM-->>Server: {job_id, queued_ahead}
    Server-->>Client: 202 Accepted (job_id, queued_ahead)
  end

  rect rgba(240,255,240,0.6)
    Note over Client,Server: Poll job status until terminal state
    loop poll
      Client->>Server: GET /jobs/{job_id} (X-Session-ID)
      Server->>JM: get(job_id)
      JM-->>Server: job record (queued/running/completed/error/canceled)
      Server-->>Client: 200 OK job record
    end
  end

  rect rgba(255,245,230,0.6)
    Note over JM,Worker: Serialized execution by single worker
    JM->>Worker: run(next_job)
    Worker-->>JM: update job status (running -> completed/error)
    JM-->>Server: job result available
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

A rabbit queued a tiny chore,
One worker hummed and then it bore —
Jobs hop, some run, some gently stall,
Clients poll and sessions call.
🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately captures the main change by describing the addition of a job queue for sandbox execution alongside client-side polling to prevent overload timeouts, which aligns with the key server and client updates in the PR.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch imoshkov/sandbox-task-queue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gwarmstrong

various comments and questions--some to verify intended behavior, in other places I think some additional logging will be helpful for future debugging.

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py

nemo_skills/code_execution/sandbox.py

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py

nemo_skills/code_execution/sandbox.py

gwarmstrong · 2025-10-13T19:58:01Z

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py

+
+class JobManager:
+    def __init__(self, shell_manager):
+        self.jobs = {}


A potential issue with keeping this all in memory, is if the worker stops responding or crashes for some reason, all of your state disappears with the worker. Might not be something we need to fix right now, but if we start having issues of workers stopping/crashing, this will be a crux.

It's not clear to me if we need something more robust yet, though.

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py

gwarmstrong · 2025-10-13T20:05:47Z

nemo_skills/code_execution/sandbox.py

+            if p.get("status") in {"completed", "timeout", "error", "failed", "canceled"}:
+                r = p.get("result") or {}
+                return r if "process_status" in r else {**r, "process_status": p.get("status", "error")}
+            await asyncio.sleep(0.2)


log session id? maybe 0.2 is too often to log that though, if you want to log it every so often? Will be helpful for tracing timeouts.

gwarmstrong · 2025-10-13T20:06:36Z

nemo_skills/code_execution/sandbox.py

+                continue
+            p = resp.json()
+            if p.get("status") in {"completed", "timeout", "error", "failed", "canceled"}:
+                r = p.get("result") or {}


log session id? maybe 0.2 is too often to log that though, if you want to log it every so often? Will be helpful for tracing timeouts.

also, if you follow suggestion to return pid, log and pop it here alongside the session id

Good point. I’d prefer to handle that on the server side, though, so it goes into sandbox.log instead of main.log. That way, we keep the main logs clean and avoid excessive noise from frequent polling.

yeah I mean either way I would probably increase the time between the status poll logging, and probably also log when you update the state

Added logging on state update. Not sure if we want to log each polling request though

nemo_skills/code_execution/sandbox.py

Signed-off-by: i-vainn <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

nemo_skills/code_execution/sandbox.py (3)
130-139: Document the timeout condition and consider canceling the job.

The timeout condition time.monotonic() > deadline should be rare under normal operation (only if timeout * queued_ahead > 1200 or if the server is unresponsive). However, when it occurs, the job may still be running on the server.

As noted in past reviews:

Consider canceling the job via POST to /jobs/{job_id}/cancel before resetting the worker, since the client will never retrieve the result.

The condition "should never happen under normal circumstances" warrants clarification: it can happen if the queue is very long or if the server becomes unresponsive. If timeouts become frequent, this may indicate a need to scale workers or investigate server health rather than continuing to send requests.

Based on past review comments.

156-156: Consider reducing polling frequency or throttling logs.

Polling every 0.2 seconds can generate excessive logs if session_id logging is added. Past reviews suggested that 0.2 seconds might be too frequent for logging each poll attempt.

Consider one of the following:

Increase the polling interval (e.g., 0.5 or 1.0 seconds) if responsiveness allows.

Log session_id only every N attempts (e.g., every 5 seconds) rather than on every poll.

Based on past review comments.

145-146: Log the actual status code received.

When polling fails with a non-200 status, log the status code to aid debugging. Past reviews noted this "shouldn't really happen" but if it does, detailed logging is essential.

Apply this diff:
     if getattr(resp, "status_code", 200) != 200:
-        raise RuntimeError(f"Error during polling job {job_id}: {resp.text}")
+        LOG.error("Polling job %s failed with status %d: %s", job_id, resp.status_code, resp.text)
+        raise RuntimeError(f"Error during polling job {job_id} (status {resp.status_code}): {resp.text}")
Based on past review comments.
nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (1)

77-85: Job eviction after first retrieval may cause data loss.

Evicting finished jobs immediately after returning them (lines 83-84) means a client can only retrieve a job result once. If the client's GET request times out or fails, the result is permanently lost.

Consider one of the following:

Keep finished jobs in memory for a grace period (e.g., 60 seconds) before eviction.

Add a flag to track whether a job has been successfully retrieved (e.g., retrieved_at timestamp).

Document this behavior clearly so clients know to implement retries with idempotency.

Past reviews noted: "you probably want to distinguish whether it is a finished job or a queued/incomplete job. If finished, it should probably be evicted after it is returned in order to save memory. This will also mean finished jobs can only be 'gotten' once."

Based on past review comments.

🧹 Nitpick comments (5)

nemo_skills/code_execution/sandbox.py (2)

102-107: Consider explicit None-check for session_id.

The or operator will cause _affinity_key to be used if session_id is any falsy value (empty string, 0, False, etc.), not just None. If this is intentional, consider adding a comment to clarify the behavior.

Apply this diff for more explicit behavior:
-    affinity_header = request.pop("session_id", None) or request.pop("_affinity_key", None)
+    affinity_header = request.pop("session_id", None)
+    if affinity_header is None:
+        affinity_header = request.pop("_affinity_key", None)
120-120: Consider using raise ... from err for exception chaining.

Static analysis suggests using raise ... from err to preserve the exception chain, which helps with debugging.

Apply this diff:
-    except json.JSONDecodeError:
-        raise RuntimeError(f"Error during parsing output: {output.text}")
+    except json.JSONDecodeError as err:
+        raise RuntimeError(f"Error during parsing output: {output.text}") from err

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (3)

192-194: Use logging.exception for better traceback capture.

When catching a broad Exception in job execution, use logging.exception instead of logging.error to automatically include the full traceback.

Apply this diff:

     except Exception as e:
-        logging.error(f"Error during job execution: {e}\n{traceback.format_exc()}")
+        logging.exception("Error during job execution: %s", e)
         return {"process_status": "error", "stderr": f"An unexpected error occurred: {e}"}

This eliminates the need for manual traceback.format_exc() calls.

284-310: Simplify nested try-except blocks in stop_shell.

The multiple nested try-except blocks (lines 292-306) that silently ignore all exceptions make debugging difficult. While defensive coding is good, consider consolidating or logging suppressed exceptions.

Apply this diff to consolidate the cleanup logic:

-        try:
-            try:
-                conn.send({"cmd": "shutdown"})
-            except Exception:
-                pass
-            try:
-                conn.close()
-            except Exception:
-                pass
-            try:
-                proc.terminate()
-            except Exception:
-                try:
-                    os.kill(proc.pid, signal.SIGKILL)
-                except Exception:
-                    pass
-            proc.join(timeout=2.0)
-            return True, f"IPython session {shell_id} deleted successfully"
-        except Exception as e:
-            return False, f"Error stopping IPython session {shell_id}: {e}"
+        try:
+            # Best-effort shutdown sequence
+            for action, fn in [
+                ("send shutdown", lambda: conn.send({"cmd": "shutdown"})),
+                ("close conn", lambda: conn.close()),
+                ("terminate", lambda: proc.terminate()),
+            ]:
+                try:
+                    fn()
+                except Exception as exc:
+                    logging.debug("stop_shell %s: %s failed: %s", shell_id, action, exc)
+            
+            # Last resort: SIGKILL
+            try:
+                if proc.is_alive():
+                    os.kill(proc.pid, signal.SIGKILL)
+            except Exception as exc:
+                logging.debug("stop_shell %s: SIGKILL failed: %s", shell_id, exc)
+            
+            proc.join(timeout=2.0)
+            return True, f"IPython session {shell_id} deleted successfully"
+        except Exception as e:
+            return False, f"Error stopping IPython session {shell_id}: {e}"

This makes the cleanup sequence more readable and logs suppressed exceptions at DEBUG level for troubleshooting.

877-878: Log cancellation results for related jobs.

The code cancels related jobs (line 877-878) but doesn't check or log the results. This could make debugging difficult if related job cancellation fails.

Apply this diff:

-    for related_job_id in job_manager.get_related_jobs(job_id):
-        ok, msg = job_manager.cancel_job(related_job_id)
+    for related_job_id in job_manager.get_related_jobs(job_id):
+        ok, msg = job_manager.cancel_job(related_job_id)
+        if ok:
+            logging.info("Canceled related job %s: %s", related_job_id, msg)
+        else:
+            logging.warning("Failed to cancel related job %s: %s", related_job_id, msg)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ab4917 and 739cf2d.

📒 Files selected for processing (2)

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (6 hunks)
nemo_skills/code_execution/sandbox.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.14.0)

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py

191-191: Consider moving this statement to an else block

(TRY300)

192-192: Do not catch blind exception: Exception

(BLE001)

193-193: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

294-295: try-except-pass detected, consider logging the exception

(S110)

294-294: Do not catch blind exception: Exception

(BLE001)

298-299: try-except-pass detected, consider logging the exception

(S110)

298-298: Do not catch blind exception: Exception

(BLE001)

302-302: Do not catch blind exception: Exception

(BLE001)

305-306: try-except-pass detected, consider logging the exception

(S110)

305-305: Do not catch blind exception: Exception

(BLE001)

308-308: Consider moving this statement to an else block

(TRY300)

309-309: Do not catch blind exception: Exception

(BLE001)

864-864: Consider moving this statement to an else block

(TRY300)

865-865: Do not catch blind exception: Exception

(BLE001)

866-866: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

nemo_skills/code_execution/sandbox.py

120-120: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

120-120: Avoid specifying long messages outside the exception class

(TRY003)

146-146: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (3)

nemo_skills/code_execution/sandbox.py (1)

226-280: LGTM on session restoration logic!

The restoration flow properly:

Propagates session affinity via _affinity_key (line 245)

Handles restoration failures by cleaning up the remote session

Returns a clear error message to the user

Avoids re-triggering cells that previously errored or timed out

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (2)

786-795: LGTM on job submission endpoint!

The execute endpoint properly:

Submits jobs to JobManager

Returns 202 Accepted with job_id for polling

Includes queued_ahead count for client deadline estimation

Logs pid and worker_id for tracing timeouts (addressing past review comments)

844-867: LGTM on reset_worker endpoint!

The delayed exit pattern (lines 856-862) ensures the HTTP response is delivered before the worker terminates. This is a good approach for administrative reset.

Signed-off-by: i-vainn <[email protected]>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/code_execution/sandbox.py (1)

441-446: delete_session bypasses SSH path; fails for SSH‑tunneled sandboxes

This calls AsyncClient directly; it won’t work when using ssh_server/ssh_key_path.

-                response = await self.http_session.delete(
-                    url=f"http://{self.host}:{self.port}/sessions/{session_id}",
-                    timeout=10.0,
-                    headers={"X-Session-ID": session_id},
-                )
+                response = await self._request(
+                    "delete",
+                    f"http://{self.host}:{self.port}/sessions/{session_id}",
+                    timeout=3.0,
+                    headers={"X-Session-ID": session_id},
+                )

Also reduced timeout for this fast control call, per prior feedback.

🧹 Nitpick comments (7)

nemo_skills/code_execution/sandbox.py (7)

109-116: Initial POST timeout is high; make this a short, strict call

The execute POST should return fast (202 or result). Use a smaller timeout to surface server issues quickly and avoid tying up client connections.

-        output = await self._request(
+        output = await self._request(
             "post",
             self._get_execute_url(),
             content=payload,
-            timeout=10.0,
+            timeout=2.0,
             headers={"Content-Type": "application/json", **extra_headers},
         )

Past review asked for shorter timeouts for these fast paths.

125-142: Cancel job before hard worker reset; avoid immediate destructive action

On poll deadline, first try canceling the job; only reset the worker if cancel fails. This uses the new cancel endpoint and reduces recovery blast-radius.

-            if time.monotonic() > deadline:
+            if time.monotonic() > deadline:
                 # TODO: if the worker is healthy, instead of resetting it, we should try to restart the job
                 # maybe reset the deadline and log retry attempt
                 LOG.error(
                     "Client timed out polling job %s; issuing hard worker reset (session_id=%s)",
                     job_id,
                     affinity_header,
                 )
-                _ = await self._request("post", self._get_reset_worker_url(), timeout=5.0, headers=extra_headers)
+                try:
+                    # Best-effort cancel
+                    await self._request("post", self._get_cancel_job_url(job_id), timeout=2.0, headers=extra_headers)
+                except Exception as e:
+                    LOG.warning("Cancel job %s failed before reset: %s", job_id, e)
+                # Fall back to hard reset
+                _ = await self._request("post", self._get_reset_worker_url(), timeout=3.0, headers=extra_headers)
                 raise httpx.TimeoutException("Client poll deadline exceeded")

Also consider exponential backoff and telemetry on repeated deadline hits.

117-121: Exception chaining for JSON decode; satisfy Ruff B904

Preserve the original decode error context.

-        try:
-            response_json = output.json()
-        except json.JSONDecodeError:
-            raise RuntimeError(f"Error during parsing output: {output.text}")
+        try:
+            response_json = output.json()
+        except json.JSONDecodeError as exc:
+            raise RuntimeError(f"Error during parsing output: {output.text}") from exc

Based on static analysis hints.

149-157: Guard poll JSON parsing; return actionable error on malformed response

If the job status endpoint returns non‑JSON, the client will crash. Handle decode errors and include job/session info for triage.

-            job_data = resp.json()
+            try:
+                job_data = resp.json()
+            except json.JSONDecodeError as exc:
+                LOG.error(
+                    "Polling job %s returned non-JSON (session_id=%s): %s",
+                    job_id,
+                    affinity_header,
+                    getattr(resp, "text", "<no text>"),
+                )
+                raise RuntimeError("Invalid JSON from job status endpoint") from exc

143-147: Include session affinity in poll error logs

Add session identifier for traceability across workers.

-            if getattr(resp, "status_code", 200) != 200:
-                LOG.error("Polling job %s failed with status %d: %s", job_id, resp.status_code, resp.text)
+            if getattr(resp, "status_code", 200) != 200:
+                LOG.error(
+                    "Polling job %s failed with status %d (session_id=%s): %s",
+                    job_id,
+                    resp.status_code,
+                    affinity_header,
+                    resp.text,
+                )

This aligns with previous review feedback about logging session IDs for timeouts/errors.

130-157: Handle client cancellation by canceling remote job

If the coroutine is cancelled (e.g., shutdown), propagate cancellation to the server to avoid orphaned jobs.

-        while True:
+        while True:
             # This should never happen under normal circumstances
             if time.monotonic() > deadline:
                 ...
-            resp = await self._request("get", self._get_jobs_url(job_id), timeout=2.0, headers=extra_headers)
+            try:
+                resp = await self._request("get", self._get_jobs_url(job_id), timeout=1.0, headers=extra_headers)
+            except asyncio.CancelledError:
+                try:
+                    await self._request("post", self._get_cancel_job_url(job_id), timeout=2.0, headers=extra_headers)
+                except Exception as e:
+                    LOG.warning("Cancel job %s failed after client cancellation: %s", job_id, e)
+                raise

Also slightly reduce the poll timeout to keep the call short‑lived.

82-99: Minor: request shim OK; consider reusing SSH session or lazy import guard

Current approach creates a new SSH session per call. If performance becomes an issue, reuse a session or pool. Also guard ImportError with a clearer message.

Would you like a follow‑up patch to pool SSH sessions?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 739cf2d and a2720d4.

📒 Files selected for processing (1)

nemo_skills/code_execution/sandbox.py (5 hunks)

🧰 Additional context used

🪛 Ruff (0.14.0)

nemo_skills/code_execution/sandbox.py

120-120: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

120-120: Avoid specifying long messages outside the exception class

(TRY003)

141-141: Avoid specifying long messages outside the exception class

(TRY003)

147-147: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

nemo_skills/code_execution/sandbox.py

Signed-off-by: i-vainn <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (1)
166-197: Prefer logging.exception for automatic traceback inclusion.

The execution logic is correct. However, consider using logging.exception at Line 196 instead of manually formatting the traceback, which automatically includes the traceback and is more concise.

Apply this diff:
-        except Exception as e:
-            logging.error(f"Error during job execution: {e}\n{traceback.format_exc()}")
+        except Exception as e:
+            logging.exception("Error during job execution: %s", e)
             return {"process_status": "error", "stderr": f"An unexpected error occurred: {e}"}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2720d4 and 85b398c.

📒 Files selected for processing (1)

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (6 hunks)

🧰 Additional context used

🪛 Ruff (0.14.0)

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py

194-194: Consider moving this statement to an else block

(TRY300)

195-195: Do not catch blind exception: Exception

(BLE001)

196-196: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

297-298: try-except-pass detected, consider logging the exception

(S110)

297-297: Do not catch blind exception: Exception

(BLE001)

301-302: try-except-pass detected, consider logging the exception

(S110)

301-301: Do not catch blind exception: Exception

(BLE001)

305-305: Do not catch blind exception: Exception

(BLE001)

308-309: try-except-pass detected, consider logging the exception

(S110)

308-308: Do not catch blind exception: Exception

(BLE001)

311-311: Consider moving this statement to an else block

(TRY300)

312-312: Do not catch blind exception: Exception

(BLE001)

867-867: Consider moving this statement to an else block

(TRY300)

868-868: Do not catch blind exception: Exception

(BLE001)

869-869: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (15)

nemo_skills/code_execution/local_sandbox/local_sandbox_server.py (15)

28-32: LGTM!

The new imports and Job dataclass are clean and appropriate for the async job orchestration. The use of field(default_factory=time.time) for created_at is the correct pattern.

Also applies to: 48-57

59-75: LGTM!

The JobManager design is solid:

Single-worker ThreadPoolExecutor ensures serial execution

Lock correctly protects shared state within class methods

Job submission with UUID generation is clean

77-85: LGTM!

The job eviction logic correctly implements the previous review feedback: finished jobs (not "queued" or "running") are popped from memory after being retrieved once, which helps manage memory usage effectively.

87-89: LGTM!

The queued_ahead_count method cleanly implements the previous review suggestion using a sum with boolean expressions. The lock is used correctly.

91-97: LGTM!

The get_related_jobs method correctly identifies jobs sharing the same session_id for coordinated cancellation. Lock usage is appropriate.

99-128: LGTM!

The _run_job method correctly manages job lifecycle:

Properly transitions from queued → running → finished states

Handles mid-execution cancellation by checking status after execution

Comprehensive logging with worker_id and job_id

129-164: LGTM!

The cancel_job method implements robust cancellation logic:

Properly handles queued vs. running jobs

For IPython sessions, attempts to stop the shell for graceful interruption

Best-effort approach is well-documented

Returns informative success/failure messages

292-313: LGTM!

The defensive cleanup operations in stop_shell are appropriate for shutdown scenarios where failures are expected. The updated return pattern using (ok, msg) tuples improves caller ergonomics. Silent failures during cleanup are acceptable here as they don't impact correctness.

Based on learnings

494-498: LGTM!

The updated cleanup logic properly handles both success and failure cases from stop_shell, with appropriate logging for each path.

777-777: LGTM!

Global job_manager instance is appropriate for the singleton pattern in this server context.

780-785: LGTM!

The /jobs/<job_id> endpoint cleanly retrieves job state, returning 404 for non-existent jobs and using asdict for consistent JSON serialization.

789-798: LGTM!

The modified /execute endpoint correctly implements the async pattern:

Returns HTTP 202 (Accepted) for queued jobs

Includes job_id for polling

Logs pid alongside job_id as requested in previous review

Provides queued_ahead count for client feedback

844-844: LGTM!

Adding worker identifier to health check response improves observability.

847-870: LGTM!

The /admin/reset_worker endpoint correctly implements hard reset:

Returns response before exit using background thread with delay

Ensures HTTP response is delivered to client

Appropriate logging

Minor: Consider using logging.exception at Line 869 for consistency (same pattern as execute_job comment).

873-890: LGTM!

The /jobs/<job_id>/cancel endpoint correctly implements cancellation:

Cancels related jobs sharing the same session_id

Returns appropriate HTTP status codes (200 for success, 404/409 for failures)

Logs with header session info for debugging

Signed-off-by: i-vainn <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

nemo_skills/code_execution/sandbox.py (1)

131-140: Cancel the job before resetting the worker.

When the client timeout is reached, the code resets the entire worker without canceling the specific job that timed out. This can compound issues if multiple requests time out, as noted in past reviews.

Apply this diff to cancel the job before resetting:

         if time.monotonic() > deadline:
             # TODO: if the worker is healthy, instead of resetting it, we should try to restart the job
             # maybe reset the deadline and log retry attempt
             LOG.error(
                 "Client timed out polling job %s; issuing hard worker reset (session_id=%s)",
                 job_id,
                 affinity_header,
             )
+            # Cancel the job first to prevent it from running after reset
+            try:
+                _ = await self._request("post", self._get_cancel_job_url(job_id), timeout=2.0, headers=extra_headers)
+            except Exception as e:
+                LOG.warning("Failed to cancel job %s before worker reset: %s", job_id, e)
             _ = await self._request("post", self._get_reset_worker_url(), timeout=5.0, headers=extra_headers)
             raise httpx.TimeoutException("Client poll deadline exceeded")

🧹 Nitpick comments (1)

nemo_skills/code_execution/sandbox.py (1)
116-119: Use exception chaining for better debugging.

When re-raising exceptions, use raise ... from err to preserve the exception chain and make debugging easier.

Apply this diff:
     try:
         response_json = output.json()
     except json.JSONDecodeError:
-        raise RuntimeError(f"Error during parsing output: {output.text}")
+        raise RuntimeError(f"Error during parsing output: {output.text}") from None

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85b398c and 3c4b517.

📒 Files selected for processing (1)

nemo_skills/code_execution/sandbox.py (5 hunks)

🧰 Additional context used

🪛 Ruff (0.14.0)

nemo_skills/code_execution/sandbox.py

119-119: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

119-119: Avoid specifying long messages outside the exception class

(TRY003)

140-140: Avoid specifying long messages outside the exception class

(TRY003)

146-146: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (5)

nemo_skills/code_execution/sandbox.py (5)

82-98: LGTM! Clean request routing abstraction.

The centralized request handler properly routes through SSH tunneling when configured and correctly adapts the content kwarg to data for the sshtunnel_requests library.

166-176: LGTM! Well-structured URL builder abstraction.

The abstract methods provide a clean interface for URL construction, making it easy to support different sandbox backends.

398-405: LGTM! Clean URL builder implementations.

The implementations correctly construct RESTful URLs for the local sandbox endpoints.

439-444: LGTM! Consistent with the new request architecture.

The update properly uses the centralized _request method and correctly passes the X-Session-ID header for session affinity.

102-108: Session-ID header handling confirmed. Server reads X-Session-ID and injects it into the request JSON, so omitting session_id from the payload is safe.

gwarmstrong

@i-vainn Appears to address the major functional concerns from before. I have some lingering concerns about robustness of the overall system (e.g., if workers start failing), but it's probably best to see what happens in practice. Happy to help debug any sessions you find that are having failures with the sandbox interaction.

Kipok · 2025-10-31T18:35:53Z

@i-vainn what's left for us to merge this?

i-vainn and others added 10 commits October 13, 2025 15:34

initial, WIP

17cd382

Signed-off-by: i-vainn <[email protected]>

original design, WIP

aa164d8

Signed-off-by: i-vainn <[email protected]>

stable; soft session deletion

ab1eea2

Signed-off-by: i-vainn <[email protected]>

hard cancel

77e9758

Signed-off-by: i-vainn <[email protected]>

removed unused cancel method

b504010

Signed-off-by: i-vainn <[email protected]>

client request polling

8d988c6

Signed-off-by: i-vainn <[email protected]>

fixes

c96b9ea

Signed-off-by: i-vainn <[email protected]>

fixed sticky routing for no-sid reqs

7450af7

Signed-off-by: i-vainn <[email protected]>

aligned affinity tests

fdb94e2

Signed-off-by: i-vainn <[email protected]>

Merge branch 'main' into imoshkov/sandbox-task-queue

4ab4917

Signed-off-by: i-vainn <[email protected]>

i-vainn requested review from Kipok and gwarmstrong October 13, 2025 17:24

gwarmstrong reviewed Oct 13, 2025

View reviewed changes

i-vainn added 8 commits October 14, 2025 11:39

added queued_ahead_count method

33172e7

Signed-off-by: i-vainn <[email protected]>

evict finished jobs

28bb0d2

Signed-off-by: i-vainn <[email protected]>

added pid and worker_num logging

cafbc09

Signed-off-by: i-vainn <[email protected]>

cancel requests on timeout

8101829

Signed-off-by: i-vainn <[email protected]>

reduced request timeouts

5f08cd4

Signed-off-by: i-vainn <[email protected]>

reset worker on client time out

5b896bc

Signed-off-by: i-vainn <[email protected]>

improved readability, add _request method

bf02e02

Signed-off-by: i-vainn <[email protected]>

Merge branch 'main' into imoshkov/sandbox-task-queue

739cf2d

Signed-off-by: i-vainn <[email protected]>

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

i-vainn added 4 commits October 15, 2025 07:36

increased ipython overhead timeout

9f3aa75

Signed-off-by: i-vainn <[email protected]>

added get_url methods

e38b334

Signed-off-by: i-vainn <[email protected]>

increased critical timeouts

a2720d4

Signed-off-by: i-vainn <[email protected]>

Merge branch 'main' into imoshkov/sandbox-task-queue

dd9a689

Signed-off-by: i-vainn <[email protected]>

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

nemo_skills/code_execution/sandbox.py Outdated Show resolved Hide resolved

logging status updates

85b398c

Signed-off-by: i-vainn <[email protected]>

i-vainn requested a review from gwarmstrong October 15, 2025 16:59

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

simplified affinity header logic

3c4b517

Signed-off-by: i-vainn <[email protected]>

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

gwarmstrong approved these changes Oct 27, 2025

View reviewed changes

Queue sandbox jobs with client polling to avoid overload timeouts #939

Are you sure you want to change the base?

Queue sandbox jobs with client polling to avoid overload timeouts #939

Uh oh!

Conversation

i-vainn commented Oct 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gwarmstrong Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gwarmstrong Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

i-vainn Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

i-vainn Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

i-vainn commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading