feat(cloud): async suspend + resume via job queue (Workers can't SSH)#7808
feat(cloud): async suspend + resume via job queue (Workers can't SSH)#7808standujar wants to merge 4 commits into
Conversation
Suspend and resume previously called elizaSandboxService directly from the cloud-api Worker, which silently failed because Workers can't SSH the Hetzner cores. The DB row flipped to `stopped` while the container kept burning RAM, and "resume" attempted a full re-provision every time even when the original container was still on disk. Mirrors the agent_delete refactor (PR #7746): - new JOB_TYPES.AGENT_SUSPEND and AGENT_RESUME with data/result shapes, type guards, and idempotent enqueue helpers in provisioning-jobs.ts - elizaSandboxService.executeSuspend: lifecycle lock, SSH docker stop (tolerant of "container gone"), DB to stopped, sandbox_id retained so the same container can be resumed - elizaSandboxService.executeResume: fast path `docker start` on the existing container (~5s), full re-provision fallback if the container is gone (~60s). Existing Neon DB reused on both paths. - cloud-api PATCH /eliza/agents/[id], /suspend, /resume all return 202 with a jobId; clients poll /api/v1/jobs/<id> for the final status - second suspend/resume while one is in flight reuses the existing job Follow-up: same pattern for agent_logs and agent_snapshot.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
Catches the cheap mistakes the orchestrator daemon can't recover from at runtime: typo on the wire value, missing entry, accidental duplicates. Deeper executor tests (SSH, locks, DB writes) need a test harness the package doesn't have yet — follow-up.
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
Greptile flagged three issues on the resume executor: P1: fast path UPDATEs status='running' but doesn't restore bridge_url / health_url that executeSuspend clears, so proxy routes that guard on rec.bridge_url see a "running" agent that's unreachable. P2: executeResume reads outside any transaction and holds no lifecycle lock, unlike executeSuspend. Two concurrent resume jobs can race. P2: when provider.start is undefined the code falls through to a full re-provision with no log, hiding the "method not implemented" path. Underlying truth: DockerSandboxProvider doesn't expose a standalone start() at all, so the fast path was dead code on the only provider that ships today — every resume already paid the re-provision cost. Fix: drop the fast path. executeResume now delegates to provision(), which restores bridge_url / health_url from a fresh sandbox handle and acquires its own advisory lock. The fast path returns when the provider grows a start() that yields the handle — tracked as follow-up. Also updates the resume route comment and user-facing message that falsely claimed "~5s fast path".
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
|
Superseded by #7810 — branch was renamed ( |
Three new daemon-side handlers extend the cores-never-touched-from-Workers architecture started in #7808 (suspend/resume) to cover the rest of the lifecycle: - executeRestart: shutdown() + provision() atomically on the daemon. Replaces the Worker-side sequence which silently no-op'd the stop step and could leave a stale container running alongside the new one. - executeLogs: SSH \`docker logs --tail N <container>\` via the new SandboxProvider.fetchLogs() method. Works for stopped + crashed agents (the Worker-side fetch(bridge_url + /logs) returned empty for anything not actively running). - executeSnapshot: thin wrapper around the existing snapshot() so the job dispatcher can route through a single contract. Invoked from the daemon so outbound traffic to cores uses the same network identity as every other lifecycle op. Job machinery in provisioning-jobs.ts mirrors the suspend/resume patterns: data/result shapes, type guards, idempotent enqueue methods that reuse in-flight jobs on duplicate requests. DockerSandboxProvider.fetchLogs() merges stderr into stdout because agent crash traces tend to land on stderr.
Summary
Closes the Worker→core SSH gap for the agent suspend/resume lifecycle.
Suspend and resume were previously called directly from the cloud-api Worker via
elizaSandboxService.shutdown()/provision(). Workers can't SSH the Hetzner cores, so the inline path silently failed to stop the container — the DB row flipped tostoppedwhile the container kept burning RAM, and "resume" attempted a full re-provision every time even when the original container was still on disk.This refactor mirrors the
agent_deletejob-queue path that already shipped in PR #7746:JOB_TYPES.AGENT_SUSPENDandAGENT_RESUMEwith data/result shapes, type guards, and idempotentenqueueAgent{Suspend,Resume}Oncehelpers inprovisioning-jobs.tselizaSandboxService.executeSuspend: lifecycle lock, SSHdocker stop(tolerant of "container gone"), flips DB tostopped, clearsbridge_url/health_url, keepssandbox_idso the same container's row can be resumedelizaSandboxService.executeResume: delegates toprovision()which restoresbridge_url/health_urlfrom a fresh sandbox handle and reuses the existing Neon DB.provision()acquires its own advisory lock so concurrent resumes serialize. (A future fastdocker startpath is tracked as follow-up — see Greptile thread below.)PATCH /eliza/agents/[id](action=shutdown/suspend),POST /eliza/agents/[id]/suspend,POST /eliza/agents/[id]/resumeall now return202with ajobId; clients poll/api/v1/jobs/<id>for the final statusThe orchestrator daemon (
provisioning-worker.serviceon the Hetzner VM) is the only thing with SSH access to the cores — it picks up the job, runs the SSH operation, and persists the result.Greptile findings addressed (commit
7f670b9)executeResumenow delegates toprovision(), which restores both URLs from a fresh sandbox handle.provision()already acquires its own advisory lock, so two concurrent resume jobs serialize.provider.startis absent. Same fix: there is no fallback path anymore. (Underlying truth:DockerSandboxProvidernever exposed a standalonestart(), so the fast path was dead code on the only provider that ships today — every resume was already paying the re-provision cost without any log indicating it.)The fast path will return when the provider exposes
start(sandboxId): Promise<handle>that returns a fresh sandbox handle (sobridge_url/health_urlcan be re-derived). Tracked as follow-up.Why one PR
Same architectural change across 6 source files (1 type registry, 2 service files, 3 routes) + 1 test file. Bisect-friendly as a unit; splitting would create intermediate states where routes enqueue job types the daemon doesn't recognize.
What I tested
Locally:
tsc --noEmitclean on cloud-shared (filtered to touched files)bun test packages/cloud-shared/src/lib/services/__tests__/provisioning-job-types.test.ts→ 4 pass, 100% coverage on the registry file. Catches the cheap mistakes: typo in wire value, missing entry, accidental duplicates.What I did NOT test yet (needs deploy):
agent_suspendjob → SSHdocker stop→ DB row updates → container actually stopped on the coreagent_resume→provision()restores bridge_url + container Up/api/v1/jobs/<id>polling surfaces correct status transitionsThe e2e path requires the new code to be live on both the cloud-api Worker AND the orchestrator daemon (Hetzner VM
/opt/eliza). Smoke-test after merge or after a staging deploy.Test gap:
The
execute*methods (executeSuspend, executeResume, executeAgentSuspend, executeAgentResume) aren't unit-tested. The package doesn't have a harness for mockingdbWrite.transaction, advisory locks, or the SSH provider — same as the shippedagent_deletepath. Adding that harness is a follow-up of its own size.Follow-up
DockerSandboxProviderwithstart(sandboxId): Promise<SandboxHandle>that does the SSHdocker start+ waits for health + returns a handle withbridgeUrl/healthUrl. ThenexecuteResumecan branch on container presence and skip the full create flow when the container is still on disk.agent_logsandagent_snapshot(currently still inline through the missingcontainer-control-planeservice)./api/compat/agents/.../logsto the new v1 routes.enqueue*Oncemethods (provision,delete,suspend,resume) — ~80 lines of near-identical advisory-lock + idempotency code. Touches shipped code so out of scope here.agent_deletepath).Greptile Summary
This PR migrates
agent_suspendandagent_resumefrom inline Worker calls (which cannot SSH Hetzner cores) to the existing job-queue pattern, closing the silent container-leak bug where the DB row flipped tostoppedwhile the container kept running.AGENT_SUSPENDandAGENT_RESUMEjob types are registered in the type registry with matching data/result shapes, type guards, and idempotentenqueueAgent{Suspend,Resume}Oncehelpers that use the existing advisory-lock pattern.elizaSandboxService.executeSuspendruns inside the advisory lock, SSH-stops the container, and clearsbridge_url/health_urlwhile retainingsandbox_id;executeResumecurrently delegates toprovision()(fastdocker startpath deferred to a follow-up).PATCH /agents/[id],POST /suspend,POST /resume) now return 202 with ajobIdfor polling, consistent with the existing delete path.Confidence Score: 3/5
The architectural direction is correct, but executeResume runs without a lifecycle lock and neither enqueue helper guards against an opposing job type being in flight, allowing docker stop and docker start to race on the same container.
Two executor-level gaps flagged in earlier review rounds remain unaddressed: executeResume reads the sandbox record and calls provision() without holding the advisory lock, and neither enqueueAgentSuspendOnce nor enqueueAgentResumeOnce checks for an active job of the opposing type before creating a new one. Both gaps affect the core suspend/resume execution path and produce non-deterministic container state under concurrent operations.
eliza-sandbox.ts (executeResume lacks the lifecycle lock that executeSuspend has) and provisioning-jobs.ts (enqueue helpers need cross-type job conflict detection).
Important Files Changed
Sequence Diagram
sequenceDiagram participant Client participant Worker as CF Worker (cloud-api) participant DB as PostgreSQL (jobs table) participant Daemon as Orchestrator Daemon (Hetzner VM) participant Core as Agent Core (SSH) Note over Client,Core: Suspend Flow Client->>Worker: "POST /suspend or PATCH action=suspend" Worker->>DB: enqueueAgentSuspendOnce (advisory lock) DB-->>Worker: "{ job, created }" Worker->>DB: triggerImmediate() [fire-and-forget] Worker-->>Client: "202 { jobId }" Daemon->>DB: poll pending jobs DB-->>Daemon: agent_suspend job Daemon->>Daemon: "executeAgentSuspend -> executeSuspend (advisory lock)" Daemon->>Core: SSH docker stop sandbox_id Core-->>Daemon: container stopped Daemon->>DB: "UPDATE status=stopped, bridge_url=NULL, health_url=NULL" Daemon->>DB: updateStatus(job, completed) Client->>Worker: "GET /api/v1/jobs/{jobId}" Worker-->>Client: "{ status: completed }" Note over Client,Core: Resume Flow Client->>Worker: POST /resume Worker->>DB: enqueueAgentResumeOnce (advisory lock) DB-->>Worker: "{ job, created }" Worker->>DB: triggerImmediate() [fire-and-forget] Worker-->>Client: "202 { jobId }" Daemon->>DB: poll pending jobs DB-->>Daemon: agent_resume job Daemon->>Daemon: "executeAgentResume -> executeResume (no lock)" Daemon->>Daemon: provision() [re-provision path] Daemon->>Core: SSH docker start / create container Core-->>Daemon: bridge_url + health_url Daemon->>DB: "UPDATE status=running, bridge_url, health_url" Daemon->>DB: updateStatus(job, completed)Reviews (4): Last reviewed commit: "fix(cloud): drop dead fast-path in execu..." | Re-trigger Greptile