feat(cloud): async suspend + resume via job queue (Workers can't SSH) by standujar · Pull Request #7808 · elizaOS/eliza

standujar · 2026-05-19T16:12:10Z

Summary

Closes the Worker→core SSH gap for the agent suspend/resume lifecycle.

Suspend and resume were previously called directly from the cloud-api Worker via elizaSandboxService.shutdown() / provision(). Workers can't SSH the Hetzner cores, so the inline path silently failed to stop the container — the DB row flipped to stopped while the container kept burning RAM, and "resume" attempted a full re-provision every time even when the original container was still on disk.

This refactor mirrors the agent_delete job-queue path that already shipped in PR #7746:

new JOB_TYPES.AGENT_SUSPEND and AGENT_RESUME with data/result shapes, type guards, and idempotent enqueueAgent{Suspend,Resume}Once helpers in provisioning-jobs.ts
elizaSandboxService.executeSuspend: lifecycle lock, SSH docker stop (tolerant of "container gone"), flips DB to stopped, clears bridge_url / health_url, keeps sandbox_id so the same container's row can be resumed
elizaSandboxService.executeResume: delegates to provision() which restores bridge_url / health_url from a fresh sandbox handle and reuses the existing Neon DB. provision() acquires its own advisory lock so concurrent resumes serialize. (A future fast docker start path is tracked as follow-up — see Greptile thread below.)
cloud-api PATCH /eliza/agents/[id] (action=shutdown/suspend), POST /eliza/agents/[id]/suspend, POST /eliza/agents/[id]/resume all now return 202 with a jobId; clients poll /api/v1/jobs/<id> for the final status
a second suspend/resume while one is in flight reuses the existing job (idempotent)

The orchestrator daemon (provisioning-worker.service on the Hetzner VM) is the only thing with SSH access to the cores — it picks up the job, runs the SSH operation, and persists the result.

Greptile findings addressed (commit `7f670b9`)

P1 — fast path leaves bridge_url / health_url NULL. Fixed by dropping the fast path entirely. executeResume now delegates to provision(), which restores both URLs from a fresh sandbox handle.
P2 — executeResume has no lifecycle lock. Same fix: provision() already acquires its own advisory lock, so two concurrent resume jobs serialize.
P2 — silent fallback when provider.start is absent. Same fix: there is no fallback path anymore. (Underlying truth: DockerSandboxProvider never exposed a standalone start(), so the fast path was dead code on the only provider that ships today — every resume was already paying the re-provision cost without any log indicating it.)

The fast path will return when the provider exposes start(sandboxId): Promise<handle> that returns a fresh sandbox handle (so bridge_url / health_url can be re-derived). Tracked as follow-up.

Why one PR

Same architectural change across 6 source files (1 type registry, 2 service files, 3 routes) + 1 test file. Bisect-friendly as a unit; splitting would create intermediate states where routes enqueue job types the daemon doesn't recognize.

What I tested

Locally:

tsc --noEmit clean on cloud-shared (filtered to touched files)
biome check clean on all changed files
bun test packages/cloud-shared/src/lib/services/__tests__/provisioning-job-types.test.ts → 4 pass, 100% coverage on the registry file. Catches the cheap mistakes: typo in wire value, missing entry, accidental duplicates.
Existing service tests still pass (27 pass / 2 unrelated content-safety fails that pre-date this branch)

What I did NOT test yet (needs deploy):

e2e via dashboard: click "Suspend" → daemon picks up agent_suspend job → SSH docker stop → DB row updates → container actually stopped on the core
e2e via dashboard: click "Resume" on stopped agent → daemon picks up agent_resume → provision() restores bridge_url + container Up
idempotency: two rapid clicks → second returns the existing job, no duplicate
/api/v1/jobs/<id> polling surfaces correct status transitions

The e2e path requires the new code to be live on both the cloud-api Worker AND the orchestrator daemon (Hetzner VM /opt/eliza). Smoke-test after merge or after a staging deploy.

Test gap:
The execute* methods (executeSuspend, executeResume, executeAgentSuspend, executeAgentResume) aren't unit-tested. The package doesn't have a harness for mocking dbWrite.transaction, advisory locks, or the SSH provider — same as the shipped agent_delete path. Adding that harness is a follow-up of its own size.

Follow-up

Real fast resume path: extend DockerSandboxProvider with start(sandboxId): Promise<SandboxHandle> that does the SSH docker start + waits for health + returns a handle with bridgeUrl / healthUrl. Then executeResume can branch on container presence and skip the full create flow when the container is still on disk.
Apply the same job-queue pattern to agent_logs and agent_snapshot (currently still inline through the missing container-control-plane service).
Repoint the frontend logs viewer from the legacy /api/compat/agents/.../logs to the new v1 routes.
DRY the four enqueue*Once methods (provision, delete, suspend, resume) — ~80 lines of near-identical advisory-lock + idempotency code. Touches shipped code so out of scope here.
Test harness for the execute* methods (covers this PR + the existing agent_delete path).

Greptile Summary

This PR migrates agent_suspend and agent_resume from inline Worker calls (which cannot SSH Hetzner cores) to the existing job-queue pattern, closing the silent container-leak bug where the DB row flipped to stopped while the container kept running.

New AGENT_SUSPEND and AGENT_RESUME job types are registered in the type registry with matching data/result shapes, type guards, and idempotent enqueueAgent{Suspend,Resume}Once helpers that use the existing advisory-lock pattern.
elizaSandboxService.executeSuspend runs inside the advisory lock, SSH-stops the container, and clears bridge_url/health_url while retaining sandbox_id; executeResume currently delegates to provision() (fast docker start path deferred to a follow-up).
All three affected HTTP routes (PATCH /agents/[id], POST /suspend, POST /resume) now return 202 with a jobId for polling, consistent with the existing delete path.

Confidence Score: 3/5

The architectural direction is correct, but executeResume runs without a lifecycle lock and neither enqueue helper guards against an opposing job type being in flight, allowing docker stop and docker start to race on the same container.

Two executor-level gaps flagged in earlier review rounds remain unaddressed: executeResume reads the sandbox record and calls provision() without holding the advisory lock, and neither enqueueAgentSuspendOnce nor enqueueAgentResumeOnce checks for an active job of the opposing type before creating a new one. Both gaps affect the core suspend/resume execution path and produce non-deterministic container state under concurrent operations.

eliza-sandbox.ts (executeResume lacks the lifecycle lock that executeSuspend has) and provisioning-jobs.ts (enqueue helpers need cross-type job conflict detection).

Important Files Changed

Filename	Overview
packages/cloud-shared/src/lib/services/eliza-sandbox.ts	Adds executeSuspend (with lifecycle lock) and executeResume (no lock, delegates to provision()). The missing lifecycle lock in executeResume and no guard against deletion_pending status are previously-flagged open issues.
packages/cloud-shared/src/lib/services/provisioning-jobs.ts	Adds enqueueAgentSuspendOnce and enqueueAgentResumeOnce with advisory-lock idempotency; each checks only for its own job type, not the opposing type, allowing suspend/resume to race — a previously-flagged P1.
packages/cloud-api/v1/eliza/agents/[agentId]/resume/route.ts	Switches to enqueueAgentResumeOnce and returns 202; still returns HTTP 409 for the already-in-progress case (previously-flagged inconsistency), and contains a dead "Agent state changed while starting" error branch inherited from the old provision path.
packages/cloud-api/v1/eliza/agents/[agentId]/suspend/route.ts	Switches to enqueueAgentSuspendOnce, returns 202 for both fresh and idempotent cases — consistent with the delete route.
packages/cloud-api/v1/eliza/agents/[agentId]/route.ts	PATCH suspend/shutdown actions now enqueue a job and return 202; checks for provisioning status before enqueuing and delegates gracefully for other states.
packages/cloud-shared/src/lib/services/provisioning-job-types.ts	Adds AGENT_SUSPEND and AGENT_RESUME wire values; registry is correct and covered by the new smoke test.
packages/cloud-shared/src/lib/services/tests/provisioning-job-types.test.ts	New smoke tests verify wire-value correctness, uniqueness, snake_case convention, and type narrowing — good coverage for the registry.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Worker as CF Worker (cloud-api)
    participant DB as PostgreSQL (jobs table)
    participant Daemon as Orchestrator Daemon (Hetzner VM)
    participant Core as Agent Core (SSH)

    Note over Client,Core: Suspend Flow
    Client->>Worker: "POST /suspend or PATCH action=suspend"
    Worker->>DB: enqueueAgentSuspendOnce (advisory lock)
    DB-->>Worker: "{ job, created }"
    Worker->>DB: triggerImmediate() [fire-and-forget]
    Worker-->>Client: "202 { jobId }"

    Daemon->>DB: poll pending jobs
    DB-->>Daemon: agent_suspend job
    Daemon->>Daemon: "executeAgentSuspend -> executeSuspend (advisory lock)"
    Daemon->>Core: SSH docker stop sandbox_id
    Core-->>Daemon: container stopped
    Daemon->>DB: "UPDATE status=stopped, bridge_url=NULL, health_url=NULL"
    Daemon->>DB: updateStatus(job, completed)

    Client->>Worker: "GET /api/v1/jobs/{jobId}"
    Worker-->>Client: "{ status: completed }"

    Note over Client,Core: Resume Flow
    Client->>Worker: POST /resume
    Worker->>DB: enqueueAgentResumeOnce (advisory lock)
    DB-->>Worker: "{ job, created }"
    Worker->>DB: triggerImmediate() [fire-and-forget]
    Worker-->>Client: "202 { jobId }"

    Daemon->>DB: poll pending jobs
    DB-->>Daemon: agent_resume job
    Daemon->>Daemon: "executeAgentResume -> executeResume (no lock)"
    Daemon->>Daemon: provision() [re-provision path]
    Daemon->>Core: SSH docker start / create container
    Core-->>Daemon: bridge_url + health_url
    Daemon->>DB: "UPDATE status=running, bridge_url, health_url"
    Daemon->>DB: updateStatus(job, completed)

_{Reviews (4): Last reviewed commit: "fix(cloud): drop dead fast-path in execu..." | Re-trigger Greptile}

Suspend and resume previously called elizaSandboxService directly from the cloud-api Worker, which silently failed because Workers can't SSH the Hetzner cores. The DB row flipped to `stopped` while the container kept burning RAM, and "resume" attempted a full re-provision every time even when the original container was still on disk. Mirrors the agent_delete refactor (PR #7746): - new JOB_TYPES.AGENT_SUSPEND and AGENT_RESUME with data/result shapes, type guards, and idempotent enqueue helpers in provisioning-jobs.ts - elizaSandboxService.executeSuspend: lifecycle lock, SSH docker stop (tolerant of "container gone"), DB to stopped, sandbox_id retained so the same container can be resumed - elizaSandboxService.executeResume: fast path `docker start` on the existing container (~5s), full re-provision fallback if the container is gone (~60s). Existing Neon DB reused on both paths. - cloud-api PATCH /eliza/agents/[id], /suspend, /resume all return 202 with a jobId; clients poll /api/v1/jobs/<id> for the final status - second suspend/resume while one is in flight reuses the existing job Follow-up: same pattern for agent_logs and agent_snapshot.

coderabbitai · 2026-05-19T16:12:20Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d07f719e-55e6-46dd-96ba-80935fe99ac3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/orchestrator-http-server

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-05-19T16:12:56Z

Claude encountered an error after 0s —— View job

I'll analyze this and get back to you.

Catches the cheap mistakes the orchestrator daemon can't recover from at runtime: typo on the wire value, missing entry, accidental duplicates. Deeper executor tests (SSH, locks, DB writes) need a test harness the package doesn't have yet — follow-up.

claude · 2026-05-19T17:35:58Z

Claude encountered an error after 0s —— View job

I'll analyze this and get back to you.

claude · 2026-05-19T19:37:14Z

Claude encountered an error after 0s —— View job

I'll analyze this and get back to you.

Greptile flagged three issues on the resume executor: P1: fast path UPDATEs status='running' but doesn't restore bridge_url / health_url that executeSuspend clears, so proxy routes that guard on rec.bridge_url see a "running" agent that's unreachable. P2: executeResume reads outside any transaction and holds no lifecycle lock, unlike executeSuspend. Two concurrent resume jobs can race. P2: when provider.start is undefined the code falls through to a full re-provision with no log, hiding the "method not implemented" path. Underlying truth: DockerSandboxProvider doesn't expose a standalone start() at all, so the fast path was dead code on the only provider that ships today — every resume already paid the re-provision cost. Fix: drop the fast path. executeResume now delegates to provision(), which restores bridge_url / health_url from a fresh sandbox handle and acquires its own advisory lock. The fast path returns when the provider grows a start() that yields the handle — tracked as follow-up. Also updates the resume route comment and user-facing message that falsely claimed "~5s fast path".

claude · 2026-05-19T19:53:35Z

Claude encountered an error after 0s —— View job

I'll analyze this and get back to you.

standujar · 2026-05-19T20:10:14Z

Superseded by #7810 — branch was renamed (feat/orchestrator-http-server → feat/agent-lifecycle-via-job-queue) since the scope grew to include RESTART / LOGS / SNAPSHOT job types beyond just orchestrator wiring. Same commits + new work continues there.

Three new daemon-side handlers extend the cores-never-touched-from-Workers architecture started in #7808 (suspend/resume) to cover the rest of the lifecycle: - executeRestart: shutdown() + provision() atomically on the daemon. Replaces the Worker-side sequence which silently no-op'd the stop step and could leave a stale container running alongside the new one. - executeLogs: SSH \`docker logs --tail N <container>\` via the new SandboxProvider.fetchLogs() method. Works for stopped + crashed agents (the Worker-side fetch(bridge_url + /logs) returned empty for anything not actively running). - executeSnapshot: thin wrapper around the existing snapshot() so the job dispatcher can route through a single contract. Invoked from the daemon so outbound traffic to cores uses the same network identity as every other lifecycle op. Job machinery in provisioning-jobs.ts mirrors the suspend/resume patterns: data/result shapes, type guards, idempotent enqueue methods that reuse in-flight jobs on duplicate requests. DockerSandboxProvider.fetchLogs() merges stderr into stdout because agent crash traces tend to land on stderr.

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread packages/cloud-shared/src/lib/services/eliza-sandbox.ts Outdated

Comment thread packages/cloud-shared/src/lib/services/eliza-sandbox.ts

Comment thread packages/cloud-shared/src/lib/services/eliza-sandbox.ts Outdated

github-actions Bot added the Tests label May 19, 2026

Merge branch 'develop' into feat/orchestrator-http-server

34c8d67

standujar marked this pull request as draft May 19, 2026 20:01

standujar closed this May 19, 2026

standujar deleted the feat/orchestrator-http-server branch May 19, 2026 20:06

standujar mentioned this pull request May 19, 2026

feat(cloud): all agent lifecycle ops via job queue (Workers don't call cores) #7810

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cloud): async suspend + resume via job queue (Workers can't SSH)#7808

feat(cloud): async suspend + resume via job queue (Workers can't SSH)#7808
standujar wants to merge 4 commits into
developfrom
feat/orchestrator-http-server

standujar commented May 19, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Review skipped

Uh oh!

claude Bot commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 19, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 19, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 19, 2026 •

edited

Loading

Uh oh!

standujar commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

standujar commented May 19, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Greptile findings addressed (commit 7f670b9)

Why one PR

What I tested

Follow-up

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

standujar commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

standujar commented May 19, 2026 •

edited by greptile-apps Bot

Loading

Greptile findings addressed (commit `7f670b9`)

coderabbitai Bot commented May 19, 2026 •

edited

Loading

claude Bot commented May 19, 2026 •

edited

Loading

claude Bot commented May 19, 2026 •

edited

Loading

claude Bot commented May 19, 2026 •

edited

Loading

claude Bot commented May 19, 2026 •

edited

Loading