Skip to content

feat(cloud): async suspend + resume via job queue (Workers can't SSH)#7808

Closed
standujar wants to merge 4 commits into
developfrom
feat/orchestrator-http-server
Closed

feat(cloud): async suspend + resume via job queue (Workers can't SSH)#7808
standujar wants to merge 4 commits into
developfrom
feat/orchestrator-http-server

Conversation

@standujar
Copy link
Copy Markdown
Collaborator

@standujar standujar commented May 19, 2026

Summary

Closes the Worker→core SSH gap for the agent suspend/resume lifecycle.

Suspend and resume were previously called directly from the cloud-api Worker via elizaSandboxService.shutdown() / provision(). Workers can't SSH the Hetzner cores, so the inline path silently failed to stop the container — the DB row flipped to stopped while the container kept burning RAM, and "resume" attempted a full re-provision every time even when the original container was still on disk.

This refactor mirrors the agent_delete job-queue path that already shipped in PR #7746:

  • new JOB_TYPES.AGENT_SUSPEND and AGENT_RESUME with data/result shapes, type guards, and idempotent enqueueAgent{Suspend,Resume}Once helpers in provisioning-jobs.ts
  • elizaSandboxService.executeSuspend: lifecycle lock, SSH docker stop (tolerant of "container gone"), flips DB to stopped, clears bridge_url / health_url, keeps sandbox_id so the same container's row can be resumed
  • elizaSandboxService.executeResume: delegates to provision() which restores bridge_url / health_url from a fresh sandbox handle and reuses the existing Neon DB. provision() acquires its own advisory lock so concurrent resumes serialize. (A future fast docker start path is tracked as follow-up — see Greptile thread below.)
  • cloud-api PATCH /eliza/agents/[id] (action=shutdown/suspend), POST /eliza/agents/[id]/suspend, POST /eliza/agents/[id]/resume all now return 202 with a jobId; clients poll /api/v1/jobs/<id> for the final status
  • a second suspend/resume while one is in flight reuses the existing job (idempotent)

The orchestrator daemon (provisioning-worker.service on the Hetzner VM) is the only thing with SSH access to the cores — it picks up the job, runs the SSH operation, and persists the result.

Greptile findings addressed (commit 7f670b9)

  1. P1 — fast path leaves bridge_url / health_url NULL. Fixed by dropping the fast path entirely. executeResume now delegates to provision(), which restores both URLs from a fresh sandbox handle.
  2. P2 — executeResume has no lifecycle lock. Same fix: provision() already acquires its own advisory lock, so two concurrent resume jobs serialize.
  3. P2 — silent fallback when provider.start is absent. Same fix: there is no fallback path anymore. (Underlying truth: DockerSandboxProvider never exposed a standalone start(), so the fast path was dead code on the only provider that ships today — every resume was already paying the re-provision cost without any log indicating it.)

The fast path will return when the provider exposes start(sandboxId): Promise<handle> that returns a fresh sandbox handle (so bridge_url / health_url can be re-derived). Tracked as follow-up.

Why one PR

Same architectural change across 6 source files (1 type registry, 2 service files, 3 routes) + 1 test file. Bisect-friendly as a unit; splitting would create intermediate states where routes enqueue job types the daemon doesn't recognize.

What I tested

Locally:

  • tsc --noEmit clean on cloud-shared (filtered to touched files)
  • biome check clean on all changed files
  • bun test packages/cloud-shared/src/lib/services/__tests__/provisioning-job-types.test.ts → 4 pass, 100% coverage on the registry file. Catches the cheap mistakes: typo in wire value, missing entry, accidental duplicates.
  • Existing service tests still pass (27 pass / 2 unrelated content-safety fails that pre-date this branch)

What I did NOT test yet (needs deploy):

  • e2e via dashboard: click "Suspend" → daemon picks up agent_suspend job → SSH docker stop → DB row updates → container actually stopped on the core
  • e2e via dashboard: click "Resume" on stopped agent → daemon picks up agent_resumeprovision() restores bridge_url + container Up
  • idempotency: two rapid clicks → second returns the existing job, no duplicate
  • /api/v1/jobs/<id> polling surfaces correct status transitions

The e2e path requires the new code to be live on both the cloud-api Worker AND the orchestrator daemon (Hetzner VM /opt/eliza). Smoke-test after merge or after a staging deploy.

Test gap:
The execute* methods (executeSuspend, executeResume, executeAgentSuspend, executeAgentResume) aren't unit-tested. The package doesn't have a harness for mocking dbWrite.transaction, advisory locks, or the SSH provider — same as the shipped agent_delete path. Adding that harness is a follow-up of its own size.

Follow-up

  • Real fast resume path: extend DockerSandboxProvider with start(sandboxId): Promise<SandboxHandle> that does the SSH docker start + waits for health + returns a handle with bridgeUrl / healthUrl. Then executeResume can branch on container presence and skip the full create flow when the container is still on disk.
  • Apply the same job-queue pattern to agent_logs and agent_snapshot (currently still inline through the missing container-control-plane service).
  • Repoint the frontend logs viewer from the legacy /api/compat/agents/.../logs to the new v1 routes.
  • DRY the four enqueue*Once methods (provision, delete, suspend, resume) — ~80 lines of near-identical advisory-lock + idempotency code. Touches shipped code so out of scope here.
  • Test harness for the execute* methods (covers this PR + the existing agent_delete path).

Greptile Summary

This PR migrates agent_suspend and agent_resume from inline Worker calls (which cannot SSH Hetzner cores) to the existing job-queue pattern, closing the silent container-leak bug where the DB row flipped to stopped while the container kept running.

  • New AGENT_SUSPEND and AGENT_RESUME job types are registered in the type registry with matching data/result shapes, type guards, and idempotent enqueueAgent{Suspend,Resume}Once helpers that use the existing advisory-lock pattern.
  • elizaSandboxService.executeSuspend runs inside the advisory lock, SSH-stops the container, and clears bridge_url/health_url while retaining sandbox_id; executeResume currently delegates to provision() (fast docker start path deferred to a follow-up).
  • All three affected HTTP routes (PATCH /agents/[id], POST /suspend, POST /resume) now return 202 with a jobId for polling, consistent with the existing delete path.

Confidence Score: 3/5

The architectural direction is correct, but executeResume runs without a lifecycle lock and neither enqueue helper guards against an opposing job type being in flight, allowing docker stop and docker start to race on the same container.

Two executor-level gaps flagged in earlier review rounds remain unaddressed: executeResume reads the sandbox record and calls provision() without holding the advisory lock, and neither enqueueAgentSuspendOnce nor enqueueAgentResumeOnce checks for an active job of the opposing type before creating a new one. Both gaps affect the core suspend/resume execution path and produce non-deterministic container state under concurrent operations.

eliza-sandbox.ts (executeResume lacks the lifecycle lock that executeSuspend has) and provisioning-jobs.ts (enqueue helpers need cross-type job conflict detection).

Important Files Changed

Filename Overview
packages/cloud-shared/src/lib/services/eliza-sandbox.ts Adds executeSuspend (with lifecycle lock) and executeResume (no lock, delegates to provision()). The missing lifecycle lock in executeResume and no guard against deletion_pending status are previously-flagged open issues.
packages/cloud-shared/src/lib/services/provisioning-jobs.ts Adds enqueueAgentSuspendOnce and enqueueAgentResumeOnce with advisory-lock idempotency; each checks only for its own job type, not the opposing type, allowing suspend/resume to race — a previously-flagged P1.
packages/cloud-api/v1/eliza/agents/[agentId]/resume/route.ts Switches to enqueueAgentResumeOnce and returns 202; still returns HTTP 409 for the already-in-progress case (previously-flagged inconsistency), and contains a dead "Agent state changed while starting" error branch inherited from the old provision path.
packages/cloud-api/v1/eliza/agents/[agentId]/suspend/route.ts Switches to enqueueAgentSuspendOnce, returns 202 for both fresh and idempotent cases — consistent with the delete route.
packages/cloud-api/v1/eliza/agents/[agentId]/route.ts PATCH suspend/shutdown actions now enqueue a job and return 202; checks for provisioning status before enqueuing and delegates gracefully for other states.
packages/cloud-shared/src/lib/services/provisioning-job-types.ts Adds AGENT_SUSPEND and AGENT_RESUME wire values; registry is correct and covered by the new smoke test.
packages/cloud-shared/src/lib/services/tests/provisioning-job-types.test.ts New smoke tests verify wire-value correctness, uniqueness, snake_case convention, and type narrowing — good coverage for the registry.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Worker as CF Worker (cloud-api)
    participant DB as PostgreSQL (jobs table)
    participant Daemon as Orchestrator Daemon (Hetzner VM)
    participant Core as Agent Core (SSH)

    Note over Client,Core: Suspend Flow
    Client->>Worker: "POST /suspend or PATCH action=suspend"
    Worker->>DB: enqueueAgentSuspendOnce (advisory lock)
    DB-->>Worker: "{ job, created }"
    Worker->>DB: triggerImmediate() [fire-and-forget]
    Worker-->>Client: "202 { jobId }"

    Daemon->>DB: poll pending jobs
    DB-->>Daemon: agent_suspend job
    Daemon->>Daemon: "executeAgentSuspend -> executeSuspend (advisory lock)"
    Daemon->>Core: SSH docker stop sandbox_id
    Core-->>Daemon: container stopped
    Daemon->>DB: "UPDATE status=stopped, bridge_url=NULL, health_url=NULL"
    Daemon->>DB: updateStatus(job, completed)

    Client->>Worker: "GET /api/v1/jobs/{jobId}"
    Worker-->>Client: "{ status: completed }"

    Note over Client,Core: Resume Flow
    Client->>Worker: POST /resume
    Worker->>DB: enqueueAgentResumeOnce (advisory lock)
    DB-->>Worker: "{ job, created }"
    Worker->>DB: triggerImmediate() [fire-and-forget]
    Worker-->>Client: "202 { jobId }"

    Daemon->>DB: poll pending jobs
    DB-->>Daemon: agent_resume job
    Daemon->>Daemon: "executeAgentResume -> executeResume (no lock)"
    Daemon->>Daemon: provision() [re-provision path]
    Daemon->>Core: SSH docker start / create container
    Core-->>Daemon: bridge_url + health_url
    Daemon->>DB: "UPDATE status=running, bridge_url, health_url"
    Daemon->>DB: updateStatus(job, completed)
Loading

Reviews (4): Last reviewed commit: "fix(cloud): drop dead fast-path in execu..." | Re-trigger Greptile

Suspend and resume previously called elizaSandboxService directly from
the cloud-api Worker, which silently failed because Workers can't SSH
the Hetzner cores. The DB row flipped to `stopped` while the container
kept burning RAM, and "resume" attempted a full re-provision every
time even when the original container was still on disk.

Mirrors the agent_delete refactor (PR #7746):

- new JOB_TYPES.AGENT_SUSPEND and AGENT_RESUME with data/result shapes,
  type guards, and idempotent enqueue helpers in provisioning-jobs.ts
- elizaSandboxService.executeSuspend: lifecycle lock, SSH docker stop
  (tolerant of "container gone"), DB to stopped, sandbox_id retained
  so the same container can be resumed
- elizaSandboxService.executeResume: fast path `docker start` on the
  existing container (~5s), full re-provision fallback if the container
  is gone (~60s). Existing Neon DB reused on both paths.
- cloud-api PATCH /eliza/agents/[id], /suspend, /resume all return 202
  with a jobId; clients poll /api/v1/jobs/<id> for the final status
- second suspend/resume while one is in flight reuses the existing job

Follow-up: same pattern for agent_logs and agent_snapshot.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d07f719e-55e6-46dd-96ba-80935fe99ac3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/orchestrator-http-server

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

Comment thread packages/cloud-shared/src/lib/services/eliza-sandbox.ts Outdated
Comment thread packages/cloud-shared/src/lib/services/eliza-sandbox.ts
Comment thread packages/cloud-shared/src/lib/services/eliza-sandbox.ts Outdated
Catches the cheap mistakes the orchestrator daemon can't recover from
at runtime: typo on the wire value, missing entry, accidental
duplicates. Deeper executor tests (SSH, locks, DB writes) need a test
harness the package doesn't have yet — follow-up.
@github-actions github-actions Bot added the Tests label May 19, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

Greptile flagged three issues on the resume executor:

P1: fast path UPDATEs status='running' but doesn't restore bridge_url /
health_url that executeSuspend clears, so proxy routes that guard on
rec.bridge_url see a "running" agent that's unreachable.

P2: executeResume reads outside any transaction and holds no lifecycle
lock, unlike executeSuspend. Two concurrent resume jobs can race.

P2: when provider.start is undefined the code falls through to a full
re-provision with no log, hiding the "method not implemented" path.

Underlying truth: DockerSandboxProvider doesn't expose a standalone
start() at all, so the fast path was dead code on the only provider
that ships today — every resume already paid the re-provision cost.

Fix: drop the fast path. executeResume now delegates to provision(),
which restores bridge_url / health_url from a fresh sandbox handle and
acquires its own advisory lock. The fast path returns when the
provider grows a start() that yields the handle — tracked as
follow-up.

Also updates the resume route comment and user-facing message that
falsely claimed "~5s fast path".
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

@standujar standujar marked this pull request as draft May 19, 2026 20:01
@standujar standujar closed this May 19, 2026
@standujar standujar deleted the feat/orchestrator-http-server branch May 19, 2026 20:06
@standujar
Copy link
Copy Markdown
Collaborator Author

Superseded by #7810 — branch was renamed (feat/orchestrator-http-serverfeat/agent-lifecycle-via-job-queue) since the scope grew to include RESTART / LOGS / SNAPSHOT job types beyond just orchestrator wiring. Same commits + new work continues there.

standujar added a commit that referenced this pull request May 19, 2026
Three new daemon-side handlers extend the cores-never-touched-from-Workers
architecture started in #7808 (suspend/resume) to cover the rest of the
lifecycle:

- executeRestart: shutdown() + provision() atomically on the daemon.
  Replaces the Worker-side sequence which silently no-op'd the stop
  step and could leave a stale container running alongside the new one.

- executeLogs: SSH \`docker logs --tail N <container>\` via the new
  SandboxProvider.fetchLogs() method. Works for stopped + crashed
  agents (the Worker-side fetch(bridge_url + /logs) returned empty for
  anything not actively running).

- executeSnapshot: thin wrapper around the existing snapshot() so the
  job dispatcher can route through a single contract. Invoked from
  the daemon so outbound traffic to cores uses the same network
  identity as every other lifecycle op.

Job machinery in provisioning-jobs.ts mirrors the suspend/resume
patterns: data/result shapes, type guards, idempotent enqueue methods
that reuse in-flight jobs on duplicate requests.

DockerSandboxProvider.fetchLogs() merges stderr into stdout because
agent crash traces tend to land on stderr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant