feat(cloud): all agent lifecycle ops via job queue (Workers don't call cores)#7810
Conversation
Suspend and resume previously called elizaSandboxService directly from the cloud-api Worker, which silently failed because Workers can't SSH the Hetzner cores. The DB row flipped to `stopped` while the container kept burning RAM, and "resume" attempted a full re-provision every time even when the original container was still on disk. Mirrors the agent_delete refactor (PR #7746): - new JOB_TYPES.AGENT_SUSPEND and AGENT_RESUME with data/result shapes, type guards, and idempotent enqueue helpers in provisioning-jobs.ts - elizaSandboxService.executeSuspend: lifecycle lock, SSH docker stop (tolerant of "container gone"), DB to stopped, sandbox_id retained so the same container can be resumed - elizaSandboxService.executeResume: fast path `docker start` on the existing container (~5s), full re-provision fallback if the container is gone (~60s). Existing Neon DB reused on both paths. - cloud-api PATCH /eliza/agents/[id], /suspend, /resume all return 202 with a jobId; clients poll /api/v1/jobs/<id> for the final status - second suspend/resume while one is in flight reuses the existing job Follow-up: same pattern for agent_logs and agent_snapshot.
Catches the cheap mistakes the orchestrator daemon can't recover from at runtime: typo on the wire value, missing entry, accidental duplicates. Deeper executor tests (SSH, locks, DB writes) need a test harness the package doesn't have yet — follow-up.
Greptile flagged three issues on the resume executor: P1: fast path UPDATEs status='running' but doesn't restore bridge_url / health_url that executeSuspend clears, so proxy routes that guard on rec.bridge_url see a "running" agent that's unreachable. P2: executeResume reads outside any transaction and holds no lifecycle lock, unlike executeSuspend. Two concurrent resume jobs can race. P2: when provider.start is undefined the code falls through to a full re-provision with no log, hiding the "method not implemented" path. Underlying truth: DockerSandboxProvider doesn't expose a standalone start() at all, so the fast path was dead code on the only provider that ships today — every resume already paid the re-provision cost. Fix: drop the fast path. executeResume now delegates to provision(), which restores bridge_url / health_url from a fresh sandbox handle and acquires its own advisory lock. The fast path returns when the provider grows a start() that yields the handle — tracked as follow-up. Also updates the resume route comment and user-facing message that falsely claimed "~5s fast path".
…T job types Three new wire values for the orchestrator daemon to dispatch on. The executors land in subsequent commits — register first so smoke tests cover the registry before the implementations rely on it. Why all three at once: same architectural shift as suspend/resume — Cloudflare Workers never call cores directly, every cores-bound op flows through the queue. RESTART is shutdown+provision atomic, LOGS is daemon-side `docker logs`, SNAPSHOT is daemon-side `fetchSnapshotState` (today the Worker hits bridge HTTP, which fails for stopped or crashed agents).
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Three new daemon-side handlers extend the cores-never-touched-from-Workers architecture started in #7808 (suspend/resume) to cover the rest of the lifecycle: - executeRestart: shutdown() + provision() atomically on the daemon. Replaces the Worker-side sequence which silently no-op'd the stop step and could leave a stale container running alongside the new one. - executeLogs: SSH \`docker logs --tail N <container>\` via the new SandboxProvider.fetchLogs() method. Works for stopped + crashed agents (the Worker-side fetch(bridge_url + /logs) returned empty for anything not actively running). - executeSnapshot: thin wrapper around the existing snapshot() so the job dispatcher can route through a single contract. Invoked from the daemon so outbound traffic to cores uses the same network identity as every other lifecycle op. Job machinery in provisioning-jobs.ts mirrors the suspend/resume patterns: data/result shapes, type guards, idempotent enqueue methods that reuse in-flight jobs on duplicate requests. DockerSandboxProvider.fetchLogs() merges stderr into stdout because agent crash traces tend to land on stderr.
Each route now enqueues the appropriate job and returns 202 + jobId instead of running the operation inline. Cloudflare Workers can't SSH the Hetzner cores and the bridge-HTTP path is unreachable for non- running agents — the daemon, which owns the SSH keys and runs from a stable network identity, executes everything. Routes touched: - v1/agents/[id]/suspend (was: shutdown() inline → silent no-op) - v1/agents/[id]/restart (was: shutdown() + provision() inline) - v1/agents/[id]/logs (was: fetch bridge /logs, empty when stopped) - v1/eliza/agents/[id]/snapshot (was: bridge-HTTP fetchSnapshotState) - compat/agents/[id]/suspend - compat/agents/[id]/logs Each response includes a \`polling\` block with the jobs endpoint, recommended interval, and expected duration so clients can poll \`/api/v1/jobs/<id>\` for the final status. Breaking shape change vs. the old inline routes: clients that expected logs/snapshot results synchronously in the response body now need to poll. Frontend toasts still fire optimistically — same behavior as the already-shipped suspend route — and will be tracked through the existing job poller in a follow-up.
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
| void provisioningJobService.triggerImmediate().catch(() => { | ||
| // Logged inside the service. | ||
| }); |
There was a problem hiding this comment.
triggerImmediate missing c.env in Worker-hosted routes
In Cloudflare Workers, env vars are available only as request-context bindings (c.env), not via process.env. The triggerImmediate fallback path uses process.env.CONTAINER_CONTROL_PLANE_URL, process.env.CRON_SECRET, etc., all of which will be undefined at runtime in a CF Worker. The result is that triggerImmediate() silently returns early and the daemon is never kicked — so restart jobs sit in pending until the next scheduled cron tick (up to ~1 minute), while the very same issue is avoided for AGENT_DELETE in v1/eliza/agents/[agentId]/route.ts which correctly passes c.env. The same pattern is missing in v1/agents/[agentId]/suspend/route.ts line 76 and v1/agents/[agentId]/logs/route.ts line 269.
| void provisioningJobService.triggerImmediate().catch(() => { | ||
| // Logged inside the service. | ||
| }); |
There was a problem hiding this comment.
c.env not passed in PATCH but is in DELETE in the same file
The DELETE handler on line 325 correctly calls provisioningJobService.triggerImmediate(c.env), giving the daemon access to Worker-context bindings. The new PATCH handler calls triggerImmediate() without c.env, so the daemon kick never fires in a Cloudflare Worker context — the suspend/shutdown operation will appear unresponsive for up to a full cron interval after the user clicks the button.
| async executeRestart( | ||
| agentId: string, | ||
| orgId: string, | ||
| ): Promise<{ | ||
| success: boolean; | ||
| containerStopped: boolean; | ||
| containerStarted: boolean; | ||
| bridgeUrl?: string; | ||
| healthUrl?: string; | ||
| error?: string; | ||
| }> { | ||
| const shutdownResult = await this.shutdown(agentId, orgId); | ||
| if (!shutdownResult.success) { | ||
| if (shutdownResult.error === "Agent not found") { | ||
| return { | ||
| success: false, | ||
| containerStopped: false, | ||
| containerStarted: false, | ||
| error: "Agent not found", | ||
| }; | ||
| } | ||
| logger.warn("[agent-sandbox] Shutdown during restart returned error, continuing", { | ||
| agentId, | ||
| error: shutdownResult.error, | ||
| }); | ||
| } | ||
|
|
||
| const provisionResult = await this.provision(agentId, orgId); | ||
| if (!provisionResult.success) { | ||
| return { | ||
| success: false, | ||
| containerStopped: shutdownResult.success, | ||
| containerStarted: false, | ||
| error: provisionResult.error, | ||
| }; | ||
| } | ||
|
|
||
| return { | ||
| success: true, | ||
| containerStopped: shutdownResult.success, | ||
| containerStarted: true, | ||
| bridgeUrl: provisionResult.bridgeUrl, | ||
| healthUrl: provisionResult.healthUrl, | ||
| }; |
There was a problem hiding this comment.
executeRestart is not atomic despite the docstring's claim
The docstring says "atomic on the daemon side so two concurrent restarts can't interleave stop+start out of order," but shutdown() and provision() are two separate transactions each acquiring and releasing the advisory lock independently. Between them — after shutdown() commits and before provision() starts — the lock is free and the agent status is stopped. A concurrent AGENT_RESUME job can acquire the lock at that point and race against the restart's own provision() call. The idempotency gate in enqueueAgentResumeOnce only checks for in-flight AGENT_RESUME jobs; it does not see the AGENT_RESTART in-progress marker. Depending on how provision() handles a concurrent re-entrant call, this can leave two containers provisioned or one restart result overwriting the other's bridge_url/health_url.
…s-helpers) Unblocks the lint-and-types CI step that's failing on this PR. Drift was on develop after #7804 merged — touching it from here so the merge of this branch into develop produces a clean lint pass.
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
Five small wins surfaced by a second /clean pass after the lifecycle queue stack merged (elizaOS#7810/elizaOS#7813/elizaOS#7815/elizaOS#7816): - provisioning-jobs.ts: drop 4 redundant type casts. `status: "error"`, `"deletion_failed"`, and the `webhook_status` updates are all literals matching the inferred parameter type — the `as Parameters<...>[1]` / `as Partial<Job>` casts added nothing. - provisioning-jobs.ts: executeAgentProvision failure path now uses agentProvisionJobResultToRecord({...}) like every other executor, preserving the typed-serialization-boundary pattern. - v1/eliza/agents/[id]/route.ts (DELETE): drop the redundant `instanceof Error && error.message === "Agent not found"` branch. failureResponse() already maps any "not found" error to 404. - v1/agents/[id]/logs/route.ts: add `success: false` to the 404 body to match the response shape every sibling route uses. - v1/eliza/agents/[id]/resume/route.ts: trim a forward-looking comment about a "future docker start fast path" — belongs in a ticket, not the route. Kept the audit-log rationale. - __tests__/provisioning-job-types.test.ts: lock the registry size with `expect(Object.keys(JOB_TYPES)).toHaveLength(7)`. A new entry without a matching wire-value assertion now fails CI instead of being silently under-covered. Net diff: -5 LOC, 5 files. No behavior change.
Architectural shift
Cloudflare Workers never talk to Hetzner cores directly. Every cores-bound op flows through the job queue, the orchestrator daemon (the only context with SSH keys + a stable outbound network identity) executes it, and the Worker reads the result from `jobs.result`.
Previously the Worker code path called `elizaSandboxService.shutdown()` / `provision()` / `snapshot()` / bridge-HTTP-fetch inline. SSH-bound calls silently no-op'd from Workers; bridge-HTTP calls failed for stopped or crashed agents (no bridge to fetch from). Result: DB rows that claimed `stopped` while containers kept burning RAM, "save snapshot" 500s, and a logs view that returned empty for any non-trivially-running agent.
Scope (all shipped in this PR)
Six lifecycle operations, five new job types, six route rewrites. Pattern mirrors the shipped `agent_delete` (PR #7746).
Greptile-resolved (from #7808, commit `7f670b9`)
The fast-path `docker start` in `executeResume` was dead code on `DockerSandboxProvider` (no standalone `start()` exposed), so every resume already paid the full re-provision cost. Greptile flagged three derived issues: bridge_url not restored, no lifecycle lock, silent fallback. Fix: drop the fast path. `provision()` restores URLs and holds its own advisory lock. Real fast path returns when the provider grows a `start(sandboxId): Promise` — tracked as follow-up.
What I tested
Locally:
What I did NOT test (needs deploy on both Worker + daemon VM):
Test gap:
The `execute*` methods aren't unit-tested. cloud-shared has no harness for mocking `dbWrite.transaction` + advisory locks + SSH provider. Same gap as the shipped `agent_delete` / `agent_provision` paths. Follow-up of its own size.
Frontend UX note
Snapshot / suspend / restart toasts fire optimistically (the same as today for the already-shipped suspend route): the user sees "Snapshot saved" before the job actually completes. The DB eventually reflects reality. To make the toast truthful, the frontend needs to extend its existing `poller.track(agentId, jobId)` flow (used today for provision/resume) to cover the new 202 paths. Follow-up frontend PR.
Follow-up
Greptile Summary
This PR shifts all agent lifecycle operations (suspend, resume, restart, logs, snapshot) from inline Cloudflare Worker calls to a job-queue pattern executed by the daemon — the only context with SSH access to Hetzner cores. It adds five new job types mirroring the shipped
agent_deletepattern, rewrites six route handlers to return202 + jobId, and implements five daemon-sideexecute*methods inElizaSandboxService.v1/,v1/eliza/, andcompat/replace directelizaSandboxServicecalls withprovisioningJobService.enqueue*Once+triggerImmediate(), returning202and a polling envelope.execute*methods ineliza-sandbox.tshandle the actual SSH operations daemon-side:executeSuspend(advisory-locked transaction),executeResume(delegates toprovision()),executeRestart(sequentialshutdown+provision),executeLogs(SSHdocker logs), andexecuteSnapshot(delegates to existingsnapshot()).DockerSandboxProvider.fetchLogsis added with integer-clampedsafeTailand2>&1stderr merge so crashed agents' boot traces are included.Confidence Score: 4/5
The structural change is sound — six routes correctly shifted from inline Worker calls to job-queue dispatched by the daemon — but the
triggerImmediate()calls without CF Worker env bindings mean the orchestrator kick is a no-op at runtime, so jobs sit pending until the next cron tick.The
triggerImmediate/ missing env-binding issue carried from the prior review is a present runtime defect: the daemon wake-up call silently fails in every new route that adds the pattern, so suspend, restart, logs, and snapshot operations will always incur the full cron-tick latency rather than the near-instant dispatch intended.The
triggerImmediate()calls across the new route handlers (restart, suspend, logs inv1/agents/, PATCH inv1/eliza/agents/[agentId]/route.ts, and the newly added resume/snapshot/suspend underv1/eliza/agents/) all need thec.envpass-through addressed before deploy.Important Files Changed
enqueueAgentSnapshotOnceidempotency check doesn't filter bysnapshotType, mirroring the already-flaggedtail/logs issue.executeSuspendis properly transaction-locked;executeResumeandexecuteRestartrely on provision()'s own lock but are not wrapped in a single atomic boundary (already flagged in previous review).triggerImmediate()called withoutc.env(already flagged in previous review).triggerImmediate()missingc.env(already flagged).triggerImmediate()missingc.env(already flagged). Idempotency ignorestail(already flagged).triggerImmediate()missingc.envin the new path (already flagged), whereas DELETE correctly passes c.env.triggerImmediate()without env context issue as the v1 routes (same pattern flagged in previous review).Sequence Diagram
sequenceDiagram participant C as Client participant W as CF Worker (route) participant DB as Database (jobs) participant D as Daemon (orchestrator) participant H as Hetzner Core (SSH) C->>W: "POST /suspend | /restart | /snapshot" W->>DB: "enqueue*Once() advisory lock, idempotency check, INSERT job" DB-->>W: "job {id, status: pending}" W->>D: triggerImmediate() best-effort W-->>C: "202 {jobId, polling.endpoint}" Note over D: Cron tick or triggerImmediate kick D->>DB: SELECT pending jobs DB-->>D: job row alt AGENT_SUSPEND or AGENT_RESTART D->>H: SSH docker stop container H-->>D: ok D->>DB: "UPDATE status=stopped, bridge_url=NULL" else AGENT_RESUME D->>H: provision() docker run + health check H-->>D: bridgeUrl, healthUrl D->>DB: "UPDATE status=running, bridge_url, health_url" else AGENT_LOGS D->>H: SSH docker logs --tail N H-->>D: stdout+stderr else AGENT_SNAPSHOT D->>H: fetch bridge_url/snapshot H-->>D: backup metadata D->>DB: INSERT agent_sandbox_backups end D->>DB: "UPDATE jobs SET status=completed, result" C->>DB: "GET /api/v1/jobs/{id} poll until completed" DB-->>C: "{status, result}"Comments Outside Diff (1)
packages/cloud-shared/src/lib/services/provisioning-jobs.ts, line 936-957 (link)AGENT_LOGSignores thetailparameterThe in-flight dedup query matches any
pending/in_progressagent_logsjob for the same agent, regardless oftail. A second request fortail=5000while atail=100job is executing returns thetail=100job, but both route handlers respond with the caller's owntailvalue in the envelope. The client seesalreadyInProgress: truebut the job stored injobs.data.tailand eventually written tojobs.result.tailis 100 — clients polling the job will receive 100-line logs when they asked for 5000. Either includetailin the idempotency check, or make the response tail field come from the returned job's data rather than the caller's parameter.Reviews (2): Last reviewed commit: "chore(cloud-shared): biome format drift ..." | Re-trigger Greptile