Skip to content

feat(cloud): all agent lifecycle ops via job queue (Workers don't call cores)#7810

Merged
lalalune merged 9 commits into
developfrom
feat/agent-lifecycle-via-job-queue
May 20, 2026
Merged

feat(cloud): all agent lifecycle ops via job queue (Workers don't call cores)#7810
lalalune merged 9 commits into
developfrom
feat/agent-lifecycle-via-job-queue

Conversation

@standujar
Copy link
Copy Markdown
Collaborator

@standujar standujar commented May 19, 2026

Note: this PR supersedes #7808, which was auto-closed when the branch was renamed (`feat/orchestrator-http-server` → `feat/agent-lifecycle-via-job-queue`). The first three commits (`d29da4c`, `fb1a92b`, `7f670b9`) are identical to what reviewers saw on #7808 — including the Greptile findings and the fix commit `7f670b9`.

Architectural shift

Cloudflare Workers never talk to Hetzner cores directly. Every cores-bound op flows through the job queue, the orchestrator daemon (the only context with SSH keys + a stable outbound network identity) executes it, and the Worker reads the result from `jobs.result`.

Previously the Worker code path called `elizaSandboxService.shutdown()` / `provision()` / `snapshot()` / bridge-HTTP-fetch inline. SSH-bound calls silently no-op'd from Workers; bridge-HTTP calls failed for stopped or crashed agents (no bridge to fetch from). Result: DB rows that claimed `stopped` while containers kept burning RAM, "save snapshot" 500s, and a logs view that returned empty for any non-trivially-running agent.

Scope (all shipped in this PR)

Six lifecycle operations, five new job types, six route rewrites. Pattern mirrors the shipped `agent_delete` (PR #7746).

Op Job type Daemon executor Route(s)
Suspend `AGENT_SUSPEND` `executeSuspend` — SSH `docker stop`, DB→stopped, keeps sandbox_id `PATCH /v1/eliza/agents/[id]`, `POST /v1/eliza/agents/[id]/suspend`, `POST /v1/agents/[id]/suspend`, `POST /compat/agents/[id]/suspend`
Resume `AGENT_RESUME` `executeResume` → `provision()` (URLs + DB restored) `POST /v1/eliza/agents/[id]/resume`
Restart `AGENT_RESTART` `executeRestart` — shutdown + provision atomic `POST /v1/agents/[id]/restart`
Logs `AGENT_LOGS` `executeLogs` — SSH `docker logs --tail N` via `provider.fetchLogs()` `GET /v1/agents/[id]/logs`, `GET /compat/agents/[id]/logs`
Snapshot `AGENT_SNAPSHOT` `executeSnapshot` — daemon-side snapshot() `POST /v1/eliza/agents/[id]/snapshot`

Greptile-resolved (from #7808, commit `7f670b9`)

The fast-path `docker start` in `executeResume` was dead code on `DockerSandboxProvider` (no standalone `start()` exposed), so every resume already paid the full re-provision cost. Greptile flagged three derived issues: bridge_url not restored, no lifecycle lock, silent fallback. Fix: drop the fast path. `provision()` restores URLs and holds its own advisory lock. Real fast path returns when the provider grows a `start(sandboxId): Promise` — tracked as follow-up.

What I tested

Locally:

  • `tsc --noEmit` clean on cloud-shared (touched files filtered)
  • biome check clean on all 10 touched files
  • `bun test packages/cloud-shared/src/lib/services/tests/provisioning-job-types.test.ts` → 4 pass, 100% coverage on the registry file. Catches typos / missing entries / duplicates / non-snake-case.
  • Manual /clean single-pass on the diff: no slop, no defensive overkill, no `any` / `as unknown` casts. JSDocs explain Why.

What I did NOT test (needs deploy on both Worker + daemon VM):

  • e2e dashboard: each of suspend / resume / restart / logs / snapshot via the dashboard against a live agent
  • Idempotency: rapid double-click on any op → second returns existing job
  • Logs job uniqueness: completed logs job is NOT reused on a fresh click (only in-flight ones are)
  • `/api/v1/jobs/` polling surfaces status transitions correctly

Test gap:
The `execute*` methods aren't unit-tested. cloud-shared has no harness for mocking `dbWrite.transaction` + advisory locks + SSH provider. Same gap as the shipped `agent_delete` / `agent_provision` paths. Follow-up of its own size.

Frontend UX note

Snapshot / suspend / restart toasts fire optimistically (the same as today for the already-shipped suspend route): the user sees "Snapshot saved" before the job actually completes. The DB eventually reflects reality. To make the toast truthful, the frontend needs to extend its existing `poller.track(agentId, jobId)` flow (used today for provision/resume) to cover the new 202 paths. Follow-up frontend PR.

Follow-up

  • Frontend poller for the new 202 routes (suspend, restart, snapshot, logs) — kills the optimistic-toast lie.
  • Real fast-path `docker start` via `provider.start()` returning a fresh handle (perf optimization, ~5s vs ~60s for resume).
  • DRY the now-seven `enqueue*Once` methods — ~80 LOC of near-identical lock + idempotency code per method. Touches shipped code, so out of scope here.
  • Test harness for the execute* methods (covers this PR + the existing `agent_delete` / `agent_provision` paths).
  • 5 remaining `elizaSandboxService.shutdown()` callers from non-route contexts (`agent-managed-github`, `agent-managed-discord`, `eliza-managed-launch`, `active-billing`). Architectural question: should those services run as Workers or daemon? Same Worker-side gap applies.

Greptile Summary

This PR shifts all agent lifecycle operations (suspend, resume, restart, logs, snapshot) from inline Cloudflare Worker calls to a job-queue pattern executed by the daemon — the only context with SSH access to Hetzner cores. It adds five new job types mirroring the shipped agent_delete pattern, rewrites six route handlers to return 202 + jobId, and implements five daemon-side execute* methods in ElizaSandboxService.

  • Six route rewrites across v1/, v1/eliza/, and compat/ replace direct elizaSandboxService calls with provisioningJobService.enqueue*Once + triggerImmediate(), returning 202 and a polling envelope.
  • Five new execute* methods in eliza-sandbox.ts handle the actual SSH operations daemon-side: executeSuspend (advisory-locked transaction), executeResume (delegates to provision()), executeRestart (sequential shutdown + provision), executeLogs (SSH docker logs), and executeSnapshot (delegates to existing snapshot()).
  • DockerSandboxProvider.fetchLogs is added with integer-clamped safeTail and 2>&1 stderr merge so crashed agents' boot traces are included.

Confidence Score: 4/5

The structural change is sound — six routes correctly shifted from inline Worker calls to job-queue dispatched by the daemon — but the triggerImmediate() calls without CF Worker env bindings mean the orchestrator kick is a no-op at runtime, so jobs sit pending until the next cron tick.

The triggerImmediate / missing env-binding issue carried from the prior review is a present runtime defect: the daemon wake-up call silently fails in every new route that adds the pattern, so suspend, restart, logs, and snapshot operations will always incur the full cron-tick latency rather than the near-instant dispatch intended.

The triggerImmediate() calls across the new route handlers (restart, suspend, logs in v1/agents/, PATCH in v1/eliza/agents/[agentId]/route.ts, and the newly added resume/snapshot/suspend under v1/eliza/agents/) all need the c.env pass-through addressed before deploy.

Important Files Changed

Filename Overview
packages/cloud-shared/src/lib/services/provisioning-jobs.ts Core of the PR: adds five new enqueue methods and five executor methods following the existing agent_delete/provision pattern. The enqueueAgentSnapshotOnce idempotency check doesn't filter by snapshotType, mirroring the already-flagged tail/logs issue.
packages/cloud-shared/src/lib/services/eliza-sandbox.ts Adds five daemon-side execute* methods. executeSuspend is properly transaction-locked; executeResume and executeRestart rely on provision()'s own lock but are not wrapped in a single atomic boundary (already flagged in previous review).
packages/cloud-api/v1/agents/[agentId]/restart/route.ts Route correctly enqueues AGENT_RESTART and returns 202; triggerImmediate() called without c.env (already flagged in previous review).
packages/cloud-api/v1/eliza/agents/[agentId]/snapshot/route.ts Enqueues AGENT_SNAPSHOT and returns 202; pre-gates on agent.status === "running". The idempotency check in the underlying enqueue method ignores snapshotType.
packages/cloud-api/v1/agents/[agentId]/suspend/route.ts Route enqueues AGENT_SUSPEND with correct idempotency fast-paths for stopped/provisioning. triggerImmediate() missing c.env (already flagged).
packages/cloud-api/v1/agents/[agentId]/logs/route.ts Route enqueues AGENT_LOGS; triggerImmediate() missing c.env (already flagged). Idempotency ignores tail (already flagged).
packages/cloud-api/v1/eliza/agents/[agentId]/route.ts PATCH handler now enqueues AGENT_SUSPEND; triggerImmediate() missing c.env in the new path (already flagged), whereas DELETE correctly passes c.env.
packages/cloud-api/v1/eliza/agents/[agentId]/resume/route.ts Switches from enqueueAgentProvisionOnce to enqueueAgentResumeOnce; drops expectedUpdatedAt check. Clean change, no issues found.
packages/cloud-api/v1/eliza/agents/[agentId]/suspend/route.ts Correctly rewired to enqueue AGENT_SUSPEND with proper 409/already-stopped fast-paths. Clean implementation.
packages/cloud-api/compat/agents/[id]/logs/route.ts Compat logs route rewired to job queue; shares the same triggerImmediate() without env context issue as the v1 routes (same pattern flagged in previous review).
packages/cloud-api/compat/agents/[id]/suspend/route.ts Compat suspend route correctly handles stopped/provisioning fast-paths and enqueues AGENT_SUSPEND with 202 response.
packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts Adds fetchLogs with integer-clamped tail (1-5000), shellQuote protection, and stderr merge via 2>&1. Clean implementation.
packages/cloud-shared/src/lib/services/provisioning-job-types.ts Adds five new job type constants; tested for uniqueness and snake_case convention in the new provisioning-job-types.test.ts.
packages/cloud-shared/src/lib/services/tests/provisioning-job-types.test.ts New smoke tests validate the job type registry for uniqueness, snake_case format, and correct wire values. Good coverage for the registry file.

Sequence Diagram

sequenceDiagram
    participant C as Client
    participant W as CF Worker (route)
    participant DB as Database (jobs)
    participant D as Daemon (orchestrator)
    participant H as Hetzner Core (SSH)

    C->>W: "POST /suspend | /restart | /snapshot"
    W->>DB: "enqueue*Once() advisory lock, idempotency check, INSERT job"
    DB-->>W: "job {id, status: pending}"
    W->>D: triggerImmediate() best-effort
    W-->>C: "202 {jobId, polling.endpoint}"

    Note over D: Cron tick or triggerImmediate kick
    D->>DB: SELECT pending jobs
    DB-->>D: job row

    alt AGENT_SUSPEND or AGENT_RESTART
        D->>H: SSH docker stop container
        H-->>D: ok
        D->>DB: "UPDATE status=stopped, bridge_url=NULL"
    else AGENT_RESUME
        D->>H: provision() docker run + health check
        H-->>D: bridgeUrl, healthUrl
        D->>DB: "UPDATE status=running, bridge_url, health_url"
    else AGENT_LOGS
        D->>H: SSH docker logs --tail N
        H-->>D: stdout+stderr
    else AGENT_SNAPSHOT
        D->>H: fetch bridge_url/snapshot
        H-->>D: backup metadata
        D->>DB: INSERT agent_sandbox_backups
    end

    D->>DB: "UPDATE jobs SET status=completed, result"
    C->>DB: "GET /api/v1/jobs/{id} poll until completed"
    DB-->>C: "{status, result}"
Loading

Comments Outside Diff (1)

  1. packages/cloud-shared/src/lib/services/provisioning-jobs.ts, line 936-957 (link)

    P2 Idempotency key for AGENT_LOGS ignores the tail parameter

    The in-flight dedup query matches any pending/in_progress agent_logs job for the same agent, regardless of tail. A second request for tail=5000 while a tail=100 job is executing returns the tail=100 job, but both route handlers respond with the caller's own tail value in the envelope. The client sees alreadyInProgress: true but the job stored in jobs.data.tail and eventually written to jobs.result.tail is 100 — clients polling the job will receive 100-line logs when they asked for 5000. Either include tail in the idempotency check, or make the response tail field come from the returned job's data rather than the caller's parameter.

Reviews (2): Last reviewed commit: "chore(cloud-shared): biome format drift ..." | Re-trigger Greptile

standujar added 5 commits May 19, 2026 18:11
Suspend and resume previously called elizaSandboxService directly from
the cloud-api Worker, which silently failed because Workers can't SSH
the Hetzner cores. The DB row flipped to `stopped` while the container
kept burning RAM, and "resume" attempted a full re-provision every
time even when the original container was still on disk.

Mirrors the agent_delete refactor (PR #7746):

- new JOB_TYPES.AGENT_SUSPEND and AGENT_RESUME with data/result shapes,
  type guards, and idempotent enqueue helpers in provisioning-jobs.ts
- elizaSandboxService.executeSuspend: lifecycle lock, SSH docker stop
  (tolerant of "container gone"), DB to stopped, sandbox_id retained
  so the same container can be resumed
- elizaSandboxService.executeResume: fast path `docker start` on the
  existing container (~5s), full re-provision fallback if the container
  is gone (~60s). Existing Neon DB reused on both paths.
- cloud-api PATCH /eliza/agents/[id], /suspend, /resume all return 202
  with a jobId; clients poll /api/v1/jobs/<id> for the final status
- second suspend/resume while one is in flight reuses the existing job

Follow-up: same pattern for agent_logs and agent_snapshot.
Catches the cheap mistakes the orchestrator daemon can't recover from
at runtime: typo on the wire value, missing entry, accidental
duplicates. Deeper executor tests (SSH, locks, DB writes) need a test
harness the package doesn't have yet — follow-up.
Greptile flagged three issues on the resume executor:

P1: fast path UPDATEs status='running' but doesn't restore bridge_url /
health_url that executeSuspend clears, so proxy routes that guard on
rec.bridge_url see a "running" agent that's unreachable.

P2: executeResume reads outside any transaction and holds no lifecycle
lock, unlike executeSuspend. Two concurrent resume jobs can race.

P2: when provider.start is undefined the code falls through to a full
re-provision with no log, hiding the "method not implemented" path.

Underlying truth: DockerSandboxProvider doesn't expose a standalone
start() at all, so the fast path was dead code on the only provider
that ships today — every resume already paid the re-provision cost.

Fix: drop the fast path. executeResume now delegates to provision(),
which restores bridge_url / health_url from a fresh sandbox handle and
acquires its own advisory lock. The fast path returns when the
provider grows a start() that yields the handle — tracked as
follow-up.

Also updates the resume route comment and user-facing message that
falsely claimed "~5s fast path".
…T job types

Three new wire values for the orchestrator daemon to dispatch on. The
executors land in subsequent commits — register first so smoke tests
cover the registry before the implementations rely on it.

Why all three at once: same architectural shift as suspend/resume —
Cloudflare Workers never call cores directly, every cores-bound op
flows through the queue. RESTART is shutdown+provision atomic, LOGS
is daemon-side `docker logs`, SNAPSHOT is daemon-side
`fetchSnapshotState` (today the Worker hits bridge HTTP, which fails
for stopped or crashed agents).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 837641a2-77e8-4477-8eac-99fa636f3921

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/agent-lifecycle-via-job-queue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

standujar added 2 commits May 19, 2026 23:27
Three new daemon-side handlers extend the cores-never-touched-from-Workers
architecture started in #7808 (suspend/resume) to cover the rest of the
lifecycle:

- executeRestart: shutdown() + provision() atomically on the daemon.
  Replaces the Worker-side sequence which silently no-op'd the stop
  step and could leave a stale container running alongside the new one.

- executeLogs: SSH \`docker logs --tail N <container>\` via the new
  SandboxProvider.fetchLogs() method. Works for stopped + crashed
  agents (the Worker-side fetch(bridge_url + /logs) returned empty for
  anything not actively running).

- executeSnapshot: thin wrapper around the existing snapshot() so the
  job dispatcher can route through a single contract. Invoked from
  the daemon so outbound traffic to cores uses the same network
  identity as every other lifecycle op.

Job machinery in provisioning-jobs.ts mirrors the suspend/resume
patterns: data/result shapes, type guards, idempotent enqueue methods
that reuse in-flight jobs on duplicate requests.

DockerSandboxProvider.fetchLogs() merges stderr into stdout because
agent crash traces tend to land on stderr.
Each route now enqueues the appropriate job and returns 202 + jobId
instead of running the operation inline. Cloudflare Workers can't SSH
the Hetzner cores and the bridge-HTTP path is unreachable for non-
running agents — the daemon, which owns the SSH keys and runs from a
stable network identity, executes everything.

Routes touched:
- v1/agents/[id]/suspend       (was: shutdown() inline → silent no-op)
- v1/agents/[id]/restart       (was: shutdown() + provision() inline)
- v1/agents/[id]/logs          (was: fetch bridge /logs, empty when stopped)
- v1/eliza/agents/[id]/snapshot (was: bridge-HTTP fetchSnapshotState)
- compat/agents/[id]/suspend
- compat/agents/[id]/logs

Each response includes a \`polling\` block with the jobs endpoint,
recommended interval, and expected duration so clients can poll
\`/api/v1/jobs/<id>\` for the final status.

Breaking shape change vs. the old inline routes: clients that
expected logs/snapshot results synchronously in the response body
now need to poll. Frontend toasts still fire optimistically — same
behavior as the already-shipped suspend route — and will be tracked
through the existing job poller in a follow-up.
@standujar standujar marked this pull request as ready for review May 19, 2026 21:28
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

Comment on lines +58 to +60
void provisioningJobService.triggerImmediate().catch(() => {
// Logged inside the service.
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 triggerImmediate missing c.env in Worker-hosted routes

In Cloudflare Workers, env vars are available only as request-context bindings (c.env), not via process.env. The triggerImmediate fallback path uses process.env.CONTAINER_CONTROL_PLANE_URL, process.env.CRON_SECRET, etc., all of which will be undefined at runtime in a CF Worker. The result is that triggerImmediate() silently returns early and the daemon is never kicked — so restart jobs sit in pending until the next scheduled cron tick (up to ~1 minute), while the very same issue is avoided for AGENT_DELETE in v1/eliza/agents/[agentId]/route.ts which correctly passes c.env. The same pattern is missing in v1/agents/[agentId]/suspend/route.ts line 76 and v1/agents/[agentId]/logs/route.ts line 269.

Comment on lines +253 to +255
void provisioningJobService.triggerImmediate().catch(() => {
// Logged inside the service.
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 c.env not passed in PATCH but is in DELETE in the same file

The DELETE handler on line 325 correctly calls provisioningJobService.triggerImmediate(c.env), giving the daemon access to Worker-context bindings. The new PATCH handler calls triggerImmediate() without c.env, so the daemon kick never fires in a Cloudflare Worker context — the suspend/shutdown operation will appear unresponsive for up to a full cron interval after the user clicks the button.

Comment on lines +2823 to +2866
async executeRestart(
agentId: string,
orgId: string,
): Promise<{
success: boolean;
containerStopped: boolean;
containerStarted: boolean;
bridgeUrl?: string;
healthUrl?: string;
error?: string;
}> {
const shutdownResult = await this.shutdown(agentId, orgId);
if (!shutdownResult.success) {
if (shutdownResult.error === "Agent not found") {
return {
success: false,
containerStopped: false,
containerStarted: false,
error: "Agent not found",
};
}
logger.warn("[agent-sandbox] Shutdown during restart returned error, continuing", {
agentId,
error: shutdownResult.error,
});
}

const provisionResult = await this.provision(agentId, orgId);
if (!provisionResult.success) {
return {
success: false,
containerStopped: shutdownResult.success,
containerStarted: false,
error: provisionResult.error,
};
}

return {
success: true,
containerStopped: shutdownResult.success,
containerStarted: true,
bridgeUrl: provisionResult.bridgeUrl,
healthUrl: provisionResult.healthUrl,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 executeRestart is not atomic despite the docstring's claim

The docstring says "atomic on the daemon side so two concurrent restarts can't interleave stop+start out of order," but shutdown() and provision() are two separate transactions each acquiring and releasing the advisory lock independently. Between them — after shutdown() commits and before provision() starts — the lock is free and the agent status is stopped. A concurrent AGENT_RESUME job can acquire the lock at that point and race against the restart's own provision() call. The idempotency gate in enqueueAgentResumeOnce only checks for in-flight AGENT_RESUME jobs; it does not see the AGENT_RESTART in-progress marker. Depending on how provision() handles a concurrent re-entrant call, this can leave two containers provisioned or one restart result overwriting the other's bridge_url/health_url.

standujar added 2 commits May 20, 2026 00:07
…s-helpers)

Unblocks the lint-and-types CI step that's failing on this PR. Drift
was on develop after #7804 merged — touching it from here so the merge
of this branch into develop produces a clean lint pass.
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

@lalalune lalalune merged commit fb79721 into develop May 20, 2026
31 of 36 checks passed
@lalalune lalalune deleted the feat/agent-lifecycle-via-job-queue branch May 20, 2026 02:09
2-A-M pushed a commit to 2-A-M/eliza that referenced this pull request May 20, 2026
Five small wins surfaced by a second /clean pass after the lifecycle
queue stack merged (elizaOS#7810/elizaOS#7813/elizaOS#7815/elizaOS#7816):

- provisioning-jobs.ts: drop 4 redundant type casts. `status: "error"`,
  `"deletion_failed"`, and the `webhook_status` updates are all literals
  matching the inferred parameter type — the `as Parameters<...>[1]` /
  `as Partial<Job>` casts added nothing.
- provisioning-jobs.ts: executeAgentProvision failure path now uses
  agentProvisionJobResultToRecord({...}) like every other executor,
  preserving the typed-serialization-boundary pattern.
- v1/eliza/agents/[id]/route.ts (DELETE): drop the redundant
  `instanceof Error && error.message === "Agent not found"` branch.
  failureResponse() already maps any "not found" error to 404.
- v1/agents/[id]/logs/route.ts: add `success: false` to the 404 body
  to match the response shape every sibling route uses.
- v1/eliza/agents/[id]/resume/route.ts: trim a forward-looking comment
  about a "future docker start fast path" — belongs in a ticket, not
  the route. Kept the audit-log rationale.
- __tests__/provisioning-job-types.test.ts: lock the registry size
  with `expect(Object.keys(JOB_TYPES)).toHaveLength(7)`. A new entry
  without a matching wire-value assertion now fails CI instead of being
  silently under-covered.

Net diff: -5 LOC, 5 files. No behavior change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants