Skip to content

fix(cloud-frontend): track async job ids for suspend/snapshot/restart toasts#7813

Merged
lalalune merged 1 commit into
developfrom
fix/cloud-frontend-job-poller-async-routes
May 20, 2026
Merged

fix(cloud-frontend): track async job ids for suspend/snapshot/restart toasts#7813
lalalune merged 1 commit into
developfrom
fix/cloud-frontend-job-poller-async-routes

Conversation

@standujar
Copy link
Copy Markdown
Collaborator

@standujar standujar commented May 19, 2026

Why

Backend lifecycle ops moved to the job queue in #7810: suspend, restart, snapshot, logs all return 202 + jobId instead of completing inline.

The frontend was still firing the success toast immediately on 2xx and reloading the page, which lied about the operation being done. "Snapshot saved" appeared in < 1s while the daemon had just started a 30s job.

What changed

`packages/cloud-frontend/src/dashboard/containers/_components/agent-actions.tsx`:

  • Detection of 202 + jobId is now generic (any action), not hardcoded to provision/resume.
  • New "queued" toast variants for snapshot / suspend / shutdown.
  • For queued ops, the existing `useJobPoller` already handles `onComplete` / `onFailed` toasts + auto-refresh — no `window.location.reload()` needed on the queued path.

`packages/cloud-frontend/src/dashboard/containers/_components/eliza-agents-table.tsx`:

  • `handleSuspend` now reads 202 + jobId and `poller.track()`s it (same hook the provision flow uses).

Test plan

  • Click Suspend in the agents table → toast says "Suspend queued" → after ~5s of polling, table updates to stopped + toast "Agent provisioning completed" (provision-completed name is misleading but the hook fires it; copy can be tweaked later).
  • Click Save Snapshot in agent detail → "Snapshot queued" toast → poller resolves with onComplete callback.
  • Click Restart (if exposed via UI) → "Restart queued" → resolves.
  • Provision/Resume still work as before (no regression).
  • 409 path still tracked via existing fallback message.

Depends on

#7810 (backend routes returning 202 + jobId). Mergeable in any order — without #7810 the frontend changes are dead code paths (no route returns 202 + jobId for those actions yet), but they don't break anything.

Follow-up

  • Generic poller toast copy currently says "Provisioning completed" / "Provisioning failed" regardless of which op finished. Untangle by passing the action name through `useJobPoller` and templating the toast.

Greptile Summary

This PR fixes premature "success" toasts for async lifecycle operations (suspend, snapshot, restart) by detecting the 202+jobId response pattern — previously used only for provision/resume — and routing those actions through useJobPoller instead of resolving immediately.

  • agent-actions.tsx: 202+jobId detection is generalised to any action with per-action queued toast messages; the premature window.location.reload() on the 202 path is removed (reload still happens after the job completes via autoRefresh). The 409+jobId path is also generalised.
  • eliza-agents-table.tsx: handleSuspend now reads the response body for 202+jobId and calls poller.track(), matching the pattern handleProvision already used. However, unlike agent-actions.tsx, the 409+jobId case is not handled — a concurrent suspend attempt throws a false error instead of attaching to the existing job.

Confidence Score: 3/5

Safe for most flows, but the table's suspend handler has an incomplete edge-case that produces a visible false error toast for concurrent suspend attempts.

The core fix (routing 202+jobId through the job poller) is sound and the happy path works correctly. The table's handleSuspend does not handle the 409+jobId case that agent-actions.tsx now covers — if a suspend is already running and the user clicks suspend again from the table, they see "Suspend failed" and the optimistic status reverts, even though the daemon is healthy. The completion toasts ("Agent provisioning completed") are also incorrect for suspend and snapshot operations, which is a user-visible lie each time those jobs finish.

eliza-agents-table.tsx needs the 409+jobId guard added to handleSuspend to match the pattern in agent-actions.tsx.

Important Files Changed

Filename Overview
packages/cloud-frontend/src/dashboard/containers/_components/agent-actions.tsx Generalises 202+jobId tracking to all actions (not just provision/resume), adds per-action queued toast messages, and removes premature window.location.reload() on the async path. Minor issues: shutdown reuses the suspendQueued i18n key, and onComplete/onFailed messages still say "provisioning" for non-provisioning operations.
packages/cloud-frontend/src/dashboard/containers/_components/eliza-agents-table.tsx handleSuspend updated to read 202+jobId and call poller.track(), but misses the 409+jobId case that agent-actions.tsx now handles — a concurrent suspend attempt shows a false "Suspend failed" error instead of attaching to the in-flight job.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant FE as Frontend (agent-actions / table)
    participant API as Backend API
    participant P as useJobPoller
    participant J as /api/v1/jobs/:id

    U->>FE: Click Suspend / Snapshot / Restart
    FE->>API: PATCH/POST action
    alt 409 + jobId (already in flight)
        API-->>FE: "409 { data: { jobId } }"
        FE->>P: poller.track(agentId, jobId)
        FE-->>U: "toast.info("{action} already in progress")"
    else 202 + jobId (newly queued)
        API-->>FE: "202 { data: { jobId } }"
        FE->>P: poller.track(agentId, jobId)
        FE-->>U: "toast.success("{action} queued")"
    else 2xx no jobId (legacy inline)
        API-->>FE: "200 {}"
        FE-->>U: "toast.success("{action} done")"
        FE->>FE: window.location.reload()
    else error
        API-->>FE: 4xx/5xx
        FE-->>U: toast.error("Action failed: ...")
    end

    loop Every 5s while job active
        P->>J: GET /api/v1/jobs/:jobId
        J-->>P: "{ status, error }"
        alt completed
            P->>FE: onComplete() → toast "Agent provisioning completed"
            P->>FE: window.location.reload()
        else failed
            P->>FE: onFailed() → toast.error(job.error)
            P->>FE: window.location.reload()
        else timed out
            P->>FE: onFailed("Timed out waiting...")
        end
    end
Loading

Comments Outside Diff (1)

  1. packages/cloud-frontend/src/dashboard/containers/_components/agent-actions.tsx, line 34-48 (link)

    P2 Generic onComplete/onFailed messages are wrong for all newly-tracked actions

    The PR extends job tracking to snapshot, suspend, and shutdown, but the useJobPoller callbacks remain hardcoded to "Agent provisioning completed" / "Provisioning failed". A user who clicks "Save Snapshot" sees "Snapshot queued", waits ~30 s, and then receives "Agent provisioning completed" — which directly contradicts the earlier toast. The same problem applies to suspend and shutdown in both this file and eliza-agents-table.tsx (line 278–286). The follow-up mentioned in the PR description would need to pass an action-to-message map into useJobPoller or use per-action pollers to fix this.

Reviews (1): Last reviewed commit: "fix(cloud-frontend): track async job ids..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

… toasts

Backend lifecycle ops moved to the job queue in #7810: suspend,
restart, snapshot, logs all return 202 + jobId instead of completing
inline. The frontend was still firing the success toast immediately on
2xx and reloading the page, which lied about the operation being done
when the daemon had only just started it.

Generalize the existing provision/resume 202+jobId path so any action
that returns a jobId attaches to the existing useJobPoller. The toast
now says "Suspend queued" / "Snapshot queued" up front and resolves
through onComplete/onFailed once the daemon actually finishes. The
window.location.reload() is gone for queued ops — the poller's
onComplete callback handles the refresh.

Also covers the standalone handleSuspend on the agents table.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 014b093b-1fa1-4d9d-8053-5ac299f70340

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cloud-frontend-job-poller-async-routes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

Comment on lines +423 to +438
if (!res.ok && res.status !== 202) {
// Revert optimistic update
void refreshData();
throw new Error("Suspend failed");
}

// 202 + jobId: the daemon executes the suspend asynchronously.
// Track the job so the table reflects the real completion (and
// the success toast doesn't lie before the container actually
// stops).
const data = await res.json().catch(() => ({}));
const jobId = (data as { data?: { jobId?: string } }).data?.jobId;
if (res.status === 202 && jobId) {
poller.track(id, jobId);
toast.success("Suspend queued");
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 409+jobId case silently errors in the table's suspend handler

agent-actions.tsx (the detail page) now correctly extracts a jobId from a 409 response and attaches to the in-flight job. handleSuspend here does not — a 409 hits the !res.ok && res.status !== 202 guard (409 satisfies both conditions), throws "Suspend failed", and reverts the optimistic update. A user who clicks Suspend while a suspend is already running will see an error toast and a spurious "stopped"→"running" flicker even though the daemon is working correctly. The table should mirror the agent-actions pattern: read the body first, check for 409+jobId, and call poller.track() before falling through to the error throw.

Comment on lines +131 to +136
suspend: t("cloud.containers.agentActions.suspendQueued", {
defaultValue: "Suspend queued",
}),
shutdown: t("cloud.containers.agentActions.suspendQueued", {
defaultValue: "Suspend queued",
}),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The shutdown action uses the same i18n key as suspend (suspendQueued). If a "shutdownQueued" translation is ever added to the i18n catalogue, the shutdown action will still resolve to the suspend copy.

Suggested change
suspend: t("cloud.containers.agentActions.suspendQueued", {
defaultValue: "Suspend queued",
}),
shutdown: t("cloud.containers.agentActions.suspendQueued", {
defaultValue: "Suspend queued",
}),
suspend: t("cloud.containers.agentActions.suspendQueued", {
defaultValue: "Suspend queued",
}),
shutdown: t("cloud.containers.agentActions.shutdownQueued", {
defaultValue: "Shutdown queued",
}),

@lalalune lalalune merged commit 511d9f2 into develop May 20, 2026
33 of 37 checks passed
@lalalune lalalune deleted the fix/cloud-frontend-job-poller-async-routes branch May 20, 2026 02:09
2-A-M pushed a commit to 2-A-M/eliza that referenced this pull request May 20, 2026
Five small wins surfaced by a second /clean pass after the lifecycle
queue stack merged (elizaOS#7810/elizaOS#7813/elizaOS#7815/elizaOS#7816):

- provisioning-jobs.ts: drop 4 redundant type casts. `status: "error"`,
  `"deletion_failed"`, and the `webhook_status` updates are all literals
  matching the inferred parameter type — the `as Parameters<...>[1]` /
  `as Partial<Job>` casts added nothing.
- provisioning-jobs.ts: executeAgentProvision failure path now uses
  agentProvisionJobResultToRecord({...}) like every other executor,
  preserving the typed-serialization-boundary pattern.
- v1/eliza/agents/[id]/route.ts (DELETE): drop the redundant
  `instanceof Error && error.message === "Agent not found"` branch.
  failureResponse() already maps any "not found" error to 404.
- v1/agents/[id]/logs/route.ts: add `success: false` to the 404 body
  to match the response shape every sibling route uses.
- v1/eliza/agents/[id]/resume/route.ts: trim a forward-looking comment
  about a "future docker start fast path" — belongs in a ticket, not
  the route. Kept the audit-log rationale.
- __tests__/provisioning-job-types.test.ts: lock the registry size
  with `expect(Object.keys(JOB_TYPES)).toHaveLength(7)`. A new entry
  without a matching wire-value assertion now fails CI instead of being
  silently under-covered.

Net diff: -5 LOC, 5 files. No behavior change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants