fix(cloud-frontend): track async job ids for suspend/snapshot/restart toasts#7813
Conversation
… toasts Backend lifecycle ops moved to the job queue in #7810: suspend, restart, snapshot, logs all return 202 + jobId instead of completing inline. The frontend was still firing the success toast immediately on 2xx and reloading the page, which lied about the operation being done when the daemon had only just started it. Generalize the existing provision/resume 202+jobId path so any action that returns a jobId attaches to the existing useJobPoller. The toast now says "Suspend queued" / "Snapshot queued" up front and resolves through onComplete/onFailed once the daemon actually finishes. The window.location.reload() is gone for queued ops — the poller's onComplete callback handles the refresh. Also covers the standalone handleSuspend on the agents table.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
| if (!res.ok && res.status !== 202) { | ||
| // Revert optimistic update | ||
| void refreshData(); | ||
| throw new Error("Suspend failed"); | ||
| } | ||
|
|
||
| // 202 + jobId: the daemon executes the suspend asynchronously. | ||
| // Track the job so the table reflects the real completion (and | ||
| // the success toast doesn't lie before the container actually | ||
| // stops). | ||
| const data = await res.json().catch(() => ({})); | ||
| const jobId = (data as { data?: { jobId?: string } }).data?.jobId; | ||
| if (res.status === 202 && jobId) { | ||
| poller.track(id, jobId); | ||
| toast.success("Suspend queued"); | ||
| return; |
There was a problem hiding this comment.
409+jobId case silently errors in the table's suspend handler
agent-actions.tsx (the detail page) now correctly extracts a jobId from a 409 response and attaches to the in-flight job. handleSuspend here does not — a 409 hits the !res.ok && res.status !== 202 guard (409 satisfies both conditions), throws "Suspend failed", and reverts the optimistic update. A user who clicks Suspend while a suspend is already running will see an error toast and a spurious "stopped"→"running" flicker even though the daemon is working correctly. The table should mirror the agent-actions pattern: read the body first, check for 409+jobId, and call poller.track() before falling through to the error throw.
| suspend: t("cloud.containers.agentActions.suspendQueued", { | ||
| defaultValue: "Suspend queued", | ||
| }), | ||
| shutdown: t("cloud.containers.agentActions.suspendQueued", { | ||
| defaultValue: "Suspend queued", | ||
| }), |
There was a problem hiding this comment.
The
shutdown action uses the same i18n key as suspend (suspendQueued). If a "shutdownQueued" translation is ever added to the i18n catalogue, the shutdown action will still resolve to the suspend copy.
| suspend: t("cloud.containers.agentActions.suspendQueued", { | |
| defaultValue: "Suspend queued", | |
| }), | |
| shutdown: t("cloud.containers.agentActions.suspendQueued", { | |
| defaultValue: "Suspend queued", | |
| }), | |
| suspend: t("cloud.containers.agentActions.suspendQueued", { | |
| defaultValue: "Suspend queued", | |
| }), | |
| shutdown: t("cloud.containers.agentActions.shutdownQueued", { | |
| defaultValue: "Shutdown queued", | |
| }), |
Five small wins surfaced by a second /clean pass after the lifecycle queue stack merged (elizaOS#7810/elizaOS#7813/elizaOS#7815/elizaOS#7816): - provisioning-jobs.ts: drop 4 redundant type casts. `status: "error"`, `"deletion_failed"`, and the `webhook_status` updates are all literals matching the inferred parameter type — the `as Parameters<...>[1]` / `as Partial<Job>` casts added nothing. - provisioning-jobs.ts: executeAgentProvision failure path now uses agentProvisionJobResultToRecord({...}) like every other executor, preserving the typed-serialization-boundary pattern. - v1/eliza/agents/[id]/route.ts (DELETE): drop the redundant `instanceof Error && error.message === "Agent not found"` branch. failureResponse() already maps any "not found" error to 404. - v1/agents/[id]/logs/route.ts: add `success: false` to the 404 body to match the response shape every sibling route uses. - v1/eliza/agents/[id]/resume/route.ts: trim a forward-looking comment about a "future docker start fast path" — belongs in a ticket, not the route. Kept the audit-log rationale. - __tests__/provisioning-job-types.test.ts: lock the registry size with `expect(Object.keys(JOB_TYPES)).toHaveLength(7)`. A new entry without a matching wire-value assertion now fails CI instead of being silently under-covered. Net diff: -5 LOC, 5 files. No behavior change.
Why
Backend lifecycle ops moved to the job queue in #7810: suspend, restart, snapshot, logs all return 202 + jobId instead of completing inline.
The frontend was still firing the success toast immediately on 2xx and reloading the page, which lied about the operation being done. "Snapshot saved" appeared in < 1s while the daemon had just started a 30s job.
What changed
`packages/cloud-frontend/src/dashboard/containers/_components/agent-actions.tsx`:
`packages/cloud-frontend/src/dashboard/containers/_components/eliza-agents-table.tsx`:
Test plan
Depends on
#7810 (backend routes returning 202 + jobId). Mergeable in any order — without #7810 the frontend changes are dead code paths (no route returns 202 + jobId for those actions yet), but they don't break anything.
Follow-up
Greptile Summary
This PR fixes premature "success" toasts for async lifecycle operations (suspend, snapshot, restart) by detecting the 202+jobId response pattern — previously used only for provision/resume — and routing those actions through
useJobPollerinstead of resolving immediately.agent-actions.tsx: 202+jobId detection is generalised to any action with per-action queued toast messages; the prematurewindow.location.reload()on the 202 path is removed (reload still happens after the job completes viaautoRefresh). The 409+jobId path is also generalised.eliza-agents-table.tsx:handleSuspendnow reads the response body for 202+jobId and callspoller.track(), matching the patternhandleProvisionalready used. However, unlikeagent-actions.tsx, the 409+jobId case is not handled — a concurrent suspend attempt throws a false error instead of attaching to the existing job.Confidence Score: 3/5
Safe for most flows, but the table's suspend handler has an incomplete edge-case that produces a visible false error toast for concurrent suspend attempts.
The core fix (routing 202+jobId through the job poller) is sound and the happy path works correctly. The table's handleSuspend does not handle the 409+jobId case that agent-actions.tsx now covers — if a suspend is already running and the user clicks suspend again from the table, they see "Suspend failed" and the optimistic status reverts, even though the daemon is healthy. The completion toasts ("Agent provisioning completed") are also incorrect for suspend and snapshot operations, which is a user-visible lie each time those jobs finish.
eliza-agents-table.tsx needs the 409+jobId guard added to handleSuspend to match the pattern in agent-actions.tsx.
Important Files Changed
Sequence Diagram
sequenceDiagram participant U as User participant FE as Frontend (agent-actions / table) participant API as Backend API participant P as useJobPoller participant J as /api/v1/jobs/:id U->>FE: Click Suspend / Snapshot / Restart FE->>API: PATCH/POST action alt 409 + jobId (already in flight) API-->>FE: "409 { data: { jobId } }" FE->>P: poller.track(agentId, jobId) FE-->>U: "toast.info("{action} already in progress")" else 202 + jobId (newly queued) API-->>FE: "202 { data: { jobId } }" FE->>P: poller.track(agentId, jobId) FE-->>U: "toast.success("{action} queued")" else 2xx no jobId (legacy inline) API-->>FE: "200 {}" FE-->>U: "toast.success("{action} done")" FE->>FE: window.location.reload() else error API-->>FE: 4xx/5xx FE-->>U: toast.error("Action failed: ...") end loop Every 5s while job active P->>J: GET /api/v1/jobs/:jobId J-->>P: "{ status, error }" alt completed P->>FE: onComplete() → toast "Agent provisioning completed" P->>FE: window.location.reload() else failed P->>FE: onFailed() → toast.error(job.error) P->>FE: window.location.reload() else timed out P->>FE: onFailed("Timed out waiting...") end endComments Outside Diff (1)
packages/cloud-frontend/src/dashboard/containers/_components/agent-actions.tsx, line 34-48 (link)onComplete/onFailedmessages are wrong for all newly-tracked actionsThe PR extends job tracking to
snapshot,suspend, andshutdown, but theuseJobPollercallbacks remain hardcoded to "Agent provisioning completed" / "Provisioning failed". A user who clicks "Save Snapshot" sees "Snapshot queued", waits ~30 s, and then receives "Agent provisioning completed" — which directly contradicts the earlier toast. The same problem applies to suspend and shutdown in both this file andeliza-agents-table.tsx(line 278–286). The follow-up mentioned in the PR description would need to pass an action-to-message map intouseJobPolleror use per-action pollers to fix this.Reviews (1): Last reviewed commit: "fix(cloud-frontend): track async job ids..." | Re-trigger Greptile