Skip to content

fix(orchestrator): resume interactive workflows on chat platforms#1756

Draft
Wirasm wants to merge 3 commits into
devfrom
archon/task-archon-fix-github-issue-experimental-1779701539931
Draft

fix(orchestrator): resume interactive workflows on chat platforms#1756
Wirasm wants to merge 3 commits into
devfrom
archon/task-archon-fix-github-issue-experimental-1779701539931

Conversation

@Wirasm
Copy link
Copy Markdown
Collaborator

@Wirasm Wirasm commented May 25, 2026

Summary

  • Problem: Approval-gate and interactive-loop workflows launched from Slack, Telegram, Discord, or GitHub never resumed after a user response — each reply triggered a brand-new run from node 0 in a fresh worktree, discarding all completed work and re-asking the same questions indefinitely.
  • Why it matters: Every chat-platform user running any interactive or approval-gate workflow was fully broken; only Web worked correctly.
  • What changed: Lifted the resume-detection block (findResumableRunByParentConversationhydrateResumableRun → resume path) out of the if (platform === 'web') gate in dispatchOrchestratorWorkflow so it runs for all platforms. Added codebase_id scoping to the resume query to prevent cross-project resume on persistent chat conversation IDs.
  • What did not change: The background-dispatch path (web + non-interactive, no resumable run) is unchanged. hydrateResumableRun is unchanged. getPausedWorkflowRun (natural-language approval interceptor) is unchanged. Issue C from the reporter (codebase name resolution) is out of scope.

UX Journey

Before

User (Slack)            Archon                      Workflow Engine
────────────            ──────                      ───────────────
sends message ────────▶ handleMessage
                        detects workflow name
                        calls dispatchOrchestratorWorkflow
                          platform === 'slack'
                          → ELSE branch (no resume check)
                          → executeWorkflow(fresh cwd) ──────────▶ starts NEW run from node 0
                                                                   creates NEW worktree
                                                                   re-asks approval question ──▶ user sees duplicate question
                        (prior paused run abandoned, loop restarts)

After

User (Slack)            Archon                      Workflow Engine
────────────            ──────                      ───────────────
sends message ────────▶ handleMessage
                        detects workflow name
                        calls dispatchOrchestratorWorkflow
                          [findResumableRunByParentConversation(name, convId, codebaseId)]
                          → resumable run found (status=paused)
                          → hydrateResumableRun → prepared != null
                          → executeWorkflow(resumableRun.working_path) ──▶ RESUMES from paused node
                                                                           continues in original worktree
                                                                           workflow completes ──────────▶ user sees result

Architecture Diagram

Before

dispatchOrchestratorWorkflow
├── if platform === 'web'
│   ├── findResumableRunByParentConversation(name, convId)  ← resume lookup
│   │   ├── found: hydrateResumableRun → executeWorkflow(working_path)
│   │   └── not found + interactive: executeWorkflow(fresh cwd)
│   └── not found + !interactive: dispatchBackgroundWorkflow
└── else  (slack / telegram / discord / github)
    └── executeWorkflow(fresh cwd)  ← ALWAYS fresh, no resume check

After

dispatchOrchestratorWorkflow
├── [~] findResumableRunByParentConversation(name, convId, codebaseId)  ← ALL platforms
│   ├── found: hydrateResumableRun → executeWorkflow(working_path)
│   └── not found:
│       ├── if platform === 'web' && !interactive: dispatchBackgroundWorkflow
│       └── else: executeWorkflow(fresh cwd)

Connection inventory:

From To Status Notes
orchestrator-agent.ts:dispatchOrchestratorWorkflow workflowDb.findResumableRunByParentConversation modified Now called for all platforms; adds codebaseId as 3rd arg
workflows.ts:findResumableRunByParentConversation PostgreSQL/SQLite modified SQL gains AND codebase_id = $3
orchestrator-agent.ts:dispatchOrchestratorWorkflow executeWorkflow unchanged Resume path: called with working_path; fresh path: called with cwd
orchestrator-agent.ts:dispatchOrchestratorWorkflow dispatchBackgroundWorkflow unchanged Condition unchanged: web + non-interactive + no resumable run
orchestrator-agent.ts:dispatchOrchestratorWorkflow hydrateResumableRun unchanged Called only when resume candidate found

Label Snapshot

  • Risk: risk: low
  • Size: size: S
  • Scope: core
  • Module: core:orchestrator, core:db

Change Metadata

  • Change type: bug
  • Primary scope: core

Linked Issue

Validation Evidence (required)

bun run validate

All six checks passed:

Check Result
check:bundled ✅ Pass — bundled-defaults.generated.ts up to date (36 commands, 20 workflows)
check:bundled-skill ✅ Pass — bundled-skill.ts up to date (21 files)
type-check ✅ Pass — 0 errors across all 10 packages
lint ✅ Pass — 0 errors, 0 warnings (--max-warnings 0)
format:check ✅ Pass — all files formatted
test ✅ Pass — all packages, 0 failures

New tests added to orchestrator-agent.test.ts:

  • chat resume: resumes a paused run on chat platform when one exists

  • chat resume: scopes resume query to (workflow, conversation, codebase)

  • chat resume: starts fresh run when no resumable run exists on chat platform

  • Evidence provided: All automated checks passed as listed above.

  • Intentionally skipped: None.

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes — the resume query gains an additional codebase_id filter; all callers already have codebase.id available.
  • Config/env changes? No
  • Database migration needed? No — codebase_id is an existing column on remote_agent_workflow_runs; no schema changes.

Human Verification (required)

Automated CI covers the logic paths via the three new unit tests. Manual end-to-end verification requires a live Slack/Telegram bot with an approval-gate workflow, which was not available in the worktree environment.

  • Verified scenarios: type-check, lint, format, all unit tests (including new chat-resume tests)
  • Edge cases checked (by tests): codebase-scoped query call, fresh-run fallback when no resumable run found, paused-run resume with correct working_path
  • What was not verified: live Slack/Telegram end-to-end round-trip

Side Effects / Blast Radius (required)

  • Affected subsystems: dispatchOrchestratorWorkflow (all dispatch paths now run the resume lookup), findResumableRunByParentConversation (new required codebaseId parameter)
  • Potential unintended effects: A stale paused run pointing to a deleted worktree will be picked up and fail with a clear error; user can bun run cli workflow abandon <id> to clear it. This was already the behavior on web.
  • Guardrails: hydrateResumableRun returns null if no completed nodes exist, causing a graceful fall-through to a fresh run on the same worktree.

Rollback Plan (required)

Risks and Mitigations

  • Risk: All existing dispatches now call findResumableRunByParentConversation; if the DB query is slow, chat dispatch latency increases slightly.
    • Mitigation: The query is indexed on (workflow_name, parent_conversation_id, codebase_id, status) via existing indexes; expected sub-millisecond latency. The query was already executed on every web dispatch.

)

Interactive approval-gate and interactive-loop workflows started from
Slack, Telegram, Discord, or GitHub never resumed after the user
provided their answer — each approval response triggered a brand-new
workflow run from node 0 in a fresh worktree, re-asking the same
questions indefinitely. The cause was a `platform.getPlatformType() ===
'web'` gate that wrapped the entire resume-detection block in
`dispatchOrchestratorWorkflow`, leaving all chat platforms to
unconditionally fall through to a fresh `executeWorkflow`. The chat-side
`resumeRun` mechanism that previously handled this was removed in
#915 (natural-language approval routing) without lifting the resume
lookup out of the web branch.

Changes:
- Restructure dispatchOrchestratorWorkflow so resume detection
  (findResumableRunByParentConversation + hydrateResumableRun) runs for
  every platform; only the background-dispatch branch remains web-only
- Add codebaseId parameter to findResumableRunByParentConversation so
  persistent chat conversation IDs (Telegram chat_id, Slack thread)
  cannot resume a stale run from a different project
- Add tests for chat resume, codebase scoping, and fresh-run fallback

Fixes #1741
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8010f676-f6ef-4ee1-8464-44a9e4faf912

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch archon/task-archon-fix-github-issue-experimental-1779701539931

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented May 25, 2026

Comprehensive PR Review

PR: #1756 — fix(orchestrator): resume interactive workflows on chat platforms
Reviewed by: 3 specialized agents (code-review, error-handling, test-coverage)
Date: 2026-05-25


Summary

The PR cleanly lifts the interactive-workflow resume block out of the web-only gate and applies it to all platforms, with correct codebase_id scoping to prevent cross-project resumes on persistent chat IDs. All three new code paths are covered by targeted tests. No silent error swallowing introduced.

Verdict: APPROVE

Severity Count
🔴 CRITICAL 0
🟠 HIGH 0
🟡 MEDIUM 1
🟢 LOW 4

🟡 Medium Issues (Needs Decision)

Missing test: web non-interactive + resumable run dispatch priority

📍 packages/core/src/orchestrator/orchestrator-agent.ts / orchestrator-agent.test.ts

The refactor moved resume detection before the else if (web && !interactive) background-dispatch gate. The only "non-interactive web" test uses a null resumable run. A future refactor could accidentally reintroduce the old guard without test failure — a web user's paused run would silently get a fresh background dispatch instead of resuming.

View recommended test (LOW effort — copy-paste of existing pattern)
test('web non-interactive workflow with resumable run resumes foreground (not background)', async () => {
  mockGetOrCreateConversation.mockReturnValueOnce(Promise.resolve(makeDispatchConversation()));
  mockGetCodebase.mockReturnValueOnce(Promise.resolve(makeDispatchCodebase()));
  mockHandleCommand.mockReturnValueOnce(Promise.resolve(makeWorkflowResult(undefined))); // non-interactive
  mockFindResumableRunByParentConversation.mockReturnValueOnce(
    Promise.resolve({
      id: 'web-noninteractive-resume-1',
      workflow_name: 'test-workflow',
      working_path: '/repos/test-repo/worktrees/web-feature',
      parent_conversation_id: 'conv-1',
      status: 'paused',
    })
  );

  const platform = makePlatform(); // getPlatformType returns 'web'
  await handleMessage(platform, 'conv-1', '/workflow run test-workflow');

  expect(mockHydrateResumableRun).toHaveBeenCalled();
  expect(mockExecuteWorkflow).toHaveBeenCalled();
  expect(mockDispatchBackgroundWorkflow).not.toHaveBeenCalled();
  const callArgs = mockExecuteWorkflow.mock.calls[0] as unknown[];
  expect(callArgs[3]).toBe('/repos/test-repo/worktrees/web-feature');
});

🟢 Low Issues

View 4 low-priority observations

L1 — orchestrator.test.ts executor mock missing hydrateResumableRun
📍 orchestrator.test.ts:166-168

orchestrator-agent.ts imports both executeWorkflow and hydrateResumableRun from @archon/workflows/executor, but orchestrator.test.ts only mocks executeWorkflow. Safe today (all tests use null resumable run, so hydrateResumableRun is never called), but a future test exercising the resume path would get an opaque TypeError: hydrateResumableRun is not a function.

Fix: Add hydrateResumableRun: mock(() => Promise.resolve(null)) to the executor mock block.


L2 — DB resume lookup failure now blocks all platforms (behavioral scope expansion, not a bug)
📍 orchestrator-agent.ts:369-373

Previously a transient DB error only affected web dispatches. After the fix it blocks all platforms. This is correct per the fail-fast principle — launching fresh when a resumable run might exist risks duplicate worktrees. Flagged for awareness only; leave as-is.


L3 — "…starting fresh in the same worktree" message now shown on chat platforms (cosmetic)
📍 orchestrator-agent.ts:406-409

Pre-existing message, technically accurate. "Worktree" is opaque to chat users but not misleading. Out of scope for this PR.


L4 — GitHub platform not explicitly tested for chat resume path

Telegram/Slack/Discord are exercised by the 3 new tests. GitHub shares the same else branch so existing tests provide indirect coverage. Optional completeness addition; risk is low.


What's Good

  • Scoping is airtight: codebase_id added to both the SQL query and log context — a persistent Telegram chat_id spanning two projects cannot accidentally resume the wrong project's run.
  • Log improved: platformType field added to orchestrator.foreground_resume_detected — Slack/Telegram/Discord/GitHub resume events are now distinguishable from web in production logs.
  • Tests check the right things: The resume test verifies callArgs[3] is the prior working_path and opts.preCreatedRun.id comes from the hydrated run — not just toHaveBeenCalled().
  • No silent swallows: Every error handler re-throws, logs + re-throws, or explicitly notifies the user.
  • CLAUDE.md compliance: Type safety, fail-fast, YAGNI, no autonomous lifecycle mutation, logging format, DB error pattern, test isolation — all pass.

Reviewed by Archon prp-review-agents workflow

…ive resume test

- Add hydrateResumableRun to executor mock in orchestrator.test.ts to
  mirror the real module exports and prevent opaque TypeErrors for future
  test contributors
- Add test asserting that a web non-interactive workflow with a resumable
  run resumes foreground rather than dispatching a fresh background run,
  pinning the priority order of the if/else if dispatch block
@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented May 25, 2026

⚡ Self-Fix Report (Aggressive)

Status: COMPLETE
Pushed: ✅ Changes pushed to archon/task-archon-fix-github-issue-experimental-1779701539931
Commit: c0c9565c
Philosophy: Fix everything unless clearly a new concern


Fixes Applied (2 total)

Severity Count
🔴 CRITICAL 0
🟠 HIGH 0
🟡 MEDIUM 1
🟢 LOW 1
View all fixes
  • Web non-interactive + resumable run has no test (packages/core/src/orchestrator/orchestrator-agent.test.ts) — Added test 'web non-interactive workflow with resumable run resumes foreground (not background)' pinning the dispatch priority order: resume check beats the background-dispatch gate. Asserts executeWorkflow is called (not dispatchBackgroundWorkflow) with the prior worktree path when a resumable run exists.
  • Executor mock missing hydrateResumableRun (packages/core/src/orchestrator/orchestrator.test.ts:166-168) — Added hydrateResumableRun: mock(() => Promise.resolve(null)) to mirror real module exports and prevent opaque TypeError: hydrateResumableRun is not a function for future test contributors.

Tests Added

  • packages/core/src/orchestrator/orchestrator-agent.test.ts: web non-interactive workflow with resumable run resumes foreground (not background)

Skipped (3)

Severity Finding Reason
🟢 LOW DB resume lookup failure now blocks all platforms Intentional fail-fast — correct per CLAUDE.md; launching fresh when lookup fails risks duplicate worktrees
🟢 LOW "starting fresh in the same worktree" message shown on chat platforms Pre-existing message, out of scope, cosmetic only
🟢 LOW GitHub platform not explicitly tested Shares same else branch as Telegram/Slack/Discord; indirect coverage sufficient

Suggested Follow-up Issues

(none)


Validation

✅ Type check | ✅ Lint | ✅ Tests (all packages, 0 failures)


Self-fix by Archon · aggressive mode · fixes pushed to archon/task-archon-fix-github-issue-experimental-1779701539931

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented May 25, 2026

Review Summary

Verdict: minor-fixes-needed

This PR fixes a long-standing bug where chat platforms (Slack, Telegram, Discord, GitHub) always started a fresh workflow run instead of resuming a paused one after an approval gate. The implementation is clean, the codebaseId scoping prevents cross-project resume on shared chat IDs, and the new tests cover the key permutations. One test mock needs updating before merge, and one docs line needs a quick update.

Blocking issues

  • packages/core/src/orchestrator/orchestrator.test.ts:167: mock.module('@archon/workflows/executor') stubs executeWorkflow but is missing hydrateResumableRun, which dispatchOrchestratorWorkflow also imports (orchestrator-agent.ts:34). Any test that accidentally triggers the resume path will throw TypeError: hydrateResumableRun is not a function.
    • Fix: Add hydrateResumableRun: mock(() => Promise.resolve(null)) to the mock block.

Suggested fixes

  • packages/docs-web/src/content/docs/guides/authoring-workflows.md:~531: The "DAG Resume on Failure" section says "Chat (web): Approving or rejecting a paused workflow auto-resumes..." — this excludes Slack, Telegram, Discord, and GitHub, which now also resume correctly after this PR.
    • Fix: Update to "Chat platforms (web, Slack, Telegram, Discord, GitHub)" or simply "Chat platforms". Optionally add a note that resume is scoped to the current codebase.

Minor / nice-to-have

  • packages/core/src/orchestrator/orchestrator-agent.test.ts:1318: A comment (// cwd comes from validateAndResolveIsolation (default '/test/cwd'), not a prior worktree) visually belongs to the wrong test block — the assertions themselves are correct.
  • packages/core/src/orchestrator/orchestrator-agent.test.ts:1378: Test name says "scopes resume query" but only verifies mock call args, not the null-return behavior for mismatched codebases. Not blocking — behavior is covered elsewhere.
  • packages/core/src/db/workflows.ts:342–344: Function JSDoc says "the web orchestrator" — should be "the orchestrator (all platforms)".
  • packages/docs-web/src/content/docs/guides/authoring-workflows.md:~531 (low priority): Consider adding a note that chat-platform resume is scoped to codebaseId for multi-project chat adapter safety.

Compliments

  • Excellent comments throughout: the block comment explaining why resume detection now runs for ALL platforms (orchestrator-agent.ts:364-369) and the test comment describing the ordering constraint for web non-interactive resume (orchestrator-agent.test.ts) are exactly the kind of non-obvious WHY documentation that prevents future regressions.
  • The #1741 reference in the test is appropriate — it's a permanent issue number that gives future engineers a trail to follow.
  • The codebaseId addition is a thoughtful safety measure that prevents cross-project resume on shared Telegram chat IDs without requiring users to change anything.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality, docs-impact.

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented May 26, 2026

Review Summary

Verdict: ready-to-merge

This PR lifts the findResumableRunByParentConversation resume-detection block out of the web-only guard and adds a codebaseId scope to the DB query, so chat platforms (Slack, Telegram, Discord, GitHub) now resume prior runs correctly — and won't accidentally resume a run from a different project on a shared conversation ID. Code quality is high and no error-handling issues were found.

Blocking issues

None.

Suggested fixes

  • packages/core/src/db/workflows.ts:341findResumableRunByParentConversation gained a codebaseId parameter and an AND codebase_id = $3 SQL clause but lacks a direct unit test. The SQL change is exercised only transitively through mocked orchestrator tests. Add a test in workflows.test.ts validating that: (1) a matching run is found when codebase_id matches, (2) no run is found when codebase_id differs even if workflow+conversation match, (3) null is returned when no run exists. This is an explicit regression guard for the cross-project-resume fix.

Minor / nice-to-have

  • packages/core/src/orchestrator/orchestrator-agent.test.ts:1321 — The web background-dispatch test doesn't explicitly stub a codebase return, so it passes undefined as codebaseId to the now-3-arg findResumableRunByParentConversation. Works via default mock behavior, but making the null expectation explicit is cleaner.
  • packages/core/src/orchestrator/orchestrator-agent.ts:363 — The 5-line block comment can be trimmed to a one-liner: "Check for a resumable run on this workflow before dispatching fresh." The platform-rationale detail belongs in the test comment (where it already exists with the Chat (Slack/Telegram) approval & interactive-loop workflows never resume — re-ask the same questions forever #1741 reference).

Compliments


Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chat (Slack/Telegram) approval & interactive-loop workflows never resume — re-ask the same questions forever

1 participant