fix(cli/workflow): add SIGTERM grace period to avoid losing completion to a race by jasonjack-jay · Pull Request #1707 · coleam00/Archon

jasonjack-jay · 2026-05-17T06:37:07Z

When an external orchestrator watches the same state that a workflow's
final node modifies, it can race the DAG executor: the last node has
already done its side effect (e.g. moved an issue label out of an
'in-progress' set), the orchestrator sees the state change and SIGTERMs
the runner, and the SIGTERM lands in the small window between the bash
node returning and completeWorkflowRun being called.

Before this change, cleanup unconditionally marked the active run as
failed and exited 1. Downstream consumers saw exit-1 and retried
idempotent work that was already done, producing log noise and (in some
cases) duplicate side effects.

Cleanup now polls getActiveWorkflowRun every 100ms for up to 5s
before deciding what to do:

If the run finishes naturally during grace, look up its final status
and exit 0 only if the DB says completed — otherwise exit 1.
If the run is still active after 5s, force-fail it as before
(preserves the user-Ctrl+C semantics).
If we never see an active run during grace, preserve the previous
exit-1 behaviour — we can't tell what happened.

5s is generous: completeWorkflowRun is typically <100ms. A genuine
user-driven cancel is delayed by at most 5s, which is the right trade
for not corrupting the success/failure signal.

Test mock for the workflows DB module is extended with
getWorkflowRunStatus so the new call site has a stub. All 94 existing
workflow.test.ts tests still pass.

This was observed in production with a harness that watches GitHub state
labels: every spike workflow exited 1 with the issue label correctly
moved and the PR correctly opened. The harness's defensive guard
prevented duplicate dispatches but the exit-1 noise made it harder to
distinguish real failures from this cosmetic race.

Summary by CodeRabbit

Bug Fixes
- Enhanced workflow termination handling with a grace period for natural completion before forced shutdown. Exit codes now accurately reflect workflow completion status rather than always failing.
Tests
- Extended test mocks to support workflow status verification.

…n to a race When an external orchestrator watches the same state that a workflow's final node modifies, it can race the DAG executor: the last node has already done its side effect (e.g. moved an issue label out of an 'in-progress' set), the orchestrator sees the state change and SIGTERMs the runner, and the SIGTERM lands in the small window between the bash node returning and `completeWorkflowRun` being called. Before this change, cleanup unconditionally marked the active run as failed and exited 1. Downstream consumers saw exit-1 and retried idempotent work that was already done, producing log noise and (in some cases) duplicate side effects. Cleanup now polls `getActiveWorkflowRun` every 100ms for up to 5s before deciding what to do: - If the run finishes naturally during grace, look up its final status and exit 0 only if the DB says `completed` — otherwise exit 1. - If the run is still active after 5s, force-fail it as before (preserves the user-Ctrl+C semantics). - If we never see an active run during grace, preserve the previous exit-1 behaviour — we can't tell what happened. 5s is generous: `completeWorkflowRun` is typically <100ms. A genuine user-driven cancel is delayed by at most 5s, which is the right trade for not corrupting the success/failure signal. Test mock for the workflows DB module is extended with `getWorkflowRunStatus` so the new call site has a stub. All 94 existing workflow.test.ts tests still pass. This was observed in production with a harness that watches GitHub state labels: every spike workflow exited 1 with the issue label correctly moved and the PR correctly opened. The harness's defensive guard prevented duplicate dispatches but the exit-1 noise made it harder to distinguish real failures from this cosmetic race.

coderabbitai · 2026-05-17T06:37:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 957b9430-ce4e-4376-820e-cfebd34302ae

📥 Commits

Reviewing files that changed from the base of the PR and between 7bdf931 and 67766ed.

📒 Files selected for processing (2)

packages/cli/src/commands/workflow.test.ts
packages/cli/src/commands/workflow.ts

📝 Walkthrough

Walkthrough

The CLI workflow runner's termination handling is replaced with a graceful shutdown strategy. On SIGTERM/SIGINT signals, instead of immediately failing the workflow, the process now waits up to 5 seconds (polling every 100ms) to detect natural completion, only force-failing if the workflow remains active after the grace window. Exit code is 0 only when the database reports the workflow as completed.

Changes

Graceful Workflow Termination

Layer / File(s)	Summary
Graceful termination strategy with polling and signal integration `packages/cli/src/commands/workflow.ts`, `packages/cli/src/commands/workflow.test.ts`	Termination grace constants (5s, 100ms poll), re-entrancy guard, and async cleanup function poll `getActiveWorkflowRun`; if the run finishes naturally and is reported as `completed`, process exits code 0; if still active after grace or on error, `failWorkflowRun` is called and code 1 is used. Signal handlers invoke cleanup asynchronously. Test mock adds `getWorkflowRunStatus` support.

Sequence Diagram(s)

sequenceDiagram
  participant Process
  participant SignalHandler
  participant Cleanup as Cleanup Function
  participant DB as getActiveWorkflowRun
  participant FailFn as failWorkflowRun
  participant StatusDB as getWorkflowRunStatus
  
  Process->>SignalHandler: SIGTERM/SIGINT received
  SignalHandler->>Cleanup: invoke async cleanup
  Cleanup->>DB: poll for active run
  alt Run completes during grace
    DB-->>Cleanup: no active run found
    Cleanup->>StatusDB: query final status
    alt Status is completed
      StatusDB-->>Cleanup: completed
      Cleanup->>Process: exit code 0
    else Status not completed
      StatusDB-->>Cleanup: other status
      Cleanup->>Process: exit code 1
    end
  else Grace period expires
    Cleanup->>FailFn: force-fail active run
    Cleanup->>Process: exit code 1
  else Cleanup error
    Cleanup->>Process: log failure, exit code 1
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

coleam00/Archon#1563: Both PRs adjust workflow-run "zombie" handling by branching on getWorkflowRunStatus (e.g., treating running differently from completed and triggering failWorkflowRun/cleanup backstops), including adding getWorkflowRunStatus to related mocks.

Poem

🐰 A graceful bow when signals say goodbye,
Five seconds kind—no rush to fail or die,
The workflow spins with hope, then finds its peace,
While polls and patience grant the sweet release! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description comprehensively explains the problem, solution, and rationale, but does not follow the required template structure with sections like Summary, UX Journey, Architecture Diagram, and other required metadata.	Rewrite the description following the provided template structure, including Summary bullets, UX Journey, Architecture Diagram, Label Snapshot, Change Metadata, Linked Issue, Validation Evidence, Security Impact, Compatibility, Human Verification, Side Effects, Rollback Plan, and Risks sections.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a SIGTERM grace period to prevent losing workflow completion status to a race condition.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Wirasm · 2026-05-19T08:22:17Z

Review Summary

Verdict: blocking-issues

Your grace-period polling strategy for the SIGTERM race condition is solid — the production backstory and 5-second calibration in the comments are genuinely valuable. However, there's a logic bug that makes the entire "completed during grace → exit 0" path unreachable, plus a missing test suite for the new cleanup handler.

Blocking issues

packages/cli/src/commands/workflow.ts:740: The condition else if (lastSeenRunId && Date.now() - startedAt >= TERMINATION_GRACE_MS) is always true when lastSeenRunId is set, because the while loop only breaks when the time condition becomes false. Completed runs that finish during the grace window will be force-failed instead of exiting 0. Fix: change to else if (lastSeenRunId) — the if branch above already handles the timeout case. See the code snippet in the full review.
packages/cli/src/commands/workflow.ts:711-766: The cleanup() function is entirely untested. No test covers the signal registration, polling loop, or any of the three exit-code paths. Add tests for: (1) workflow finishes during grace → exit 0, (2) workflow still active after grace → failWorkflowRun called → exit 1, (3) no active run → exit 1, (4) getWorkflowRunStatus throws → exit 1 without crash. Consider exporting TERMINATION_GRACE_MS or injecting graceMs as a parameter so tests don't wait 5 seconds.

Suggested fixes

workflow.ts:727 and workflow.ts:737: getActiveWorkflowRun and getWorkflowRunStatus silently swallow DB errors. Log the error before swallowing so operators can see DB issues during termination:

workflowDb.getActiveWorkflowRun(conversation.id).catch(err => {
  getLog().error({ err, conversationId: conversation.id }, 'workflow.termination_poll_failed');
  return null;
});

workflow.ts:719: Add getLog().info({ runId, status }, 'workflow.termination_grace_satisfied') when the workflow exits 0 via the DB verdict, so logs distinguish success from the null-run path.

Minor / nice-to-have

workflow.ts:716: Comment says "Still active after grace" but the condition uses >=. At exactly 5000ms elapsed, the outer branch fires instead. Either change >= to > or update the comment to "Still active at or beyond grace boundary".
workflow.ts:719-720: void cleanup('SIGTERM') / void cleanup('SIGINT') intentionally discard the Promise — a brief // process exit imminent comment prevents future readers from flagging this as a missing await.

Compliments

The multi-paragraph block at workflow.ts:698–726 explaining the race between completeWorkflowRun and external SIGTERM is exactly the kind of non-obvious WHY that justifies a comment. The "5 seconds is generous: completeWorkflowRun typically runs in <100ms" calibration is concrete and well-reasoned.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli/workflow): add SIGTERM grace period to avoid losing completion to a race#1707

fix(cli/workflow): add SIGTERM grace period to avoid losing completion to a race#1707
jasonjack-jay wants to merge 1 commit into
coleam00:devfrom
jasonjack-jay:fix/sigterm-grace-period-during-completion-race

jasonjack-jay commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Wirasm commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jasonjack-jay commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Wirasm commented May 19, 2026

Review Summary

Blocking issues

Suggested fixes

Minor / nice-to-have

Compliments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jasonjack-jay commented May 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading