Skip to content

fix(cli/workflow): add SIGTERM grace period to avoid losing completion to a race#1707

Open
jasonjack-jay wants to merge 1 commit into
coleam00:devfrom
jasonjack-jay:fix/sigterm-grace-period-during-completion-race
Open

fix(cli/workflow): add SIGTERM grace period to avoid losing completion to a race#1707
jasonjack-jay wants to merge 1 commit into
coleam00:devfrom
jasonjack-jay:fix/sigterm-grace-period-during-completion-race

Conversation

@jasonjack-jay
Copy link
Copy Markdown

@jasonjack-jay jasonjack-jay commented May 17, 2026

When an external orchestrator watches the same state that a workflow's
final node modifies, it can race the DAG executor: the last node has
already done its side effect (e.g. moved an issue label out of an
'in-progress' set), the orchestrator sees the state change and SIGTERMs
the runner, and the SIGTERM lands in the small window between the bash
node returning and completeWorkflowRun being called.

Before this change, cleanup unconditionally marked the active run as
failed and exited 1. Downstream consumers saw exit-1 and retried
idempotent work that was already done, producing log noise and (in some
cases) duplicate side effects.

Cleanup now polls getActiveWorkflowRun every 100ms for up to 5s
before deciding what to do:

  • If the run finishes naturally during grace, look up its final status
    and exit 0 only if the DB says completed — otherwise exit 1.
  • If the run is still active after 5s, force-fail it as before
    (preserves the user-Ctrl+C semantics).
  • If we never see an active run during grace, preserve the previous
    exit-1 behaviour — we can't tell what happened.

5s is generous: completeWorkflowRun is typically <100ms. A genuine
user-driven cancel is delayed by at most 5s, which is the right trade
for not corrupting the success/failure signal.

Test mock for the workflows DB module is extended with
getWorkflowRunStatus so the new call site has a stub. All 94 existing
workflow.test.ts tests still pass.

This was observed in production with a harness that watches GitHub state
labels: every spike workflow exited 1 with the issue label correctly
moved and the PR correctly opened. The harness's defensive guard
prevented duplicate dispatches but the exit-1 noise made it harder to
distinguish real failures from this cosmetic race.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced workflow termination handling with a grace period for natural completion before forced shutdown. Exit codes now accurately reflect workflow completion status rather than always failing.
  • Tests

    • Extended test mocks to support workflow status verification.

Review Change Stack

…n to a race

When an external orchestrator watches the same state that a workflow's
final node modifies, it can race the DAG executor: the last node has
already done its side effect (e.g. moved an issue label out of an
'in-progress' set), the orchestrator sees the state change and SIGTERMs
the runner, and the SIGTERM lands in the small window between the bash
node returning and `completeWorkflowRun` being called.

Before this change, cleanup unconditionally marked the active run as
failed and exited 1. Downstream consumers saw exit-1 and retried
idempotent work that was already done, producing log noise and (in some
cases) duplicate side effects.

Cleanup now polls `getActiveWorkflowRun` every 100ms for up to 5s
before deciding what to do:

  - If the run finishes naturally during grace, look up its final status
    and exit 0 only if the DB says `completed` — otherwise exit 1.
  - If the run is still active after 5s, force-fail it as before
    (preserves the user-Ctrl+C semantics).
  - If we never see an active run during grace, preserve the previous
    exit-1 behaviour — we can't tell what happened.

5s is generous: `completeWorkflowRun` is typically <100ms. A genuine
user-driven cancel is delayed by at most 5s, which is the right trade
for not corrupting the success/failure signal.

Test mock for the workflows DB module is extended with
`getWorkflowRunStatus` so the new call site has a stub. All 94 existing
workflow.test.ts tests still pass.

This was observed in production with a harness that watches GitHub state
labels: every spike workflow exited 1 with the issue label correctly
moved and the PR correctly opened. The harness's defensive guard
prevented duplicate dispatches but the exit-1 noise made it harder to
distinguish real failures from this cosmetic race.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 957b9430-ce4e-4376-820e-cfebd34302ae

📥 Commits

Reviewing files that changed from the base of the PR and between 7bdf931 and 67766ed.

📒 Files selected for processing (2)
  • packages/cli/src/commands/workflow.test.ts
  • packages/cli/src/commands/workflow.ts

📝 Walkthrough

Walkthrough

The CLI workflow runner's termination handling is replaced with a graceful shutdown strategy. On SIGTERM/SIGINT signals, instead of immediately failing the workflow, the process now waits up to 5 seconds (polling every 100ms) to detect natural completion, only force-failing if the workflow remains active after the grace window. Exit code is 0 only when the database reports the workflow as completed.

Changes

Graceful Workflow Termination

Layer / File(s) Summary
Graceful termination strategy with polling and signal integration
packages/cli/src/commands/workflow.ts, packages/cli/src/commands/workflow.test.ts
Termination grace constants (5s, 100ms poll), re-entrancy guard, and async cleanup function poll getActiveWorkflowRun; if the run finishes naturally and is reported as completed, process exits code 0; if still active after grace or on error, failWorkflowRun is called and code 1 is used. Signal handlers invoke cleanup asynchronously. Test mock adds getWorkflowRunStatus support.

Sequence Diagram(s)

sequenceDiagram
  participant Process
  participant SignalHandler
  participant Cleanup as Cleanup Function
  participant DB as getActiveWorkflowRun
  participant FailFn as failWorkflowRun
  participant StatusDB as getWorkflowRunStatus
  
  Process->>SignalHandler: SIGTERM/SIGINT received
  SignalHandler->>Cleanup: invoke async cleanup
  Cleanup->>DB: poll for active run
  alt Run completes during grace
    DB-->>Cleanup: no active run found
    Cleanup->>StatusDB: query final status
    alt Status is completed
      StatusDB-->>Cleanup: completed
      Cleanup->>Process: exit code 0
    else Status not completed
      StatusDB-->>Cleanup: other status
      Cleanup->>Process: exit code 1
    end
  else Grace period expires
    Cleanup->>FailFn: force-fail active run
    Cleanup->>Process: exit code 1
  else Cleanup error
    Cleanup->>Process: log failure, exit code 1
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • coleam00/Archon#1563: Both PRs adjust workflow-run "zombie" handling by branching on getWorkflowRunStatus (e.g., treating running differently from completed and triggering failWorkflowRun/cleanup backstops), including adding getWorkflowRunStatus to related mocks.

Poem

🐰 A graceful bow when signals say goodbye,
Five seconds kind—no rush to fail or die,
The workflow spins with hope, then finds its peace,
While polls and patience grant the sweet release! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description comprehensively explains the problem, solution, and rationale, but does not follow the required template structure with sections like Summary, UX Journey, Architecture Diagram, and other required metadata. Rewrite the description following the provided template structure, including Summary bullets, UX Journey, Architecture Diagram, Label Snapshot, Change Metadata, Linked Issue, Validation Evidence, Security Impact, Compatibility, Human Verification, Side Effects, Rollback Plan, and Risks sections.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a SIGTERM grace period to prevent losing workflow completion status to a race condition.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented May 19, 2026

Review Summary

Verdict: blocking-issues

Your grace-period polling strategy for the SIGTERM race condition is solid — the production backstory and 5-second calibration in the comments are genuinely valuable. However, there's a logic bug that makes the entire "completed during grace → exit 0" path unreachable, plus a missing test suite for the new cleanup handler.

Blocking issues

  • packages/cli/src/commands/workflow.ts:740: The condition else if (lastSeenRunId && Date.now() - startedAt >= TERMINATION_GRACE_MS) is always true when lastSeenRunId is set, because the while loop only breaks when the time condition becomes false. Completed runs that finish during the grace window will be force-failed instead of exiting 0. Fix: change to else if (lastSeenRunId) — the if branch above already handles the timeout case. See the code snippet in the full review.

  • packages/cli/src/commands/workflow.ts:711-766: The cleanup() function is entirely untested. No test covers the signal registration, polling loop, or any of the three exit-code paths. Add tests for: (1) workflow finishes during grace → exit 0, (2) workflow still active after grace → failWorkflowRun called → exit 1, (3) no active run → exit 1, (4) getWorkflowRunStatus throws → exit 1 without crash. Consider exporting TERMINATION_GRACE_MS or injecting graceMs as a parameter so tests don't wait 5 seconds.

Suggested fixes

  • workflow.ts:727 and workflow.ts:737: getActiveWorkflowRun and getWorkflowRunStatus silently swallow DB errors. Log the error before swallowing so operators can see DB issues during termination:

    workflowDb.getActiveWorkflowRun(conversation.id).catch(err => {
      getLog().error({ err, conversationId: conversation.id }, 'workflow.termination_poll_failed');
      return null;
    });
  • workflow.ts:719: Add getLog().info({ runId, status }, 'workflow.termination_grace_satisfied') when the workflow exits 0 via the DB verdict, so logs distinguish success from the null-run path.

Minor / nice-to-have

  • workflow.ts:716: Comment says "Still active after grace" but the condition uses >=. At exactly 5000ms elapsed, the outer branch fires instead. Either change >= to > or update the comment to "Still active at or beyond grace boundary".

  • workflow.ts:719-720: void cleanup('SIGTERM') / void cleanup('SIGINT') intentionally discard the Promise — a brief // process exit imminent comment prevents future readers from flagging this as a missing await.

Compliments

The multi-paragraph block at workflow.ts:698–726 explaining the race between completeWorkflowRun and external SIGTERM is exactly the kind of non-obvious WHY that justifies a comment. The "5 seconds is generous: completeWorkflowRun typically runs in <100ms" calibration is concrete and well-reasoned.


Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants