fix(cli/workflow): add SIGTERM grace period to avoid losing completion to a race#1707
Conversation
…n to a race
When an external orchestrator watches the same state that a workflow's
final node modifies, it can race the DAG executor: the last node has
already done its side effect (e.g. moved an issue label out of an
'in-progress' set), the orchestrator sees the state change and SIGTERMs
the runner, and the SIGTERM lands in the small window between the bash
node returning and `completeWorkflowRun` being called.
Before this change, cleanup unconditionally marked the active run as
failed and exited 1. Downstream consumers saw exit-1 and retried
idempotent work that was already done, producing log noise and (in some
cases) duplicate side effects.
Cleanup now polls `getActiveWorkflowRun` every 100ms for up to 5s
before deciding what to do:
- If the run finishes naturally during grace, look up its final status
and exit 0 only if the DB says `completed` — otherwise exit 1.
- If the run is still active after 5s, force-fail it as before
(preserves the user-Ctrl+C semantics).
- If we never see an active run during grace, preserve the previous
exit-1 behaviour — we can't tell what happened.
5s is generous: `completeWorkflowRun` is typically <100ms. A genuine
user-driven cancel is delayed by at most 5s, which is the right trade
for not corrupting the success/failure signal.
Test mock for the workflows DB module is extended with
`getWorkflowRunStatus` so the new call site has a stub. All 94 existing
workflow.test.ts tests still pass.
This was observed in production with a harness that watches GitHub state
labels: every spike workflow exited 1 with the issue label correctly
moved and the PR correctly opened. The harness's defensive guard
prevented duplicate dispatches but the exit-1 noise made it harder to
distinguish real failures from this cosmetic race.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe CLI workflow runner's termination handling is replaced with a graceful shutdown strategy. On SIGTERM/SIGINT signals, instead of immediately failing the workflow, the process now waits up to 5 seconds (polling every 100ms) to detect natural completion, only force-failing if the workflow remains active after the grace window. Exit code is 0 only when the database reports the workflow as completed. ChangesGraceful Workflow Termination
Sequence Diagram(s)sequenceDiagram
participant Process
participant SignalHandler
participant Cleanup as Cleanup Function
participant DB as getActiveWorkflowRun
participant FailFn as failWorkflowRun
participant StatusDB as getWorkflowRunStatus
Process->>SignalHandler: SIGTERM/SIGINT received
SignalHandler->>Cleanup: invoke async cleanup
Cleanup->>DB: poll for active run
alt Run completes during grace
DB-->>Cleanup: no active run found
Cleanup->>StatusDB: query final status
alt Status is completed
StatusDB-->>Cleanup: completed
Cleanup->>Process: exit code 0
else Status not completed
StatusDB-->>Cleanup: other status
Cleanup->>Process: exit code 1
end
else Grace period expires
Cleanup->>FailFn: force-fail active run
Cleanup->>Process: exit code 1
else Cleanup error
Cleanup->>Process: log failure, exit code 1
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review SummaryVerdict: blocking-issues Your grace-period polling strategy for the SIGTERM race condition is solid — the production backstory and 5-second calibration in the comments are genuinely valuable. However, there's a logic bug that makes the entire "completed during grace → exit 0" path unreachable, plus a missing test suite for the new cleanup handler. Blocking issues
Suggested fixes
Minor / nice-to-have
ComplimentsThe multi-paragraph block at Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality. |
When an external orchestrator watches the same state that a workflow's
final node modifies, it can race the DAG executor: the last node has
already done its side effect (e.g. moved an issue label out of an
'in-progress' set), the orchestrator sees the state change and SIGTERMs
the runner, and the SIGTERM lands in the small window between the bash
node returning and
completeWorkflowRunbeing called.Before this change, cleanup unconditionally marked the active run as
failed and exited 1. Downstream consumers saw exit-1 and retried
idempotent work that was already done, producing log noise and (in some
cases) duplicate side effects.
Cleanup now polls
getActiveWorkflowRunevery 100ms for up to 5sbefore deciding what to do:
and exit 0 only if the DB says
completed— otherwise exit 1.(preserves the user-Ctrl+C semantics).
exit-1 behaviour — we can't tell what happened.
5s is generous:
completeWorkflowRunis typically <100ms. A genuineuser-driven cancel is delayed by at most 5s, which is the right trade
for not corrupting the success/failure signal.
Test mock for the workflows DB module is extended with
getWorkflowRunStatusso the new call site has a stub. All 94 existingworkflow.test.ts tests still pass.
This was observed in production with a harness that watches GitHub state
labels: every spike workflow exited 1 with the issue label correctly
moved and the PR correctly opened. The harness's defensive guard
prevented duplicate dispatches but the exit-1 noise made it harder to
distinguish real failures from this cosmetic race.
Summary by CodeRabbit
Bug Fixes
Tests