feat(workflows): add always_run node opt-out for resume caching (closes #1391)#1730
Conversation
Closes coleam00#1391. Adds an optional `always_run: boolean` field on every DAG node. When `true`, the node re-executes on resume even if it completed in the prior run. The resume pre-populate filters out always_run node IDs, and the per-node skip-check is gated by `!node.always_run`. Use case: producers whose exit code does not validate their output (bash that writes a file the consumer parses, code generators, fetch scripts). Today a successful-but-garbage producer stays cached across every resume; the only escape is renaming the node. Default is unchanged. Normal cached nodes in the same run still skip. Emits a new `dag.node_always_run_resume_forced` log event so operators can see the flag firing.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR adds an optional Changesalways_run Resume Opt-Out Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/workflows/src/dag-executor.test.ts (1)
4807-4851: ⚡ Quick winAssert the forced-resume log event for
always_runnodes.This validates rerun semantics, but it doesn’t verify the promised
dag.node_always_run_resume_forcedemission. Adding that assertion will catch observability regressions alongside behavior regressions.Suggested test assertion
it('re-runs node flagged always_run even when present in priorCompletedNodes', async () => { const store = createMockStore(); const mockDeps = createMockDeps(store); const platform = createMockPlatform(); const workflowRun = makeWorkflowRun(); + mockLogFn.mockClear(); const priorCompletedNodes = new Map([['producer', 'cached stale output']]); await executeDagWorkflow( mockDeps, @@ const skippedEvent = eventCalls.find( (call: unknown[]) => (call[0] as { event_type: string }).event_type === 'node_skipped_prior_success' && (call[0] as { step_name: string }).step_name === 'producer' ); expect(skippedEvent).toBeUndefined(); + + const forcedResumeLogs = mockLogFn.mock.calls.filter((call: unknown[]) => + call.some( + arg => typeof arg === 'string' && arg.includes('dag.node_always_run_resume_forced') + ) + ); + expect(forcedResumeLogs.length).toBeGreaterThan(0); });As per coding guidelines, "Use structured logging with Pino; event naming format:
{domain}.{action}_{state}with standard states: _started, _completed, _failed, _validated, _rejected".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/workflows/src/dag-executor.test.ts` around lines 4807 - 4851, The test it('re-runs node flagged always_run even when present in priorCompletedNodes', ...) currently checks behavior but does not assert the structured log/event emission for forced resume; update the test to assert that store.createWorkflowEvent was called with an event having event_type 'dag.node_always_run_resume_forced' and step_name 'producer' (alongside the existing skippedEvent check and mockSendQueryDag assertions) so the test verifies the dag.node_always_run_resume_forced emission from the code paths that rerun always_run nodes.packages/docs-web/src/content/docs/guides/authoring-workflows.md (1)
575-576: 💤 Low valueConsider clarifying "same run" phrasing for precision.
The phrase "Normal cached nodes in the same run are still skipped" is potentially ambiguous in the context of resume behavior. Readers might wonder whether "same run" refers to the resume run or to same-run (non-resume) caching behavior.
✏️ Suggested clearer phrasing
-On resume, `fetch-data` re-runs regardless of prior success, so `process-data` reads a freshly produced file. Normal cached nodes in the same run are still skipped — `always_run` is per-node. +On resume, `fetch-data` re-runs regardless of prior success, so `process-data` reads a freshly produced file. Other nodes without `always_run` are still skipped as normal — the flag is per-node.Alternative:
-On resume, `fetch-data` re-runs regardless of prior success, so `process-data` reads a freshly produced file. Normal cached nodes in the same run are still skipped — `always_run` is per-node. +On resume, `fetch-data` re-runs regardless of prior success, so `process-data` reads a freshly produced file. Non-`always_run` nodes in the resumed workflow remain skipped — the flag is per-node.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/docs-web/src/content/docs/guides/authoring-workflows.md` around lines 575 - 576, Update the sentence "Normal cached nodes in the same run are still skipped — `always_run` is per-node." to explicitly state that "same run" refers to the resumed execution (the resume run), e.g. replace with wording like: "During a resume, cached nodes that were up-to-date at the start of the resume run are still skipped — `always_run` applies per node." Locate and change the sentence following the explanation that "`fetch-data` re-runs regardless of prior success" so readers understand "same run" means the resume run rather than the original run.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/docs-web/src/content/docs/guides/authoring-workflows.md`:
- Around line 575-576: Update the sentence "Normal cached nodes in the same run
are still skipped — `always_run` is per-node." to explicitly state that "same
run" refers to the resumed execution (the resume run), e.g. replace with wording
like: "During a resume, cached nodes that were up-to-date at the start of the
resume run are still skipped — `always_run` applies per node." Locate and change
the sentence following the explanation that "`fetch-data` re-runs regardless of
prior success" so readers understand "same run" means the resume run rather than
the original run.
In `@packages/workflows/src/dag-executor.test.ts`:
- Around line 4807-4851: The test it('re-runs node flagged always_run even when
present in priorCompletedNodes', ...) currently checks behavior but does not
assert the structured log/event emission for forced resume; update the test to
assert that store.createWorkflowEvent was called with an event having event_type
'dag.node_always_run_resume_forced' and step_name 'producer' (alongside the
existing skippedEvent check and mockSendQueryDag assertions) so the test
verifies the dag.node_always_run_resume_forced emission from the code paths that
rerun always_run nodes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 87c9efac-1536-44e1-91ff-055c8622b34c
📒 Files selected for processing (5)
packages/docs-web/src/content/docs/guides/authoring-workflows.mdpackages/workflows/src/dag-executor.test.tspackages/workflows/src/dag-executor.tspackages/workflows/src/loader.test.tspackages/workflows/src/schemas/dag-node.ts
Review SummaryVerdict: blocking-issues This PR adds Blocking issues
Suggested fixes
Minor / nice-to-have
Compliments
Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality, docs-impact. |
The always_run resume-forced path only wrote a structured log line. The prior_success skip path writes a DB workflow_event, so resume forensics could see skipped nodes but not nodes that were reset from the skip list. Add a symmetric node_always_run_reset event with the prior output so operators can reconstruct resume decisions from the workflow_events table. Drop the trailing PR reference from the comment — surrounding text explains intent.
Blocking issues
Suggested fixes
Minor
|
Summary
always_run: booleanfield on every DAG node. Whentrue, the node re-executes on resume even whenpriorCompletedNodessays it completed — both the resume pre-populate and the per-node skip-check honor the flag.UX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
always_run?: booleanto all variantsalways_runwhen definedLabel Snapshot
risk: lowsize: Sworkflowsworkflows:dag-executor,workflows:schemasChange Metadata
featureworkflowsLinked Issue
Validation Evidence
New tests cover:
priorCompletedNodes(nonode_skipped_prior_successevent)always_run: trueand leaves it undefined when omittedSecurity Impact
Compatibility / Migration
Human Verification
always_run: trueon a real two-node YAML (fetch-databash producer +process-dataconsumer); the parsed AST exposes the field; consumer reads fresh output post-resume.nodeOutputs.set(...)write.What was not verified beyond CI:
archon workflow run ... --resume) was not exercised — the dag-executor unit tests directly drive the same code path viapriorCompletedNodes.Side Effects / Blast Radius
packages/workflows/src/dag-executor.tsresume pre-populate + skip-check,packages/workflows/src/schemas/dag-node.tsschema field.always_run: trueon a node whose downstream consumer depends on idempotency, the consumer must tolerate re-execution — same contract as any producer node.dag.node_always_run_resume_forcedsurfaces every time the flag fires, so operators can see resume behavior in workflow logs.Rollback Plan
always_runwould see the field ignored — Zod still parses it (optional unknown field), no runtime error.dag.node_always_run_resume_forcedto identify which node opted out.Risks and Mitigations
always_run: trueon a destructive node (e.g.rm -rf <dir>) and triggers it on every resume.nodeOutputs.set(...)writes the fresh value before the consumer'ssubstituteNodeOutputRefscall. Test 3 in the new suite asserts this end-to-end.Summary by CodeRabbit
New Features
always_runnode option to opt a node out of resume caching so it is re-executed on workflow resume.Behavior
always_runnodes to rerun and emits a dedicated workflow event when prior outputs are reset.Documentation
always_run.Tests