Skip to content

[One Workflow] Fix live progress flush racing concurrency cancellation#271229

Open
h88 wants to merge 4 commits into
elastic:mainfrom
h88:1wf-main-live-progress-concurrency-fix
Open

[One Workflow] Fix live progress flush racing concurrency cancellation#271229
h88 wants to merge 4 commits into
elastic:mainfrom
h88:1wf-main-live-progress-concurrency-fix

Conversation

@h88
Copy link
Copy Markdown
Contributor

@h88 h88 commented May 26, 2026

Summary

Follow-up to #270900 (live progress during timer waits). That change called flushState() before setting the workflow to WAITING for all timer-based waits. That makes concurrency cancellation during wait steps race with persisted WAITING state and can surface as Scout failures (cancelled/skipped vs failed), tracked in #257103.

This PR keeps live progress for short in-process waits while avoiding the race:

  • Call flushState() only on the short wait path (diff < 5s), not before long waits that schedule a Task Manager resume task.
  • Flush while the workflow is still RUNNING, then set WAITING, so the persistence loop is not stopped before cancellation can propagate.
  • On TimeoutAbortedError, do not reset workflow status to RUNNING when the step abort signal was triggered by cancellation.

The short-wait abort behavior intentionally diverges from #260406, which added a unit test expecting RUNNING to be set on any in-process sleep abort (including cancellation). That reset interferes with concurrency cancel-in-progress / drop handling during timer waits.

9.4: The same fix is manually backported on #271218 (backport follow-up to #270968).

Backport note

Auto-backport between main and 9.4 for this change is not recommended as a single chain:

  • 9.4 already has #271218 with the same logic on a smaller handle_execution_delay.ts (no WAITING_FOR_CHILD / idle-timeout scheduling).
  • main has additional timer-wait code paths from #260406 and related work.

Merge both PRs independently - bot backport in either direction will likely hit merge conflicts.

Scout status on main (before this fix)

concurrency_control.spec.ts was run locally on up-to-date main with #270900 merged and passed 2/2 twice. The failure mode is timing-dependent (same class of flake CI saw on #270900 / #270968), so a green local run does not mean the race is absent.

Test plan

  • node scripts/jest src/platform/plugins/shared/workflows_execution_engine/server/workflow_execution_loop/handle_execution_delay.test.ts
  • node scripts/scout run-tests --arch stateful --domain classic --serverConfigSet workflows_ui --testFiles src/platform/plugins/shared/workflows_management/test/scout_workflows_ui/api/tests/workflow_execution/concurrency_control.spec.ts

Made with Cursor

Fixes #257103

Follow-up to elastic#270900: scope flushState to short in-process waits only, flush
while the workflow is still RUNNING, and skip RUNNING reset when the delay
abort signal was triggered by cancellation. Reverts the short-wait abort
behavior from elastic#260406 that reset status to RUNNING on all TimeoutAbortedError.

Fixes elastic#257103

Co-authored-by: Cursor <cursoragent@cursor.com>
@h88 h88 self-assigned this May 26, 2026
@h88 h88 added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:One Workflow Team label for One Workflow (Workflow automation) labels May 26, 2026
@h88 h88 marked this pull request as ready for review May 26, 2026 10:49
@h88 h88 requested a review from a team as a code owner May 26, 2026 10:49
Copy link
Copy Markdown
Contributor

@skynetigor skynetigor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@delanni
Copy link
Copy Markdown
Member

delanni commented May 26, 2026

👍 @h88 - can we get this backported to 9.4, it's flaky there

@kibanamachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #2 / should pass the document flyout scope id to the full reason renderer

Metrics [docs]

✅ unchanged

History

cc @h88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:One Workflow Team label for One Workflow (Workflow automation)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failing test: Workflow execution concurrency control - drop strategy drops new executions until there is an already running execution

4 participants