[One Workflow] Fix live progress flush racing concurrency cancellation#271229
Open
h88 wants to merge 4 commits into
Open
[One Workflow] Fix live progress flush racing concurrency cancellation#271229h88 wants to merge 4 commits into
h88 wants to merge 4 commits into
Conversation
Follow-up to elastic#270900: scope flushState to short in-process waits only, flush while the workflow is still RUNNING, and skip RUNNING reset when the delay abort signal was triggered by cancellation. Reverts the short-wait abort behavior from elastic#260406 that reset status to RUNNING on all TimeoutAbortedError. Fixes elastic#257103 Co-authored-by: Cursor <cursoragent@cursor.com>
Member
|
👍 @h88 - can we get this backported to 9.4, it's flaky there |
Contributor
💛 Build succeeded, but was flaky
Failed CI StepsTest FailuresMetrics [docs]
History
cc @h88 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #270900 (live progress during timer waits). That change called
flushState()before setting the workflow toWAITINGfor all timer-based waits. That makes concurrency cancellation duringwaitsteps race with persistedWAITINGstate and can surface as Scout failures (cancelled/skippedvsfailed), tracked in #257103.This PR keeps live progress for short in-process waits while avoiding the race:
flushState()only on the short wait path (diff < 5s), not before long waits that schedule a Task Manager resume task.RUNNING, then setWAITING, so the persistence loop is not stopped before cancellation can propagate.TimeoutAbortedError, do not reset workflow status toRUNNINGwhen the step abort signal was triggered by cancellation.The short-wait abort behavior intentionally diverges from #260406, which added a unit test expecting
RUNNINGto be set on any in-process sleep abort (including cancellation). That reset interferes with concurrencycancel-in-progress/drophandling during timer waits.9.4: The same fix is manually backported on #271218 (backport follow-up to #270968).
Backport note
Auto-backport between
mainand9.4for this change is not recommended as a single chain:9.4already has #271218 with the same logic on a smallerhandle_execution_delay.ts(noWAITING_FOR_CHILD/ idle-timeout scheduling).mainhas additional timer-wait code paths from #260406 and related work.Merge both PRs independently - bot backport in either direction will likely hit merge conflicts.
Scout status on
main(before this fix)concurrency_control.spec.tswas run locally on up-to-datemainwith #270900 merged and passed 2/2 twice. The failure mode is timing-dependent (same class of flake CI saw on #270900 / #270968), so a green local run does not mean the race is absent.Test plan
node scripts/jest src/platform/plugins/shared/workflows_execution_engine/server/workflow_execution_loop/handle_execution_delay.test.tsnode scripts/scout run-tests --arch stateful --domain classic --serverConfigSet workflows_ui --testFiles src/platform/plugins/shared/workflows_management/test/scout_workflows_ui/api/tests/workflow_execution/concurrency_control.spec.tsMade with Cursor
Fixes #257103