fix: resolve local activities in order of history events by chris-olszewski · Pull Request #1075 · temporalio/sdk-core

chris-olszewski · 2025-12-09T21:49:18Z

What was changed

On replay, local activities will now be resolved in the order that their markers appear in history as opposed to their schedule order.

In order to achieve this a few things had to change:

Peekahead Preresolutions

Previously we stored these as a map, switch to a vec so we can preserve the order we peek them in the history.

LA machine changes

Before

After

Important change here is the addition of a state between scheduling and getting resolved. This allows us to still create the LA machines as they come in, but delays resolving them until later so we can resolve them in the order we peeked the resolutions.

Removal of `FakeMarker`

This was causing failures with the new state machine as we would end up applying workflow completion events to LA machines.

I did a quick test off of master and removing the FakeMarker that gets emitted on transitioning to WaitingMarkerEventPreResolved doesn't result in any test failure. I can do this change in a separate PR if desired.

Why?

This could cause NDE errors if there were additional commands produced on each LA completion and the first scheduled LA didn't complete first e.g.

Promise.all([a().then(() => b()), c().then(() => d())])

would trigger NDE on replay if in history c finished before a.

Checklist

Closes [Bug] NDE replaying nested promises sdk-typescript#1744 once TS is updated to include this commit
How was this tested:
Existing test suite. Added test where LAs are resolved in different order than they are scheduled.
Updated TS to this branch and the failing nested promise replay test now passes without the NDE
Any docs updates needed?
No

chris-olszewski · 2025-12-10T14:43:53Z

crates/common/src/protos/canned_histories.rs

+///  8: EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
+///  9: EVENT_TYPE_WORKFLOW_TASK_STARTED


Just updating comment here to match actual history

Sushisource

Really good fix. Love how you simplified things here.

The only thing left to do here is ensure that we're not breaking any existing histories (that wouldn't have already been broken by this bug). The way to do that is to take your new tests, and run them without your fix (multiple times since they will have different LA ordering?), save the histories (crates/sdk-core/tests/histories), and then use those in the replayer too.

I think, with this fix, it'll be the case that either the history triggers an NDE or not depending on if the original run happened to line up with the now-deterministic ordering of the LAs or not. I think the potentially worrying scenario here is that users possibly could've had workflows that, under replay, would sporadically throw NDEs but eventually pass after a few retries because they happened to hit the right sequence. Now, those workflows may fail/pass 100% of the time. I think that's probably acceptable, considering how rare this bug was to trigger anyway, but the moral of the story is we want to prove to ourselves the circumstances where the behavior change does or does not cause old histories to fail to replay. @mjameswh probably has some good advice about what other tests can be added based on his original research.

crates/sdk-core/src/worker/workflow/machines/workflow_machines.rs

mjameswh · 2025-12-10T21:10:59Z

crates/sdk-core/src/worker/workflow/machines/workflow_machines.rs

                            self.process_machine_responses(mk, resps)?;
                        } else {
-                            self.local_activity_data.insert_peeked_marker(*la_dat);
+                            // Since the LA machine exists, we have encountered the LA WF command.


Is this case even possible? If it is, it feels wrong, and possibly dangerous, to simply drop the marker.

Unless we can prove this case exists, or otherwise prove that dropping the marker would be safe, I think we should just remove the call to will_accept_resolve_marker above, and let try_resolve_with_dat propagate an error should we ever reached that from an unexpected state.

This case is hit during query_during_wft_heartbeat_doesnt_accidentally_fail_to_continue_heartbeat both before/after the changes in this PR. Previously it would be added to peeked collection, but just sit there unused since the LAM was already created.

The test in question executes the LA, but has 2 history polls where the second one includes 2 WFT so we do perform a peek ahead even in this non-replay scenario. The marker will get handled through the normal means when we process the 3rd WFT.

mjameswh · 2025-12-11T07:47:18Z

Great! Thanks for completing that work. And really pleased to see this ended up removing the need for fake markers, that's a welcome simplification.

I think, with this fix, it'll be the case that either the history triggers an NDE or not depending on if the original run happened to line up with the now-deterministic ordering of the LAs or not. I think the potentially worrying scenario here is that users possibly could've had workflows that, under replay, would sporadically throw NDEs but eventually pass after a few retries because they happened to hit the right sequence.

I'm quite sure there's no risk of "sporadic" repro on this issue. Legacy behavior was actually deterministic (according to the traditional, non-temporal definition of that word) in both the non-replay and replay cases, though possibly inconsistent between these two (hence the non-determinism issue according to Temporal's definition of that word). That is:

In the non-replay mode, LAs would always notify lang in the order that the LA execution completed;
In replay mode, they would always complete in the order that they got scheduled, assuming they originally completed within the same WFT.

It's true that some workflows were replaying successfully despite technically hitting this LA bug, but that's still fully deterministic (traditional definition) for a given workflow execution. There would be no case of "retry a few times, and it may eventually pass".

Note that it is also possible to get various situations where we have within the scope of a single WFT, some LAs that completed within the same WFT that they were scheduled, mixed with other LAs that completed in different WFT (i.e. either scheduled during a previous WFT, or that will complete in a following WFT). I'm definitely NOT confident that we properly handled those scenarios so far in this PR. They will have to be explicitly tested.

I have other concerns that I want us to cover in tests. I'll detail that tomorrow.

Sushisource · 2026-03-16T16:28:11Z

@claude Review

chris-olszewski · 2026-03-16T19:25:08Z

Discussed with @mjameswh and this PR is currently on hold until #1146 is fixed as we believe it will be easier to land with that fix already in place.

Sushisource · 2026-03-20T16:36:54Z

(sorry, trying to figure out why the bot isn't working)

Sushisource · 2026-03-20T16:53:54Z

@claude review

claude · 2026-03-20T17:43:00Z

crates/sdk-core/src/worker/workflow/machines/local_activity_state_machine.rs

    // Replay path ================================================================================
-    // LAs on the replay path always need to eventually see the marker
-    WaitingMarkerEvent --(MarkerRecorded(CompleteLocalActivityData), shared on_marker_recorded)
-      --> MarkerCommandRecorded;
+    WaitingResolveFromMarkerLookAhead --(HandleKnownResult(ResolveDat), on_handle_result) --> ResolvedFromMarkerLookAheadWaitingMarkerEvent;
    // If we are told to cancel while waiting for the marker, we still need to wait for the marker.
-    WaitingMarkerEvent --(Cancel, on_cancel_requested) --> WaitingMarkerEvent;
+    WaitingResolveFromMarkerLookAhead --(Cancel, on_cancel_requested) --> WaitingResolveFromMarkerLookAhead;
+    ResolvedFromMarkerLookAheadWaitingMarkerEvent --(Cancel, on_cancel_requested) --> ResolvedFromMarkerLookAheadWaitingMarkerEvent;
+
    // Because there could be non-heartbeat WFTs (ex: signals being received) between scheduling
    // the LA and the marker being recorded, peekahead might not always resolve the LA *before*
    // scheduling it. This transition accounts for that.
-    WaitingMarkerEvent --(HandleKnownResult(ResolveDat), on_handle_result) --> WaitingMarkerEvent;
-    WaitingMarkerEvent --(NoWaitCancel(ActivityCancellationType),
-                          on_no_wait_cancel) --> WaitingMarkerEvent;
+    WaitingResolveFromMarkerLookAhead --(NoWaitCancel(ActivityCancellationType),
+                                                     on_no_wait_cancel) --> WaitingResolveFromMarkerLookAhead;
+    ResolvedFromMarkerLookAheadWaitingMarkerEvent --(NoWaitCancel(ActivityCancellationType),
+                                                     on_no_wait_cancel) --> ResolvedFromMarkerLookAheadWaitingMarkerEvent;
+
+    // LAs on the replay path always need to eventually see the marker
+    ResolvedFromMarkerLookAheadWaitingMarkerEvent --(MarkerRecorded(CompleteLocalActivityData), shared on_marker_recorded)
+      --> MarkerCommandRecorded;

    // It is entirely possible to have started the LA while replaying, only to find that we have
    // reached a new WFT and there still was no marker. In such cases we need to execute the LA.
    // This can easily happen if upon first execution, the worker does WFT heartbeating but then
    // dies for some reason.


🔴 The WaitingResolveFromMarkerLookAhead state is missing a MarkerRecorded transition and is not handled in marker_should_get_special_handling(), causing a fatal error when LA markers arrive before lookahead has pre-resolved the machine. This happens during paginated replay with heartbeat WFTs: the 2-WFT extraction guarantee may yield WFT1 + heartbeat-WFT2 (no markers), and when WFT3 (containing the markers) arrives via page fetch, the LA machines are still in WaitingResolveFromMarkerLookAhead and hit the catch-all fatal!() branch. Fix by adding a MarkerRecorded transition from WaitingResolveFromMarkerLookAhead (resolving directly) and including it in marker_should_get_special_handling().

Extended reasoning...

Bug Analysis

The old code had a WaitingMarkerEvent state with a direct MarkerRecorded transition:

WaitingMarkerEvent --(MarkerRecorded(...)) --> MarkerCommandRecorded

and marker_should_get_special_handling() returned Ok(true) for WaitingMarkerEvent. The new code replaces this with WaitingResolveFromMarkerLookAhead, which has no MarkerRecorded transition and is not listed in marker_should_get_special_handling() (falling into _ => Err(fatal!(...))). The design assumes lookahead always pre-resolves the machine to ResolvedFromMarkerLookAheadWaitingMarkerEvent before the marker event is processed.

Why the assumption fails

The HistoryPaginator::extract_next_update() guarantees at least 2 WFT sequences when more pages exist (line 294: if update.wft_count < 2 { continue }). However, this guarantee is insufficient when heartbeat WFTs are present. Consider this history:

WFT1 (events 1-3): Schedule LA1, LA2 WFT2 (events 4-5): Heartbeat (LAs still running) WFT3 (events 6-10): WFT2_completed, LA1 marker, LA2 marker, WFT3_scheduled, WFT3_started

extract_next_update() returns events 1-5 (WFT1 + WFT2 = 2 WFTs), satisfying the guarantee, but the LA markers are in WFT3 on a different page.

Step-by-step proof

Initial update: extract_next_update() returns events 1-5 (WFT1 + heartbeat WFT2). wft_count == 2, so it stops fetching despite more pages existing.

Process WFT1: apply_next_wft_from_history takes events 1-3. The lookahead (peek_next_wft_sequence) sees events 4-5 (heartbeat WFT2) — no LA markers found, so no preresolutions are stored.

Activation to lang: Lang schedules LA1 and LA2.

Completion processing: push_commands_and_iterate → iterate_machines → handle_driven_results creates LA machines in WaitingResolveFromMarkerLookAhead. apply_local_action_peeked_resolutions() finds no preresolutions.

Process WFT2 (heartbeat): apply_next_task_if_ready → apply_next_wft_from_history takes events 4-5. Lookahead peeks — buffer is now empty (WFT3 not fetched yet). No preresolutions stored.

Lang completes WFT2: ready_to_apply_next_wft() returns false (buffer empty, not last WFT). Completion is deferred, page fetch is triggered.

Page 2 arrives: _fetched_page_completion → _process_completion calls push_commands_and_iterate (for deferred WFT2 completion) then feed_history_from_new_page(update).

feed_history_from_new_page → new_history_from_server → apply_next_wft_from_history processes WFT3 events. The event loop encounters event 7 (LA marker).

handle_command_event → marker_should_get_special_handling() is called on the LA machine, which is still in WaitingResolveFromMarkerLookAhead. This state hits the _ => Err(fatal!(...)) catch-all → fatal error.

The lookahead at the end of apply_next_wft_from_history (which peeks at WFT4) cannot help because the marker is processed in the event loop before the lookahead runs.

Addressing the refutation

The refutation correctly states that extract_next_update guarantees ≥2 WFT sequences. However, when heartbeat WFTs are present, those 2 WFTs can be WFT1 + heartbeat-WFT2, which contain no LA markers. The markers in WFT3 require a separate page fetch, during which the machines remain in the unresolvable WaitingResolveFromMarkerLookAhead state. The 2-WFT guarantee prevents the scenario only when markers appear in the immediate next WFT after scheduling — it does not cover the heartbeat + pagination case.

Impact

This is a regression from the old code. In production, long-running workflows commonly use WFT heartbeating (especially with local activities), and large histories require pagination. When page boundaries fall between a heartbeat WFT and the WFT containing LA markers, the workflow will hit a fatal error during replay, preventing it from making progress.

Fix

Add a MarkerRecorded transition from WaitingResolveFromMarkerLookAhead that resolves the LA directly (like the old WaitingMarkerEvent did), and add WaitingResolveFromMarkerLookAhead to the match in marker_should_get_special_handling() returning Ok(true). This restores the fallback path that the old code had.

claude · 2026-03-20T17:43:00Z

crates/sdk-core/src/worker/workflow/machines/local_activity_state_machine.rs

    // Because there could be non-heartbeat WFTs (ex: signals being received) between scheduling
    // the LA and the marker being recorded, peekahead might not always resolve the LA *before*
    // scheduling it. This transition accounts for that.
-    WaitingMarkerEvent --(HandleKnownResult(ResolveDat), on_handle_result) --> WaitingMarkerEvent;
-    WaitingMarkerEvent --(NoWaitCancel(ActivityCancellationType),
-                          on_no_wait_cancel) --> WaitingMarkerEvent;
+    WaitingResolveFromMarkerLookAhead --(NoWaitCancel(ActivityCancellationType),
+                                                     on_no_wait_cancel) --> WaitingResolveFromMarkerLookAhead;
+    ResolvedFromMarkerLookAheadWaitingMarkerEvent --(NoWaitCancel(ActivityCancellationType),
+                                                     on_no_wait_cancel) --> ResolvedFromMarkerLookAheadWaitingMarkerEvent;


🟡 Two minor naming/comment issues: (1) apply_local_action_peeked_resolutions and its doc comment use "local action" instead of "local activity", inconsistent with the rest of the codebase (LocalActivityMachine, LocalActivityData, local_activity_data, etc.). (2) The comment at lines 73-75 about peekahead resolution timing was originally attached to the HandleKnownResult transition but after the refactor now incorrectly describes the NoWaitCancel transitions below it — it should be moved above line 68.

Extended reasoning...

Naming inconsistency: "local action" vs "local activity"

The new function apply_local_action_peeked_resolutions (workflow_machines.rs:1660) and its doc comment on line 1659 ("Applies local action preresolutions peeked from history...") use the term "local action" instead of "local activity". The entire codebase consistently uses "local activity" — LocalActivityMachine, LocalActivityData, local_activity_data, LocalActivityExecutionResult, etc. The same incorrect term also appears in the comment at line 814 referencing this function name. This should be apply_local_activity_peeked_resolutions with the comment updated to say "local activity".

Misplaced FSM comment

In local_activity_state_machine.rs, lines 73-75 contain a comment:

// Because there could be non-heartbeat WFTs (ex: signals being received) between scheduling // the LA and the marker being recorded, peekahead might not always resolve the LA *before* // scheduling it. This transition accounts for that.

This comment was originally placed above the HandleKnownResult(ResolveDat) transition in the old code, which it correctly described — that transition handles the case where peekahead resolves the LA after it has already been scheduled. After the refactor, HandleKnownResult was moved to line 68, but the comment was left at lines 73-75 where it now precedes the NoWaitCancel transitions on lines 76-79, which deal with cancellation semantics, not peekahead resolution timing.

Step-by-step proof

Looking at the FSM definition starting at line 65:

Line 68: WaitingResolveFromMarkerLookAhead --(HandleKnownResult(ResolveDat), on_handle_result) --> ResolvedFromMarkerLookAheadWaitingMarkerEvent; — this is the transition the comment describes

Lines 73-75: The peekahead comment

Lines 76-79: WaitingResolveFromMarkerLookAhead --(NoWaitCancel(...)) and ResolvedFromMarkerLookAheadWaitingMarkerEvent --(NoWaitCancel(...)) — these are cancellation transitions, unrelated to peekahead

The comment says "This transition accounts for that" but "that" (peekahead not resolving before scheduling) is accounted for by HandleKnownResult on line 68, not NoWaitCancel on lines 76-79.

Impact

Both issues are cosmetic with zero runtime impact. The naming inconsistency makes the function harder to find via grep/search for "local activity", and the misplaced comment could confuse future contributors reading the FSM definition.

Fix

Rename apply_local_action_peeked_resolutions to apply_local_activity_peeked_resolutions and update the doc comment and all references.

Move the comment block from lines 73-75 to just above line 68 (before the HandleKnownResult transition).

chris-olszewski commented Dec 10, 2025

View reviewed changes

chris-olszewski marked this pull request as ready for review December 10, 2025 15:22

chris-olszewski requested a review from a team as a code owner December 10, 2025 15:22

Sushisource reviewed Dec 10, 2025

View reviewed changes

crates/sdk-core/src/worker/workflow/machines/workflow_machines.rs Show resolved Hide resolved

mjameswh reviewed Dec 10, 2025

View reviewed changes

mjameswh mentioned this pull request Dec 17, 2025

[DRAFT] Fix LA completion order on replay #1035

Closed

chris-olszewski requested a review from mjameswh December 29, 2025 21:30

chris-olszewski added 11 commits March 12, 2026 11:43

add failing test

d7baa51

initial fix

ac4ddeb

avoid saving unneeded peeked resolutions

a0f9ff4

remove dead code, clean up comments

2040383

remove MachineAssociatedCommand

244deb3

clean up comments

e659d5a

more comments, fix formatting

4cc9796

remove special case check

18e19d7

make error nde

b85167b

additional tests

6e3ac66

fix test workflows to have correct ordering

2afa4af

chris-olszewski force-pushed the olszewski/la_completion_order branch from e0690a4 to 2afa4af Compare March 12, 2026 16:37

claude bot reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve local activities in order of history events#1075

fix: resolve local activities in order of history events#1075
chris-olszewski wants to merge 11 commits intomasterfrom
olszewski/la_completion_order

chris-olszewski commented Dec 9, 2025 •

edited

Loading

Uh oh!

chris-olszewski Dec 10, 2025

Uh oh!

Sushisource left a comment

Uh oh!

Uh oh!

mjameswh Dec 10, 2025

Uh oh!

chris-olszewski Dec 17, 2025

Uh oh!

mjameswh commented Dec 11, 2025 •

edited

Loading

Uh oh!

Sushisource commented Mar 16, 2026 •

edited

Loading

Uh oh!

chris-olszewski commented Mar 16, 2026

Uh oh!

Sushisource commented Mar 20, 2026

Uh oh!

Sushisource commented Mar 20, 2026

Uh oh!

claude bot Mar 20, 2026

Uh oh!

claude bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		/// 8: EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
		/// 9: EVENT_TYPE_WORKFLOW_TASK_STARTED

Conversation

chris-olszewski commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Peekahead Preresolutions

LA machine changes

Removal of FakeMarker

Why?

Checklist

Uh oh!

chris-olszewski Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mjameswh Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

chris-olszewski Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

mjameswh commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sushisource commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chris-olszewski commented Mar 16, 2026

Uh oh!

Sushisource commented Mar 20, 2026

Uh oh!

Sushisource commented Mar 20, 2026

Uh oh!

claude bot Mar 20, 2026

Choose a reason for hiding this comment

Bug Analysis

Why the assumption fails

Step-by-step proof

Addressing the refutation

Impact

Fix

Uh oh!

claude bot Mar 20, 2026

Choose a reason for hiding this comment

Naming inconsistency: "local action" vs "local activity"

Misplaced FSM comment

Step-by-step proof

Impact

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chris-olszewski commented Dec 9, 2025 •

edited

Loading

Removal of `FakeMarker`

mjameswh commented Dec 11, 2025 •

edited

Loading

Sushisource commented Mar 16, 2026 •

edited

Loading