Skip to content

Fix possibly incorrect is_replaying values when processing empty WFTs#910

Merged
Sushisource merged 8 commits intomasterfrom
fix-update-replay-nde
May 6, 2025
Merged

Fix possibly incorrect is_replaying values when processing empty WFTs#910
Sushisource merged 8 commits intomasterfrom
fix-update-replay-nde

Conversation

@Sushisource
Copy link
Copy Markdown
Member

@Sushisource Sushisource commented Apr 29, 2025

Currently, everything passes except some UTs that specifically test WFT boundaries involving histories with empty WFTs, which this fix changes.

I tried this change with Python, and everything passes. My intuition is that this is probably a compatible change, but we could always flag it to be sure. If anyone can come up with a counterexample please do. It's possible that this change can potentially move jobs from one activation to another (without changing ordering), but I don't think that matters in any way that actually affects anything in terms of how user code is woken up (besides making is_replay be correct in places where it previously wasn't, which would've led to incorrect behavior like in the bug, or NDEs anyway)

Comment on lines +737 to +744
if !saw_command
&& next_next_event.event_type() == EventType::WorkflowTaskScheduled
Copy link
Copy Markdown
Member Author

@Sushisource Sushisource Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix.

Before this change, if the history looked something like this (newest event at the bottom):

  • WF started
  • Full workflow task <- this is previous WFT started
  • Full activity sched/start/compl
  • Full workflow task
  • WFT Scheduled
  • WFT Started

I'd end up sending an activation to lang with the activity resolution but also with replaying = false. That's because this function here, which decides the task boundaries, was not properly considering the end of the last WFT through the activity events to the next WFT as the next sequence. It was also including the next (partial) WFT (and since that WFT was the end of history, decided replaying is now false when processing the events).

It did so because WFT "heartbeats" normally don't count as a "real" wft and are skipped over because they shouldn't cause spurious wakeups (ex: when LAs are running).

However, in this case, it really should count as a separate WFT, because the activity resolution happened in that sequence, and should be considered replaying, and then we should move on to the new (partial) WFT and set replaying to false at that point.

assert_eq!(seq.len(), 6);
let seq = next_check_peek(&mut update, 6);
assert_eq!(seq.len(), 13);
assert_eq!(seq.len(), 4);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test that is updated to work with the new boundaries

@Sushisource Sushisource force-pushed the fix-update-replay-nde branch 3 times, most recently from bb31a3c to fc0daae Compare April 29, 2025 22:34
@Sushisource Sushisource marked this pull request as ready for review May 5, 2025 21:29
@Sushisource Sushisource requested a review from a team as a code owner May 5, 2025 21:29
@Sushisource Sushisource force-pushed the fix-update-replay-nde branch 2 times, most recently from 86b8a84 to db9ecc1 Compare May 5, 2025 21:45
Copy link
Copy Markdown
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I can't think of an obvious way to break existing workflows that may inadvertently rely on the existing task-end-boundary expectation. We may want to consider an environment variable to be able to flip back for a release or two just in case.

@Sushisource Sushisource force-pushed the fix-update-replay-nde branch from d7d371d to deec018 Compare May 6, 2025 21:02
@Sushisource
Copy link
Copy Markdown
Member Author

Added emergency env var

@Sushisource Sushisource enabled auto-merge (squash) May 6, 2025 21:02
@Sushisource Sushisource merged commit 9af3cb5 into master May 6, 2025
17 checks passed
@Sushisource Sushisource deleted the fix-update-replay-nde branch May 6, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants