Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support updating closed workflows by mutation #7420

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yycptt
Copy link
Member

@yycptt yycptt commented Mar 5, 2025

What changed?

  • Support update closed workflow by mutation

Why?

  • Today, any updates (e.g. update callbacks after workflows & update branch token. With CHASM, there will be more cases) for closed workflows require persisting the entire mutable snapshot using the SetWorkflowExecution persistence API in order to skip the current record validation.
  • Application logic is complex and have to call different APIs based on workflow state
  • Performance is also bad as we are writing unnecessary data when persisting a snapshot, even if they haven't been mutated.

How did you test it?

  • Unit & persistence integration tests

Potential risks

Documentation

Is hotfix candidate?

  • No.

@yycptt yycptt changed the title POC: Support update closed workflow by mutation POC: Support updating closed workflows by mutation Mar 5, 2025
@yux0
Copy link
Contributor

yux0 commented Mar 7, 2025

LGTM. No concern on this change.

@yycptt yycptt force-pushed the update-skip-current branch from 0ae5403 to 8e018ba Compare March 11, 2025 01:22
@yycptt yycptt changed the title POC: Support updating closed workflows by mutation Support updating closed workflows by mutation Mar 11, 2025
@yycptt yycptt marked this pull request as ready for review March 11, 2025 19:07
@yycptt yycptt requested a review from a team as a code owner March 11, 2025 19:07
@yycptt yycptt force-pushed the update-skip-current branch 3 times, most recently from bc9e82b to 6169664 Compare March 21, 2025 23:04
@yycptt yycptt force-pushed the update-skip-current branch from 6169664 to e2ab6b4 Compare March 21, 2025 23:08
return executionContext.UpdateWorkflowExecutionAsPassive(ctx, r.shardContext)
}

// TODO: remove following code once EnableUpdateClosedWorkflowByMutation config is deprecated.
updateMode := persistence.UpdateWorkflowModeUpdateCurrent
if state, _ := mutableState.GetWorkflowStateStatus(); state == enumsspb.WORKFLOW_EXECUTION_STATE_ZOMBIE {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic is kinda wrong/misleading as it making decision based on the in memory workflow state instead of the workflow state in DB. The updateMode though, is for doing current record CAS in db, so we need to pick it based on the state in db (i.e. the workflow state before the current mutable state transaction)

The logic still works because activity state replicator (and same for the hsm state replicator) below don't really change the workflow state, so in memory state == state in DB.

return wfCtx.UpdateWorkflowExecutionAsActive(ctx, e.shardContext)
}

// TODO: remove following code once EnableUpdateClosedWorkflowByMutation config is deprecated.
if ms.GetExecutionState().State == enumsspb.WORKFLOW_EXECUTION_STATE_COMPLETED {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The hsm framework doesn't really change workflow state right now.

@yycptt yycptt requested a review from gow April 4, 2025 19:00
Copy link
Contributor

@gow gow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some nit suggestions and comments below.

@@ -1425,6 +1425,12 @@ HistoryCacheSizeBasedLimit is set to true.`,
timeout timer when execution timeout is specified when starting a workflow.
For backward compatibility, this feature is disabled by default and should only be enabled after server version
containing this flag is deployed to all history service nodes in the cluster.`,
)
EnableUpdateClosedWorkflowByMutation = NewGlobalBoolSetting(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Name this EnableUpdateWorkflowModeSkipCurrent or something similar since that is what it enables. It makes it easier to discover this config and its purpose.

// For example, when updating a closed workflow, it may or may not be the current workflow.
// This is similar to SetWorkflowExecution, but UpdateWorkflowExecution with this mode persists the workflow as a mutation,
// instead of a snapshot.
UpdateWorkflowModeSkipCurrent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does conflict a lot with UpdateWorkflowModeBypassCurrent.
Can we consider something like UpdateWorkflowModeIgnoreCurrent? It'll be even better if we can get rid of one of these modes all together (if possible)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense. IgnoreCurrent is a better name.

This updateMode is part of the persistence interface so not trivial to get rid of any existing ones.
Long term if we ever get a chance to redo our persistence interface, I think the interface can explicit ask to the caller to state which columns/rows to update and to what, including changes for current rows. That will make things much more explicit. Right now the current row concept, which is an application concept is almost entirely in persistence layer...

@@ -461,6 +461,12 @@ func (t *visibilityQueueTaskExecutor) cleanupExecutionInfo(
executionInfo.Memo = nil
executionInfo.SearchAttributes = nil
executionInfo.RelocatableAttributesRemoved = true

if t.shardContext.GetConfig().EnableUpdateClosedWorkflowByMutation() {
return weContext.UpdateWorkflowExecutionAsPassive(ctx, t.shardContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why UpdateWorkflowExecutionAsPassive here and HSM state replicator. Why not UpdateWorkflowExecutionAsActive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HSM/Activity state replicator are replication logic for applying changes in the standby cluster of a namespace. And when "AsPassive" is used, the mutable state logic will prevent another replication task from being generated (there are some other differences, but this is the most important one).

For visibility, you can think this the removal of Memo & SearchAttributes is a cluster local operation (move the data from mutable state to ES). and this operation doesn't need to be replicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants