Skip to content

fix: WorkflowTaskSets size bloat for large workflows#16075

Open
ntny wants to merge 5 commits into
argoproj:mainfrom
ntny:fix-tasksets-size-bloat
Open

fix: WorkflowTaskSets size bloat for large workflows#16075
ntny wants to merge 5 commits into
argoproj:mainfrom
ntny:fix-tasksets-size-bloat

Conversation

@ntny

@ntny ntny commented May 10, 2026

Copy link
Copy Markdown

Motivation

removeCompletedTaskSetStatus was executed after workflow dehydration, causing completed task set nodes to be skipped during cleanup and leading to WorkflowTaskSets size bloat. This is because getDeleteTaskAndNodePatch relies on in-memory status.nodes, which are cleared during workflow compression.

As a result, large compressed workflows could continuously accumulate completed task set data and eventually fail with errors such as:

etcdserver: request is too large
TaskSet Patch Failed

This change:

  • moves cleanup before dehydration
  • checks that the workflow is not compressed before tracking task set nodes

Fixes #TODO

Verification

Verified the changes on:

  • regular workflows
  • compressed workflows (AWF) that previously reproduced the issue

Confirmed that completed task sets are cleaned up correctly and workflows no longer fail with oversized patch requests.

AI

@ntny ntny changed the title Fix WorkflowTaskSets size bloat for large workflows fix: WorkflowTaskSets size bloat for large workflows May 10, 2026
@ntny ntny force-pushed the fix-tasksets-size-bloat branch 4 times, most recently from 0fa410b to 1bdcf97 Compare May 10, 2026 21:31

@terrytangyuan terrytangyuan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some tests to exercise this change?

@ntny

ntny commented May 11, 2026

Copy link
Copy Markdown
Author

Could you add some tests to exercise this change?

Sure, I added a test to verify that compressed workflows now fail explicitly with a clear error instead of silently proceeding with invalid taskset cleanup logic.

I also was considering adding a large stress/e2e test for the executor plugin around the 2MB limit, but I’m not sure it’s a good idea.

@ntny ntny force-pushed the fix-tasksets-size-bloat branch 6 times, most recently from d71d2d5 to 0bd9e8c Compare May 14, 2026 14:22
@ntny

ntny commented May 14, 2026

Copy link
Copy Markdown
Author

@terrytangyuan
I’ve added unit and e2e tests
Could you please review?

  • unit RemoveCompletedTaskSetStatusFailsForCompressedWorkflow just check that compressed workflows now fail explicitly with a clear error instead of silently proceeding with invalid taskset cleanup logic (the operator should decompress the workflow before cleanup calls)
  • E2E TestCompressedTemplateExecutor_WorkflowTaskSetIsProperlyCleaned verifies that for an AWF that has reached the compressed size limit and has been compressed, tasksets were properly cleaned up

@terrytangyuan terrytangyuan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joibel Can you take another look here?

@Joibel Joibel self-assigned this May 19, 2026
Comment thread test/e2e/fixtures/then.go
}

func (t *Then) ExpectWorkflowCompressed() *Then {
ctx := logging.TestContext(t.t.Context())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow ExpectWorkflow, ExpectWorkflowName and call expectWorkflow(name, func) instead of hand rolling this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I considered using ExpectWorkflow as well, but it dehydrates the workflow here before the function check, so it doesn't work for this particular check.

Comment thread test/e2e/manifests/plugins/massive-executor-plugin-configmap.yaml Outdated
Comment thread util/file/watch.go
return err
}
if last == nil || fi.ModTime() != last.ModTime() || fi.Size() != last.Size() {
if last == nil || !fi.ModTime().Equal(last.ModTime()) || fi.Size() != last.Size() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant? It looks like a good change, but perhaps should be separate - is there something here that tickled this problem?

Having said that, this is better so happy to retain the change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not directly related to the fix. I noticed it while working on this area and changed it as a small cleanup.

I thought it might not be worth opening a separate PR and running the full e2e pipeline just for this small change. That said, if you prefer to keep the PR strictly focused, I can remove it.

Comment thread workflow/controller/taskset.go Outdated
@ntny ntny force-pushed the fix-tasksets-size-bloat branch from 35fd092 to 0f501b7 Compare June 2, 2026 18:44
@ntny ntny requested a review from a team as a code owner June 2, 2026 18:44
@ntny ntny force-pushed the fix-tasksets-size-bloat branch from 0f501b7 to 27b0f6e Compare June 2, 2026 18:46
arpechenin added 4 commits June 2, 2026 22:13
removeCompletedTaskSetStatus was executed after workflow dehydration, causing completed task set nodes to be skipped during cleanup.

This change:

This change:
- moves cleanup before dehydration

Signed-off-by: arpechenin <arpechenin@avito.ru>
- add explicit failure test for compressed workflows in removeCompletedTaskSetStatus
- add e2e coverage for WorkflowTaskSet cleanup in completed compressed workflows

Signed-off-by: arpechenin <arpechenin@avito.ru>
- add explicit failure test for compressed workflows in removeCompletedTaskSetStatus
- add e2e coverage for WorkflowTaskSet cleanup in completed compressed workflows

Signed-off-by: arpechenin <arpechenin@avito.ru>
- add explicit failure test for compressed workflows in removeCompletedTaskSetStatus
- add e2e coverage for WorkflowTaskSet cleanup in completed compressed workflows

Signed-off-by: arpechenin <arpechenin@avito.ru>
@ntny ntny force-pushed the fix-tasksets-size-bloat branch from 27b0f6e to 682545a Compare June 2, 2026 19:13
Signed-off-by: arpechenin <arpechenin@avito.ru>
@ntny ntny force-pushed the fix-tasksets-size-bloat branch from 682545a to eea539d Compare June 2, 2026 19:52
@ntny

ntny commented Jun 9, 2026

Copy link
Copy Markdown
Author

@Joibel could you please take another look? I’ve addressed your review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants