Skip to content

[One Workflow] Orphaned scheduled task root cause fix + self-healing#271610

Open
VladimirFilonov wants to merge 3 commits into
elastic:mainfrom
VladimirFilonov:fix/17420-bug-one-workflow-workflow-not-found-issue-in-serverless
Open

[One Workflow] Orphaned scheduled task root cause fix + self-healing#271610
VladimirFilonov wants to merge 3 commits into
elastic:mainfrom
VladimirFilonov:fix/17420-bug-one-workflow-workflow-not-found-issue-in-serverless

Conversation

@VladimirFilonov
Copy link
Copy Markdown
Contributor

@VladimirFilonov VladimirFilonov commented May 28, 2026

Summary

closes: https://github.com/elastic/security-team/issues/17420

Root cause fix

Fix orphaned workflow:scheduled Task Manager tasks after workflow deletion that caused Workflow <id> not found errors in workflowsExecutionEngine (observed on observability serverless).

Root cause: unscheduleWorkflowTasks used taskManager.fetch to find tasks before bulkRemove. Task Manager schedules with refresh: false, so a task could be missing from search at delete time while still being created or claimable later. Unschedule then no-op’d, leaving a task that ran against a deleted or tombstoned workflow.

Fix:

  • Single-workflow unschedule (syncSchedulerAfterSave, etc.): call removeIfExists with the deterministic id workflow:<workflowId>:scheduled — no fetch.
  • Bulk unschedule (soft/hard delete, disableAllWorkflows): add bulkUnscheduleWorkflowTasks that bulkRemoves by those ids; treat 404 as success; log per-workflow warnings on other failures without failing the batch.
  • Thin unschedule_workflow_tasks wrapper delegates to bulkUnscheduleWorkflowTasks and drops the unused logger parameter from call sites.

Self-healing for already affected envs

  • add self-healing behavior for workflow:scheduled runs when the workflow document is missing
  • return shouldDeleteTask: true from the scheduled task runner so Task Manager deletes orphaned recurring tasks
  • add a focused unit test to verify the missing-workflow path returns delete intent and skips execution

@VladimirFilonov VladimirFilonov requested a review from a team as a code owner May 28, 2026 07:17
@botelastic botelastic Bot added the Team:One Workflow Team label for One Workflow (Workflow automation) label May 28, 2026
@VladimirFilonov VladimirFilonov changed the title [One Workflow] Orphaned scheduled task self-healing [One Workflow] Orphaned scheduled task root cause fix + self-healing May 28, 2026
@kibanamachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

Copy link
Copy Markdown
Contributor

@h88 h88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
One question: self-healing deletes when getWorkflow returns null, which includes soft-deleted tombstones (not just hard-deleted). Is that intentional?

@VladimirFilonov
Copy link
Copy Markdown
Contributor Author

LGTM One question: self-healing deletes when getWorkflow returns null, which includes soft-deleted tombstones (not just hard-deleted). Is that intentional?

Yes - on any edit (restore if its possible), task will be rescheduled

@VladimirFilonov VladimirFilonov enabled auto-merge (squash) June 1, 2026 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

9.5 candidate backport:version Backport to applied version labels release_note:fix Team:One Workflow Team label for One Workflow (Workflow automation) v9.4.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants