[One Workflow] Orphaned scheduled task root cause fix + self-healing#271610
Open
VladimirFilonov wants to merge 3 commits into
Open
Conversation
Contributor
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]
History
|
h88
approved these changes
May 31, 2026
Contributor
h88
left a comment
There was a problem hiding this comment.
LGTM
One question: self-healing deletes when getWorkflow returns null, which includes soft-deleted tombstones (not just hard-deleted). Is that intentional?
…d-issue-in-serverless
Contributor
Author
Yes - on any edit (restore if its possible), task will be rescheduled |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
closes: https://github.com/elastic/security-team/issues/17420
Root cause fix
Fix orphaned
workflow:scheduledTask Manager tasks after workflow deletion that causedWorkflow <id> not founderrors inworkflowsExecutionEngine(observed on observability serverless).Root cause:
unscheduleWorkflowTasksusedtaskManager.fetchto find tasks beforebulkRemove. Task Manager schedules withrefresh: false, so a task could be missing from search at delete time while still being created or claimable later. Unschedule then no-op’d, leaving a task that ran against a deleted or tombstoned workflow.Fix:
syncSchedulerAfterSave, etc.): callremoveIfExistswith the deterministic idworkflow:<workflowId>:scheduled— no fetch.disableAllWorkflows): addbulkUnscheduleWorkflowTasksthatbulkRemoves by those ids; treat 404 as success; log per-workflow warnings on other failures without failing the batch.unschedule_workflow_taskswrapper delegates tobulkUnscheduleWorkflowTasksand drops the unusedloggerparameter from call sites.Self-healing for already affected envs
workflow:scheduledruns when the workflow document is missingshouldDeleteTask: truefrom the scheduled task runner so Task Manager deletes orphaned recurring tasks