-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent simultaneous task failures multiple archive calls #1008
base: develop
Are you sure you want to change the base?
Conversation
55553f1
to
aae765e
Compare
I adjusted the solution based on discussion. Since the behavior of a failing task failing and archiving the entire workflow may change, I added a more general check to |
So the workflow that caused the error doesn't cause the error anymore, however, the above workflow continued but the final status was not in the archive so because a task failed on one branch the other branch continued to run but the workflow was archived too early. I've added the example. This workflow should be archived and removed from the ~/.beeflow/workflows directory. It was archived early and the status is incorrect. If you do a beeflow query on it, it looks fine because it is accessing the ~/.beeflow/workflow directory. |
To solve that problem we need to think about how we want this to work at the moment. There's a couple things to consider:
For the first, it seems like the answer is no. Perhaps we may want to add some way of specifying this in the future; where failure of some very important task kills the workflow entirely. For the second, we could use the "Archived/Failed" state if any task fails (setting a flag) or just not use this state. I'm not sure which is more clear to a user. If you do I think I personally lean towards not using "Archived/Failed". |
|
I noticed that if there is a build fail for a task, it also fails/archives the workflow. I think this case would have the same issues as the topic of this PR. It seems like this should be handled the same as any task failure; I'm not sure why it is being handled differently at the moment. Is this correct? |
I’ll have to look at this one. The problem is: I think it’s difficult to tell when a build fails. Let me see if I can simulate one.
From: Aaron R Hall ***@***.***>
Reply-To: lanl/BEE ***@***.***>
Date: Wednesday, February 19, 2025 at 9:19 AM
To: lanl/BEE ***@***.***>
Cc: "Grubel, Patricia A" ***@***.***>, Review requested ***@***.***>
Subject: [EXTERNAL] Re: [lanl/BEE] Prevent simultaneous task failures multiple archive calls (PR #1008)
I noticed that if there is a build fail for a task, it also fails/archives the workflow. I think this case would have the same issues as the topic of this PR.
It seems like this should be handled the same as any task failure; I'm not sure why it is being handled differently at the moment. Is this correct?
|
Apparently, Build_FAIL is not being handle correctly at all ffmpeg depends on clamr so the entire workflow should fail. |
I'm thinking a container failure is fairly catastrophic to a workflow. Maybe we should stop the workflow. As we work on the builder code we can make other decisions |
9cdf965
to
df4cd0a
Compare
Currently no changes to build failure
df4cd0a
to
801fc56
Compare
I think this is ready for review again. With the latest commit, a failed task will not fail the workflow ( Should make issues about the following that came up during review, but are outside the scope of this PR:
|
If multiple tasks fail simultaneously, they will all attempt to call
beeflow.wf_manager.resources.wf_update.archive_workflow
. This will succeed for the first task that makes the call, but then lead to an endless loop of calls for the other tasks that error since the workflow folder will no longer exist in.beeflow/workflows
.This solves that problem by checking the workflow state does not begin with
Archived
when attempting to archive a failed workflow.db.workflows.get_workflow_state
seems to be the only way to do this once a workflow has been archived.I put this check where a failed task would call
archive_fail_workflow
, but potentially this could be inarchive_workflow
directly.Needs a test before removing
WIP
.