Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent simultaneous task failures multiple archive calls #1008

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

arhall0
Copy link
Collaborator

@arhall0 arhall0 commented Feb 11, 2025

If multiple tasks fail simultaneously, they will all attempt to call beeflow.wf_manager.resources.wf_update.archive_workflow. This will succeed for the first task that makes the call, but then lead to an endless loop of calls for the other tasks that error since the workflow folder will no longer exist in .beeflow/workflows.

This solves that problem by checking the workflow state does not begin with Archived when attempting to archive a failed workflow. db.workflows.get_workflow_state seems to be the only way to do this once a workflow has been archived.

I put this check where a failed task would call archive_fail_workflow, but potentially this could be in archive_workflow directly.

Needs a test before removing WIP.

@arhall0 arhall0 added the WIP Work in progress label Feb 11, 2025
@arhall0 arhall0 self-assigned this Feb 11, 2025
@arhall0 arhall0 added bug Something isn't working WIP Work in progress and removed WIP Work in progress labels Feb 11, 2025
@arhall0 arhall0 force-pushed the prevent-multi-failed-tasks-archiving branch from 55553f1 to aae765e Compare February 12, 2025 17:31
@arhall0
Copy link
Collaborator Author

arhall0 commented Feb 12, 2025

I adjusted the solution based on discussion. Since the behavior of a failing task failing and archiving the entire workflow may change, I added a more general check to archive_workflow that prevents archiving if the workflow is already in an Archived state and logs a warning about this. I think this is a more robust fix as it should not need to be changed if task failing behavior changes.

@arhall0 arhall0 removed the WIP Work in progress label Feb 12, 2025
@arhall0 arhall0 requested review from rstyd and pagrubel February 12, 2025 21:00
@pagrubel
Copy link
Collaborator

Nice I tried a workflow of cat-grep-tar with a 2nd branch starting with running the cat step without a proper input file and all looks good. One branch fails and the workflow is still archived. Here is the dag. I think I may put the workflow in our data directory. I'll add a readme for it.
2fe1fb

@pagrubel
Copy link
Collaborator

pagrubel commented Feb 15, 2025

So the workflow that caused the error doesn't cause the error anymore, however, the above workflow continued but the final status was not in the archive so because a task failed on one branch the other branch continued to run but the workflow was archived too early. I've added the example. This workflow should be archived and removed from the ~/.beeflow/workflows directory. It was archived early and the status is incorrect. If you do a beeflow query on it, it looks fine because it is accessing the ~/.beeflow/workflow directory.

@arhall0
Copy link
Collaborator Author

arhall0 commented Feb 18, 2025

To solve that problem we need to think about how we want this to work at the moment.

There's a couple things to consider:

  • Should a failed task ever archive the workflow
  • What determines if a workflow has the state "Archived/Failed" (if we want to use this at all)

For the first, it seems like the answer is no. Perhaps we may want to add some way of specifying this in the future; where failure of some very important task kills the workflow entirely.

For the second, we could use the "Archived/Failed" state if any task fails (setting a flag) or just not use this state. I'm not sure which is more clear to a user. If you do beeflow query you would see tasks had failed, in either case. But if you did beeflow list then it could be confusing. Using "Archived/Failed" may lead them to believe the entire workflow failed. Using just "Archive" may hide that some tasks failed.

I think I personally lean towards not using "Archived/Failed".

@pagrubel
Copy link
Collaborator

To solve that problem we need to think about how we want this to work at the moment.

There's a couple things to consider:

  • Should a failed task ever archive the workflow
  • What determines if a workflow has the state "Archived/Failed" (if we want to use this at all)

For the first, it seems like the answer is no. Perhaps we may want to add some way of specifying this in the future; where failure of some very important task kills the workflow entirely.

For the second, we could use the "Archived/Failed" state if any task fails (setting a flag) or just not use this state. I'm not sure which is more clear to a user. If you do beeflow query you would see tasks had failed, in either case. But if you did beeflow list then it could be confusing. Using "Archived/Failed" may lead them to believe the entire workflow failed. Using just "Archive" may hide that some tasks failed.

I think I personally lean towards not using "Archived/Failed".
Sounds fine.

  1. We should think about implementing a way to do your first point, I guess this can be an issue or discussion. At that time we may want to re-introduce the Archived/Failed status.
  2. We intend to add more information to query and list.

@arhall0
Copy link
Collaborator Author

arhall0 commented Feb 19, 2025

I noticed that if there is a build fail for a task, it also fails/archives the workflow. I think this case would have the same issues as the topic of this PR.

It seems like this should be handled the same as any task failure; I'm not sure why it is being handled differently at the moment. Is this correct?

@pagrubel
Copy link
Collaborator

pagrubel commented Feb 19, 2025 via email

@pagrubel
Copy link
Collaborator

pagrubel commented Feb 19, 2025

Apparently, Build_FAIL is not being handle correctly at all
beeflow query 0f
Running
clamr--BUILD_FAIL
ffmpeg--WAITING

ffmpeg depends on clamr so the entire workflow should fail.

@pagrubel
Copy link
Collaborator

I'm thinking a container failure is fairly catastrophic to a workflow. Maybe we should stop the workflow. As we work on the builder code we can make other decisions

@arhall0 arhall0 force-pushed the prevent-multi-failed-tasks-archiving branch from 9cdf965 to df4cd0a Compare February 20, 2025 17:43
Currently no changes to build failure
@arhall0 arhall0 force-pushed the prevent-multi-failed-tasks-archiving branch from df4cd0a to 801fc56 Compare February 20, 2025 19:00
@arhall0
Copy link
Collaborator Author

arhall0 commented Feb 21, 2025

I think this is ready for review again. With the latest commit, a failed task will not fail the workflow (BUILD_FAIL is not being changed by this PR however). I checked the different workflows we've discussed on this PR are working as expected.

Should make issues about the following that came up during review, but are outside the scope of this PR:

  1. End workflow state should be Archived, Archived/Failed, or Archived/Partial-Fail depending on state of final task nodes
  2. If a task ends in BUILD_FAIL, the entire workflow should end; including fixing a bug where BUILD_FAIL doesn't even fail dependent tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants