Skip to content

Complete action listener on failed synchronous workflow provisioning #1098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 7, 2025

Conversation

dbwiddis
Copy link
Member

@dbwiddis dbwiddis commented Apr 7, 2025

Description

During synchronous provisioning (and reprovisioning) the actionListener was never completed on an exception in the workflow steps (which throw an exception on actionGet()).

  • For asynchronous provisioning this was expected, as the listener was already completed. In the sync case we ended up just failing on the timeout in this case.
  • Also had to handle the case of synchronous provisioning timeout

Related Issues

Resolves #1097

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Apr 7, 2025

Codecov Report

Attention: Patch coverage is 34.37500% with 21 lines in your changes missing coverage. Please review.

Project coverage is 77.29%. Comparing base (72ce8db) to head (f484d3d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rk/transport/ProvisionWorkflowTransportAction.java 27.27% 7 Missing and 1 partial ⚠️
...rch/flowframework/util/WorkflowTimeoutUtility.java 30.00% 6 Missing and 1 partial ⚠️
.../transport/ReprovisionWorkflowTransportAction.java 45.45% 6 Missing ⚠️

❌ Your patch status has failed because the patch coverage (34.37%) is below the target coverage (70.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1098      +/-   ##
============================================
+ Coverage     76.25%   77.29%   +1.03%     
- Complexity     1079     1093      +14     
============================================
  Files           101      101              
  Lines          5285     5302      +17     
  Branches        505      509       +4     
============================================
+ Hits           4030     4098      +68     
+ Misses         1006      954      -52     
- Partials        249      250       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@junweid62
Copy link
Contributor

LGTM overall!

One question similar to @amitgalitz – are we sure we want to allow releasing while provisioning might still be ongoing? Since our timeout logic doesn't cancel provisioning, it could still be running in the background. Would that cause issues if we release early?

@dbwiddis dbwiddis force-pushed the fix-provision-listener branch from 68ffb33 to eab73e4 Compare April 7, 2025 17:51
@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 7, 2025

Tenant-aware listener: releases on failure, we need to release (exactly once) on success.
Possible code paths:

  • We wrap the original listener in this, so the listener throughout the code is this one.

    • If we fail on lack of permission, listener is failed. Done. Otherwise getWorkflow().
    • If we fail to get the workflow, listener is failed. Done. Otherwise onGetWorkflowResponse().
    • If we fail parsing etc. listener is failed. Done. Otherwise run the function executeProvisionRequest().
    • Get the template, update state, etc. If we fail in any, listener is failed. Done. Otherwise:
      • If wait for completion timeout is MINUS_ONE (line 235) async provisioning. Also (line 252) Complete the listener successfully and return. (This does not release the provisioning lock which must be released on successful or failed provisioning, one of the fixes in this PR.)
      • Else sync provisioning
  • Asynchronous provisioning (existing behavior). Occurs when wait_for_completion_timeout is -1.

    • call executeWorkflow with isSyncExecution false
  • Synchronous provisioning. TImeout >= 0.

    • call executeWorkflow with a new action listener.
      • On success calls timeout utility response which completes the original listener
      • on failure calls timeout utility failure which completes the original listener
  • executeWorkflow

    • Provisions asynchronously (but tracks return-to-user with isSyncExecution value
      • If successful completion of the workflow (all workflow steps complete) we update the state
        • if isSyncExecution we fetch the workflow state to return
          • we either release provision lock and complete onResponse on update, which goes to timeout utility handleResponse
          • or fail on update which goes to handleFailure which completes original listener and releases the lock
        • if not isSyncExecution we just release the lock (we have already completed action listener)
      • If an exception occurs in a workflow, actionGet() throws exception at line 405 entering the catch block at line 448
        • we update the failed state.
          • on success state update
            • with isSyncExecution we fail which goes to timeout utility handle failure and completes the original listener (this is the main bug fix of this PR)
            • otherwise we release the lock (original listener was already completed)
          • on failed state update
            • with isSyncExecution we fail which goes to timeout utility handle failure and completes the original listener (this is the main bug fix of this PR)
            • otherwise we release the lock (original listener was already completed)
  • Other possible outcome: we are sync provisioning but timeout expires. This is similar to async provisioning.

    • we will complete the original action listener on success but continue "async" provisioning in the background under the isSyncExecution code paths above. WE do not want to release the provision lock.
    • on failed return value we still do not want to release the provision lock but we DO want to fail the original action listener.
    • Code path here starts with scheduleTimeoutHandler() on line 363
    • Eventually goes to fetchWorkflowState after timeout
    • We do NOT want to fail the original listener here because this would release the provision lock
      • so needed change, exception path on workflow timeout utility line 210 should NOT complete the "tenant aware lock" listener but rather complete on response but with a failed state.

@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 7, 2025

One question similar to @amitgalitz – are we sure we want to allow releasing while provisioning might still be ongoing? Since our timeout logic doesn't cancel provisioning, it could still be running in the background. Would that cause issues if we release early?

Yes, both @junweid62 and @amitgalitz called out correctly that for timeout we want to continue with the provisioning as we normally would for the "async" case. I've gone through all the code paths and updated the PR.

  • all of the releasing is handled either by the onFailure listener (which can only release once) or manually in the (Re)Provision transport actions on the success cases. No releasing occurs in the timeout utility.
  • the timeout utility keeps track of whether the listener onResponse was sent, so in the success case it only sends it once, but the provision release still always occurs even if the provision success onResponse occurs after timeout; it releases the lock and then the onResponse ends up being a no-op.

@dbwiddis dbwiddis merged commit 0f04d61 into opensearch-project:main Apr 7, 2025
21 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/flow-framework/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/flow-framework/backport-2.x
# Create a new branch
git switch --create backport/backport-1098-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0f04d61ef9994e3a5aa2d6c48b40ab9e1388771e
# Push it to GitHub
git push --set-upstream origin backport/backport-1098-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/flow-framework/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1098-to-2.x.

@opensearch-trigger-bot opensearch-trigger-bot bot added the backport-failed Applied to PRs when the automatic backport fails label Apr 7, 2025
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/flow-framework/backport-2.19 2.19
# Navigate to the new working tree
pushd ../.worktrees/flow-framework/backport-2.19
# Create a new branch
git switch --create backport/backport-1098-to-2.19
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0f04d61ef9994e3a5aa2d6c48b40ab9e1388771e
# Push it to GitHub
git push --set-upstream origin backport/backport-1098-to-2.19
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/flow-framework/backport-2.19

Then, create a pull request where the base branch is 2.19 and the compare/head branch is backport/backport-1098-to-2.19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x backport PRs to 2.x branch backport 2.19 backport-failed Applied to PRs when the automatic backport fails
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Action Listener is not completed on failed synchronous provisioning
4 participants