Add workflow concurrency control#6475
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6475 +/- ##
==========================================
+ Coverage 58.67% 58.70% +0.02%
==========================================
Files 938 938
Lines 71466 71690 +224
==========================================
+ Hits 41933 42085 +152
- Misses 26346 26413 +67
- Partials 3187 3192 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Can you run |
|
@thomasjhuang thanks for your contribution. Also please check out the failing DCO check. |
|
Taking a look |
|
@troychiu can you please take a first look, you looked into this recently right? |
|
I don't really have context but would love to take a look. |
|
@troychiu sorry wrong one. I meant the volcano PR. I can take a look at this one |
da55f4d to
532dfc4
Compare
532dfc4 to
1e6ffd4
Compare
|
Okay I've added a few things - docs, unit test, and ran |
4e02f0a to
235c354
Compare
|
@thomasjhuang thanks for this amazing contribution. Sorry about this but could you port the docs update to the new repo? https://github.com/unionai/docs I'll take point on updating this repo's |
EngHabu
left a comment
There was a problem hiding this comment.
Looks great! (and simple 🥳 🥳 🥳 )... left a couple of comments
| WHERE | ||
| executions.project = '${lpProject}' | ||
| AND executions.domain = '${lpDomain}' | ||
| AND launch_plans.name = '${lpName}' |
There was a problem hiding this comment.
Should it also filter on launch_plans.project/domain? If you confirmed it uses the right index, then nvm
There was a problem hiding this comment.
Adding launch_plans.project/domain probably isn't needed bc the join already ensures we only see launch plans from the same project/domain as the filtered executions. I had done a MySQL explain and it does use the newly created index correctly.
There was a problem hiding this comment.
I'm looking at porting a version of this to our fork and one thing I'm looking at is how this affects Flyte scheduler. Flyte scheduler is hardcoded to try and retry the execution creation up to 30 times if it fails, unless the gRPC status code in the error is codes.AlreadyExists.
I think the scheduler code will need to be updated to either give up on ResourceExhausted or we'll need to a way to articulate this case through richer error details.
Got it, I've opened a PR here - unionai/unionai-docs#417 |
Makes sense - this change will increase |
|
Bito Automatic Review Skipped - Files Excluded |
|
We have an end to end test currently validating this functionality and it seems like there might be a correctness error in the logic that looks for previous executions. func TestFlyte_WorkflowConcurrencyLimits(t *testing.T) {
lp := "e2e_singleton_workflow"
client, err := config.Flyte.GetAdminClient(config.TestCtx)
require.NoError(t, err, "getting flyte admin client")
latestLaunchPlan := config.Flyte.FindLatestLaunchPlan(config.TestCtx, t, lp)
t.Logf("Found most recent launch plan with version [%s]", latestLaunchPlan.GetId().GetVersion())
_, err = client.AdminClient().CreateExecution(config.TestCtx, &pbadmin.ExecutionCreateRequest{
Project: config.Flyte.Project,
Domain: config.Flyte.Domain,
Spec: &pbadmin.ExecutionSpec{
LaunchPlan: &pbcore.Identifier{
ResourceType: pbcore.ResourceType_LAUNCH_PLAN,
Project: config.Flyte.Project,
Domain: config.Flyte.Domain,
Name: lp,
Version: latestLaunchPlan.GetId().GetVersion(),
},
},
})
require.NoError(t, err, "creating execution")
// Creating a second execution should fail while the first is non-terminal
_, err = client.AdminClient().CreateExecution(config.TestCtx, &pbadmin.ExecutionCreateRequest{
Project: config.Flyte.Project,
Domain: config.Flyte.Domain,
Spec: &pbadmin.ExecutionSpec{
LaunchPlan: &pbcore.Identifier{
ResourceType: pbcore.ResourceType_LAUNCH_PLAN,
Project: config.Flyte.Project,
Domain: config.Flyte.Domain,
Name: lp,
Version: latestLaunchPlan.GetId().GetVersion(),
},
},
})
require.Error(t, err, "creating execution")
s, ok := status.FromError(err)
require.True(t, ok, "should be a grpc status error")
require.Equal(t, codes.ResourceExhausted, s.Code())
}This is failing in our production environment where there is more load and I'm wondering if the state filtering isn't quite right. |
For this version can we at least treat it as non-retryable? That's what we're doing and it seems to be ok. |
8a83072 to
0f57bdd
Compare
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
Signed-off-by: thomasjhuang <thomashuang63@gmail.com>
0f57bdd to
8208637
Compare
|
@EngHabu seemed happy with this. We've been using a variation of it in production for the past couple weeks so I think its safe to land this. |
|
Congrats on merging your first pull request! 🎉 |
Hi @thomasjhuang @Sovietaced . Since I’m currently working on some scheduler related issues, I can go ahead and open an issue for this and submit a PR. |
Tracking issue
This solves #5659 and also closes out #6309. The related flytekit change is here - flyteorg/flytekit#3267. Docs are added in this PR - unionai/unionai-docs#417
Why are the changes needed?
This work is to provide workflow concurrency control at a project level. It is a barebones implementation with the guarantee that at most
xmany workflows will be running, but it cannot guarantee thatxmany workflows are always running. We can control concurrency across all versions, and users specify the controls viaLaunchPlaninstantiation:What changes were proposed in this pull request?
The primary mechanism is fairly straightforward - whenever we attempt to launch an execution, make a db query to check for running executions given the NamedEntityIdentifier triplet (project/domain/wf_name), and if the running executions is above the threshold for concurrency (
max_concurrency) then we immediately fail to create the execution.How was this patch tested?
Unit tests added to execution_manager_test.go, but namely this was internally tested at LinkedIn since we are porting this feature externally. More testing is in progress on local sandbox.
Labels
ConcurrencyPolicyand logic surrounding workflow concurrency management, as well as db migration to add an index onexecutionstable forexecution_phase.Setup process
Screenshots
I was able to confirm locally that we receive the correct error message when we are running multiple workflows concurrently at limits of 1 and 2. Within the same project, running a workflow without limits works.
Check all the applicable boxes
Related PRs
#5659 and #6309, will also add in flytekit PR as reference.
Docs link
Summary by Bito
This pull request introduces a feature for managing workflow concurrency at the project level, allowing users to set limits on concurrent executions via a ConcurrencyPolicy. It includes logic to check running executions, enhances logging, and updates the database schema to support these changes, addressing issues #5659 and #6309.