Add workflow concurrency control by thomasjhuang · Pull Request #6475 · flyteorg/flyte

thomasjhuang · 2025-05-29T20:43:06Z

Tracking issue

This solves #5659 and also closes out #6309. The related flytekit change is here - flyteorg/flytekit#3267. Docs are added in this PR - unionai/unionai-docs#417

Why are the changes needed?

This work is to provide workflow concurrency control at a project level. It is a barebones implementation with the guarantee that at most x many workflows will be running, but it cannot guarantee that x many workflows are always running. We can control concurrency across all versions, and users specify the controls via LaunchPlan instantiation:

concurrency_limited_lp = LaunchPlan.get_or_create(
    name="my_concurrent_lp",
    workflow=my_workflow,
    concurrency=ConcurrencyPolicy(
        max_concurrency=3,
        behavior=ConcurrencyLimitBehavior.SKIP,
    ),
)

What changes were proposed in this pull request?

The primary mechanism is fairly straightforward - whenever we attempt to launch an execution, make a db query to check for running executions given the NamedEntityIdentifier triplet (project/domain/wf_name), and if the running executions is above the threshold for concurrency (max_concurrency) then we immediately fail to create the execution.

How was this patch tested?

Unit tests added to execution_manager_test.go, but namely this was internally tested at LinkedIn since we are porting this feature externally. More testing is in progress on local sandbox.

Labels

added: ConcurrencyPolicy and logic surrounding workflow concurrency management, as well as db migration to add an index on executions table for execution_phase.

Setup process

Screenshots

I was able to confirm locally that we receive the correct error message when we are running multiple workflows concurrently at limits of 1 and 2. Within the same project, running a workflow without limits works.

Error: rpc error: code = ResourceExhausted desc = Concurrency limit (1) reached for launch plan concurrency_
limit_1. Skipping execution.
{"json":{},"level":"error","msg":"rpc error: code = ResourceExhausted desc = Concurrency limit (1) reached f
or launch plan concurrency_limit_1. Skipping execution.","ts":"2025-07-07T23:12:38-07:00"}

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

#5659 and #6309, will also add in flytekit PR as reference.

Docs link

Will add in docs

Summary by Bito

This pull request introduces a feature for managing workflow concurrency at the project level, allowing users to set limits on concurrent executions via a ConcurrencyPolicy. It includes logic to check running executions, enhances logging, and updates the database schema to support these changes, addressing issues #5659 and #6309.

codecov · 2025-05-29T21:17:03Z

Codecov Report

❌ Patch coverage is 63.63636% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.70%. Comparing base (c085cb5) to head (8208637).
⚠️ Report is 12 commits behind head on master.

Files with missing lines	Patch %	Lines
flyteadmin/pkg/manager/impl/execution_manager.go	71.01%	15 Missing and 5 partials ⚠️
flyteadmin/pkg/repositories/config/migrations.go	0.00%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6475      +/-   ##
==========================================
+ Coverage   58.67%   58.70%   +0.02%     
==========================================
  Files         938      938              
  Lines       71466    71690     +224     
==========================================
+ Hits        41933    42085     +152     
- Misses      26346    26413      +67     
- Partials     3187     3192       +5

Flag	Coverage Δ
unittests-datacatalog	`59.03% <ø> (ø)`
unittests-flyteadmin	`56.14% <63.63%> (-0.09%)`	⬇️
unittests-flytecopilot	`39.56% <ø> (ø)`
unittests-flytectl	`64.72% <ø> (ø)`
unittests-flyteidl	`76.12% <ø> (ø)`
unittests-flyteplugins	`61.14% <ø> (+<0.01%)`	⬆️
unittests-flytepropeller	`55.06% <ø> (+0.22%)`	⬆️
unittests-flytestdlib	`64.02% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sovietaced · 2025-05-30T03:10:55Z

Can you run make generate in flyteidl dir and make lint-fix in flyteadmin dir? There are some build failures

davidmirror-ops · 2025-06-03T22:33:58Z

@thomasjhuang thanks for your contribution. Also please check out the failing DCO check.
Do you plan to set a specific phase when an execution is skipped? That could be a bit beyond the scope of this PR though but it'd be useful.
Also please add some docs so users know how to leverage this in launchplans.

kumare3 · 2025-06-13T20:58:10Z

Taking a look

kumare3 · 2025-06-13T20:58:34Z

@troychiu can you please take a first look, you looked into this recently right?

troychiu · 2025-06-13T23:23:16Z

I don't really have context but would love to take a look.

kumare3 · 2025-06-17T22:17:57Z

@troychiu sorry wrong one. I meant the volcano PR. I can take a look at this one

thomasjhuang · 2025-07-08T05:13:13Z

Okay I've added a few things - docs, unit test, and ran make generate. make lint-fix didn't indicate any issues on code changes that I've introduced here. Is there a reason why the checks are all pending? Probably ready for review now. Please also check out the related flytekit change - flyteorg/flytekit#3267.

@davidmirror-ops @kumare3 @Sovietaced

davidmirror-ops · 2025-07-09T16:41:19Z

@thomasjhuang thanks for this amazing contribution. Sorry about this but could you port the docs update to the new repo? https://github.com/unionai/docs I'll take point on updating this repo's docs folder to indicate it's no longer the content that ends up rendered in the docs site

EngHabu

Looks great! (and simple 🥳 🥳 🥳 )... left a couple of comments

EngHabu · 2025-07-09T19:12:34Z

+		WHERE
+			executions.project = '${lpProject}'
+			AND executions.domain = '${lpDomain}'
+			AND launch_plans.name = '${lpName}'


Should it also filter on launch_plans.project/domain? If you confirmed it uses the right index, then nvm

Adding launch_plans.project/domain probably isn't needed bc the join already ensures we only see launch plans from the same project/domain as the filtered executions. I had done a MySQL explain and it does use the newly created index correctly.

Sovietaced

I'm looking at porting a version of this to our fork and one thing I'm looking at is how this affects Flyte scheduler. Flyte scheduler is hardcoded to try and retry the execution creation up to 30 times if it fails, unless the gRPC status code in the error is codes.AlreadyExists.

I think the scheduler code will need to be updated to either give up on ResourceExhausted or we'll need to a way to articulate this case through richer error details.

thomasjhuang · 2025-07-22T18:11:41Z

@thomasjhuang thanks for this amazing contribution. Sorry about this but could you port the docs update to the new repo? https://github.com/unionai/docs I'll take point on updating this repo's docs folder to indicate it's no longer the content that ends up rendered in the docs site

Got it, I've opened a PR here - unionai/unionai-docs#417

thomasjhuang · 2025-07-22T20:42:59Z

I'm looking at porting a version of this to our fork and one thing I'm looking at is how this affects Flyte scheduler. Flyte scheduler is hardcoded to try and retry the execution creation up to 30 times if it fails, unless the gRPC status code in the error is codes.AlreadyExists.

I think the scheduler code will need to be updated to either give up on ResourceExhausted or we'll need to a way to articulate this case through richer error details.

Makes sense - this change will increase ResourceExhausted errors noticeably, especially if measured as a metric and it can be misleading since it isn't really an error. Some changes on flytescheduler are probably needed. Although not implemented yet, I intend to also add the execution phase SKIPPED so that the UI can indicate skipped executions rather than throw resource error.

flyte-bot · 2025-07-22T21:48:49Z

Bito Automatic Review Skipped - Files Excluded

Bito didn't auto-review this change because all changed files are in the exclusion list for automatic reviews. No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change the excluded files settings here, or contact your Bito workspace admin at eduardo@union.ai.

Sovietaced · 2025-07-29T16:51:12Z

We have an end to end test currently validating this functionality and it seems like there might be a correctness error in the logic that looks for previous executions.

func TestFlyte_WorkflowConcurrencyLimits(t *testing.T) {

	lp := "e2e_singleton_workflow"
	client, err := config.Flyte.GetAdminClient(config.TestCtx)
	require.NoError(t, err, "getting flyte admin client")

	latestLaunchPlan := config.Flyte.FindLatestLaunchPlan(config.TestCtx, t, lp)

	t.Logf("Found most recent launch plan with version [%s]", latestLaunchPlan.GetId().GetVersion())

	_, err = client.AdminClient().CreateExecution(config.TestCtx, &pbadmin.ExecutionCreateRequest{
		Project: config.Flyte.Project,
		Domain:  config.Flyte.Domain,
		Spec: &pbadmin.ExecutionSpec{
			LaunchPlan: &pbcore.Identifier{
				ResourceType: pbcore.ResourceType_LAUNCH_PLAN,
				Project:      config.Flyte.Project,
				Domain:       config.Flyte.Domain,
				Name:         lp,
				Version:      latestLaunchPlan.GetId().GetVersion(),
			},
		},
	})
	require.NoError(t, err, "creating execution")

	// Creating a second execution should fail while the first is non-terminal
	_, err = client.AdminClient().CreateExecution(config.TestCtx, &pbadmin.ExecutionCreateRequest{
		Project: config.Flyte.Project,
		Domain:  config.Flyte.Domain,
		Spec: &pbadmin.ExecutionSpec{
			LaunchPlan: &pbcore.Identifier{
				ResourceType: pbcore.ResourceType_LAUNCH_PLAN,
				Project:      config.Flyte.Project,
				Domain:       config.Flyte.Domain,
				Name:         lp,
				Version:      latestLaunchPlan.GetId().GetVersion(),
			},
		},
	})
	require.Error(t, err, "creating execution")
	s, ok := status.FromError(err)
	require.True(t, ok, "should be a grpc status error")
	require.Equal(t, codes.ResourceExhausted, s.Code())
}

This is failing in our production environment where there is more load and I'm wondering if the state filtering isn't quite right.

Sovietaced · 2025-08-01T16:55:38Z

Makes sense - this change will increase ResourceExhausted errors noticeably, especially if measured as a metric and it can be misleading since it isn't really an error. Some changes on flytescheduler are probably needed. Although not implemented yet, I intend to also add the execution phase SKIPPED so that the UI can indicate skipped executions rather than throw resource error.

For this version can we at least treat it as non-retryable? That's what we're doing and it seems to be ok.

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Sovietaced · 2025-08-01T21:16:59Z

@EngHabu seemed happy with this. We've been using a variation of it in production for the past couple weeks so I think its safe to land this.

welcome · 2025-08-01T21:17:11Z

Congrats on merging your first pull request! 🎉

popojk · 2025-08-05T03:23:30Z

I'm looking at porting a version of this to our fork and one thing I'm looking at is how this affects Flyte scheduler. Flyte scheduler is hardcoded to try and retry the execution creation up to 30 times if it fails, unless the gRPC status code in the error is codes.AlreadyExists.
I think the scheduler code will need to be updated to either give up on ResourceExhausted or we'll need to a way to articulate this case through richer error details.

Makes sense - this change will increase ResourceExhausted errors noticeably, especially if measured as a metric and it can be misleading since it isn't really an error. Some changes on flytescheduler are probably needed. Although not implemented yet, I intend to also add the execution phase SKIPPED so that the UI can indicate skipped executions rather than throw resource error.

Hi @thomasjhuang @Sovietaced . Since I’m currently working on some scheduler related issues, I can go ahead and open an issue for this and submit a PR.

davidmirror-ops added the triage/discuss label Jun 3, 2025

thomasjhuang mentioned this pull request Jun 6, 2025

Add concurrency policy flyteorg/flytekit#3267

Merged

3 tasks

thomasjhuang force-pushed the thhuang/internal-concurrency-commit branch from da55f4d to 532dfc4 Compare June 20, 2025 22:46

thomasjhuang force-pushed the thhuang/internal-concurrency-commit branch from 532dfc4 to 1e6ffd4 Compare July 8, 2025 05:08

thomasjhuang requested a review from ppiegaze as a code owner July 8, 2025 05:08

thomasjhuang force-pushed the thhuang/internal-concurrency-commit branch from 4e02f0a to 235c354 Compare July 8, 2025 17:44

EngHabu reviewed Jul 9, 2025

View reviewed changes

Sovietaced approved these changes Jul 13, 2025

View reviewed changes

Sovietaced self-requested a review July 15, 2025 21:13

Sovietaced requested changes Jul 15, 2025

View reviewed changes

thomasjhuang mentioned this pull request Jul 22, 2025

Add concurrency control docs to launch plan section unionai/unionai-docs#417

Merged

Sovietaced reviewed Jul 29, 2025

View reviewed changes

Comment thread flyteadmin/pkg/manager/impl/execution_manager.go

Sovietaced reviewed Jul 30, 2025

View reviewed changes

Comment thread flyteadmin/pkg/manager/impl/execution_manager_test.go Outdated

Sovietaced approved these changes Aug 1, 2025

View reviewed changes

thomasjhuang force-pushed the thhuang/internal-concurrency-commit branch from 8a83072 to 0f57bdd Compare August 1, 2025 20:07

Add workflow concurrency control

8c2a0af

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

thomasjhuang added 12 commits August 1, 2025 13:33

Make generate

4949a4e

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Add docs and add unit test

ac4a1f5

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Remove unnecessary phase checks

785ca19

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Fix linter issues

3bd4c90

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Update docs

509eb26

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Update flyteidl/protos/flyteidl/admin/launch_plan.proto

c2a2f96

Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Update flyteadmin/pkg/manager/impl/execution_manager.go

2276a55

Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Regenerate protos

4445fe8

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Fix logging/error nits

7c85780

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Small fixes and move to switch case

2e44453

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Remove concurrency docs, to be added in different repo

143a60d

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Update test and switch to non-retryable via AlreadyExists

8208637

Signed-off-by: thomasjhuang <thomashuang63@gmail.com>

Sovietaced added added Merged changes that add new functionality and removed triage/discuss labels Aug 1, 2025

thomasjhuang force-pushed the thhuang/internal-concurrency-commit branch from 0f57bdd to 8208637 Compare August 1, 2025 20:34

Sovietaced merged commit 640ad57 into flyteorg:master Aug 1, 2025
49 checks passed

Uh oh!

Conversation

thomasjhuang commented May 29, 2025 • edited by flyte-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Labels

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

Summary by Bito

Uh oh!

codecov Bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Sovietaced commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidmirror-ops commented Jun 3, 2025

Uh oh!

kumare3 commented Jun 13, 2025

Uh oh!

kumare3 commented Jun 13, 2025

Uh oh!

troychiu commented Jun 13, 2025

Uh oh!

kumare3 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjhuang commented Jul 8, 2025

Uh oh!

davidmirror-ops commented Jul 9, 2025

Uh oh!

EngHabu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EngHabu Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

thomasjhuang Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sovietaced left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjhuang commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjhuang commented Jul 22, 2025

Uh oh!

flyte-bot commented Jul 22, 2025

Uh oh!

Sovietaced commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Sovietaced commented Aug 1, 2025

Uh oh!

Sovietaced commented Aug 1, 2025

Uh oh!

Uh oh!

welcome Bot commented Aug 1, 2025

Uh oh!

thomasjhuang commented May 29, 2025 •

edited by flyte-bot

Loading

codecov Bot commented May 29, 2025 •

edited

Loading

Sovietaced commented May 30, 2025 •

edited

Loading

kumare3 commented Jun 17, 2025 •

edited

Loading

Sovietaced left a comment •

edited

Loading

thomasjhuang commented Jul 22, 2025 •

edited

Loading