Skip to content

Fix type error for suspended kubeflow jobs#7241

Merged
pingsutw merged 3 commits into
flyteorg:masterfrom
strigazi:fix/suspended-kfjobs
Apr 23, 2026
Merged

Fix type error for suspended kubeflow jobs#7241
pingsutw merged 3 commits into
flyteorg:masterfrom
strigazi:fix/suspended-kfjobs

Conversation

@strigazi

Copy link
Copy Markdown

When kubeflow jobs are suspended by an external system (eg kueue), the controller error's with the following [0].

https://github.com/flyteorg/flyte/pull/6295/changes relies only on app.Spec.RunPolicy.Suspend and does not take into account the phase.

[0] failed Execute for node. Error: failed at Node[dn1]. RuntimeExecutionError: failed during plugin execution, caused by: Invalid transition for plugin [pytorch]: transition doesn't have task info nor an execution error filled [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>]

related: #6294 #6295

Observed in versions: v1.16.3, v1.16.4, v1.16.5

@strigazi

Copy link
Copy Markdown
Author

Hello @fg91, following your PR #6295 to make flyte work with suspended kubeflow jobs, can you have a quick look in this PR? Does the direction of this patch look good to you?

pingsutw
pingsutw previously approved these changes Apr 20, 2026
@codecov

codecov Bot commented Apr 20, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.95%. Comparing base (3fe1a5e) to head (5a3a069).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7241      +/-   ##
==========================================
+ Coverage   56.58%   56.95%   +0.37%     
==========================================
  Files         931      931              
  Lines       64866    58250    -6616     
==========================================
- Hits        36707    33179    -3528     
+ Misses      25105    22017    -3088     
  Partials     3054     3054              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.14% <ø> (ø)
unittests-flytecopilot 43.06% <ø> (+1.78%) ⬆️
unittests-flytectl 64.09% <ø> (ø)
unittests-flyteidl 75.71% <ø> (+2.49%) ⬆️
unittests-flyteplugins 60.19% <100.00%> (+0.85%) ⬆️
unittests-flytepropeller 53.71% <ø> (+0.33%) ⬆️
unittests-flytestdlib 62.61% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fg91 fg91 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it doesn't make sense that a kubeflow job that is .JobSuspended (meaning not yet running) is treated as pluginsCore.PhaseInfoRunning. The ray plugin treats it as "phase info queued", see here which makes more sense to me. Let's return pluginsCore.PhaseInfoQueued(... instead.

@fg91 fg91 dismissed pingsutw’s stale review April 20, 2026 19:37

@pingsutw I think this is currently not correct :)

@pingsutw

Copy link
Copy Markdown
Member

but kubeflow may also suspend a running job too, (running -> suspended)

@fg91

fg91 commented Apr 20, 2026

Copy link
Copy Markdown
Member

but kubeflow may also suspend a running job too, (running -> suspended)

Do you know in which situation kubeflow does this? Haven't seen this yet, only by external systems like kueue via an admission webhook right after creation of the job object.

@strigazi

Copy link
Copy Markdown
Author

I think that pluginsCore.PhaseInfoQueued makes more sense as well. Kueue suspends workloads to queue them for later. Either right after creation or after already running.

@strigazi strigazi force-pushed the fix/suspended-kfjobs branch from e914edb to fe3c9ad Compare April 21, 2026 08:20
@strigazi strigazi requested a review from fg91 April 21, 2026 08:31
Comment thread flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi_test.go Outdated
When kubeflow jobs are suspended by an external system (eg kueue),
the controller error's with the following [0].

https://github.com/flyteorg/flyte/pull/6295/changes relies only on
app.Spec.RunPolicy.Suspend and does not take into account the phase.

[0] RuntimeExecutionError: failed during plugin execution, caused by:
    Invalid transition for plugin [pytorch]: transition doesn't have task
    info nor an execution error filled
    [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>]

Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com>
@strigazi strigazi force-pushed the fix/suspended-kfjobs branch from fe3c9ad to ece7e47 Compare April 22, 2026 08:38

@fg91 fg91 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks!

@fg91 fg91 requested a review from pingsutw April 22, 2026 20:27
@pingsutw pingsutw enabled auto-merge (squash) April 23, 2026 18:20
@pingsutw pingsutw merged commit 682d81e into flyteorg:master Apr 23, 2026
48 checks passed
@welcome

welcome Bot commented Apr 23, 2026

Copy link
Copy Markdown

Congrats on merging your first pull request! 🎉

@strigazi

Copy link
Copy Markdown
Author

Are there any requirements to cut a release? In this case 1.16.6 (with this patch)?

@Sovietaced

Copy link
Copy Markdown
Member

Are there any requirements to cut a release? In this case 1.16.6 (with this patch)?

Not really. I can push one out tomorrow if someone else doesn't.

muskan-creates352 pushed a commit to muskan-creates352/flyte that referenced this pull request Apr 25, 2026
Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com>
Co-authored-by: Jason Parraga <Sovietaced@gmail.com>
Signed-off-by: Muskan Kumari <er.muskan09@gmail.com>
nuthalapativarun pushed a commit to nuthalapativarun/flyte that referenced this pull request May 1, 2026
Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com>
Co-authored-by: Jason Parraga <Sovietaced@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants