Fix type error for suspended kubeflow jobs#7241
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7241 +/- ##
==========================================
+ Coverage 56.58% 56.95% +0.37%
==========================================
Files 931 931
Lines 64866 58250 -6616
==========================================
- Hits 36707 33179 -3528
+ Misses 25105 22017 -3088
Partials 3054 3054
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
To me it doesn't make sense that a kubeflow job that is .JobSuspended (meaning not yet running) is treated as pluginsCore.PhaseInfoRunning. The ray plugin treats it as "phase info queued", see here which makes more sense to me. Let's return pluginsCore.PhaseInfoQueued(... instead.
@pingsutw I think this is currently not correct :)
|
but kubeflow may also suspend a running job too, (running -> suspended) |
Do you know in which situation kubeflow does this? Haven't seen this yet, only by external systems like kueue via an admission webhook right after creation of the job object. |
|
I think that pluginsCore.PhaseInfoQueued makes more sense as well. Kueue suspends workloads to queue them for later. Either right after creation or after already running. |
e914edb to
fe3c9ad
Compare
When kubeflow jobs are suspended by an external system (eg kueue), the controller error's with the following [0]. https://github.com/flyteorg/flyte/pull/6295/changes relies only on app.Spec.RunPolicy.Suspend and does not take into account the phase. [0] RuntimeExecutionError: failed during plugin execution, caused by: Invalid transition for plugin [pytorch]: transition doesn't have task info nor an execution error filled [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>] Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com>
fe3c9ad to
ece7e47
Compare
|
Congrats on merging your first pull request! 🎉 |
|
Are there any requirements to cut a release? In this case 1.16.6 (with this patch)? |
Not really. I can push one out tomorrow if someone else doesn't. |
Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com> Co-authored-by: Jason Parraga <Sovietaced@gmail.com> Signed-off-by: Muskan Kumari <er.muskan09@gmail.com>
Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com> Co-authored-by: Jason Parraga <Sovietaced@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
When kubeflow jobs are suspended by an external system (eg kueue), the controller error's with the following [0].
https://github.com/flyteorg/flyte/pull/6295/changes relies only on app.Spec.RunPolicy.Suspend and does not take into account the phase.
[0]
failed Execute for node. Error: failed at Node[dn1]. RuntimeExecutionError: failed during plugin execution, caused by: Invalid transition for plugin [pytorch]: transition doesn't have task info nor an execution error filled [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>]related: #6294 #6295
Observed in versions: v1.16.3, v1.16.4, v1.16.5