### Problem
Currently, when a Kubeflow Pipeline fails due to pod lifecycle issues (such as CrashLoopBackOff, OOMKilled, or ImagePullBackOff), the UI does not clearly indicate the reason for failure.
Instead, the pipeline often appears to be stuck or not progressing, which creates confusion for users. To understand the issue, users need to manually inspect Kubernetes resources using kubectl, which breaks the abstraction that Kubeflow aims to provide.
### Proposed Improvement
I would like to propose enhancements to improve failure visibility and handling in the Kubeflow Pipelines UI:
-
Detect and classify pod lifecycle failures into categories:
- Provisioning failures (ImagePullBackOff, Unschedulable)
- Runtime failures (CrashLoopBackOff, OOMKilled)
- Node-level failures (NodeLost, Preempted)
-
Display clear failure reasons directly in the UI:
- Show error type and message
- Highlight failed pipeline nodes visually
-
Introduce timeout handling:
- Prevent pipelines from appearing stuck indefinitely
- Allow configurable timeout based on failure type
-
Improve user experience:
- Provide human-readable explanations
- Optionally suggest possible fixes (e.g., increase memory for OOMKilled)
### Expected Impact
- Improved debugging experience for users
- Reduced dependency on Kubernetes CLI tools
- Better alignment with Kubeflow’s goal of abstracting infrastructure complexity
Additional Context
I am exploring contributing to this area as part of GSoC and would love feedback from maintainers on feasibility and design direction.
### Problem
Currently, when a Kubeflow Pipeline fails due to pod lifecycle issues (such as CrashLoopBackOff, OOMKilled, or ImagePullBackOff), the UI does not clearly indicate the reason for failure.
Instead, the pipeline often appears to be stuck or not progressing, which creates confusion for users. To understand the issue, users need to manually inspect Kubernetes resources using kubectl, which breaks the abstraction that Kubeflow aims to provide.
### Proposed Improvement
I would like to propose enhancements to improve failure visibility and handling in the Kubeflow Pipelines UI:
Detect and classify pod lifecycle failures into categories:
Display clear failure reasons directly in the UI:
Introduce timeout handling:
Improve user experience:
### Expected Impact
Additional Context
I am exploring contributing to this area as part of GSoC and would love feedback from maintainers on feasibility and design direction.