Skip to content

Improve Visibility and Handling of Pod Lifecycle Failures in Kubeflow Pipelines UI #13182

@SinghhSarvesh

Description

@SinghhSarvesh

### Problem

Currently, when a Kubeflow Pipeline fails due to pod lifecycle issues (such as CrashLoopBackOff, OOMKilled, or ImagePullBackOff), the UI does not clearly indicate the reason for failure.

Instead, the pipeline often appears to be stuck or not progressing, which creates confusion for users. To understand the issue, users need to manually inspect Kubernetes resources using kubectl, which breaks the abstraction that Kubeflow aims to provide.

### Proposed Improvement

I would like to propose enhancements to improve failure visibility and handling in the Kubeflow Pipelines UI:

  1. Detect and classify pod lifecycle failures into categories:

    • Provisioning failures (ImagePullBackOff, Unschedulable)
    • Runtime failures (CrashLoopBackOff, OOMKilled)
    • Node-level failures (NodeLost, Preempted)
  2. Display clear failure reasons directly in the UI:

    • Show error type and message
    • Highlight failed pipeline nodes visually
  3. Introduce timeout handling:

    • Prevent pipelines from appearing stuck indefinitely
    • Allow configurable timeout based on failure type
  4. Improve user experience:

    • Provide human-readable explanations
    • Optionally suggest possible fixes (e.g., increase memory for OOMKilled)

### Expected Impact

  • Improved debugging experience for users
  • Reduced dependency on Kubernetes CLI tools
  • Better alignment with Kubeflow’s goal of abstracting infrastructure complexity

Additional Context

I am exploring contributing to this area as part of GSoC and would love feedback from maintainers on feasibility and design direction.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions