Skip to content

Error/Debugging Experience Needs Improvements #286

@gflarity

Description

@gflarity

What you would like to be added?

Grove's error/debugging user experience needs improvement. We should improve the user experience by reducing the learning curve, make errors/gating conditions easier to discover, and reduce the friction required to troubleshoot a Grove deployment (PCS).

Why is this needed?

Users Face a Steep Learning Curve

  • Many Grove users deploying AI workloads may have limited Kubernetes operational experience
  • Even experienced Kubernetes users must learn Grove's abstraction layers (PodCliqueSet → PodCliqueScalingGroup → PodClique → Pod)
  • Gang scheduling introduces non-obvious lateral dependencies between sibling/cousin resources—a concept absent from standard Kubernetes workloads
  • When something goes wrong, users must mentally reconstruct gang topology to understand why seemingly independent resources are blocking each other

Errors Are Difficult to Discover

  • Analysis found 41 error conditions that log internally but don't emit Kubernetes events—invisible to users without operator log access
  • Status conditions like InsufficientScheduledPods describe symptoms but not root causes
  • Schedule-gated PodCliques provide no indication they're blocked by gang dependencies or which related resource is at fault
  • Duplicate warning events create noise that obscures actionable diagnostic information

Troubleshooting Requires Complex Navigation

  • Users must check multiple resource types and understand their relationships to diagnose issues
  • No unified view shows the health of an entire workload topology
  • Standard tools like kubectl require multiple commands to piece together the full picture

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions