-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Labels
enhancementNew feature or requestNew feature or request
Description
What you would like to be added?
Grove's error/debugging user experience needs improvement. We should improve the user experience by reducing the learning curve, make errors/gating conditions easier to discover, and reduce the friction required to troubleshoot a Grove deployment (PCS).
Why is this needed?
Users Face a Steep Learning Curve
- Many Grove users deploying AI workloads may have limited Kubernetes operational experience
- Even experienced Kubernetes users must learn Grove's abstraction layers (PodCliqueSet → PodCliqueScalingGroup → PodClique → Pod)
- Gang scheduling introduces non-obvious lateral dependencies between sibling/cousin resources—a concept absent from standard Kubernetes workloads
- When something goes wrong, users must mentally reconstruct gang topology to understand why seemingly independent resources are blocking each other
Errors Are Difficult to Discover
- Analysis found 41 error conditions that log internally but don't emit Kubernetes events—invisible to users without operator log access
- Status conditions like
InsufficientScheduledPodsdescribe symptoms but not root causes - Schedule-gated PodCliques provide no indication they're blocked by gang dependencies or which related resource is at fault
- Duplicate warning events create noise that obscures actionable diagnostic information
Troubleshooting Requires Complex Navigation
- Users must check multiple resource types and understand their relationships to diagnose issues
- No unified view shows the health of an entire workload topology
- Standard tools like
kubectlrequire multiple commands to piece together the full picture
Related Issues
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request