Skip to content

[KYUUBI #7424] Periodically delete pod stuck at FailedMount state#7425

Open
oh0873 wants to merge 3 commits intoapache:masterfrom
oh0873:hoonoh/failedmountCleanups
Open

[KYUUBI #7424] Periodically delete pod stuck at FailedMount state#7425
oh0873 wants to merge 3 commits intoapache:masterfrom
oh0873:hoonoh/failedmountCleanups

Conversation

@oh0873
Copy link
Copy Markdown
Contributor

@oh0873 oh0873 commented Apr 28, 2026

Why are the changes needed?

When kyuubi server pod crashes, some spark driver pod may stuck at FailedMount state.
This PR adds periodic check to clean up these pods.

How was this patch tested?

Deployed and Tested in our environment. We were able to observe FailedMount pods do get deleted after configured time.

Was this patch authored or co-authored using generative AI tooling?

Yes Cursor was used.

oh0873 added 2 commits April 16, 2026 12:08
---
**Work Item:** #10502574

---

**Problem**
--
Kyuubi driver pods get stuck at `FailedMount` error.

When an application is submitted, `spark-submit` creates a driver pod and a configmap. Sometimes driver pod is created but the config map is not created. This would happen if the kyuubi dies in between those two steps.

**Approach**
--
We want to clean up driver pods if it is stuck at `FailedMount` stage.

We will check all pending application and its pods.
If a pod is stuck and `FailedMount` stage for more than certain count, we will delete the pod.

This will remove all driver pods that are stuck.

---

**Code Changes**
--

**Three Configuration Added (`KyuubiConf.scala`)**:
- `KUBERNETES_POD_FAILED_MOUNT_LOOP_CHECK_ENABLED`: if true check and delete failed mount pods.
- `KUBERNETES_POD_FAILED_MOUNT_LOOP_CHECK_INTERVAL`: Interval to check failed mount stuck pods. Currently set to 1 hour. The check will be done every hour starting its deployment.
- `KUBERNETES_POD_FAILED_MOUNT_LOOP_THRESHOLD`: If the count of `FailedMount` exceeds the threshold, then delete the pod. It is set to 720 which is roughly 24 hours.

**Added Failed Mount Check in `KubernetesApplicationOperation.scala`**:

Added `cleanupFailedMountLoopPodExecutor`. It runs every interval (set to 1 hour) to check failedMount pod.
- Goes through all applications
- If appInfo.state is PENDING get all pods
- for each pod run `checkPodFailedMountLoop`.

Added `checkPodFailedMountLoop` to delete pod.
- get All events associated with this pod
- Check if the event has "FailedMount" in it and the count is bigger than the Threshold
- If it hasFailedMount with count exceeding the threshold, delete the pod.

**Minor change**
- Added `.metals` to `.gitignore`. (Cursor keeps creating `.metals` for scala projects).

**Test**
--

Tested in POC, it was able to detect and delete driver pods stuck at failedMount for long time.

Log shows detection and deletion of the stuck driver pod (Newest event first).

![Pasted Graphic.png](https://geico.visualstudio.com/a9381017-9a49-48ba-968b-b91b8c491290/_apis/git/repositories/59e64e8f-fe99-4025-8d42-a666f09c4ed3/pullRequests/4876778/attachments/Pasted%20Graphic.png) 

---

**Concern**
--

**Performance**
Failed Mount errors does not happen frequently, scanning all pending pods may be costly.
Also note that it is getting all events, then filter with kyuubiId, (equivalent to `kubectl get events --field-selector involvedObject.kind=Pod,involvedObject.name=kyuubi-spark-def6721d-bff8-4b29-94c6-dd1708cfd596-driver`) If there are a lot of events, this may not be cheap.

**False Positive**
If a pod is stuck at FailedMount for 24 hours but somehow start working again, this code will still delete this pod, because the (past) event still exists.  A pod should not be stuck for more than 24 hours in FailedMount stage (unless configMap was/is/will not create(d)).

**Questions**
--
- Is checking every hour good? Should I check less frequently to improve performance?
- I...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant