[KYUUBI #7424] Periodically delete pod stuck at FailedMount state#7425
Open
oh0873 wants to merge 3 commits intoapache:masterfrom
Open
[KYUUBI #7424] Periodically delete pod stuck at FailedMount state#7425oh0873 wants to merge 3 commits intoapache:masterfrom
oh0873 wants to merge 3 commits intoapache:masterfrom
Conversation
--- **Work Item:** #10502574 --- **Problem** -- Kyuubi driver pods get stuck at `FailedMount` error. When an application is submitted, `spark-submit` creates a driver pod and a configmap. Sometimes driver pod is created but the config map is not created. This would happen if the kyuubi dies in between those two steps. **Approach** -- We want to clean up driver pods if it is stuck at `FailedMount` stage. We will check all pending application and its pods. If a pod is stuck and `FailedMount` stage for more than certain count, we will delete the pod. This will remove all driver pods that are stuck. --- **Code Changes** -- **Three Configuration Added (`KyuubiConf.scala`)**: - `KUBERNETES_POD_FAILED_MOUNT_LOOP_CHECK_ENABLED`: if true check and delete failed mount pods. - `KUBERNETES_POD_FAILED_MOUNT_LOOP_CHECK_INTERVAL`: Interval to check failed mount stuck pods. Currently set to 1 hour. The check will be done every hour starting its deployment. - `KUBERNETES_POD_FAILED_MOUNT_LOOP_THRESHOLD`: If the count of `FailedMount` exceeds the threshold, then delete the pod. It is set to 720 which is roughly 24 hours. **Added Failed Mount Check in `KubernetesApplicationOperation.scala`**: Added `cleanupFailedMountLoopPodExecutor`. It runs every interval (set to 1 hour) to check failedMount pod. - Goes through all applications - If appInfo.state is PENDING get all pods - for each pod run `checkPodFailedMountLoop`. Added `checkPodFailedMountLoop` to delete pod. - get All events associated with this pod - Check if the event has "FailedMount" in it and the count is bigger than the Threshold - If it hasFailedMount with count exceeding the threshold, delete the pod. **Minor change** - Added `.metals` to `.gitignore`. (Cursor keeps creating `.metals` for scala projects). **Test** -- Tested in POC, it was able to detect and delete driver pods stuck at failedMount for long time. Log shows detection and deletion of the stuck driver pod (Newest event first).   --- **Concern** -- **Performance** Failed Mount errors does not happen frequently, scanning all pending pods may be costly. Also note that it is getting all events, then filter with kyuubiId, (equivalent to `kubectl get events --field-selector involvedObject.kind=Pod,involvedObject.name=kyuubi-spark-def6721d-bff8-4b29-94c6-dd1708cfd596-driver`) If there are a lot of events, this may not be cheap. **False Positive** If a pod is stuck at FailedMount for 24 hours but somehow start working again, this code will still delete this pod, because the (past) event still exists. A pod should not be stuck for more than 24 hours in FailedMount stage (unless configMap was/is/will not create(d)). **Questions** -- - Is checking every hour good? Should I check less frequently to improve performance? - I...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
When kyuubi server pod crashes, some spark driver pod may stuck at FailedMount state.
This PR adds periodic check to clean up these pods.
How was this patch tested?
Deployed and Tested in our environment. We were able to observe FailedMount pods do get deleted after configured time.
Was this patch authored or co-authored using generative AI tooling?
Yes Cursor was used.