-
Notifications
You must be signed in to change notification settings - Fork 131
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
0.10.3
What happened?
When the ShadowPod in the provider cluster gets deleted either directly or because the offloaded namespace was recreated, Liqo changes the state of pods that are in terminal phases (Succeeded & Failed), into pending status. We have seen pods that ran over 40 days ago coming back into a pending state and running because of this bug. This is a major issue for batch workloads and one-shot pods, where they get back into a pending state and run again or clog the queue until they run. The main reason why this appears to be happening is that the workload namespace mapper doesn't check if the local pod is in a terminal phase when it detects a ShadowPod was deleted. Interestingly the fallback handler does correctly ignore pods in a succeeded phase, but not for failed pods.
Relevant log output
How can we reproduce the issue?
- Using a k8s batch job definition launch a pod that gets reflected from the consumer to the provider cluster and wait for it to get to a completed state.
- find the corresponding ShadowPod definition and delete it.
-> You should see the source pod going into a pending or running state immediately and the ShadowPod getting recreated.
You can also trigger a similar situation for failed pods by creating a job pod that exits with a non-zero value, waiting for it to get into a failed phase, and then deleting the corresponding ShadowPod in the provider cluster.
Provider or distribution
EKS, K3
CNI version
No response
Kernel Version
No response
Kubernetes Version
1.31
Code of Conduct
- I agree to follow this project's Code of Conduct