Stuck hooks issue when a sync tasks contains a Job resource with a ttlSecondsAfterFinished field set #21055
Description
Checklist:
- I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- I've included steps to reproduce the bug.
- I've pasted the output of
argocd version
.
Describe the bug
It is pretty common to implement init logic in Kubernetes Job resources and use some kind of hooks to run the init logic before the main application.
An example in Helm would be to annotate the init Job with Helm hooks (i.e. "helm.sh/hook": pre-install
and "helm.sh/hook-delete-policy": hook-succeeded,hook-failed
) so we, for example, run migrations before the Deployment resource creates the actual application.
ArgoCD has a long-standing bug where if a Job has ttlSecondsAfterFinished
set to 0 or a low value, the Job gets deleted before ArgoCD can mark the hook phase as completed, and it gets stuck in the hook phase and cannot progress further.
The infinite loop happens in this part of the code.
When the Job resource gets deleted by the Job controller because of expired TTL, the syncTask
for the hook does not have a liveObject
anymore, and it cannot call the getOperationPhase
function here to get the updated status.
The bug happens in the gitops-engine
rather than core ArgoCD.
This issue has been mentioned in a couple of places:
- waiting for completion of hook and hook never succeds #6880
- fix: set post-install-hook ttlSecondsAfterFinished to 60 (#6765) aws/karpenter-provider-aws#6825
- ArgoCD stuck in
waiting for completion of hook batch/Job/argocd-redis-secret-init
argo-helm#2887 - https://github.com/G-Research/gr-oss/issues/316
To Reproduce
Helm chart used for testing can be found here.
The chart has the following resources:
- Deployment
- Job with a ttlSecondsAfterFinished field and helm hook & delete policy.
- Install any version of ArgoCD or run it locally using
make start-local
- Create the following Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: argocd-test
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/dejanzele/argocd-hook-test
targetRevision: HEAD
path: hooks
helm:
releaseName: argocd-test
values: |
job:
sleepSeconds: 15
exitCode: 0
ttlSecondsAfterFinished: 0
destination:
server: https://kubernetes.default.svc
namespace: argocd-test
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- Open the UI, then open the created Application and notice it is constantly syncing and the message is
waiting for completion of hook batch/Job/hello-world-job
Expected behavior
PreSync hook completes successfully and the Sync progresses to Healthy.
Screenshots
Version
argocd commit 730363f
gitops-engine commit 0371401803996f84bcd70a5f6bb2f0ecc7d7b5d2
Logs
Paste any relevant application logs here.