Skip to content

Stuck hooks issue when a sync tasks contains a Job resource with a ttlSecondsAfterFinished field set #21055

Open
argoproj/gitops-engine
#646
@dejanzele

Description

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

It is pretty common to implement init logic in Kubernetes Job resources and use some kind of hooks to run the init logic before the main application.
An example in Helm would be to annotate the init Job with Helm hooks (i.e. "helm.sh/hook": pre-install and "helm.sh/hook-delete-policy": hook-succeeded,hook-failed) so we, for example, run migrations before the Deployment resource creates the actual application.

ArgoCD has a long-standing bug where if a Job has ttlSecondsAfterFinished set to 0 or a low value, the Job gets deleted before ArgoCD can mark the hook phase as completed, and it gets stuck in the hook phase and cannot progress further.

The infinite loop happens in this part of the code.

When the Job resource gets deleted by the Job controller because of expired TTL, the syncTask for the hook does not have a liveObject anymore, and it cannot call the getOperationPhase function here to get the updated status.

The bug happens in the gitops-engine rather than core ArgoCD.

This issue has been mentioned in a couple of places:

To Reproduce

Helm chart used for testing can be found here.
The chart has the following resources:

  • Deployment
  • Job with a ttlSecondsAfterFinished field and helm hook & delete policy.
  1. Install any version of ArgoCD or run it locally using make start-local
  2. Create the following Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-test
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/dejanzele/argocd-hook-test
    targetRevision: HEAD
    path: hooks
    helm:
      releaseName: argocd-test
      values: |
        job:
          sleepSeconds: 15
          exitCode: 0
          ttlSecondsAfterFinished: 0
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd-test
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
  1. Open the UI, then open the created Application and notice it is constantly syncing and the message is waiting for completion of hook batch/Job/hello-world-job

Expected behavior

PreSync hook completes successfully and the Sync progresses to Healthy.

Screenshots

image

Version

argocd commit 730363f
gitops-engine commit 0371401803996f84bcd70a5f6bb2f0ecc7d7b5d2

Logs

Paste any relevant application logs here.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcomponent:hooksversion:2.14Latest confirmed affected version is 2.14

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions