Skip to content

{{pod.name}} value different when using retryStrategy within templateDefaults #13691

Closed
@paulfouquet

Description

@paulfouquet

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Using Argo Workflows v3.5.5 on an AWS EKS cluster with version 1.28, we had a task that was using the {{pod.name}} to generate a path to our S3 artifact bucket. The value returned by {{pod.name}} was composed by: [WORKFLOW NAME]-[TASK NAME]-[POD ID] (the same value as the directory created under our s3 artifact bucket for the pod running our task).
We did an upgrade of our cluster from EKS 1.28 to 1.29. Since this upgrade, {{pod.name}} is returning a string composed by: [WORKFLOW NAME]-[TASK NAME]-[TASK ID] (TASK ID instead of POD ID). But the directory created under our S3 artifact bucket is still composed using the generated name [WORKFLOW NAME]-[TASK NAME]-[POD ID] as prior our upgrade. Note that we did not upgrade Argo Workflows, and did not deleted the cluster prior to the upgrade.

We first suspected that we might have been in a weird state where Argo Workflows was using POD_NAMES=v1 due to a previous Argo Workflows version we had on the cluster, and by upgrading the cluster an Argo Workflow server / controller node might have been "restarted" using the new default value POD_NAMES=v2. But our tests by using POD_NAMES=v1 show that the {{pod.name}} is now returning a string composed by [TASK NAME]-[TASK ID].

Now the weird thing is that we deployed a brand new cluster, using EKS 1.28 and Argo Workflows 3.5.5, and we also encounter the issue of having {{pod.name}} returning [WORKFLOW NAME]-[TASK NAME]-[TASK ID] and the S3 artifact bucket folder being [WORKFLOW NAME]-[TASK NAME]-[POD ID].

Same behaviour is also observed using EKS 1.30 and Argo Workflows 3.5.11

This issue would be hard to reproduce. I am linking our public repository that uses Argo Workflows with our IaC for our cluster: https://github.com/linz/topo-workflows/tree/master/infra and the task that is using {{pod.name}} is here.

The issue is reproduced using the minimal workflow below:

> argo logs test-pod-name-plf6v test-pod-name-plf6v-get-pod-name-1010547582 -n argo
test-pod-name-plf6v-get-pod-name-1010547582: pod.name=test-pod-name-plf6v-get-pod-name-3577080563

The value returned by {{pod.name}} is not the one we expected. It seems to return the node or type retry ID. For example: pod.name=test-pod-name-8zrgl-get-pod-name-2484369038 - we were expecting to have test-pod-name-8zrgl-get-pod-name-121682877
image
image

Thank you for your help.

Version(s)

v3.5.5
v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-pod-name-
spec:
  templateDefaults:
    retryStrategy:
      limit: '2'
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: get-pod-name
            template: get-pod-name
    - name: get-pod-name
      script:
        image: 'ghcr.io/linz/argo-tasks:latest'
        command: [node]
        source: |
          console.log('pod.name={{pod.name}}');

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep test-pod-name-plf6v
time="2024-10-13T20:04:48.996Z" level=info msg="Processing workflow" Phase= ResourceVersion=959 namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="Retry node test-pod-name-plf6v initialized Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="DAG node test-pod-name-plf6v-4260280029 initialized Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=warning msg="was unable to obtain the node for test-pod-name-plf6v-3577080563, taskName get-pod-name"
time="2024-10-13T20:04:49.001Z" level=warning msg="was unable to obtain the node for test-pod-name-plf6v-3577080563, taskName get-pod-name"
time="2024-10-13T20:04:49.001Z" level=info msg="All of node test-pod-name-plf6v(0).get-pod-name dependencies [] completed" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.001Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.001Z" level=info msg="Retry node test-pod-name-plf6v-3577080563 initialized Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.001Z" level=info msg="Pod node test-pod-name-plf6v-1010547582 initialized Pending" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.015Z" level=info msg="Created pod: test-pod-name-plf6v(0).get-pod-name(0) (test-pod-name-plf6v-get-pod-name-1010547582)" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.015Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.015Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.026Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=964 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.017Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=964 namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.017Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.018Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=test-pod-name-plf6v-1010547582 old.message= old.phase=Pending old.progress=0/1 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.018Z" level=info msg="node test-pod-name-plf6v-3577080563 phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.018Z" level=info msg="node test-pod-name-plf6v-3577080563 finished: 2024-10-13 20:04:59.018963346 +0000 UTC" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="Outbound nodes of test-pod-name-plf6v-4260280029 set to [test-pod-name-plf6v-1010547582]" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v-4260280029 phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v-4260280029 finished: 2024-10-13 20:04:59.019062436 +0000 UTC" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v finished: 2024-10-13 20:04:59.019267226 +0000 UTC" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="Marking workflow completed" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.046Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=1002 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.062Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/test-pod-name-plf6v-get-pod-name-1010547582/labelPodCompleted

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=test-pod-name-plf6v
time="2024-10-13T20:04:51.074Z" level=info msg="Starting Workflow Executor" version=v3.5.11
time="2024-10-13T20:04:51.078Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-10-13T20:04:51.078Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=test-pod-name-plf6v-get-pod-name-1010547582 templateName=get-pod-name version="&Version{Version:v3.5.11,BuildDate:2024-09-20T14:09:00Z,GitCommit:25bbb71cced32b671f9ad35f0ffd1f0ddb8226ee,GitTag:v3.5.11,GitTreeState:clean,GoVersion:go1.21.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-10-13T20:04:51.085Z" level=info msg="Starting deadline monitor"
time="2024-10-13T20:04:53.087Z" level=info msg="Main container completed" error="<nil>"
time="2024-10-13T20:04:53.087Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-10-13T20:04:53.087Z" level=info msg="No output parameters"
time="2024-10-13T20:04:53.087Z" level=info msg="No output artifacts"
time="2024-10-13T20:04:53.097Z" level=info msg="Alloc=7908 TotalAlloc=13631 Sys=25189 NumGC=4 Goroutines=8"
time="2024-10-13T20:04:53.102Z" level=info msg="Deadline monitor stopped"

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low priorityarea/controllerController issues, panicsarea/templatingTemplating with `{{...}}`solution/suggestedA solution to the bug has been suggested. Someone needs to implement it.type/bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions