Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{{pod.name}} inconsistency #13691

Open
2 of 4 tasks
paulfouquet opened this issue Oct 2, 2024 · 0 comments
Open
2 of 4 tasks

{{pod.name}} inconsistency #13691

paulfouquet opened this issue Oct 2, 2024 · 0 comments
Labels
area/controller Controller issues, panics area/templating Templating with `{{...}}` P3 Low priority type/bug

Comments

@paulfouquet
Copy link

paulfouquet commented Oct 2, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Using Argo Workflows v3.5.5 on an AWS EKS cluster with version 1.28, we had a task that was using the {{pod.name}} to generate a path to our S3 artifact bucket. The value returned by {{pod.name}} was composed by: [WORKFLOW NAME]-[TASK NAME]-[POD ID] (the same value as the directory created under our s3 artifact bucket for the pod running our task).
We did an upgrade of our cluster from EKS 1.28 to 1.29. Since this upgrade, {{pod.name}} is returning a string composed by: [WORKFLOW NAME]-[TASK NAME]-[TASK ID] (TASK ID instead of POD ID). But the directory created under our S3 artifact bucket is still composed using the generated name [WORKFLOW NAME]-[TASK NAME]-[POD ID] as prior our upgrade. Note that we did not upgrade Argo Workflows, and did not deleted the cluster prior to the upgrade.

We first suspected that we might have been in a weird state where Argo Workflows was using POD_NAMES=v1 due to a previous Argo Workflows version we had on the cluster, and by upgrading the cluster an Argo Workflow server / controller node might have been "restarted" using the new default value POD_NAMES=v2. But our tests by using POD_NAMES=v1 show that the {{pod.name}} is now returning a string composed by [TASK NAME]-[TASK ID].

Now the weird thing is that we deployed a brand new cluster, using EKS 1.28 and Argo Workflows 3.5.5, and we also encounter the issue of having {{pod.name}} returning [WORKFLOW NAME]-[TASK NAME]-[TASK ID] and the S3 artifact bucket folder being [WORKFLOW NAME]-[TASK NAME]-[POD ID].

Same behaviour is also observed using EKS 1.30 and Argo Workflows 3.5.11

This issue would be hard to reproduce. I am linking our public repository that uses Argo Workflows with our IaC for our cluster: https://github.com/linz/topo-workflows/tree/master/infra and the task that is using {{pod.name}} is here.

Thank you for your help.

Version(s)

v3.5.5
v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-location-
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: env
            template: env
          - name: get-location
            template: get-location
    - name: env
      container:
        image: ghcr.io/linz/topo-imagery:latest
        command: [env]

    - name: get-location
      script:
        image: 'ghcr.io/linz/argo-tasks:latest'
        command: [node]
        source: |
          console.log('{{pod.name}}');

Logs from the workflow controller

kubectl logs -n argo deploy/argo-workflows-workflow-controller | grep test-location-kxlhj
Found 2 pods, using pod/argo-workflows-workflow-controller-58d49f9597-264td
time="2024-10-02T02:22:02.099Z" level=info msg="Processing workflow" Phase= ResourceVersion=71837 namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=info msg="Retry node test-location-kxlhj initialized Running" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=info msg="DAG node test-location-kxlhj-205658360 initialized Running" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.107Z" level=warning msg="was unable to obtain the node for test-location-kxlhj-2554508165, taskName env"
time="2024-10-02T02:22:02.108Z" level=warning msg="was unable to obtain the node for test-location-kxlhj-2554508165, taskName env"
time="2024-10-02T02:22:02.108Z" level=info msg="All of node test-location-kxlhj(0).env dependencies [] completed" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.108Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.108Z" level=info msg="Retry node test-location-kxlhj-2554508165 initialized Running" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.108Z" level=info msg="Pod node test-location-kxlhj-3653376484 initialized Pending" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.159Z" level=info msg="Created pod: test-location-kxlhj(0).env(0) (test-location-kxlhj-env-3653376484)" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.159Z" level=warning msg="was unable to obtain the node for test-location-kxlhj-2381258082, taskName get-location"
time="2024-10-02T02:22:02.159Z" level=warning msg="was unable to obtain the node for test-location-kxlhj-2381258082, taskName get-location"
time="2024-10-02T02:22:02.159Z" level=info msg="All of node test-location-kxlhj(0).get-location dependencies [] completed" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.159Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.159Z" level=info msg="Retry node test-location-kxlhj-2381258082 initialized Running" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.159Z" level=info msg="Pod node test-location-kxlhj-4220356185 initialized Pending" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.237Z" level=info msg="Created pod: test-location-kxlhj(0).get-location(0) (test-location-kxlhj-get-location-4220356185)" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.237Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.237Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:02.250Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=71845 workflow=test-location-kxlhj
time="2024-10-02T02:22:12.161Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=71845 namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.161Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=test-location-kxlhj
time="2024-10-02T02:22:12.161Z" level=info msg="task-result changed" namespace=argo nodeID=test-location-kxlhj-4220356185 workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="task-result changed" namespace=argo nodeID=test-location-kxlhj-3653376484 workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=test-location-kxlhj-4220356185 old.message= old.phase=Pending old.progress=0/1 workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=test-location-kxlhj-3653376484 old.message= old.phase=Pending old.progress=0/1 workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node test-location-kxlhj-2554508165 phase Running -> Succeeded" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node test-location-kxlhj-2554508165 finished: 2024-10-02 02:22:12.162630489 +0000 UTC" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node test-location-kxlhj-2381258082 phase Running -> Succeeded" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node test-location-kxlhj-2381258082 finished: 2024-10-02 02:22:12.162829644 +0000 UTC" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="Outbound nodes of test-location-kxlhj-205658360 set to [test-location-kxlhj-3653376484 test-location-kxlhj-4220356185]" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node test-location-kxlhj-205658360 phase Running -> Succeeded" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.162Z" level=info msg="node test-location-kxlhj-205658360 finished: 2024-10-02 02:22:12.162956927 +0000 UTC" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.163Z" level=info msg="node test-location-kxlhj phase Running -> Succeeded" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.163Z" level=info msg="node test-location-kxlhj finished: 2024-10-02 02:22:12.163150685 +0000 UTC" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.163Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.163Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.163Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.163Z" level=info msg="Marking workflow completed" namespace=argo workflow=test-location-kxlhj
time="2024-10-02T02:22:12.168Z" level=info msg="cleaning up pod" action=deletePod key=argo/test-location-kxlhj-1340600742-agent/deletePod
time="2024-10-02T02:22:12.177Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=71923 workflow=test-location-kxlhj
time="2024-10-02T02:22:17.210Z" level=info msg="cleaning up pod" action=deletePod key=argo/test-location-kxlhj-env-3653376484/deletePod
time="2024-10-02T02:22:17.210Z" level=info msg="cleaning up pod" action=deletePod key=argo/test-location-kxlhj-get-location-4220356185/deletePod

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=test-location-kxlhj,workflow.argoproj.io/phase!=Succeeded
No resources found in argo namespace.
@shuangkun shuangkun added the area/controller Controller issues, panics label Oct 2, 2024
@agilgur5 agilgur5 changed the title {{pod.name}} inconsistency {{pod.name}} inconsistency Oct 2, 2024
@agilgur5 agilgur5 added area/templating Templating with `{{...}}` P3 Low priority labels Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/templating Templating with `{{...}}` P3 Low priority type/bug
Projects
None yet
Development

No branches or pull requests

3 participants