Skip to content

Input artifact load fails ("rename ... file exists") when destination is a volume mount root on ContainerSet templates #16144

@j-walther

Description

@j-walther

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. v4.0.5, gitCommit=0ab1452144d8f4d57c50b37ce50dad218868e950) and can confirm the issue still exists
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself

What happened? What did you expect to happen?

When a ContainerSet-template's input artifact path coincides with one of the containerSet's volumeMounts.mountPath, the artifact init container fails with:

artifact <name> failed to load: rename /mainctrfs/<mount>.tmp /mainctrfs/<mount>: file exists

The executor stages the loaded artifact at artPath + ".tmp" and then os.Renames it onto the mount. The staging sibling lives on the init container's local filesystem, while the destination is a volume mount on a different filesystem — the kernel refuses the rename (seen as EEXIST on Linux/overlayfs in our reproducer; EXDEV, EBUSY, and ENOTEMPTY are also possible depending on the kernel and underlying filesystem). The same problem affects unpack for tar/zip artifacts (it renames <destPath>.tmpdir onto destPath).

The bug is independent of the artifact driverraw: is enough to trigger it, as is git:, s3:, gcs:, http:, plugin:, etc.

For the plain Container template form, the validator at workflow/validate/validate.go:789-803 rejects this configuration upfront (already mounted in container.volumeMounts.<name>). That check was presumably added because of this same rename bug. It is not applied to ContainerSet templates (the validate path at lines 819-829 only calls tmpl.ContainerSet.Validate() which checks intra-set mount collisions, not artifact-vs-mount). So ContainerSets get past validation, attempt the doomed rename, and crash dirty at runtime.

Expected: Either the artifact lands inside the volume mount (the existing source comment on the overlap branch says "extracting to volume mount"), or the validator rejects the config consistently for all template shapes.

Actual: ContainerSet templates with an artifact whose path equals a mount path are accepted, the pod is created, the artifact init container fails with the rename error, and the workflow never starts its main container.

Related but distinct issues

Version(s)

v4.0.5, gitCommit=0ab1452144d8f4d57c50b37ce50dad218868e950. The same code shape exists on main at 6cc0d115b... and on release-4.0.

Paste a minimal workflow that reproduces the issue.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: rename-onto-mount-repro-
spec:
  entrypoint: main
  templates:
    - name: main
      inputs:
        artifacts:
          - name: data
            path: /opt/workspace        # <-- same as the volumeMount below
            raw:
              data: "hello world"
      volumes:
        - name: workspace
          emptyDir: {}
      containerSet:
        volumeMounts:
          - name: workspace
            mountPath: /opt/workspace   # <-- same as inputs.artifacts[0].path
        containers:
          - name: main
            image: alpine:3.20
            command: [sh, -c]
            args: ["ls -la /opt/workspace"]

Workflow phase ends as Error with message: init: Error (exit code 64): rename /mainctrfs/opt/workspace.tmp /mainctrfs/opt/workspace: file exists.

(Switching this to a Container template instead of ContainerSet will be rejected by the validator with "templates.main.inputs.artifacts[0].path '/opt/workspace' already mounted in container.volumeMounts.workspace", demonstrating that the validator already knows this case can't work — it just doesn't cover ContainerSet.)

Logs from the workflow controller

The controller accepts the workflow and creates the pod normally. The failure surfaces at the init-container level (below). The controller log entry for the failed node:

level=ERROR msg="Workflow has failed" message="init: Error (exit code 64): rename /mainctrfs/opt/workspace.tmp /mainctrfs/opt/workspace: file exists"

Logs from in your workflow's wait container

The wait container never starts because the artifact init container exits non-zero. The init container's logs:

time=... level=INFO msg="Starting Workflow Executor" version=v4.0.5 gitCommit=0ab1452144d8f4d57c50b37ce50dad218868e950
time=... level=INFO msg="Start loading input artifacts..."
time=... level=INFO msg="Downloading artifact" name=data
time=... level=INFO msg="Specified artifact path overlaps with volume mount, extracting to volume mount" path=/opt/workspace mountPath=/opt/workspace
time=... level=INFO msg="Loading artifact"
time=... level=INFO msg="Load artifact" artifactName=data error=<nil> key=""
time=... level=INFO msg="Detecting if file is a tarball" path=/mainctrfs/opt/workspace.tmp
time=... level=ERROR msg="executor error" error="rename /mainctrfs/opt/workspace.tmp /mainctrfs/opt/workspace: file exists"
Error: rename /mainctrfs/opt/workspace.tmp /mainctrfs/opt/workspace: file exists

Root cause

workflow/executor/executor.go::loadArtifact (and the same shape in unpack) promotes the loaded artifact via os.Rename(temp, dest). When dest is a volume mount, temp is a sibling on the init container's local filesystem — different filesystem from the mount, so the kernel refuses. The check at workflow/validate/validate.go:789-803 papers over this for tmpl.Container but is not applied to tmpl.ContainerSet (or, in principle, any other template shape that produces a pod with volume mounts).

Proposed fix

Make the executor's rename actually work for the overlap case. Add a renameOrMerge helper that tries os.Rename first and, on the specific errnos that flag this cross-filesystem-onto-mount case (EXDEV, EBUSY, EEXIST, ENOTEMPTY), falls back to recursively copying source contents into the destination and removing the source. Use it at the three artifact-promotion call sites: the non-tar/non-zip branch in loadArtifact and both rename branches in unpack. The untar/unzip extraction into a sibling .tmpdir is unchanged; only the final move onto the destination switches to the helper.

PR attached as a follow-up. Tests cover the fast-path atomic rename, the merge-into-existing-directory fallback, symlink preservation, file mode preservation, and ENOENT propagation.

I did not touch the validator check for tmpl.Container in this PR because removing it would relax behavior and risk surprising existing users; that can be done in a follow-up once the executor fix lands. If maintainers prefer the alternative path (validator-only fix: extend the existing check to cover tmpl.ContainerSet too, rejecting these workflows upfront for all template shapes), happy to swap to that approach instead.

Behavioural note

When the fallback runs, any pre-existing files in the destination directory are merged with the artifact contents (entries from the staging path overwrite same-named entries at the destination). For the volume-mount case the mount is typically a fresh emptyDir so this is invisible, but for persistentVolumeClaim-backed mounts that already contain files this is a softer "merge" semantic than the original atomic rename would have provided. The existing source comment on the overlap branch ("Extracting to volume mount") already reads as "extract into", so this aligns with the documented intent.

Disclosure

This issue (and the accompanying PR) was authored with assistance from Claude Code (Anthropic's coding agent). The reproducer was run on a local k3s cluster, the logs and validator behaviour quoted above were captured first-hand, and I have read and reviewed every line before submitting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions