dex: copyutil init container OOMs on cgroup v2 + `cp -n` silently preserves partial binary → permanent SIGSEGV

## Summary

The `dex-server` Deployment's `copyutil` init container (which copies the `argocd` binary into a shared volume so Dex can run `argocd-dex rundex`) has a default memory limit of 64Mi inherited from `dex.resources.limits.memory`. Under cgroup v2 hosts this default is **fragile** because the page cache of the 219 MB `argocd` binary copy is accounted against the cgroup's `memory.max`. When the copy crosses 64Mi mid-write, the kernel OOM-killer kills `cp` with exit 137 (SIGKILL) — leaving a **partial binary** on the shared `emptyDir`.

Init container `cmd` is `cp -n /usr/local/bin/argocd /shared/argocd-dex`. On the kubelet's first restart of the init container, `cp -n` finds an existing `/shared/argocd-dex` (the partial file from the OOMed attempt) and **silently skips it**, exiting 0. Kubelet marks the init Completed.

The main `dex-server` container then `exec`s `/shared/argocd-dex rundex`. Because the binary is truncated, the process **segfaults instantly with exit 139**. Kubelet restarts only the main container (init containers don't re-run on container restart) — so every restart re-execs the same corrupt binary and SIGSEGVs again. We observed this in a homelab cluster: **1100+ restarts over 4 days**, OIDC fully broken, no useful logs because Dex panics before flushing stderr.

The bug is **silent** because:

1. The init container's `Last State` shows the OOMKill, but its current `State` is `Completed exit 0` (the retry).
2. `kubectl describe pod` only emphasizes the latter.
3. `kubectl logs argocd-dex-server-...` returns nothing — SIGSEGV before any output.
4. Pod stays `Running 1/1=false` with mounting `RestartCount`, which superficially looks like just a flapping app, not a poisoned binary.

## Reproduction

Single-node cluster, kernel `7.0.0-15-generic` (cgroup v2). Helm chart `argo-cd-9.5.14`, image `quay.io/argoproj/argocd:v3.4.2` (219 MB binary). Chart defaults applied — `dex.resources.limits.memory: 64Mi`, no override on `dex.initImage.resources`.

`kubectl get pod argocd-dex-server-... -o jsonpath='{.status.initContainerStatuses[0]}'`:

```yaml
lastState:
  terminated:
    exitCode: 137
    reason: OOMKilled
state:
  terminated:
    exitCode: 0
    reason: Completed
restartCount: 1
```

Main container shows `exitCode: 139` (SIGSEGV) on every restart. Identical pod spec applied as a one-off Pod with `dex.initImage.resources.limits.memory: 512Mi` starts cleanly first try, init completes without OOM, dex initializes signing keys and listens on 5556/5557/5558.

## Why this surfaced now

The 219 MB binary size is on the edge of triggering OOM with 64Mi page-cache accounting. The argocd binary has grown release by release; earlier 3.x patch releases were small enough that some clusters never crossed the threshold. On cgroup v2 hosts the threshold is much easier to hit than on cgroup v1 where filesystem page cache was less aggressively accounted to the cgroup.

## Proposal

Two changes, both small:

1. **Bump default memory limit** for `dex.initImage.resources` (or whichever key controls `copyutil` resources, including the matching repo-server / server `copyutil` instances if they have the same issue). Suggested: `limits.memory: 256Mi` — comfortable headroom over the 219 MB binary plus copy buffers.

2. **Switch `cp -n` to `cp -f`** (force-overwrite) in the init container `command`. This way an OOMKilled partial copy is unconditionally truncated and rewritten on retry, instead of being silently preserved. Even with the higher default this is a defense-in-depth fix — if anyone hits OOM for any reason (resource contention, tight overrides, etc.), retries will succeed instead of cementing a poison binary.

Happy to PR either or both if there's appetite. The `cp -n` → `cp -f` change is a one-character diff in `templates/dex/deployment.yaml`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dex: copyutil init container OOMs on cgroup v2 + `cp -n` silently preserves partial binary → permanent SIGSEGV #3895

Summary

Reproduction

Why this surfaced now

Proposal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dex: copyutil init container OOMs on cgroup v2 + cp -n silently preserves partial binary → permanent SIGSEGV #3895

Description

Summary

Reproduction

Why this surfaced now

Proposal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

dex: copyutil init container OOMs on cgroup v2 + `cp -n` silently preserves partial binary → permanent SIGSEGV #3895