AICR currently pins container images inconsistently across the registry. The
just-published BOM (docs/user/container-images.md) catalogs 70 unique
images across 22 components; only 4 carry a digest, and 4 helm
components have no chart-version pin at all (gpu-operator,
network-operator, nvidia-dra-driver-gpu, nodewright-operator). Every
other version is a mutable tag.
Three classes of customer pain follow:
- Same recipe, different bytes. A render today and a render six months
from now can produce a different image set when chart-default sub-images
are unpinned and upstream charts publish new defaults. The weekly
bom-refreshaction surfaces this drift; without a policy it's open-ended. - No customer-facing reproducibility statement. Security review and air-gap planning need a definitive "this recipe at this AICR version deploys exactly these bytes." We cannot assert that today.
- No clear contract for new components. A contributor adding a new chart has no documented expectation about what to pin. The result is ad-hoc per-component decisions that drift in different directions.
The trade-off is real and not free: pinning everything in-tree blocks upstream security patches behind an AICR release; pinning nothing maximizes patch flow but breaks reproducibility. We need an intentional position for each layer of the supply chain.
- Implementing admission-time digest verification. That's Stage 3 of the supply-chain epic (#745); this ADR only declares it as the destination for the layer this policy intentionally leaves unpinned.
- Forking upstream charts to expose digest fields they don't provide natively. Maintenance burden is unbounded.
- Mandating signature verification on upstream artifacts at this policy layer. Signing requests to upstream maintainers are a separate workstream filed alongside #740 once upstream repos have been surveyed.
- Re-pinning every existing in-tree image reference in the same change. This ADR sets policy; #748/#749 land the in-tree changes incrementally under that policy.
The supply chain divides into three concentric layers, each with a different ownership boundary:
- Helm chart selection. AICR picks a chart at a specific version. The
chart name + version is recorded in
recipes/registry.yaml. - Explicit image overrides. AICR's per-component
values.yamland embedded manifests override specific image references the upstream chart would otherwise pull. Today these are tag-pinned; some carry digests. - Chart-default sub-images. The chart's own
values.yamlreferences a set of sub-images (gpu-operatoralone ships ~14: driver, toolkit, dcgm, dcgm-exporter, device-plugin, validator, mig-manager, vgpu-manager, sandbox-device-plugin, gfd, gdrcopy, kata, cc-manager). AICR does not override these; the upstream chart's defaults are deployed.
Layer 1 is wholly under AICR's control. Layer 2 is wholly under AICR's
control. Layer 3 is under the upstream chart maintainer's control; AICR can
override individual sub-images with bundle-time --set flags but
maintaining a complete override matrix per component is intractable. Many
charts also do not expose digest fields for sub-images natively, only tags.
The reproducibility / patch-flow trade-off looks different at each layer:
| Layer | Pin? | Cost of pinning | Cost of not pinning |
|---|---|---|---|
| Chart version | yes | tiny — one version per component, Renovate-driven bumps | render drift across time, no audit baseline |
| Explicit override (image we already specify) | yes, digest | small — tag→digest is mechanical, Renovate handles ongoing rotation | tag-rebuild silently changes deployed bytes |
| Chart-default sub-image | no in-tree | very high — per-sub-image override matrix per chart, blocks upstream patches | residual drift; mitigated by chart-version pin + Stage 3 admission verification |
Three-layer pinning policy:
-
Pin chart versions for every Helm component, no exceptions.
recipes/registry.yamlMUST declaredefaultVersionfor every helm component. New components without a pin are rejected at PR review and bymake bom BOM_STRICT=1(wired intomake qualify). -
Digest-pin every image AICR overrides explicitly in-tree. Anywhere AICR's
recipes/components/<name>/values.yamlor embedded manifests set animage:value, that value MUST carry an@sha256:digest in addition to its tag. Tag-only references are rejected by the new test added under #761 once it's extended to the helm-rendered surface (see #765). -
Do not pin chart-default sub-images in-tree. AICR ships chart-default sub-images at whatever the upstream chart resolves at the pinned chart version. Reproducibility for these images is delivered by admission-time digest verification (#745), not by per-sub-image in-tree overrides.
Renovate is the durable maintenance mechanism. It auto-opens PRs for chart-version bumps and digest rotations under the same tag. Patches still flow; the diff lands as a normal PR with CI.
| Component | Chart version | Explicit overrides | Chart-default sub-images |
|---|---|---|---|
gpu-operator |
pinned (Phase B of #748) | digest-pin our overrides (#749) | unpinned in-tree → #745 |
network-operator |
pinned (Phase B of #748) | digest-pin our overrides (#749) | unpinned in-tree → #745 |
nvidia-dra-driver-gpu |
pinned (Phase B of #748) | digest-pin our overrides (#749) | unpinned in-tree → #745 |
nodewright-operator |
pinned (Phase B of #748) | already digest-pinned (skyhook packages) | unpinned in-tree → #745 |
cert-manager |
already pinned (v1.20.2) |
none today | unpinned in-tree → #745 |
prometheus-operator-crds |
already pinned (28.0.1) |
none today | none (CRDs-only chart, no images) |
kube-prometheus-stack |
already pinned (84.4.0) |
none today | unpinned in-tree → #745 |
nfd, k8s-ephemeral-storage-metrics |
pinned (Phase A of #748, ✅) | none today | unpinned in-tree → #745 |
| All others | pinned today | digest-pin our overrides (#749) | unpinned in-tree → #745 |
A PR adding a new helm component to recipes/registry.yaml must:
- Set
defaultVersion. CI gates this viamake bom BOM_STRICT=1. - If the PR's
recipes/components/<name>/values.yamloverrides anyimage:reference, every overridden value must include an@sha256:digest. - If the new component is on a registry that uses regional or
account-scoped URIs (e.g., AWS regional ECR), document the override
pattern in the values.yaml comment block, following the precedent set
by
aws-efa(PR #774, #764).
- Predictable customer-facing claim. "AICR vX.Y deploys recipe R, which
produces the chart-version set documented in
docs/user/container-images.md. Within that set, every image AICR controls in-tree is digest-pinned. Chart-default sub-images are deployed at the upstream chart's defaults at the pinned chart version." - Bounded maintenance. One chart-version per component; digest rotation on explicit overrides handled by Renovate. No per-sub-image override matrix.
- Clear contract for new components. PR review and CI both enforce.
- Sequencing for the rest of #739 unlocks. #748 Phase B (NVIDIA-owned charts), #749 (digest pinning explicit refs), and #745 (admission-time verification) all flow from this policy without further policy debates.
- Chart-default sub-images remain non-deterministic in-tree. A chart
rebuild that re-tags
:latest(or the chart's chosen tag) for a sub-image silently changes deployed bytes. The weeklybom-refreshaction surfaces the change; admission-time verification (#745) is the long-term answer. Customers who need today, end-to-end byte-level reproducibility must wait for Stage 3. gpu-operatorchart-default sub-images stay unpinned (~14 images). This is the single largest implicit-surface component. The decision accepts that tradeoff in exchange for upstream patch flow.- More CI friction on Renovate PRs. Pinning chart versions and digest-pinning explicit refs means more bot-driven PRs land per quarter. Renovate's grouping config keeps the noise manageable.
- Upstream signing. AICR and NVSentinel ship cosign-signed artifacts with SBOMs. Asking upstream chart and image publishers to do the same enables admission-time signature verification (#745) without per-image digest allow-lists. A separate workstream — filed against each upstream repo once they've been surveyed — pursues that. This ADR doesn't depend on it; #745 can land first as a digest-allow-list policy and migrate to signature verification when upstream coverage is sufficient.
- Per-recipe stability signal (#613). Orthogonal but adjacent — once a recipe declares its stability, customers can map "stable" to "byte-level reproducible (chart pin + digest pin + admission verification)" and "preview" to "chart-pinned only".
- This ADR lands. Sets policy, no code changes.
- #748 Phase B lands. Pin
gpu-operator,network-operator,nvidia-dra-driver-gpu,nodewright-operatorchart versions. Wiremake bom BOM_STRICT=1intomake qualifyso the contract is enforced for new components. - #765 lands. Tighten
pkg/bom.isLikelyImageso chart-default placeholders (vgpu-manager-style bare scalars) don't dilute the published BOM. - #749 lands. Digest-pin explicit overrides; configure Renovate's
pinDigestsfor ongoing rotation; extendmake qualifyto require digests on explicit refs. - #745 lands (separately tracked, Stage 3 of #739). Admission-time digest or signature verification covers chart-default sub-images.
Each step is independently revertible without invalidating the policy.