fix kind integration tests in gha by zsblevins · Pull Request #13 · NVIDIA/nv-config-manager

zsblevins · 2026-05-29T19:27:17Z

Description

Fix timeout in kind integration test caused by busybox pull

Validation

Standard CI passes.
Kind integration passes, or this PR explains why it was not run.

The kind integration test is manual due to taking ~30 min to complete. When the PR is ready for review,
run Actions -> Kind Integration -> Run workflow against the copy-pr-bot generated
pull-request/<PR_NUMBER> branch. Use the default test_path for the full suite,
or narrow it only while debugging.

Passing Kind Integration run:

Checklist

I am familiar with the contributing guidelines in CONTRIBUTING.md.
Commits are signed off for DCO compliance.
New or existing tests cover these changes, or the PR explains why tests are not needed.
Documentation is updated for user-facing behavior changes.
Generated artifacts are updated when applicable, such as OpenAPI specs,
docs screenshots, or Helm/rendered outputs.

Summary by CodeRabbit

New Features
- Platform-aware preloading of helper images into local Kind clusters; parallel local Docker image builds; new --helm-debug and --watch-pods deployment flags; config supports kind_preload_images.
Bug Fixes
- Pod readiness timeouts now include concise diagnostic summaries (container/init states and recent events).
Tests
- Expanded coverage for preload behavior, parallel builds, Helm debug/watch-pods, kind_preload_images, and pod readiness summaries; added license headers.
Chores
- BuildKit caching for Docker builds; tightened numpy constraint; Makefile/CI helm-debug support.

copy-pr-bot · 2026-05-29T19:27:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-29T19:27:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b75b92b-e784-42f2-a831-4adb5ba598f2

📥 Commits

Reviewing files that changed from the base of the PR and between 80949e7 and 0f10224.

📒 Files selected for processing (2)

installer/src/nv_config_manager_installer/deployer.py
installer/tests/test_deployer.py

🚧 Files skipped from review as they are similar to previous changes (2)

installer/src/nv_config_manager_installer/deployer.py
installer/tests/test_deployer.py

📝 Walkthrough

Walkthrough

Adds LOADER_POD_IMAGE and platform-aware Kind helper preload, implements concurrent Docker image builds with streaming logs, adds helm_debug/watch-pods wiring (CLI/Makefile/DeployOptions), improves pod readiness diagnostics, and updates tests, CI, Dockerfiles, schema, and license headers.

Changes

Kind cluster image preloading, build parallelism, Helm debug, and diagnostics

Layer / File(s)	Summary
CI timeouts, Makefile, and BuildKit `.github/workflows/kind-integration.yml`, `Makefile`, `build/*.Dockerfile`	Adds step-level timeouts, wires BUILDX_BUILDER and HELM_DEBUG into CI deploy, adds HELM_DEBUG flag in Makefile, and enables BuildKit frontend + cache mounts for uv package directories.
LOADER_POD_IMAGE and k8s diagnostics `installer/src/nv_config_manager_installer/k8s.py`	Adds `LOADER_POD_IMAGE`, uses it as loader-pod default, and implements compact pod not-ready diagnostics used in timeouts.
Deployer imports and options wiring `installer/src/nv_config_manager_installer/deployer.py`, `installer/src/nv_config_manager_installer/cli.py`	Import adjustments, add `helm_debug` and `watch_pods` to DeployOptions and CLI, and wire the flags into helm invocation paths.
Subprocess runners and pod-summary polling `installer/src/nv_config_manager_installer/deployer.py`	Introduce `_ParallelCommand`/_RunningCommand, `_run_logged_pipe`, `_run_logged_with_pod_summary`, and pod-summary utilities.
Parallel runner and timeout logic `installer/src/nv_config_manager_installer/deployer.py`	Add `_run_logged_parallel`: fail-fast concurrent executor with prefixed streaming, periodic running-status, per-command timeouts, and process-kill-on-error.
Parallelize image builds and tests `installer/src/nv_config_manager_installer/deployer.py`, `installer/tests/test_deployer.py`	Refactor `_build_images` to run Docker builds concurrently (respecting BUILDX), bound parallelism, tag images post-build, populate local tags, and add tests validating parallel execution and log ordering.
Kind helper preload implementation and tests `installer/src/nv_config_manager_installer/deployer.py`, `deploy/configs/ci-integration-test.yaml`, `installer/tests/test_deployer.py`	Detect Docker server `linux/<arch>`, pull platform-scoped helper sources with `--platform`, retag to `LOADER_POD_IMAGE` if redirected, and import into Kind nodes via `ctr images import --platform`; CI config and tests added for preload lists.
Schema: kind_preload_images `installer/src/nv_config_manager_installer/schema.py`, `installer/tests/test_schema.py`	Add `ImagesConfig.kind_preload_images: list[str]`, update YAML pruning behavior, and add tests for default and round-trip serialization.
License/header and small files `installer/tests/*`, `installer/scripts/screenshot_tui.py`, `components/.../__init__.py`	Expand/insert Apache-2.0 SPDX headers across tests and scripts; add license-only `__init__.py` files; no runtime behavior changes.
Docs & pyproject `docs/install/index.mdx`, `pyproject.toml`	Add CI note about `--load-kind` and preload images; bump numpy constraint to `>=2.1.0,<3.0.0`.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Deployer
  participant Docker
  participant KindNode
  participant Helm
  User->>Deployer: run deploy (options: load_kind, helm_debug, watch_pods)
  Deployer->>Docker: docker version (detect server os/arch)
  Deployer->>Docker: docker pull --platform <os/arch> <helper source image>
  Docker-->>Deployer: pulled image (possibly redirected)
  Deployer->>Docker: docker tag <source> <LOADER_POD_IMAGE> (if redirected)
  Deployer->>KindNode: docker save --platform | ctr images import --platform
  Deployer->>Docker: start parallel docker build commands via _run_logged_parallel
  Deployer->>Helm: helm upgrade --install [--debug]
  Helm->>Deployer: waiting (if watch_pods)
  Deployer->>KindNode: poll pods and emit unready summaries while Helm waits

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

dmathur0
spidercensus
ryanheffernan

"A rabbit scurries with a tiny chart,
I fetch the right-arch image to start,
I tag and load into every node,
I sprint parallel builds down the road,
Then hum as pods report their heart." 🐇✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'fix kind integration tests in gha' is vague and does not clearly convey the main changes in the pull request.	Use a more descriptive title that explains the specific problem being fixed, such as 'Add Kind helper image preloading and parallel Docker builds to fix integration timeouts' or 'Preload Kind helper images and implement concurrent Docker builds'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-ci-integration

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

zsblevins · 2026-05-29T19:28:03Z

/ok to test a4e6a61

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

installer/tests/test_deployer.py (1)

402-423: ⚡ Quick win

Assert the cache probe in the cached-image test.

This test currently verifies “no pull” but not that cache detection was actually performed. Adding an inspect assertion makes the regression signal tighter.

✅ Suggested test hardening

diff --git a/installer/tests/test_deployer.py b/installer/tests/test_deployer.py
@@
     def test_load_kind_uses_cached_loader_image(self, monkeypatch):
+        run_commands: list[list[str]] = []
         logged_commands: list[list[str]] = []
 
         def fake_run(cmd, **kwargs):
+            run_commands.append(cmd)
             return MagicMock(returncode=0, stdout="", stderr="")
@@
         deployer._load_kind()
 
+        assert ["docker", "image", "inspect", LOADER_POD_IMAGE] in run_commands
         assert ["docker", "pull", LOADER_POD_IMAGE] not in logged_commands
         assert [
             "kind",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/tests/test_deployer.py` around lines 402 - 423, The test
test_load_kind_uses_cached_loader_image currently only asserts that a docker
pull wasn't issued; update it to also assert that the cache probe (docker image
inspect) was performed so cache detection is verified: in the test add an
assertion that logged_commands contains a call matching ["docker", "image",
"inspect", LOADER_POD_IMAGE] (or the exact args used by _load_kind’s cache
probe) before asserting no pull, referencing the Deployer._load_kind flow and
the LOADER_POD_IMAGE symbol.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 761-771: The loop that inspects/pulls _KIND_PRELOAD_IMAGES can
fail with an unclear raw docker error when running with --load-kind because the
Docker prereq check is only run for --build-images; modify the _load_kind path
to fail fast by invoking the same Docker prerequisite check used for
--build-images before the image loop (or, if that helper doesn't exist, run a
quick check like _run(["docker","--version"], check=False) or
_run(["docker","info"], check=False) and raise/log a clear prereq error). Ensure
the check happens before referencing _KIND_PRELOAD_IMAGES and before calling
_run/_run_logged so missing Docker yields a clear prereq message rather than a
late raw command failure.

---

Nitpick comments:
In `@installer/tests/test_deployer.py`:
- Around line 402-423: The test test_load_kind_uses_cached_loader_image
currently only asserts that a docker pull wasn't issued; update it to also
assert that the cache probe (docker image inspect) was performed so cache
detection is verified: in the test add an assertion that logged_commands
contains a call matching ["docker", "image", "inspect", LOADER_POD_IMAGE] (or
the exact args used by _load_kind’s cache probe) before asserting no pull,
referencing the Deployer._load_kind flow and the LOADER_POD_IMAGE symbol.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 022a2ae2-0149-4a53-a998-e548ee577262

📥 Commits

Reviewing files that changed from the base of the PR and between 2083d0e and a4e6a61.

📒 Files selected for processing (3)

installer/src/nv_config_manager_installer/deployer.py
installer/src/nv_config_manager_installer/k8s.py
installer/tests/test_deployer.py

coderabbitai · 2026-05-29T19:32:07Z

+        for img in _KIND_PRELOAD_IMAGES:
+            if _run(["docker", "image", "inspect", img], check=False).returncode != 0:
+                self.callback.on_log(f"Pulling helper image {img} for Kind preload...")
+                _run_logged(["docker", "pull", img], step, self.callback, timeout=300)
+            self.callback.on_log(f"Loading helper image {img} into Kind cluster {cluster}...")
+            _run_logged(
+                ["kind", "load", "docker-image", img, "--name", cluster],
+                step,
+                self.callback,
+                timeout=300,
+            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="installer/src/nv_config_manager_installer/deployer.py" # Sanity: show lines around 740-800 (snippet area) python3 - <<'PY' path="installer/src/nv_config_manager_installer/deployer.py" start=730 end=810 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") PY # Locate _load_kind definition rg -n "def _load_kind|_load_kind\\b|load_kind" "$FILE" # Locate docker prereq checks (shutil.which docker / RuntimeError messages) rg -n "shutil\\.which\$\"docker\"\$|docker is required|--build-images|build_images|--load-kind|load_kind" "$FILE" # Search repo for the same message or check rg -n "docker is required for --build-images|--build-images/--load-kind" installer/src | head -n 50

Repository: NVIDIA/nv-config-manager

Length of output: 4390

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="installer/src/nv_config_manager_installer/deployer.py" # Show lines around 740-800 where snippet is python3 - <<'PY' path="installer/src/nv_config_manager_installer/deployer.py" start=740 end=800 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") PY # Find _load_kind definition and where options.load_kind is used rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE" # Find docker prerequisite checks rg -n "shutil\\.which\$\"docker\"\$|docker is required|RuntimeError\$\"docker|docker\\s+is\\s+required" "$FILE" # Search for any existing docker prerequisite logic elsewhere rg -n "shutil\\.which\\(\"docker\"\$" installer/src | head -n 50 rg -n "docker is required" installer/src | head -n 50

Repository: NVIDIA/nv-config-manager

Length of output: 3636

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="installer/src/nv_config_manager_installer/deployer.py" python3 - <<'PY' path="installer/src/nv_config_manager_installer/deployer.py" start=730 end=820 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") PY echo "---- rg _load_kind and options ----" rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE" echo "---- rg docker prerequisite checks in file ----" rg -n "shutil\\.which\$\"docker\"\$|docker is required|RuntimeError\$\"docker|docker\\s+is\\s+required" "$FILE" || true echo "---- rg docker prerequisite checks in repo ----" rg -n "shutil\\.which\\(\"docker\"\$" installer/src | head -n 50 || true rg -n "docker is required" installer/src | head -n 50 || true

Repository: NVIDIA/nv-config-manager

Length of output: 4888

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="installer/src/nv_config_manager_installer/deployer.py" python3 - <<'PY' path="installer/src/nv_config_manager_installer/deployer.py" start=730 end=820 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") PY echo "---- rg _load_kind and options ----" rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE" echo "---- rg docker prerequisite checks in file ----" rg -n "shutil\\.which\$\"docker\"\$|docker is required|RuntimeError\$\"docker|docker\\s+is\\s+required" "$FILE" || true echo "---- rg docker prerequisite checks in repo ----" rg -n "shutil\\.which\\(\"docker\"\$" installer/src | head -n 50 || true rg -n "docker is required" installer/src | head -n 50 || true

Repository: NVIDIA/nv-config-manager

Length of output: 4888

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="installer/src/nv_config_manager_installer/deployer.py" python3 - <<'PY' path="installer/src/nv_config_manager_installer/deployer.py" start=730 end=820 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") PY echo "---- rg _load_kind and options ----" rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE" echo "---- rg docker prerequisite checks in file ----" rg -n "shutil\\.which\$\"docker\"\$|docker is required|RuntimeError\$\"docker|docker\\s+is\\s+required" "$FILE" || true echo "---- rg docker prerequisite checks in repo ----" rg -n "shutil\\.which\\(\"docker\"\$" installer/src | head -n 50 || true rg -n "docker is required" installer/src | head -n 50 || true

Repository: NVIDIA/nv-config-manager

Length of output: 4888

Fail fast when Docker is missing for --load-kind.

_load_kind() runs docker image inspect / docker pull for _KIND_PRELOAD_IMAGES, but the existing Docker prerequisite check only triggers for --build-images, so --load-kind can fail late with a raw docker command error instead of a clear prereq message.

🔧 Proposed fix

diff --git a/installer/src/nv_config_manager_installer/deployer.py b/installer/src/nv_config_manager_installer/deployer.py @@ - if self.options.build_images and not shutil.which("docker"): - raise RuntimeError("docker is required for --build-images") + if (self.options.build_images or self.options.load_kind) and not shutil.which("docker"): + raise RuntimeError("docker is required for --build-images/--load-kind")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@installer/src/nv_config_manager_installer/deployer.py` around lines 761 - 771, The loop that inspects/pulls _KIND_PRELOAD_IMAGES can fail with an unclear raw docker error when running with --load-kind because the Docker prereq check is only run for --build-images; modify the _load_kind path to fail fast by invoking the same Docker prerequisite check used for --build-images before the image loop (or, if that helper doesn't exist, run a quick check like _run(["docker","--version"], check=False) or _run(["docker","info"], check=False) and raise/log a clear prereq error). Ensure the check happens before referencing _KIND_PRELOAD_IMAGES and before calling _run/_run_logged so missing Docker yields a clear prereq message rather than a late raw command failure.

zsblevins · 2026-05-29T19:51:24Z

/ok to test 6229738

coderabbitai

♻️ Duplicate comments (1)

installer/src/nv_config_manager_installer/deployer.py (1)

782-797: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Docker prerequisite check missing for --load-kind (duplicate of past review).

The new preload logic calls docker pull and _docker_server_platform() (which runs docker version), but the prerequisite check at lines 595-598 only verifies docker is available for --build-images, not for --load-kind. If docker is missing, _load_kind() will fail late with a raw command error instead of a clear prerequisite message.

🔧 Proposed fix

-        if self.options.build_images and not shutil.which("docker"):
-            raise RuntimeError("docker is required for --build-images")
+        if (self.options.build_images or self.options.load_kind) and not shutil.which("docker"):
+            raise RuntimeError("docker is required for --build-images/--load-kind")
         if self.options.load_kind and not shutil.which("kind"):
             raise RuntimeError("kind is required for --load-kind")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/src/nv_config_manager_installer/deployer.py` around lines 782 -
797, The preload path for --load-kind calls _docker_server_platform(), runs
"docker pull", and uses _KIND_PRELOAD_IMAGES but the existing prerequisite check
only verifies docker for --build-images; update the prereq logic so docker
availability is also checked when _load_kind() will be executed (i.e., when the
--load-kind flag or the code path that triggers image preloading runs).
Specifically, add a docker check before calling _docker_server_platform() or
entering the loop over _KIND_PRELOAD_IMAGES, and surface a clear prerequisite
error via the same callback.on_log/process-exit behavior used elsewhere rather
than letting _run_logged/_docker_server_platform() raise a raw command error.

🧹 Nitpick comments (1)

installer/tests/test_deployer.py (1)

366-406: 💤 Low value

Consider adding tests for edge cases.

The test covers the happy path well. Consider adding tests for:

Fallback behavior when docker version fails (line 374 could return returncode=1)
Multiple images in _KIND_PRELOAD_IMAGES to verify the loop

📝 Example: test for docker version fallback

def test_load_kind_uses_fallback_when_docker_version_fails(self, monkeypatch):
    run_commands: list[list[str]] = []
    logged_commands: list[list[str]] = []

    def fake_run(cmd, **kwargs):
        run_commands.append(cmd)
        if cmd[:3] == ["docker", "version", "--format"]:
            return MagicMock(returncode=1, stdout="", stderr="error")
        return MagicMock(returncode=0, stdout="", stderr="")

    def fake_run_logged(cmd, step, callback, **kwargs):
        logged_commands.append(cmd)
        return MagicMock(returncode=0, stdout="", stderr="")

    monkeypatch.setattr("nv_config_manager_installer.deployer._run", fake_run)
    monkeypatch.setattr("nv_config_manager_installer.deployer._run_logged", fake_run_logged)
    monkeypatch.setattr("nv_config_manager_installer.deployer.platform.machine", lambda: "x86_64")

    deployer = Deployer(
        _make_config(),
        DeployOptions(load_kind=True, kind_cluster="test-cluster"),
        RecordingCallback(),
    )
    deployer._load_kind()

    assert ["docker", "pull", "--platform", "linux/amd64", LOADER_POD_IMAGE] in logged_commands

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/tests/test_deployer.py` around lines 366 - 406, Add two tests to
TestKindImageLoading to cover edge cases: (1) a fallback when docker version
fails — copy the existing test but have fake_run return MagicMock(returncode=1,
...) for the ["docker","version","--format"] call, monkeypatch platform.machine
to "x86_64", call Deployer._load_kind and assert that
["docker","pull","--platform","linux/amd64", LOADER_POD_IMAGE] appears in
logged_commands (reference fake_run, fake_run_logged, Deployer._load_kind,
LOADER_POD_IMAGE); (2) multiple images in _KIND_PRELOAD_IMAGES — monkeypatch
nv_config_manager_installer.deployer._KIND_PRELOAD_IMAGES to a list of two image
names, run the same setup and assert that a docker pull (with
--platform/linux/amd64 fallback as needed) and a kind load are invoked for each
image in logged_commands (reference _KIND_PRELOAD_IMAGES and the same
fake_run/fake_run_logged helpers).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 782-797: The preload path for --load-kind calls
_docker_server_platform(), runs "docker pull", and uses _KIND_PRELOAD_IMAGES but
the existing prerequisite check only verifies docker for --build-images; update
the prereq logic so docker availability is also checked when _load_kind() will
be executed (i.e., when the --load-kind flag or the code path that triggers
image preloading runs). Specifically, add a docker check before calling
_docker_server_platform() or entering the loop over _KIND_PRELOAD_IMAGES, and
surface a clear prerequisite error via the same callback.on_log/process-exit
behavior used elsewhere rather than letting
_run_logged/_docker_server_platform() raise a raw command error.

---

Nitpick comments:
In `@installer/tests/test_deployer.py`:
- Around line 366-406: Add two tests to TestKindImageLoading to cover edge
cases: (1) a fallback when docker version fails — copy the existing test but
have fake_run return MagicMock(returncode=1, ...) for the
["docker","version","--format"] call, monkeypatch platform.machine to "x86_64",
call Deployer._load_kind and assert that
["docker","pull","--platform","linux/amd64", LOADER_POD_IMAGE] appears in
logged_commands (reference fake_run, fake_run_logged, Deployer._load_kind,
LOADER_POD_IMAGE); (2) multiple images in _KIND_PRELOAD_IMAGES — monkeypatch
nv_config_manager_installer.deployer._KIND_PRELOAD_IMAGES to a list of two image
names, run the same setup and assert that a docker pull (with
--platform/linux/amd64 fallback as needed) and a kind load are invoked for each
image in logged_commands (reference _KIND_PRELOAD_IMAGES and the same
fake_run/fake_run_logged helpers).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b59f860d-3f8a-48af-aa9d-120de27c0bc5

📥 Commits

Reviewing files that changed from the base of the PR and between a4e6a61 and 6229738.

📒 Files selected for processing (3)

installer/src/nv_config_manager_installer/deployer.py
installer/src/nv_config_manager_installer/k8s.py
installer/tests/test_deployer.py

🚧 Files skipped from review as they are similar to previous changes (1)

installer/src/nv_config_manager_installer/k8s.py

zsblevins · 2026-05-29T20:11:46Z

/ok to test be45d7c

github-actions · 2026-05-29T20:16:36Z

Docs preview: https://nvidia-preview-nv-config-manager-pr-13.docs.buildwithfern.com/switch-infrastructure/config-manager

zsblevins · 2026-05-29T20:44:23Z

/ok to test 1d181d5

zsblevins · 2026-05-29T21:05:45Z

/ok to test 37047ae

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/kind-integration.yml (1)

121-134: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Retry only timeout/transient failures; current loop retries deterministic test failures and overwrites diagnostics.

At Line 123 and Line 127, any non-zero pytest exit is retried, and integration-test-results.xml is overwritten on each attempt. This increases CI time on real failures and can hide first-failure context.

Suggested patch

       - name: Run integration tests
         timeout-minutes: 50
         run: |
           for attempt in 1 2 3; do
             echo "Attempt ${attempt} of 3"
-            if timeout --kill-after=1m 15m uv run pytest "${TEST_PATH}" -v \
+            junit_file="integration-test-results-attempt-${attempt}.xml"
+            if timeout --kill-after=1m 15m uv run pytest "${TEST_PATH}" -v \
               --ci \
               --nv-config-manager-namespace "${NV_CONFIG_MANAGER_NAMESPACE}" \
               --tb=short \
-              --junitxml=integration-test-results.xml; then
+              --junitxml="${junit_file}"; then
+              cp "${junit_file}" integration-test-results.xml
               exit 0
             fi
+            exit_code=$?
+            # Retry only on timeout (124/137). Fail fast on deterministic test failures.
+            if [ "${exit_code}" -ne 124 ] && [ "${exit_code}" -ne 137 ]; then
+              cp "${junit_file}" integration-test-results.xml || true
+              exit "${exit_code}"
+            fi
             if [ "${attempt}" -lt 3 ]; then
               sleep 10
             fi
           done
           exit 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/kind-integration.yml around lines 121 - 134, The retry
loop currently reruns the command `timeout --kill-after=1m 15m uv run pytest
"${TEST_PATH}" -v ... --junitxml=integration-test-results.xml` for any non-zero
exit and overwrites `integration-test-results.xml`; change it to run pytest into
an attempt-specific results file (e.g.
`--junitxml=integration-test-results-${attempt}.xml`), capture the pytest exit
code, and decide: if exit indicates a timeout/kill (e.g. process killed by
timeout/exit code 137 or signal) or other known transient condition, then retry
up to 3 attempts; otherwise for deterministic failures immediately copy the
current attempt XML to `integration-test-results.xml` and exit non-zero; on
success copy the successful attempt XML to `integration-test-results.xml` and
exit 0. Update the loop surrounding the `timeout ... uv run pytest` invocation
and reference `${attempt}`, `${TEST_PATH}`, and `integration-test-results.xml`
when implementing these checks and file copies.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/kind-integration.yml:
- Around line 121-134: The retry loop currently reruns the command `timeout
--kill-after=1m 15m uv run pytest "${TEST_PATH}" -v ...
--junitxml=integration-test-results.xml` for any non-zero exit and overwrites
`integration-test-results.xml`; change it to run pytest into an attempt-specific
results file (e.g. `--junitxml=integration-test-results-${attempt}.xml`),
capture the pytest exit code, and decide: if exit indicates a timeout/kill (e.g.
process killed by timeout/exit code 137 or signal) or other known transient
condition, then retry up to 3 attempts; otherwise for deterministic failures
immediately copy the current attempt XML to `integration-test-results.xml` and
exit non-zero; on success copy the successful attempt XML to
`integration-test-results.xml` and exit 0. Update the loop surrounding the
`timeout ... uv run pytest` invocation and reference `${attempt}`,
`${TEST_PATH}`, and `integration-test-results.xml` when implementing these
checks and file copies.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 06525d1b-baa4-48eb-8777-1be8f1eee622

📥 Commits

Reviewing files that changed from the base of the PR and between 1d181d5 and 37047ae.

📒 Files selected for processing (6)

.github/workflows/kind-integration.yml
Makefile
installer/src/nv_config_manager_installer/cli.py
installer/src/nv_config_manager_installer/deployer.py
installer/tests/test_cli.py
installer/tests/test_deployer.py

🚧 Files skipped from review as they are similar to previous changes (5)

Makefile
installer/tests/test_cli.py
installer/src/nv_config_manager_installer/cli.py
installer/tests/test_deployer.py
installer/src/nv_config_manager_installer/deployer.py

zsblevins · 2026-05-29T21:20:35Z

/ok to test 864f985

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 602-612: In _poll_pod_summary(), after calling
_unready_pod_summary_lines(k8s, namespace) and before any call to
_emit_pod_summary or computing signature/heartbeat logic, re-check
stop_event.is_set() and return/break early if set; this prevents emitting a
final pod-summary after shutdown was requested while list_namespaced_pod() was
in-flight. Update the loop in _poll_pod_summary to check stop_event immediately
after obtaining lines (and before computing
signature/has_blockers/last_emit/_emit_pod_summary) so the thread exits without
logging when stop_event has been set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cc0d0934-655e-49c7-aac6-69e036e4933a

📥 Commits

Reviewing files that changed from the base of the PR and between 864f985 and d933ccd.

📒 Files selected for processing (5)

.github/workflows/kind-integration.yml
installer/src/nv_config_manager_installer/cli.py
installer/src/nv_config_manager_installer/deployer.py
installer/tests/test_cli.py
installer/tests/test_deployer.py

✅ Files skipped from review due to trivial changes (1)

installer/tests/test_cli.py

🚧 Files skipped from review as they are similar to previous changes (1)

installer/src/nv_config_manager_installer/cli.py

coderabbitai · 2026-05-29T21:53:31Z

+    while not stop_event.is_set():
+        lines = _unready_pod_summary_lines(k8s, namespace)
+        signature = "\n".join(lines)
+        now = time.monotonic()
+        has_blockers = lines != ["all pods ready"]
+        if signature != last_signature or (has_blockers and now - last_emit >= heartbeat_interval):
+            _emit_pod_summary(lines, step, callback)
+            last_signature = signature
+            last_emit = now
+        if stop_event.wait(interval):
+            break


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the file ls -la installer/src/nv_config_manager_installer/deployer.py # Show the relevant sections around the referenced line numbers sed -n '560,700p' installer/src/nv_config_manager_installer/deployer.py | cat -n # Find the poller/thread lifecycle helpers rg -n "def _poll_pod_summary|_run_logged_with_pod_summary|stop_event|watch_pods|_unready_pod_summary_lines" installer/src/nv_config_manager_installer/deployer.py # Show the helper implementations for the poller and its callers # (Use narrower slices around matches to avoid large output) python3 - <<'PY' import re, subprocess, os, textwrap, sys, pathlib path="installer/src/nv_config_manager_installer/deployer.py" # Get line numbers for key defs/usages patterns=[ r"def _poll_pod_summary\(", r"def _run_logged_with_pod_summary\(", r"def _unready_pod_summary_lines\(", r"def _helm_install\(", ] # Use rg to get exact line numbers for pat in patterns: out=subprocess.check_output(["rg","-n",pat,path], text=True).strip().splitlines() if not out: print("NO MATCH", pat) continue for line in out: ln=int(line.split(":")[0]) print(f"\n--- {pat} around {ln} ---") start=max(1,ln-60); end=ln+120 subprocess.run(["sed","-n",f"{start},{end}p",path], check=True) PY

Repository: NVIDIA/nv-config-manager

Length of output: 31655

Re-check stop_event before emitting pod summary logs in _poll_pod_summary()

_poll_pod_summary() emits [pods] after _unready_pod_summary_lines() returns, but it doesn’t re-check stop_event in between. If shutdown sets stop_event while k8s.v1.list_namespaced_pod() is still in-flight, the poll thread can still log one more pod-summary batch even though _run_logged_with_pod_summary() has returned (it only join(timeout=2) and the thread is daemon=True). Apply the early-exit check before computing/signaling emissions.

🔧 Proposed fix

while not stop_event.is_set(): lines = _unready_pod_summary_lines(k8s, namespace) + if stop_event.is_set(): + break signature = "\n".join(lines) now = time.monotonic() has_blockers = lines != ["all pods ready"] if signature != last_signature or (has_blockers and now - last_emit >= heartbeat_interval): _emit_pod_summary(lines, step, callback)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@installer/src/nv_config_manager_installer/deployer.py` around lines 602 - 612, In _poll_pod_summary(), after calling _unready_pod_summary_lines(k8s, namespace) and before any call to _emit_pod_summary or computing signature/heartbeat logic, re-check stop_event.is_set() and return/break early if set; this prevents emitting a final pod-summary after shutdown was requested while list_namespaced_pod() was in-flight. Update the loop in _poll_pod_summary to check stop_event immediately after obtaining lines (and before computing signature/has_blockers/last_emit/_emit_pod_summary) so the thread exits without logging when stop_event has been set.

zsblevins · 2026-05-29T22:00:39Z

/ok to test 6da405f

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

zsblevins · 2026-05-29T22:11:12Z

/ok to test e62bc48

zsblevins · 2026-05-29T22:31:24Z

/ok to test ebf71f8

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

deploy/configs/ci-integration-test.yaml (1)
81-82: 💤 Low value

Comment mentions only Docker Hub but list includes multiple registries.

The comment states images are from "Docker Hub" but the list includes images from docker.io, quay.io, and ghcr.io. Consider updating to "from public registries" or "from Docker Hub and other registries" for accuracy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/configs/ci-integration-test.yaml` around lines 81 - 82, Update the
descriptive comment that currently says "CI preloads third-party runtime images
into Kind through the host Docker daemon. This avoids anonymous in-cluster pulls
from Docker Hub during Helm's --wait window." to accurately reflect the
registries used by the image list: replace "Docker Hub" with "public registries
(e.g., docker.io, quay.io, ghcr.io)" or "Docker Hub and other registries" so the
comment above the CI preload block correctly describes the sources.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/configs/ci-integration-test.yaml`:
- Around line 83-102: The kind_preload_images list in the YAML doesn't include
busybox despite the PR stating the change fixes a timeout caused by pulling the
busybox image; either add the busybox image (e.g.,
docker.io/library/busybox:latest or a specific tag) to the kind_preload_images
array or update the PR description to remove the busybox-related claim; locate
the kind_preload_images section and add the busybox entry (or change the PR
text) so the code and PR description are consistent.
- Around line 83-102: Update the kind_preload_images verification workflow to
stop using docker manifest inspect and the dummy "Advisory check for:" log:
replace the docker-based manifest check with a Dockerless registry inspect
(e.g., use skopeo inspect --raw docker://<image> or crane manifest <image>) to
confirm each tag exists and return non-zero on missing tags, and replace the
placeholder advisory step with a remote image scan using trivy (trivy image
--severity HIGH,CRITICAL --exit-code 1 --remote <image>) so the job fails when
HIGH/CRITICAL findings are present; target the sections referencing
kind_preload_images and the current "docker manifest inspect" and "Advisory
check for:" steps to implement these changes.
- Around line 102-106: The PostgreSQL image is declared twice (in
kind_preload_images and in overrides.postgresql.repository/tag), which can lead
to divergence; add a concise clarifying comment next to both declarations (the
kind_preload_images entry and the overrides.postgresql.repository/tag block)
stating that these two values must be updated together when bumping PostgreSQL,
and optionally include the intended source of truth (e.g., "update both entries
when upgrading PostgreSQL") so future editors will keep them in sync.

---

Nitpick comments:
In `@deploy/configs/ci-integration-test.yaml`:
- Around line 81-82: Update the descriptive comment that currently says "CI
preloads third-party runtime images into Kind through the host Docker daemon.
This avoids anonymous in-cluster pulls from Docker Hub during Helm's --wait
window." to accurately reflect the registries used by the image list: replace
"Docker Hub" with "public registries (e.g., docker.io, quay.io, ghcr.io)" or
"Docker Hub and other registries" so the comment above the CI preload block
correctly describes the sources.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ddc75572-3850-451a-8175-494931f97fd7

📥 Commits

Reviewing files that changed from the base of the PR and between e62bc48 and ebf71f8.

📒 Files selected for processing (6)

deploy/configs/ci-integration-test.yaml
docs/install/index.mdx
installer/src/nv_config_manager_installer/deployer.py
installer/src/nv_config_manager_installer/schema.py
installer/tests/test_deployer.py
installer/tests/test_schema.py

✅ Files skipped from review due to trivial changes (1)

docs/install/index.mdx

🚧 Files skipped from review as they are similar to previous changes (4)

installer/src/nv_config_manager_installer/schema.py
installer/tests/test_schema.py
installer/tests/test_deployer.py
installer/src/nv_config_manager_installer/deployer.py

coderabbitai · 2026-05-29T22:35:18Z

+  kind_preload_images:
+    - docker.io/hashicorp/http-echo:1.0
+    - docker.io/alpine/kubectl:1.35.4
+    - docker.io/library/redis:7-alpine
+    - docker.io/library/nats:2.10-alpine
+    - docker.io/natsio/nats-box:0.14.3
+    - docker.io/temporalio/server:1.29
+    - docker.io/temporalio/admin-tools:1.29
+    - docker.io/temporalio/ui:v2.37.4
+    - docker.io/nginxinc/nginx-unprivileged:1.27
+    - docker.io/envoyproxy/gateway:v1.6.5
+    - docker.io/envoyproxy/ratelimit:c8765e89
+    - docker.io/envoyproxy/envoy:distroless-v1.36.5
+    - quay.io/jetstack/cert-manager-controller:v1.20.2
+    - quay.io/jetstack/cert-manager-webhook:v1.20.2
+    - quay.io/jetstack/cert-manager-cainjector:v1.20.2
+    - quay.io/jetstack/cert-manager-startupapicheck:v1.20.2
+    - quay.io/jetstack/cert-manager-acmesolver:v1.20.2
+    - ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0
+    - ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Busybox image missing from preload list despite PR objectives.

The PR description states the purpose is to "Fix a timeout in the Kind integration test caused by pulling the busybox image," but busybox (or library/busybox) is not present in the kind_preload_images list. This is inconsistent with the stated objective.

If the busybox timeout has been resolved through other means or is no longer relevant, please update the PR description. Otherwise, consider adding the busybox image to this preload list.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/configs/ci-integration-test.yaml` around lines 83 - 102, The kind_preload_images list in the YAML doesn't include busybox despite the PR stating the change fixes a timeout caused by pulling the busybox image; either add the busybox image (e.g., docker.io/library/busybox:latest or a specific tag) to the kind_preload_images array or update the PR description to remove the busybox-related claim; locate the kind_preload_images section and add the busybox entry (or change the PR text) so the code and PR description are consistent.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify all preloaded images exist and check for vulnerabilities images=( "docker.io/hashicorp/http-echo:1.0" "docker.io/alpine/kubectl:1.35.4" "docker.io/library/redis:7-alpine" "docker.io/library/nats:2.10-alpine" "docker.io/natsio/nats-box:0.14.3" "docker.io/temporalio/server:1.29" "docker.io/temporalio/admin-tools:1.29" "docker.io/temporalio/ui:v2.37.4" "docker.io/nginxinc/nginx-unprivileged:1.27" "docker.io/envoyproxy/gateway:v1.6.5" "docker.io/envoyproxy/ratelimit:c8765e89" "docker.io/envoyproxy/envoy:distroless-v1.36.5" "quay.io/jetstack/cert-manager-controller:v1.20.2" "quay.io/jetstack/cert-manager-webhook:v1.20.2" "quay.io/jetstack/cert-manager-cainjector:v1.20.2" "quay.io/jetstack/cert-manager-startupapicheck:v1.20.2" "quay.io/jetstack/cert-manager-acmesolver:v1.20.2" "ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0" "ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie" ) echo "Checking image manifest existence..." for img in "${images[@]}"; do echo "Checking: $img" if ! docker manifest inspect "$img" > /dev/null 2>&1; then echo " ❌ FAILED: Image not found or inaccessible" else echo " ✓ OK" fi done echo "" echo "Checking for security advisories (GitHub)..." # Check for known CVEs in common images for img in "${images[@]}"; do repo=$(echo "$img" | cut -d: -f1 | sed 's|docker.io/library/||' | sed 's|docker.io/||' | sed 's|/|%2F|g') echo "Advisory check for: $img" done

Repository: NVIDIA/nv-config-manager

Length of output: 3117

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== docker availability ==" command -v docker || true docker --version || true docker info >/dev/null 2>&1 && echo "docker info: OK" || echo "docker info: FAILED (daemon/network?)" echo echo "== manifest lookup sanity check ==" # Use a very common public image KNOWN="docker.io/library/busybox:latest" echo "Checking: $KNOWN" if docker manifest inspect "$KNOWN" >/dev/null 2>&1; then echo " ✓ OK" else echo " ❌ FAILED" fi echo echo "== tool availability (optional) ==" command -v trivy >/dev/null 2>&1 && echo "trivy: present" || echo "trivy: not present"

Repository: NVIDIA/nv-config-manager

Length of output: 336

Fix the kind_preload_images verification to actually check tag existence + vulnerabilities

The proposed docker manifest inspect approach is not portable: it requires a working Docker client/daemon and registry access; otherwise it fails to validate the tags. Use a Dockerless registry tool (e.g., skopeo inspect --raw docker://<image> or crane manifest <image>) to confirm each tag exists.

The “Checking for security advisories (GitHub)” section doesn’t perform any real advisory/CVE lookup (it only prints Advisory check for: ...). Replace that part with an actual scanner; trivy is available, so scan the remote images for HIGH/CRITICAL and fail the job on findings.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/configs/ci-integration-test.yaml` around lines 83 - 102, Update the kind_preload_images verification workflow to stop using docker manifest inspect and the dummy "Advisory check for:" log: replace the docker-based manifest check with a Dockerless registry inspect (e.g., use skopeo inspect --raw docker://<image> or crane manifest <image>) to confirm each tag exists and return non-zero on missing tags, and replace the placeholder advisory step with a remote image scan using trivy (trivy image --severity HIGH,CRITICAL --exit-code 1 --remote <image>) so the job fails when HIGH/CRITICAL findings are present; target the sections referencing kind_preload_images and the current "docker manifest inspect" and "Advisory check for:" steps to implement these changes.

coderabbitai · 2026-05-29T22:35:18Z

+    - ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie
+  overrides:
+    postgresql:
+      repository: ghcr.io/cloudnative-pg/postgresql
+      tag: 18.0-system-trixie


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

PostgreSQL image specified in both preload list and overrides.

The PostgreSQL image ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie is defined in two places:

Line 102: In the kind_preload_images list

Lines 104-106: In the overrides.postgresql block

While currently consistent, maintaining the same image in two locations increases the risk of divergence during future updates.

Consider adding a comment linking these two declarations or documenting that both must be updated together when upgrading PostgreSQL.

📝 Example clarifying comment

- ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0 + # Note: PostgreSQL image below must match overrides.postgresql (lines 104-106) - ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie overrides: + # Note: Must match kind_preload_images entry (line 102) postgresql: repository: ghcr.io/cloudnative-pg/postgresql tag: 18.0-system-trixie

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/configs/ci-integration-test.yaml` around lines 102 - 106, The PostgreSQL image is declared twice (in kind_preload_images and in overrides.postgresql.repository/tag), which can lead to divergence; add a concise clarifying comment next to both declarations (the kind_preload_images entry and the overrides.postgresql.repository/tag block) stating that these two values must be updated together when bumping PostgreSQL, and optionally include the intended source of truth (e.g., "update both entries when upgrading PostgreSQL") so future editors will keep them in sync.

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

zsblevins · 2026-05-29T22:43:33Z

/ok to test 76a5b5b

Plain `docker save` of a multi-arch tag under Docker's containerd image store exports an OCI index referencing every platform and, for indices with several platforms (e.g. cert-manager-controller v1.20.2 has 5), can omit the selected platform's config blob from the tar. This made `ctr images import --platform=linux/amd64` fail in kind-up with: ctr: failed to resolve rootfs: content digest sha256:...: not found Pass the same `--platform` to `docker save` (Docker 28+/API 1.48+) so the archive is a self-contained single-platform OCI image. Signed-off-by: Davesh Mathur <daveshm@nvidia.com>

dmathur0 · 2026-05-30T08:05:02Z

/ok to test 80949e7

Diagnostic for docker/cli#6457: `docker save --platform <p>` is refusing to export freshly-pulled multi-arch helper images ("no suitable export target found ... does not provide the specified platform") even though all blobs for the requested platform are present. One suspected trigger is layer dedup with sibling images pulled earlier in the same preload loop (e.g. cert-manager-controller and envoyproxy/envoy:distroless both ride on the same distroless cc-debian12 base). Run `docker system prune -af` before each helper pull so the next pull cannot share blobs with the previous one. Images held by running containers (the kind node) are not affected. If this changes the failure mode the dedup hypothesis is confirmed; if not we'll move off `docker save` entirely. Signed-off-by: Davesh Mathur <daveshm@nvidia.com>

dmathur0 · 2026-05-30T09:18:54Z

/ok to test 0f10224

zsblevins requested review from dmathur0, jojung1, polarweasel, ryanheffernan, spidercensus and susanhooks as code owners May 29, 2026 19:27

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

zsblevins force-pushed the fix-ci-integration branch from a4e6a61 to 6229738 Compare May 29, 2026 19:50

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

zsblevins force-pushed the fix-ci-integration branch from 6229738 to be45d7c Compare May 29, 2026 20:09

zsblevins force-pushed the fix-ci-integration branch from 1d181d5 to 37047ae Compare May 29, 2026 21:05

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

zsblevins added 5 commits May 29, 2026 16:10

fix kind integration tests in gha

c604164

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

speed up builds

862ce9b

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

add debug output

2842038

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

fix buildkit behavior + code rabbit feedback

be12efb

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

better visibility

d4f6511

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

zsblevins force-pushed the fix-ci-integration branch from 6da405f to e62bc48 Compare May 29, 2026 22:10

zsblevins force-pushed the fix-ci-integration branch from e62bc48 to ebf71f8 Compare May 29, 2026 22:30

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

preload everything in gha due to rate-limiting

76a5b5b

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>

zsblevins force-pushed the fix-ci-integration branch from ebf71f8 to 76a5b5b Compare May 29, 2026 22:42

dmathur0 force-pushed the fix-ci-integration branch from 6a92183 to 80949e7 Compare May 30, 2026 08:00

Conversation

zsblevins commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation

Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

zsblevins commented May 29, 2026

Uh oh!

dmathur0 commented May 30, 2026

Uh oh!

dmathur0 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

zsblevins commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading