Skip to content

fix kind integration tests in gha#13

Open
zsblevins wants to merge 8 commits into
mainfrom
fix-ci-integration
Open

fix kind integration tests in gha#13
zsblevins wants to merge 8 commits into
mainfrom
fix-ci-integration

Conversation

@zsblevins
Copy link
Copy Markdown
Collaborator

@zsblevins zsblevins commented May 29, 2026

Description

Fix timeout in kind integration test caused by busybox pull

Validation

  • Standard CI passes.
  • Kind integration passes, or this PR explains why it was not run.

The kind integration test is manual due to taking ~30 min to complete. When the PR is ready for review,
run Actions -> Kind Integration -> Run workflow against the copy-pr-bot generated
pull-request/<PR_NUMBER> branch. Use the default test_path for the full suite,
or narrow it only while debugging.

Passing Kind Integration run:

Checklist

  • I am familiar with the contributing guidelines in CONTRIBUTING.md.
  • Commits are signed off for DCO compliance.
  • New or existing tests cover these changes, or the PR explains why tests are not needed.
  • Documentation is updated for user-facing behavior changes.
  • Generated artifacts are updated when applicable, such as OpenAPI specs,
    docs screenshots, or Helm/rendered outputs.

Summary by CodeRabbit

  • New Features

    • Platform-aware preloading of helper images into local Kind clusters; parallel local Docker image builds; new --helm-debug and --watch-pods deployment flags; config supports kind_preload_images.
  • Bug Fixes

    • Pod readiness timeouts now include concise diagnostic summaries (container/init states and recent events).
  • Tests

    • Expanded coverage for preload behavior, parallel builds, Helm debug/watch-pods, kind_preload_images, and pod readiness summaries; added license headers.
  • Chores

    • BuildKit caching for Docker builds; tightened numpy constraint; Makefile/CI helm-debug support.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b75b92b-e784-42f2-a831-4adb5ba598f2

📥 Commits

Reviewing files that changed from the base of the PR and between 80949e7 and 0f10224.

📒 Files selected for processing (2)
  • installer/src/nv_config_manager_installer/deployer.py
  • installer/tests/test_deployer.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • installer/src/nv_config_manager_installer/deployer.py
  • installer/tests/test_deployer.py

📝 Walkthrough

Walkthrough

Adds LOADER_POD_IMAGE and platform-aware Kind helper preload, implements concurrent Docker image builds with streaming logs, adds helm_debug/watch-pods wiring (CLI/Makefile/DeployOptions), improves pod readiness diagnostics, and updates tests, CI, Dockerfiles, schema, and license headers.

Changes

Kind cluster image preloading, build parallelism, Helm debug, and diagnostics

Layer / File(s) Summary
CI timeouts, Makefile, and BuildKit
.github/workflows/kind-integration.yml, Makefile, build/*.Dockerfile
Adds step-level timeouts, wires BUILDX_BUILDER and HELM_DEBUG into CI deploy, adds HELM_DEBUG flag in Makefile, and enables BuildKit frontend + cache mounts for uv package directories.
LOADER_POD_IMAGE and k8s diagnostics
installer/src/nv_config_manager_installer/k8s.py
Adds LOADER_POD_IMAGE, uses it as loader-pod default, and implements compact pod not-ready diagnostics used in timeouts.
Deployer imports and options wiring
installer/src/nv_config_manager_installer/deployer.py, installer/src/nv_config_manager_installer/cli.py
Import adjustments, add helm_debug and watch_pods to DeployOptions and CLI, and wire the flags into helm invocation paths.
Subprocess runners and pod-summary polling
installer/src/nv_config_manager_installer/deployer.py
Introduce _ParallelCommand/_RunningCommand, _run_logged_pipe, _run_logged_with_pod_summary, and pod-summary utilities.
Parallel runner and timeout logic
installer/src/nv_config_manager_installer/deployer.py
Add _run_logged_parallel: fail-fast concurrent executor with prefixed streaming, periodic running-status, per-command timeouts, and process-kill-on-error.
Parallelize image builds and tests
installer/src/nv_config_manager_installer/deployer.py, installer/tests/test_deployer.py
Refactor _build_images to run Docker builds concurrently (respecting BUILDX), bound parallelism, tag images post-build, populate local tags, and add tests validating parallel execution and log ordering.
Kind helper preload implementation and tests
installer/src/nv_config_manager_installer/deployer.py, deploy/configs/ci-integration-test.yaml, installer/tests/test_deployer.py
Detect Docker server linux/<arch>, pull platform-scoped helper sources with --platform, retag to LOADER_POD_IMAGE if redirected, and import into Kind nodes via ctr images import --platform; CI config and tests added for preload lists.
Schema: kind_preload_images
installer/src/nv_config_manager_installer/schema.py, installer/tests/test_schema.py
Add ImagesConfig.kind_preload_images: list[str], update YAML pruning behavior, and add tests for default and round-trip serialization.
License/header and small files
installer/tests/*, installer/scripts/screenshot_tui.py, components/.../__init__.py
Expand/insert Apache-2.0 SPDX headers across tests and scripts; add license-only __init__.py files; no runtime behavior changes.
Docs & pyproject
docs/install/index.mdx, pyproject.toml
Add CI note about --load-kind and preload images; bump numpy constraint to >=2.1.0,<3.0.0.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Deployer
  participant Docker
  participant KindNode
  participant Helm
  User->>Deployer: run deploy (options: load_kind, helm_debug, watch_pods)
  Deployer->>Docker: docker version (detect server os/arch)
  Deployer->>Docker: docker pull --platform <os/arch> <helper source image>
  Docker-->>Deployer: pulled image (possibly redirected)
  Deployer->>Docker: docker tag <source> <LOADER_POD_IMAGE> (if redirected)
  Deployer->>KindNode: docker save --platform | ctr images import --platform
  Deployer->>Docker: start parallel docker build commands via _run_logged_parallel
  Deployer->>Helm: helm upgrade --install [--debug]
  Helm->>Deployer: waiting (if watch_pods)
  Deployer->>KindNode: poll pods and emit unready summaries while Helm waits
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • dmathur0
  • spidercensus
  • ryanheffernan

"A rabbit scurries with a tiny chart,
I fetch the right-arch image to start,
I tag and load into every node,
I sprint parallel builds down the road,
Then hum as pods report their heart." 🐇✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix kind integration tests in gha' is vague and does not clearly convey the main changes in the pull request. Use a more descriptive title that explains the specific problem being fixed, such as 'Add Kind helper image preloading and parallel Docker builds to fix integration timeouts' or 'Preload Kind helper images and implement concurrent Docker builds'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-ci-integration

Comment @coderabbitai help to get the list of available commands and usage tips.

@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test a4e6a61

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
installer/tests/test_deployer.py (1)

402-423: ⚡ Quick win

Assert the cache probe in the cached-image test.

This test currently verifies “no pull” but not that cache detection was actually performed. Adding an inspect assertion makes the regression signal tighter.

✅ Suggested test hardening
diff --git a/installer/tests/test_deployer.py b/installer/tests/test_deployer.py
@@
     def test_load_kind_uses_cached_loader_image(self, monkeypatch):
+        run_commands: list[list[str]] = []
         logged_commands: list[list[str]] = []
 
         def fake_run(cmd, **kwargs):
+            run_commands.append(cmd)
             return MagicMock(returncode=0, stdout="", stderr="")
@@
         deployer._load_kind()
 
+        assert ["docker", "image", "inspect", LOADER_POD_IMAGE] in run_commands
         assert ["docker", "pull", LOADER_POD_IMAGE] not in logged_commands
         assert [
             "kind",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/tests/test_deployer.py` around lines 402 - 423, The test
test_load_kind_uses_cached_loader_image currently only asserts that a docker
pull wasn't issued; update it to also assert that the cache probe (docker image
inspect) was performed so cache detection is verified: in the test add an
assertion that logged_commands contains a call matching ["docker", "image",
"inspect", LOADER_POD_IMAGE] (or the exact args used by _load_kind’s cache
probe) before asserting no pull, referencing the Deployer._load_kind flow and
the LOADER_POD_IMAGE symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 761-771: The loop that inspects/pulls _KIND_PRELOAD_IMAGES can
fail with an unclear raw docker error when running with --load-kind because the
Docker prereq check is only run for --build-images; modify the _load_kind path
to fail fast by invoking the same Docker prerequisite check used for
--build-images before the image loop (or, if that helper doesn't exist, run a
quick check like _run(["docker","--version"], check=False) or
_run(["docker","info"], check=False) and raise/log a clear prereq error). Ensure
the check happens before referencing _KIND_PRELOAD_IMAGES and before calling
_run/_run_logged so missing Docker yields a clear prereq message rather than a
late raw command failure.

---

Nitpick comments:
In `@installer/tests/test_deployer.py`:
- Around line 402-423: The test test_load_kind_uses_cached_loader_image
currently only asserts that a docker pull wasn't issued; update it to also
assert that the cache probe (docker image inspect) was performed so cache
detection is verified: in the test add an assertion that logged_commands
contains a call matching ["docker", "image", "inspect", LOADER_POD_IMAGE] (or
the exact args used by _load_kind’s cache probe) before asserting no pull,
referencing the Deployer._load_kind flow and the LOADER_POD_IMAGE symbol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 022a2ae2-0149-4a53-a998-e548ee577262

📥 Commits

Reviewing files that changed from the base of the PR and between 2083d0e and a4e6a61.

📒 Files selected for processing (3)
  • installer/src/nv_config_manager_installer/deployer.py
  • installer/src/nv_config_manager_installer/k8s.py
  • installer/tests/test_deployer.py

Comment on lines +761 to +771
for img in _KIND_PRELOAD_IMAGES:
if _run(["docker", "image", "inspect", img], check=False).returncode != 0:
self.callback.on_log(f"Pulling helper image {img} for Kind preload...")
_run_logged(["docker", "pull", img], step, self.callback, timeout=300)
self.callback.on_log(f"Loading helper image {img} into Kind cluster {cluster}...")
_run_logged(
["kind", "load", "docker-image", img, "--name", cluster],
step,
self.callback,
timeout=300,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="installer/src/nv_config_manager_installer/deployer.py"

# Sanity: show lines around 740-800 (snippet area)
python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=810
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
PY

# Locate _load_kind definition
rg -n "def _load_kind|_load_kind\\b|load_kind" "$FILE"

# Locate docker prereq checks (shutil.which docker / RuntimeError messages)
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|--build-images|build_images|--load-kind|load_kind" "$FILE"

# Search repo for the same message or check
rg -n "docker is required for --build-images|--build-images/--load-kind" installer/src | head -n 50

Repository: NVIDIA/nv-config-manager

Length of output: 4390


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="installer/src/nv_config_manager_installer/deployer.py"

# Show lines around 740-800 where snippet is
python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=740
end=800
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
PY

# Find _load_kind definition and where options.load_kind is used
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"

# Find docker prerequisite checks
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE"

# Search for any existing docker prerequisite logic elsewhere
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50
rg -n "docker is required" installer/src | head -n 50

Repository: NVIDIA/nv-config-manager

Length of output: 3636


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="installer/src/nv_config_manager_installer/deployer.py"

python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=820
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
PY

echo "---- rg _load_kind and options ----"
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"

echo "---- rg docker prerequisite checks in file ----"
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE" || true

echo "---- rg docker prerequisite checks in repo ----"
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50 || true
rg -n "docker is required" installer/src | head -n 50 || true

Repository: NVIDIA/nv-config-manager

Length of output: 4888


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="installer/src/nv_config_manager_installer/deployer.py"

python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=820
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
PY

echo "---- rg _load_kind and options ----"
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"

echo "---- rg docker prerequisite checks in file ----"
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE" || true

echo "---- rg docker prerequisite checks in repo ----"
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50 || true
rg -n "docker is required" installer/src | head -n 50 || true

Repository: NVIDIA/nv-config-manager

Length of output: 4888


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="installer/src/nv_config_manager_installer/deployer.py"

python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=820
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
PY

echo "---- rg _load_kind and options ----"
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"

echo "---- rg docker prerequisite checks in file ----"
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE" || true

echo "---- rg docker prerequisite checks in repo ----"
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50 || true
rg -n "docker is required" installer/src | head -n 50 || true

Repository: NVIDIA/nv-config-manager

Length of output: 4888


Fail fast when Docker is missing for --load-kind.

_load_kind() runs docker image inspect / docker pull for _KIND_PRELOAD_IMAGES, but the existing Docker prerequisite check only triggers for --build-images, so --load-kind can fail late with a raw docker command error instead of a clear prereq message.

🔧 Proposed fix
diff --git a/installer/src/nv_config_manager_installer/deployer.py b/installer/src/nv_config_manager_installer/deployer.py
@@
-        if self.options.build_images and not shutil.which("docker"):
-            raise RuntimeError("docker is required for --build-images")
+        if (self.options.build_images or self.options.load_kind) and not shutil.which("docker"):
+            raise RuntimeError("docker is required for --build-images/--load-kind")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/src/nv_config_manager_installer/deployer.py` around lines 761 -
771, The loop that inspects/pulls _KIND_PRELOAD_IMAGES can fail with an unclear
raw docker error when running with --load-kind because the Docker prereq check
is only run for --build-images; modify the _load_kind path to fail fast by
invoking the same Docker prerequisite check used for --build-images before the
image loop (or, if that helper doesn't exist, run a quick check like
_run(["docker","--version"], check=False) or _run(["docker","info"],
check=False) and raise/log a clear prereq error). Ensure the check happens
before referencing _KIND_PRELOAD_IMAGES and before calling _run/_run_logged so
missing Docker yields a clear prereq message rather than a late raw command
failure.

@zsblevins zsblevins force-pushed the fix-ci-integration branch from a4e6a61 to 6229738 Compare May 29, 2026 19:50
@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test 6229738

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
installer/src/nv_config_manager_installer/deployer.py (1)

782-797: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Docker prerequisite check missing for --load-kind (duplicate of past review).

The new preload logic calls docker pull and _docker_server_platform() (which runs docker version), but the prerequisite check at lines 595-598 only verifies docker is available for --build-images, not for --load-kind. If docker is missing, _load_kind() will fail late with a raw command error instead of a clear prerequisite message.

🔧 Proposed fix
-        if self.options.build_images and not shutil.which("docker"):
-            raise RuntimeError("docker is required for --build-images")
+        if (self.options.build_images or self.options.load_kind) and not shutil.which("docker"):
+            raise RuntimeError("docker is required for --build-images/--load-kind")
         if self.options.load_kind and not shutil.which("kind"):
             raise RuntimeError("kind is required for --load-kind")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/src/nv_config_manager_installer/deployer.py` around lines 782 -
797, The preload path for --load-kind calls _docker_server_platform(), runs
"docker pull", and uses _KIND_PRELOAD_IMAGES but the existing prerequisite check
only verifies docker for --build-images; update the prereq logic so docker
availability is also checked when _load_kind() will be executed (i.e., when the
--load-kind flag or the code path that triggers image preloading runs).
Specifically, add a docker check before calling _docker_server_platform() or
entering the loop over _KIND_PRELOAD_IMAGES, and surface a clear prerequisite
error via the same callback.on_log/process-exit behavior used elsewhere rather
than letting _run_logged/_docker_server_platform() raise a raw command error.
🧹 Nitpick comments (1)
installer/tests/test_deployer.py (1)

366-406: 💤 Low value

Consider adding tests for edge cases.

The test covers the happy path well. Consider adding tests for:

  1. Fallback behavior when docker version fails (line 374 could return returncode=1)
  2. Multiple images in _KIND_PRELOAD_IMAGES to verify the loop
📝 Example: test for docker version fallback
def test_load_kind_uses_fallback_when_docker_version_fails(self, monkeypatch):
    run_commands: list[list[str]] = []
    logged_commands: list[list[str]] = []

    def fake_run(cmd, **kwargs):
        run_commands.append(cmd)
        if cmd[:3] == ["docker", "version", "--format"]:
            return MagicMock(returncode=1, stdout="", stderr="error")
        return MagicMock(returncode=0, stdout="", stderr="")

    def fake_run_logged(cmd, step, callback, **kwargs):
        logged_commands.append(cmd)
        return MagicMock(returncode=0, stdout="", stderr="")

    monkeypatch.setattr("nv_config_manager_installer.deployer._run", fake_run)
    monkeypatch.setattr("nv_config_manager_installer.deployer._run_logged", fake_run_logged)
    monkeypatch.setattr("nv_config_manager_installer.deployer.platform.machine", lambda: "x86_64")

    deployer = Deployer(
        _make_config(),
        DeployOptions(load_kind=True, kind_cluster="test-cluster"),
        RecordingCallback(),
    )
    deployer._load_kind()

    assert ["docker", "pull", "--platform", "linux/amd64", LOADER_POD_IMAGE] in logged_commands
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/tests/test_deployer.py` around lines 366 - 406, Add two tests to
TestKindImageLoading to cover edge cases: (1) a fallback when docker version
fails — copy the existing test but have fake_run return MagicMock(returncode=1,
...) for the ["docker","version","--format"] call, monkeypatch platform.machine
to "x86_64", call Deployer._load_kind and assert that
["docker","pull","--platform","linux/amd64", LOADER_POD_IMAGE] appears in
logged_commands (reference fake_run, fake_run_logged, Deployer._load_kind,
LOADER_POD_IMAGE); (2) multiple images in _KIND_PRELOAD_IMAGES — monkeypatch
nv_config_manager_installer.deployer._KIND_PRELOAD_IMAGES to a list of two image
names, run the same setup and assert that a docker pull (with
--platform/linux/amd64 fallback as needed) and a kind load are invoked for each
image in logged_commands (reference _KIND_PRELOAD_IMAGES and the same
fake_run/fake_run_logged helpers).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 782-797: The preload path for --load-kind calls
_docker_server_platform(), runs "docker pull", and uses _KIND_PRELOAD_IMAGES but
the existing prerequisite check only verifies docker for --build-images; update
the prereq logic so docker availability is also checked when _load_kind() will
be executed (i.e., when the --load-kind flag or the code path that triggers
image preloading runs). Specifically, add a docker check before calling
_docker_server_platform() or entering the loop over _KIND_PRELOAD_IMAGES, and
surface a clear prerequisite error via the same callback.on_log/process-exit
behavior used elsewhere rather than letting
_run_logged/_docker_server_platform() raise a raw command error.

---

Nitpick comments:
In `@installer/tests/test_deployer.py`:
- Around line 366-406: Add two tests to TestKindImageLoading to cover edge
cases: (1) a fallback when docker version fails — copy the existing test but
have fake_run return MagicMock(returncode=1, ...) for the
["docker","version","--format"] call, monkeypatch platform.machine to "x86_64",
call Deployer._load_kind and assert that
["docker","pull","--platform","linux/amd64", LOADER_POD_IMAGE] appears in
logged_commands (reference fake_run, fake_run_logged, Deployer._load_kind,
LOADER_POD_IMAGE); (2) multiple images in _KIND_PRELOAD_IMAGES — monkeypatch
nv_config_manager_installer.deployer._KIND_PRELOAD_IMAGES to a list of two image
names, run the same setup and assert that a docker pull (with
--platform/linux/amd64 fallback as needed) and a kind load are invoked for each
image in logged_commands (reference _KIND_PRELOAD_IMAGES and the same
fake_run/fake_run_logged helpers).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b59f860d-3f8a-48af-aa9d-120de27c0bc5

📥 Commits

Reviewing files that changed from the base of the PR and between a4e6a61 and 6229738.

📒 Files selected for processing (3)
  • installer/src/nv_config_manager_installer/deployer.py
  • installer/src/nv_config_manager_installer/k8s.py
  • installer/tests/test_deployer.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • installer/src/nv_config_manager_installer/k8s.py

@zsblevins zsblevins force-pushed the fix-ci-integration branch from 6229738 to be45d7c Compare May 29, 2026 20:09
@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test be45d7c

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test 1d181d5

@zsblevins zsblevins force-pushed the fix-ci-integration branch from 1d181d5 to 37047ae Compare May 29, 2026 21:05
@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test 37047ae

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/kind-integration.yml (1)

121-134: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Retry only timeout/transient failures; current loop retries deterministic test failures and overwrites diagnostics.

At Line 123 and Line 127, any non-zero pytest exit is retried, and integration-test-results.xml is overwritten on each attempt. This increases CI time on real failures and can hide first-failure context.

Suggested patch
       - name: Run integration tests
         timeout-minutes: 50
         run: |
           for attempt in 1 2 3; do
             echo "Attempt ${attempt} of 3"
-            if timeout --kill-after=1m 15m uv run pytest "${TEST_PATH}" -v \
+            junit_file="integration-test-results-attempt-${attempt}.xml"
+            if timeout --kill-after=1m 15m uv run pytest "${TEST_PATH}" -v \
               --ci \
               --nv-config-manager-namespace "${NV_CONFIG_MANAGER_NAMESPACE}" \
               --tb=short \
-              --junitxml=integration-test-results.xml; then
+              --junitxml="${junit_file}"; then
+              cp "${junit_file}" integration-test-results.xml
               exit 0
             fi
+            exit_code=$?
+            # Retry only on timeout (124/137). Fail fast on deterministic test failures.
+            if [ "${exit_code}" -ne 124 ] && [ "${exit_code}" -ne 137 ]; then
+              cp "${junit_file}" integration-test-results.xml || true
+              exit "${exit_code}"
+            fi
             if [ "${attempt}" -lt 3 ]; then
               sleep 10
             fi
           done
           exit 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/kind-integration.yml around lines 121 - 134, The retry
loop currently reruns the command `timeout --kill-after=1m 15m uv run pytest
"${TEST_PATH}" -v ... --junitxml=integration-test-results.xml` for any non-zero
exit and overwrites `integration-test-results.xml`; change it to run pytest into
an attempt-specific results file (e.g.
`--junitxml=integration-test-results-${attempt}.xml`), capture the pytest exit
code, and decide: if exit indicates a timeout/kill (e.g. process killed by
timeout/exit code 137 or signal) or other known transient condition, then retry
up to 3 attempts; otherwise for deterministic failures immediately copy the
current attempt XML to `integration-test-results.xml` and exit non-zero; on
success copy the successful attempt XML to `integration-test-results.xml` and
exit 0. Update the loop surrounding the `timeout ... uv run pytest` invocation
and reference `${attempt}`, `${TEST_PATH}`, and `integration-test-results.xml`
when implementing these checks and file copies.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/kind-integration.yml:
- Around line 121-134: The retry loop currently reruns the command `timeout
--kill-after=1m 15m uv run pytest "${TEST_PATH}" -v ...
--junitxml=integration-test-results.xml` for any non-zero exit and overwrites
`integration-test-results.xml`; change it to run pytest into an attempt-specific
results file (e.g. `--junitxml=integration-test-results-${attempt}.xml`),
capture the pytest exit code, and decide: if exit indicates a timeout/kill (e.g.
process killed by timeout/exit code 137 or signal) or other known transient
condition, then retry up to 3 attempts; otherwise for deterministic failures
immediately copy the current attempt XML to `integration-test-results.xml` and
exit non-zero; on success copy the successful attempt XML to
`integration-test-results.xml` and exit 0. Update the loop surrounding the
`timeout ... uv run pytest` invocation and reference `${attempt}`,
`${TEST_PATH}`, and `integration-test-results.xml` when implementing these
checks and file copies.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 06525d1b-baa4-48eb-8777-1be8f1eee622

📥 Commits

Reviewing files that changed from the base of the PR and between 1d181d5 and 37047ae.

📒 Files selected for processing (6)
  • .github/workflows/kind-integration.yml
  • Makefile
  • installer/src/nv_config_manager_installer/cli.py
  • installer/src/nv_config_manager_installer/deployer.py
  • installer/tests/test_cli.py
  • installer/tests/test_deployer.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • Makefile
  • installer/tests/test_cli.py
  • installer/src/nv_config_manager_installer/cli.py
  • installer/tests/test_deployer.py
  • installer/src/nv_config_manager_installer/deployer.py

@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test 864f985

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 602-612: In _poll_pod_summary(), after calling
_unready_pod_summary_lines(k8s, namespace) and before any call to
_emit_pod_summary or computing signature/heartbeat logic, re-check
stop_event.is_set() and return/break early if set; this prevents emitting a
final pod-summary after shutdown was requested while list_namespaced_pod() was
in-flight. Update the loop in _poll_pod_summary to check stop_event immediately
after obtaining lines (and before computing
signature/has_blockers/last_emit/_emit_pod_summary) so the thread exits without
logging when stop_event has been set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cc0d0934-655e-49c7-aac6-69e036e4933a

📥 Commits

Reviewing files that changed from the base of the PR and between 864f985 and d933ccd.

📒 Files selected for processing (5)
  • .github/workflows/kind-integration.yml
  • installer/src/nv_config_manager_installer/cli.py
  • installer/src/nv_config_manager_installer/deployer.py
  • installer/tests/test_cli.py
  • installer/tests/test_deployer.py
✅ Files skipped from review due to trivial changes (1)
  • installer/tests/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • installer/src/nv_config_manager_installer/cli.py

Comment on lines +602 to +612
while not stop_event.is_set():
lines = _unready_pod_summary_lines(k8s, namespace)
signature = "\n".join(lines)
now = time.monotonic()
has_blockers = lines != ["all pods ready"]
if signature != last_signature or (has_blockers and now - last_emit >= heartbeat_interval):
_emit_pod_summary(lines, step, callback)
last_signature = signature
last_emit = now
if stop_event.wait(interval):
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the file
ls -la installer/src/nv_config_manager_installer/deployer.py

# Show the relevant sections around the referenced line numbers
sed -n '560,700p' installer/src/nv_config_manager_installer/deployer.py | cat -n

# Find the poller/thread lifecycle helpers
rg -n "def _poll_pod_summary|_run_logged_with_pod_summary|stop_event|watch_pods|_unready_pod_summary_lines" installer/src/nv_config_manager_installer/deployer.py

# Show the helper implementations for the poller and its callers
# (Use narrower slices around matches to avoid large output)
python3 - <<'PY'
import re, subprocess, os, textwrap, sys, pathlib
path="installer/src/nv_config_manager_installer/deployer.py"
# Get line numbers for key defs/usages
patterns=[
    r"def _poll_pod_summary\(",
    r"def _run_logged_with_pod_summary\(",
    r"def _unready_pod_summary_lines\(",
    r"def _helm_install\(",
]
# Use rg to get exact line numbers
for pat in patterns:
    out=subprocess.check_output(["rg","-n",pat,path], text=True).strip().splitlines()
    if not out:
        print("NO MATCH", pat)
        continue
    for line in out:
        ln=int(line.split(":")[0])
        print(f"\n--- {pat} around {ln} ---")
        start=max(1,ln-60); end=ln+120
        subprocess.run(["sed","-n",f"{start},{end}p",path], check=True)
PY

Repository: NVIDIA/nv-config-manager

Length of output: 31655


Re-check stop_event before emitting pod summary logs in _poll_pod_summary()

_poll_pod_summary() emits [pods] after _unready_pod_summary_lines() returns, but it doesn’t re-check stop_event in between. If shutdown sets stop_event while k8s.v1.list_namespaced_pod() is still in-flight, the poll thread can still log one more pod-summary batch even though _run_logged_with_pod_summary() has returned (it only join(timeout=2) and the thread is daemon=True). Apply the early-exit check before computing/signaling emissions.

🔧 Proposed fix
     while not stop_event.is_set():
         lines = _unready_pod_summary_lines(k8s, namespace)
+        if stop_event.is_set():
+            break
         signature = "\n".join(lines)
         now = time.monotonic()
         has_blockers = lines != ["all pods ready"]
         if signature != last_signature or (has_blockers and now - last_emit >= heartbeat_interval):
             _emit_pod_summary(lines, step, callback)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@installer/src/nv_config_manager_installer/deployer.py` around lines 602 -
612, In _poll_pod_summary(), after calling _unready_pod_summary_lines(k8s,
namespace) and before any call to _emit_pod_summary or computing
signature/heartbeat logic, re-check stop_event.is_set() and return/break early
if set; this prevents emitting a final pod-summary after shutdown was requested
while list_namespaced_pod() was in-flight. Update the loop in _poll_pod_summary
to check stop_event immediately after obtaining lines (and before computing
signature/has_blockers/last_emit/_emit_pod_summary) so the thread exits without
logging when stop_event has been set.

@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test 6da405f

zsblevins added 5 commits May 29, 2026 16:10
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
@zsblevins zsblevins force-pushed the fix-ci-integration branch from 6da405f to e62bc48 Compare May 29, 2026 22:10
@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test e62bc48

@zsblevins zsblevins force-pushed the fix-ci-integration branch from e62bc48 to ebf71f8 Compare May 29, 2026 22:30
@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test ebf71f8

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
deploy/configs/ci-integration-test.yaml (1)

81-82: 💤 Low value

Comment mentions only Docker Hub but list includes multiple registries.

The comment states images are from "Docker Hub" but the list includes images from docker.io, quay.io, and ghcr.io. Consider updating to "from public registries" or "from Docker Hub and other registries" for accuracy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/configs/ci-integration-test.yaml` around lines 81 - 82, Update the
descriptive comment that currently says "CI preloads third-party runtime images
into Kind through the host Docker daemon. This avoids anonymous in-cluster pulls
from Docker Hub during Helm's --wait window." to accurately reflect the
registries used by the image list: replace "Docker Hub" with "public registries
(e.g., docker.io, quay.io, ghcr.io)" or "Docker Hub and other registries" so the
comment above the CI preload block correctly describes the sources.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/configs/ci-integration-test.yaml`:
- Around line 83-102: The kind_preload_images list in the YAML doesn't include
busybox despite the PR stating the change fixes a timeout caused by pulling the
busybox image; either add the busybox image (e.g.,
docker.io/library/busybox:latest or a specific tag) to the kind_preload_images
array or update the PR description to remove the busybox-related claim; locate
the kind_preload_images section and add the busybox entry (or change the PR
text) so the code and PR description are consistent.
- Around line 83-102: Update the kind_preload_images verification workflow to
stop using docker manifest inspect and the dummy "Advisory check for:" log:
replace the docker-based manifest check with a Dockerless registry inspect
(e.g., use skopeo inspect --raw docker://<image> or crane manifest <image>) to
confirm each tag exists and return non-zero on missing tags, and replace the
placeholder advisory step with a remote image scan using trivy (trivy image
--severity HIGH,CRITICAL --exit-code 1 --remote <image>) so the job fails when
HIGH/CRITICAL findings are present; target the sections referencing
kind_preload_images and the current "docker manifest inspect" and "Advisory
check for:" steps to implement these changes.
- Around line 102-106: The PostgreSQL image is declared twice (in
kind_preload_images and in overrides.postgresql.repository/tag), which can lead
to divergence; add a concise clarifying comment next to both declarations (the
kind_preload_images entry and the overrides.postgresql.repository/tag block)
stating that these two values must be updated together when bumping PostgreSQL,
and optionally include the intended source of truth (e.g., "update both entries
when upgrading PostgreSQL") so future editors will keep them in sync.

---

Nitpick comments:
In `@deploy/configs/ci-integration-test.yaml`:
- Around line 81-82: Update the descriptive comment that currently says "CI
preloads third-party runtime images into Kind through the host Docker daemon.
This avoids anonymous in-cluster pulls from Docker Hub during Helm's --wait
window." to accurately reflect the registries used by the image list: replace
"Docker Hub" with "public registries (e.g., docker.io, quay.io, ghcr.io)" or
"Docker Hub and other registries" so the comment above the CI preload block
correctly describes the sources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ddc75572-3850-451a-8175-494931f97fd7

📥 Commits

Reviewing files that changed from the base of the PR and between e62bc48 and ebf71f8.

📒 Files selected for processing (6)
  • deploy/configs/ci-integration-test.yaml
  • docs/install/index.mdx
  • installer/src/nv_config_manager_installer/deployer.py
  • installer/src/nv_config_manager_installer/schema.py
  • installer/tests/test_deployer.py
  • installer/tests/test_schema.py
✅ Files skipped from review due to trivial changes (1)
  • docs/install/index.mdx
🚧 Files skipped from review as they are similar to previous changes (4)
  • installer/src/nv_config_manager_installer/schema.py
  • installer/tests/test_schema.py
  • installer/tests/test_deployer.py
  • installer/src/nv_config_manager_installer/deployer.py

Comment on lines +83 to +102
kind_preload_images:
- docker.io/hashicorp/http-echo:1.0
- docker.io/alpine/kubectl:1.35.4
- docker.io/library/redis:7-alpine
- docker.io/library/nats:2.10-alpine
- docker.io/natsio/nats-box:0.14.3
- docker.io/temporalio/server:1.29
- docker.io/temporalio/admin-tools:1.29
- docker.io/temporalio/ui:v2.37.4
- docker.io/nginxinc/nginx-unprivileged:1.27
- docker.io/envoyproxy/gateway:v1.6.5
- docker.io/envoyproxy/ratelimit:c8765e89
- docker.io/envoyproxy/envoy:distroless-v1.36.5
- quay.io/jetstack/cert-manager-controller:v1.20.2
- quay.io/jetstack/cert-manager-webhook:v1.20.2
- quay.io/jetstack/cert-manager-cainjector:v1.20.2
- quay.io/jetstack/cert-manager-startupapicheck:v1.20.2
- quay.io/jetstack/cert-manager-acmesolver:v1.20.2
- ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0
- ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Busybox image missing from preload list despite PR objectives.

The PR description states the purpose is to "Fix a timeout in the Kind integration test caused by pulling the busybox image," but busybox (or library/busybox) is not present in the kind_preload_images list. This is inconsistent with the stated objective.

If the busybox timeout has been resolved through other means or is no longer relevant, please update the PR description. Otherwise, consider adding the busybox image to this preload list.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/configs/ci-integration-test.yaml` around lines 83 - 102, The
kind_preload_images list in the YAML doesn't include busybox despite the PR
stating the change fixes a timeout caused by pulling the busybox image; either
add the busybox image (e.g., docker.io/library/busybox:latest or a specific tag)
to the kind_preload_images array or update the PR description to remove the
busybox-related claim; locate the kind_preload_images section and add the
busybox entry (or change the PR text) so the code and PR description are
consistent.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify all preloaded images exist and check for vulnerabilities

images=(
  "docker.io/hashicorp/http-echo:1.0"
  "docker.io/alpine/kubectl:1.35.4"
  "docker.io/library/redis:7-alpine"
  "docker.io/library/nats:2.10-alpine"
  "docker.io/natsio/nats-box:0.14.3"
  "docker.io/temporalio/server:1.29"
  "docker.io/temporalio/admin-tools:1.29"
  "docker.io/temporalio/ui:v2.37.4"
  "docker.io/nginxinc/nginx-unprivileged:1.27"
  "docker.io/envoyproxy/gateway:v1.6.5"
  "docker.io/envoyproxy/ratelimit:c8765e89"
  "docker.io/envoyproxy/envoy:distroless-v1.36.5"
  "quay.io/jetstack/cert-manager-controller:v1.20.2"
  "quay.io/jetstack/cert-manager-webhook:v1.20.2"
  "quay.io/jetstack/cert-manager-cainjector:v1.20.2"
  "quay.io/jetstack/cert-manager-startupapicheck:v1.20.2"
  "quay.io/jetstack/cert-manager-acmesolver:v1.20.2"
  "ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0"
  "ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie"
)

echo "Checking image manifest existence..."
for img in "${images[@]}"; do
  echo "Checking: $img"
  if ! docker manifest inspect "$img" > /dev/null 2>&1; then
    echo "  ❌ FAILED: Image not found or inaccessible"
  else
    echo "  ✓ OK"
  fi
done

echo ""
echo "Checking for security advisories (GitHub)..."
# Check for known CVEs in common images
for img in "${images[@]}"; do
  repo=$(echo "$img" | cut -d: -f1 | sed 's|docker.io/library/||' | sed 's|docker.io/||' | sed 's|/|%2F|g')
  echo "Advisory check for: $img"
done

Repository: NVIDIA/nv-config-manager

Length of output: 3117


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== docker availability =="
command -v docker || true
docker --version || true
docker info >/dev/null 2>&1 && echo "docker info: OK" || echo "docker info: FAILED (daemon/network?)"

echo
echo "== manifest lookup sanity check =="
# Use a very common public image
KNOWN="docker.io/library/busybox:latest"
echo "Checking: $KNOWN"
if docker manifest inspect "$KNOWN" >/dev/null 2>&1; then
  echo "  ✓ OK"
else
  echo "  ❌ FAILED"
fi

echo
echo "== tool availability (optional) =="
command -v trivy >/dev/null 2>&1 && echo "trivy: present" || echo "trivy: not present"

Repository: NVIDIA/nv-config-manager

Length of output: 336


Fix the kind_preload_images verification to actually check tag existence + vulnerabilities

  • The proposed docker manifest inspect approach is not portable: it requires a working Docker client/daemon and registry access; otherwise it fails to validate the tags. Use a Dockerless registry tool (e.g., skopeo inspect --raw docker://<image> or crane manifest <image>) to confirm each tag exists.
  • The “Checking for security advisories (GitHub)” section doesn’t perform any real advisory/CVE lookup (it only prints Advisory check for: ...). Replace that part with an actual scanner; trivy is available, so scan the remote images for HIGH/CRITICAL and fail the job on findings.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/configs/ci-integration-test.yaml` around lines 83 - 102, Update the
kind_preload_images verification workflow to stop using docker manifest inspect
and the dummy "Advisory check for:" log: replace the docker-based manifest check
with a Dockerless registry inspect (e.g., use skopeo inspect --raw
docker://<image> or crane manifest <image>) to confirm each tag exists and
return non-zero on missing tags, and replace the placeholder advisory step with
a remote image scan using trivy (trivy image --severity HIGH,CRITICAL
--exit-code 1 --remote <image>) so the job fails when HIGH/CRITICAL findings are
present; target the sections referencing kind_preload_images and the current
"docker manifest inspect" and "Advisory check for:" steps to implement these
changes.

Comment on lines +102 to +106
- ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie
overrides:
postgresql:
repository: ghcr.io/cloudnative-pg/postgresql
tag: 18.0-system-trixie
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

PostgreSQL image specified in both preload list and overrides.

The PostgreSQL image ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie is defined in two places:

  • Line 102: In the kind_preload_images list
  • Lines 104-106: In the overrides.postgresql block

While currently consistent, maintaining the same image in two locations increases the risk of divergence during future updates.

Consider adding a comment linking these two declarations or documenting that both must be updated together when upgrading PostgreSQL.

📝 Example clarifying comment
   - ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0
+  # Note: PostgreSQL image below must match overrides.postgresql (lines 104-106)
   - ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie
 overrides:
+  # Note: Must match kind_preload_images entry (line 102)
   postgresql:
     repository: ghcr.io/cloudnative-pg/postgresql
     tag: 18.0-system-trixie
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/configs/ci-integration-test.yaml` around lines 102 - 106, The
PostgreSQL image is declared twice (in kind_preload_images and in
overrides.postgresql.repository/tag), which can lead to divergence; add a
concise clarifying comment next to both declarations (the kind_preload_images
entry and the overrides.postgresql.repository/tag block) stating that these two
values must be updated together when bumping PostgreSQL, and optionally include
the intended source of truth (e.g., "update both entries when upgrading
PostgreSQL") so future editors will keep them in sync.

Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
@zsblevins zsblevins force-pushed the fix-ci-integration branch from ebf71f8 to 76a5b5b Compare May 29, 2026 22:42
@zsblevins
Copy link
Copy Markdown
Collaborator Author

/ok to test 76a5b5b

Plain `docker save` of a multi-arch tag under Docker's containerd image
store exports an OCI index referencing every platform and, for indices
with several platforms (e.g. cert-manager-controller v1.20.2 has 5),
can omit the selected platform's config blob from the tar. This made
`ctr images import --platform=linux/amd64` fail in kind-up with:

  ctr: failed to resolve rootfs: content digest sha256:...: not found

Pass the same `--platform` to `docker save` (Docker 28+/API 1.48+) so
the archive is a self-contained single-platform OCI image.

Signed-off-by: Davesh Mathur <daveshm@nvidia.com>
@dmathur0 dmathur0 force-pushed the fix-ci-integration branch from 6a92183 to 80949e7 Compare May 30, 2026 08:00
@dmathur0
Copy link
Copy Markdown
Collaborator

/ok to test 80949e7

Diagnostic for docker/cli#6457: `docker save --platform <p>` is
refusing to export freshly-pulled multi-arch helper images
("no suitable export target found ... does not provide the specified
platform") even though all blobs for the requested platform are
present. One suspected trigger is layer dedup with sibling images
pulled earlier in the same preload loop (e.g. cert-manager-controller
and envoyproxy/envoy:distroless both ride on the same distroless
cc-debian12 base).

Run `docker system prune -af` before each helper pull so the next
pull cannot share blobs with the previous one. Images held by
running containers (the kind node) are not affected. If this changes
the failure mode the dedup hypothesis is confirmed; if not we'll
move off `docker save` entirely.

Signed-off-by: Davesh Mathur <daveshm@nvidia.com>
@dmathur0
Copy link
Copy Markdown
Collaborator

/ok to test 0f10224

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants