fix kind integration tests in gha#13
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds LOADER_POD_IMAGE and platform-aware Kind helper preload, implements concurrent Docker image builds with streaming logs, adds helm_debug/watch-pods wiring (CLI/Makefile/DeployOptions), improves pod readiness diagnostics, and updates tests, CI, Dockerfiles, schema, and license headers. ChangesKind cluster image preloading, build parallelism, Helm debug, and diagnostics
Sequence Diagram(s)sequenceDiagram
participant User
participant Deployer
participant Docker
participant KindNode
participant Helm
User->>Deployer: run deploy (options: load_kind, helm_debug, watch_pods)
Deployer->>Docker: docker version (detect server os/arch)
Deployer->>Docker: docker pull --platform <os/arch> <helper source image>
Docker-->>Deployer: pulled image (possibly redirected)
Deployer->>Docker: docker tag <source> <LOADER_POD_IMAGE> (if redirected)
Deployer->>KindNode: docker save --platform | ctr images import --platform
Deployer->>Docker: start parallel docker build commands via _run_logged_parallel
Deployer->>Helm: helm upgrade --install [--debug]
Helm->>Deployer: waiting (if watch_pods)
Deployer->>KindNode: poll pods and emit unready summaries while Helm waits
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
/ok to test a4e6a61 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
installer/tests/test_deployer.py (1)
402-423: ⚡ Quick winAssert the cache probe in the cached-image test.
This test currently verifies “no pull” but not that cache detection was actually performed. Adding an inspect assertion makes the regression signal tighter.
✅ Suggested test hardening
diff --git a/installer/tests/test_deployer.py b/installer/tests/test_deployer.py @@ def test_load_kind_uses_cached_loader_image(self, monkeypatch): + run_commands: list[list[str]] = [] logged_commands: list[list[str]] = [] def fake_run(cmd, **kwargs): + run_commands.append(cmd) return MagicMock(returncode=0, stdout="", stderr="") @@ deployer._load_kind() + assert ["docker", "image", "inspect", LOADER_POD_IMAGE] in run_commands assert ["docker", "pull", LOADER_POD_IMAGE] not in logged_commands assert [ "kind",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@installer/tests/test_deployer.py` around lines 402 - 423, The test test_load_kind_uses_cached_loader_image currently only asserts that a docker pull wasn't issued; update it to also assert that the cache probe (docker image inspect) was performed so cache detection is verified: in the test add an assertion that logged_commands contains a call matching ["docker", "image", "inspect", LOADER_POD_IMAGE] (or the exact args used by _load_kind’s cache probe) before asserting no pull, referencing the Deployer._load_kind flow and the LOADER_POD_IMAGE symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 761-771: The loop that inspects/pulls _KIND_PRELOAD_IMAGES can
fail with an unclear raw docker error when running with --load-kind because the
Docker prereq check is only run for --build-images; modify the _load_kind path
to fail fast by invoking the same Docker prerequisite check used for
--build-images before the image loop (or, if that helper doesn't exist, run a
quick check like _run(["docker","--version"], check=False) or
_run(["docker","info"], check=False) and raise/log a clear prereq error). Ensure
the check happens before referencing _KIND_PRELOAD_IMAGES and before calling
_run/_run_logged so missing Docker yields a clear prereq message rather than a
late raw command failure.
---
Nitpick comments:
In `@installer/tests/test_deployer.py`:
- Around line 402-423: The test test_load_kind_uses_cached_loader_image
currently only asserts that a docker pull wasn't issued; update it to also
assert that the cache probe (docker image inspect) was performed so cache
detection is verified: in the test add an assertion that logged_commands
contains a call matching ["docker", "image", "inspect", LOADER_POD_IMAGE] (or
the exact args used by _load_kind’s cache probe) before asserting no pull,
referencing the Deployer._load_kind flow and the LOADER_POD_IMAGE symbol.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 022a2ae2-0149-4a53-a998-e548ee577262
📒 Files selected for processing (3)
installer/src/nv_config_manager_installer/deployer.pyinstaller/src/nv_config_manager_installer/k8s.pyinstaller/tests/test_deployer.py
| for img in _KIND_PRELOAD_IMAGES: | ||
| if _run(["docker", "image", "inspect", img], check=False).returncode != 0: | ||
| self.callback.on_log(f"Pulling helper image {img} for Kind preload...") | ||
| _run_logged(["docker", "pull", img], step, self.callback, timeout=300) | ||
| self.callback.on_log(f"Loading helper image {img} into Kind cluster {cluster}...") | ||
| _run_logged( | ||
| ["kind", "load", "docker-image", img, "--name", cluster], | ||
| step, | ||
| self.callback, | ||
| timeout=300, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="installer/src/nv_config_manager_installer/deployer.py"
# Sanity: show lines around 740-800 (snippet area)
python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=810
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
PY
# Locate _load_kind definition
rg -n "def _load_kind|_load_kind\\b|load_kind" "$FILE"
# Locate docker prereq checks (shutil.which docker / RuntimeError messages)
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|--build-images|build_images|--load-kind|load_kind" "$FILE"
# Search repo for the same message or check
rg -n "docker is required for --build-images|--build-images/--load-kind" installer/src | head -n 50Repository: NVIDIA/nv-config-manager
Length of output: 4390
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="installer/src/nv_config_manager_installer/deployer.py"
# Show lines around 740-800 where snippet is
python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=740
end=800
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
PY
# Find _load_kind definition and where options.load_kind is used
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"
# Find docker prerequisite checks
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE"
# Search for any existing docker prerequisite logic elsewhere
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50
rg -n "docker is required" installer/src | head -n 50Repository: NVIDIA/nv-config-manager
Length of output: 3636
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="installer/src/nv_config_manager_installer/deployer.py"
python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=820
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
PY
echo "---- rg _load_kind and options ----"
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"
echo "---- rg docker prerequisite checks in file ----"
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE" || true
echo "---- rg docker prerequisite checks in repo ----"
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50 || true
rg -n "docker is required" installer/src | head -n 50 || trueRepository: NVIDIA/nv-config-manager
Length of output: 4888
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="installer/src/nv_config_manager_installer/deployer.py"
python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=820
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
PY
echo "---- rg _load_kind and options ----"
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"
echo "---- rg docker prerequisite checks in file ----"
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE" || true
echo "---- rg docker prerequisite checks in repo ----"
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50 || true
rg -n "docker is required" installer/src | head -n 50 || trueRepository: NVIDIA/nv-config-manager
Length of output: 4888
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="installer/src/nv_config_manager_installer/deployer.py"
python3 - <<'PY'
path="installer/src/nv_config_manager_installer/deployer.py"
start=730
end=820
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
PY
echo "---- rg _load_kind and options ----"
rg -n "def _load_kind|_load_kind\\b|load_kind|build_images" "$FILE"
echo "---- rg docker prerequisite checks in file ----"
rg -n "shutil\\.which\\(\"docker\"\\)|docker is required|RuntimeError\\(\"docker|docker\\s+is\\s+required" "$FILE" || true
echo "---- rg docker prerequisite checks in repo ----"
rg -n "shutil\\.which\\(\"docker\"\\)" installer/src | head -n 50 || true
rg -n "docker is required" installer/src | head -n 50 || trueRepository: NVIDIA/nv-config-manager
Length of output: 4888
Fail fast when Docker is missing for --load-kind.
_load_kind() runs docker image inspect / docker pull for _KIND_PRELOAD_IMAGES, but the existing Docker prerequisite check only triggers for --build-images, so --load-kind can fail late with a raw docker command error instead of a clear prereq message.
🔧 Proposed fix
diff --git a/installer/src/nv_config_manager_installer/deployer.py b/installer/src/nv_config_manager_installer/deployer.py
@@
- if self.options.build_images and not shutil.which("docker"):
- raise RuntimeError("docker is required for --build-images")
+ if (self.options.build_images or self.options.load_kind) and not shutil.which("docker"):
+ raise RuntimeError("docker is required for --build-images/--load-kind")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@installer/src/nv_config_manager_installer/deployer.py` around lines 761 -
771, The loop that inspects/pulls _KIND_PRELOAD_IMAGES can fail with an unclear
raw docker error when running with --load-kind because the Docker prereq check
is only run for --build-images; modify the _load_kind path to fail fast by
invoking the same Docker prerequisite check used for --build-images before the
image loop (or, if that helper doesn't exist, run a quick check like
_run(["docker","--version"], check=False) or _run(["docker","info"],
check=False) and raise/log a clear prereq error). Ensure the check happens
before referencing _KIND_PRELOAD_IMAGES and before calling _run/_run_logged so
missing Docker yields a clear prereq message rather than a late raw command
failure.
a4e6a61 to
6229738
Compare
|
/ok to test 6229738 |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
installer/src/nv_config_manager_installer/deployer.py (1)
782-797:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDocker prerequisite check missing for
--load-kind(duplicate of past review).The new preload logic calls
docker pulland_docker_server_platform()(which runsdocker version), but the prerequisite check at lines 595-598 only verifiesdockeris available for--build-images, not for--load-kind. If docker is missing,_load_kind()will fail late with a raw command error instead of a clear prerequisite message.🔧 Proposed fix
- if self.options.build_images and not shutil.which("docker"): - raise RuntimeError("docker is required for --build-images") + if (self.options.build_images or self.options.load_kind) and not shutil.which("docker"): + raise RuntimeError("docker is required for --build-images/--load-kind") if self.options.load_kind and not shutil.which("kind"): raise RuntimeError("kind is required for --load-kind")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@installer/src/nv_config_manager_installer/deployer.py` around lines 782 - 797, The preload path for --load-kind calls _docker_server_platform(), runs "docker pull", and uses _KIND_PRELOAD_IMAGES but the existing prerequisite check only verifies docker for --build-images; update the prereq logic so docker availability is also checked when _load_kind() will be executed (i.e., when the --load-kind flag or the code path that triggers image preloading runs). Specifically, add a docker check before calling _docker_server_platform() or entering the loop over _KIND_PRELOAD_IMAGES, and surface a clear prerequisite error via the same callback.on_log/process-exit behavior used elsewhere rather than letting _run_logged/_docker_server_platform() raise a raw command error.
🧹 Nitpick comments (1)
installer/tests/test_deployer.py (1)
366-406: 💤 Low valueConsider adding tests for edge cases.
The test covers the happy path well. Consider adding tests for:
- Fallback behavior when
docker versionfails (line 374 could returnreturncode=1)- Multiple images in
_KIND_PRELOAD_IMAGESto verify the loop📝 Example: test for docker version fallback
def test_load_kind_uses_fallback_when_docker_version_fails(self, monkeypatch): run_commands: list[list[str]] = [] logged_commands: list[list[str]] = [] def fake_run(cmd, **kwargs): run_commands.append(cmd) if cmd[:3] == ["docker", "version", "--format"]: return MagicMock(returncode=1, stdout="", stderr="error") return MagicMock(returncode=0, stdout="", stderr="") def fake_run_logged(cmd, step, callback, **kwargs): logged_commands.append(cmd) return MagicMock(returncode=0, stdout="", stderr="") monkeypatch.setattr("nv_config_manager_installer.deployer._run", fake_run) monkeypatch.setattr("nv_config_manager_installer.deployer._run_logged", fake_run_logged) monkeypatch.setattr("nv_config_manager_installer.deployer.platform.machine", lambda: "x86_64") deployer = Deployer( _make_config(), DeployOptions(load_kind=True, kind_cluster="test-cluster"), RecordingCallback(), ) deployer._load_kind() assert ["docker", "pull", "--platform", "linux/amd64", LOADER_POD_IMAGE] in logged_commands🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@installer/tests/test_deployer.py` around lines 366 - 406, Add two tests to TestKindImageLoading to cover edge cases: (1) a fallback when docker version fails — copy the existing test but have fake_run return MagicMock(returncode=1, ...) for the ["docker","version","--format"] call, monkeypatch platform.machine to "x86_64", call Deployer._load_kind and assert that ["docker","pull","--platform","linux/amd64", LOADER_POD_IMAGE] appears in logged_commands (reference fake_run, fake_run_logged, Deployer._load_kind, LOADER_POD_IMAGE); (2) multiple images in _KIND_PRELOAD_IMAGES — monkeypatch nv_config_manager_installer.deployer._KIND_PRELOAD_IMAGES to a list of two image names, run the same setup and assert that a docker pull (with --platform/linux/amd64 fallback as needed) and a kind load are invoked for each image in logged_commands (reference _KIND_PRELOAD_IMAGES and the same fake_run/fake_run_logged helpers).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 782-797: The preload path for --load-kind calls
_docker_server_platform(), runs "docker pull", and uses _KIND_PRELOAD_IMAGES but
the existing prerequisite check only verifies docker for --build-images; update
the prereq logic so docker availability is also checked when _load_kind() will
be executed (i.e., when the --load-kind flag or the code path that triggers
image preloading runs). Specifically, add a docker check before calling
_docker_server_platform() or entering the loop over _KIND_PRELOAD_IMAGES, and
surface a clear prerequisite error via the same callback.on_log/process-exit
behavior used elsewhere rather than letting
_run_logged/_docker_server_platform() raise a raw command error.
---
Nitpick comments:
In `@installer/tests/test_deployer.py`:
- Around line 366-406: Add two tests to TestKindImageLoading to cover edge
cases: (1) a fallback when docker version fails — copy the existing test but
have fake_run return MagicMock(returncode=1, ...) for the
["docker","version","--format"] call, monkeypatch platform.machine to "x86_64",
call Deployer._load_kind and assert that
["docker","pull","--platform","linux/amd64", LOADER_POD_IMAGE] appears in
logged_commands (reference fake_run, fake_run_logged, Deployer._load_kind,
LOADER_POD_IMAGE); (2) multiple images in _KIND_PRELOAD_IMAGES — monkeypatch
nv_config_manager_installer.deployer._KIND_PRELOAD_IMAGES to a list of two image
names, run the same setup and assert that a docker pull (with
--platform/linux/amd64 fallback as needed) and a kind load are invoked for each
image in logged_commands (reference _KIND_PRELOAD_IMAGES and the same
fake_run/fake_run_logged helpers).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b59f860d-3f8a-48af-aa9d-120de27c0bc5
📒 Files selected for processing (3)
installer/src/nv_config_manager_installer/deployer.pyinstaller/src/nv_config_manager_installer/k8s.pyinstaller/tests/test_deployer.py
🚧 Files skipped from review as they are similar to previous changes (1)
- installer/src/nv_config_manager_installer/k8s.py
6229738 to
be45d7c
Compare
|
/ok to test be45d7c |
|
/ok to test 1d181d5 |
1d181d5 to
37047ae
Compare
|
/ok to test 37047ae |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/kind-integration.yml (1)
121-134:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRetry only timeout/transient failures; current loop retries deterministic test failures and overwrites diagnostics.
At Line 123 and Line 127, any non-zero pytest exit is retried, and
integration-test-results.xmlis overwritten on each attempt. This increases CI time on real failures and can hide first-failure context.Suggested patch
- name: Run integration tests timeout-minutes: 50 run: | for attempt in 1 2 3; do echo "Attempt ${attempt} of 3" - if timeout --kill-after=1m 15m uv run pytest "${TEST_PATH}" -v \ + junit_file="integration-test-results-attempt-${attempt}.xml" + if timeout --kill-after=1m 15m uv run pytest "${TEST_PATH}" -v \ --ci \ --nv-config-manager-namespace "${NV_CONFIG_MANAGER_NAMESPACE}" \ --tb=short \ - --junitxml=integration-test-results.xml; then + --junitxml="${junit_file}"; then + cp "${junit_file}" integration-test-results.xml exit 0 fi + exit_code=$? + # Retry only on timeout (124/137). Fail fast on deterministic test failures. + if [ "${exit_code}" -ne 124 ] && [ "${exit_code}" -ne 137 ]; then + cp "${junit_file}" integration-test-results.xml || true + exit "${exit_code}" + fi if [ "${attempt}" -lt 3 ]; then sleep 10 fi done exit 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/kind-integration.yml around lines 121 - 134, The retry loop currently reruns the command `timeout --kill-after=1m 15m uv run pytest "${TEST_PATH}" -v ... --junitxml=integration-test-results.xml` for any non-zero exit and overwrites `integration-test-results.xml`; change it to run pytest into an attempt-specific results file (e.g. `--junitxml=integration-test-results-${attempt}.xml`), capture the pytest exit code, and decide: if exit indicates a timeout/kill (e.g. process killed by timeout/exit code 137 or signal) or other known transient condition, then retry up to 3 attempts; otherwise for deterministic failures immediately copy the current attempt XML to `integration-test-results.xml` and exit non-zero; on success copy the successful attempt XML to `integration-test-results.xml` and exit 0. Update the loop surrounding the `timeout ... uv run pytest` invocation and reference `${attempt}`, `${TEST_PATH}`, and `integration-test-results.xml` when implementing these checks and file copies.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In @.github/workflows/kind-integration.yml:
- Around line 121-134: The retry loop currently reruns the command `timeout
--kill-after=1m 15m uv run pytest "${TEST_PATH}" -v ...
--junitxml=integration-test-results.xml` for any non-zero exit and overwrites
`integration-test-results.xml`; change it to run pytest into an attempt-specific
results file (e.g. `--junitxml=integration-test-results-${attempt}.xml`),
capture the pytest exit code, and decide: if exit indicates a timeout/kill (e.g.
process killed by timeout/exit code 137 or signal) or other known transient
condition, then retry up to 3 attempts; otherwise for deterministic failures
immediately copy the current attempt XML to `integration-test-results.xml` and
exit non-zero; on success copy the successful attempt XML to
`integration-test-results.xml` and exit 0. Update the loop surrounding the
`timeout ... uv run pytest` invocation and reference `${attempt}`,
`${TEST_PATH}`, and `integration-test-results.xml` when implementing these
checks and file copies.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 06525d1b-baa4-48eb-8777-1be8f1eee622
📒 Files selected for processing (6)
.github/workflows/kind-integration.ymlMakefileinstaller/src/nv_config_manager_installer/cli.pyinstaller/src/nv_config_manager_installer/deployer.pyinstaller/tests/test_cli.pyinstaller/tests/test_deployer.py
🚧 Files skipped from review as they are similar to previous changes (5)
- Makefile
- installer/tests/test_cli.py
- installer/src/nv_config_manager_installer/cli.py
- installer/tests/test_deployer.py
- installer/src/nv_config_manager_installer/deployer.py
|
/ok to test 864f985 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@installer/src/nv_config_manager_installer/deployer.py`:
- Around line 602-612: In _poll_pod_summary(), after calling
_unready_pod_summary_lines(k8s, namespace) and before any call to
_emit_pod_summary or computing signature/heartbeat logic, re-check
stop_event.is_set() and return/break early if set; this prevents emitting a
final pod-summary after shutdown was requested while list_namespaced_pod() was
in-flight. Update the loop in _poll_pod_summary to check stop_event immediately
after obtaining lines (and before computing
signature/has_blockers/last_emit/_emit_pod_summary) so the thread exits without
logging when stop_event has been set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cc0d0934-655e-49c7-aac6-69e036e4933a
📒 Files selected for processing (5)
.github/workflows/kind-integration.ymlinstaller/src/nv_config_manager_installer/cli.pyinstaller/src/nv_config_manager_installer/deployer.pyinstaller/tests/test_cli.pyinstaller/tests/test_deployer.py
✅ Files skipped from review due to trivial changes (1)
- installer/tests/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (1)
- installer/src/nv_config_manager_installer/cli.py
| while not stop_event.is_set(): | ||
| lines = _unready_pod_summary_lines(k8s, namespace) | ||
| signature = "\n".join(lines) | ||
| now = time.monotonic() | ||
| has_blockers = lines != ["all pods ready"] | ||
| if signature != last_signature or (has_blockers and now - last_emit >= heartbeat_interval): | ||
| _emit_pod_summary(lines, step, callback) | ||
| last_signature = signature | ||
| last_emit = now | ||
| if stop_event.wait(interval): | ||
| break |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the file
ls -la installer/src/nv_config_manager_installer/deployer.py
# Show the relevant sections around the referenced line numbers
sed -n '560,700p' installer/src/nv_config_manager_installer/deployer.py | cat -n
# Find the poller/thread lifecycle helpers
rg -n "def _poll_pod_summary|_run_logged_with_pod_summary|stop_event|watch_pods|_unready_pod_summary_lines" installer/src/nv_config_manager_installer/deployer.py
# Show the helper implementations for the poller and its callers
# (Use narrower slices around matches to avoid large output)
python3 - <<'PY'
import re, subprocess, os, textwrap, sys, pathlib
path="installer/src/nv_config_manager_installer/deployer.py"
# Get line numbers for key defs/usages
patterns=[
r"def _poll_pod_summary\(",
r"def _run_logged_with_pod_summary\(",
r"def _unready_pod_summary_lines\(",
r"def _helm_install\(",
]
# Use rg to get exact line numbers
for pat in patterns:
out=subprocess.check_output(["rg","-n",pat,path], text=True).strip().splitlines()
if not out:
print("NO MATCH", pat)
continue
for line in out:
ln=int(line.split(":")[0])
print(f"\n--- {pat} around {ln} ---")
start=max(1,ln-60); end=ln+120
subprocess.run(["sed","-n",f"{start},{end}p",path], check=True)
PYRepository: NVIDIA/nv-config-manager
Length of output: 31655
Re-check stop_event before emitting pod summary logs in _poll_pod_summary()
_poll_pod_summary() emits [pods] after _unready_pod_summary_lines() returns, but it doesn’t re-check stop_event in between. If shutdown sets stop_event while k8s.v1.list_namespaced_pod() is still in-flight, the poll thread can still log one more pod-summary batch even though _run_logged_with_pod_summary() has returned (it only join(timeout=2) and the thread is daemon=True). Apply the early-exit check before computing/signaling emissions.
🔧 Proposed fix
while not stop_event.is_set():
lines = _unready_pod_summary_lines(k8s, namespace)
+ if stop_event.is_set():
+ break
signature = "\n".join(lines)
now = time.monotonic()
has_blockers = lines != ["all pods ready"]
if signature != last_signature or (has_blockers and now - last_emit >= heartbeat_interval):
_emit_pod_summary(lines, step, callback)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@installer/src/nv_config_manager_installer/deployer.py` around lines 602 -
612, In _poll_pod_summary(), after calling _unready_pod_summary_lines(k8s,
namespace) and before any call to _emit_pod_summary or computing
signature/heartbeat logic, re-check stop_event.is_set() and return/break early
if set; this prevents emitting a final pod-summary after shutdown was requested
while list_namespaced_pod() was in-flight. Update the loop in _poll_pod_summary
to check stop_event immediately after obtaining lines (and before computing
signature/has_blockers/last_emit/_emit_pod_summary) so the thread exits without
logging when stop_event has been set.
|
/ok to test 6da405f |
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
6da405f to
e62bc48
Compare
|
/ok to test e62bc48 |
e62bc48 to
ebf71f8
Compare
|
/ok to test ebf71f8 |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
deploy/configs/ci-integration-test.yaml (1)
81-82: 💤 Low valueComment mentions only Docker Hub but list includes multiple registries.
The comment states images are from "Docker Hub" but the list includes images from
docker.io,quay.io, andghcr.io. Consider updating to "from public registries" or "from Docker Hub and other registries" for accuracy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/configs/ci-integration-test.yaml` around lines 81 - 82, Update the descriptive comment that currently says "CI preloads third-party runtime images into Kind through the host Docker daemon. This avoids anonymous in-cluster pulls from Docker Hub during Helm's --wait window." to accurately reflect the registries used by the image list: replace "Docker Hub" with "public registries (e.g., docker.io, quay.io, ghcr.io)" or "Docker Hub and other registries" so the comment above the CI preload block correctly describes the sources.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/configs/ci-integration-test.yaml`:
- Around line 83-102: The kind_preload_images list in the YAML doesn't include
busybox despite the PR stating the change fixes a timeout caused by pulling the
busybox image; either add the busybox image (e.g.,
docker.io/library/busybox:latest or a specific tag) to the kind_preload_images
array or update the PR description to remove the busybox-related claim; locate
the kind_preload_images section and add the busybox entry (or change the PR
text) so the code and PR description are consistent.
- Around line 83-102: Update the kind_preload_images verification workflow to
stop using docker manifest inspect and the dummy "Advisory check for:" log:
replace the docker-based manifest check with a Dockerless registry inspect
(e.g., use skopeo inspect --raw docker://<image> or crane manifest <image>) to
confirm each tag exists and return non-zero on missing tags, and replace the
placeholder advisory step with a remote image scan using trivy (trivy image
--severity HIGH,CRITICAL --exit-code 1 --remote <image>) so the job fails when
HIGH/CRITICAL findings are present; target the sections referencing
kind_preload_images and the current "docker manifest inspect" and "Advisory
check for:" steps to implement these changes.
- Around line 102-106: The PostgreSQL image is declared twice (in
kind_preload_images and in overrides.postgresql.repository/tag), which can lead
to divergence; add a concise clarifying comment next to both declarations (the
kind_preload_images entry and the overrides.postgresql.repository/tag block)
stating that these two values must be updated together when bumping PostgreSQL,
and optionally include the intended source of truth (e.g., "update both entries
when upgrading PostgreSQL") so future editors will keep them in sync.
---
Nitpick comments:
In `@deploy/configs/ci-integration-test.yaml`:
- Around line 81-82: Update the descriptive comment that currently says "CI
preloads third-party runtime images into Kind through the host Docker daemon.
This avoids anonymous in-cluster pulls from Docker Hub during Helm's --wait
window." to accurately reflect the registries used by the image list: replace
"Docker Hub" with "public registries (e.g., docker.io, quay.io, ghcr.io)" or
"Docker Hub and other registries" so the comment above the CI preload block
correctly describes the sources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ddc75572-3850-451a-8175-494931f97fd7
📒 Files selected for processing (6)
deploy/configs/ci-integration-test.yamldocs/install/index.mdxinstaller/src/nv_config_manager_installer/deployer.pyinstaller/src/nv_config_manager_installer/schema.pyinstaller/tests/test_deployer.pyinstaller/tests/test_schema.py
✅ Files skipped from review due to trivial changes (1)
- docs/install/index.mdx
🚧 Files skipped from review as they are similar to previous changes (4)
- installer/src/nv_config_manager_installer/schema.py
- installer/tests/test_schema.py
- installer/tests/test_deployer.py
- installer/src/nv_config_manager_installer/deployer.py
| kind_preload_images: | ||
| - docker.io/hashicorp/http-echo:1.0 | ||
| - docker.io/alpine/kubectl:1.35.4 | ||
| - docker.io/library/redis:7-alpine | ||
| - docker.io/library/nats:2.10-alpine | ||
| - docker.io/natsio/nats-box:0.14.3 | ||
| - docker.io/temporalio/server:1.29 | ||
| - docker.io/temporalio/admin-tools:1.29 | ||
| - docker.io/temporalio/ui:v2.37.4 | ||
| - docker.io/nginxinc/nginx-unprivileged:1.27 | ||
| - docker.io/envoyproxy/gateway:v1.6.5 | ||
| - docker.io/envoyproxy/ratelimit:c8765e89 | ||
| - docker.io/envoyproxy/envoy:distroless-v1.36.5 | ||
| - quay.io/jetstack/cert-manager-controller:v1.20.2 | ||
| - quay.io/jetstack/cert-manager-webhook:v1.20.2 | ||
| - quay.io/jetstack/cert-manager-cainjector:v1.20.2 | ||
| - quay.io/jetstack/cert-manager-startupapicheck:v1.20.2 | ||
| - quay.io/jetstack/cert-manager-acmesolver:v1.20.2 | ||
| - ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0 | ||
| - ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie |
There was a problem hiding this comment.
Busybox image missing from preload list despite PR objectives.
The PR description states the purpose is to "Fix a timeout in the Kind integration test caused by pulling the busybox image," but busybox (or library/busybox) is not present in the kind_preload_images list. This is inconsistent with the stated objective.
If the busybox timeout has been resolved through other means or is no longer relevant, please update the PR description. Otherwise, consider adding the busybox image to this preload list.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/configs/ci-integration-test.yaml` around lines 83 - 102, The
kind_preload_images list in the YAML doesn't include busybox despite the PR
stating the change fixes a timeout caused by pulling the busybox image; either
add the busybox image (e.g., docker.io/library/busybox:latest or a specific tag)
to the kind_preload_images array or update the PR description to remove the
busybox-related claim; locate the kind_preload_images section and add the
busybox entry (or change the PR text) so the code and PR description are
consistent.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify all preloaded images exist and check for vulnerabilities
images=(
"docker.io/hashicorp/http-echo:1.0"
"docker.io/alpine/kubectl:1.35.4"
"docker.io/library/redis:7-alpine"
"docker.io/library/nats:2.10-alpine"
"docker.io/natsio/nats-box:0.14.3"
"docker.io/temporalio/server:1.29"
"docker.io/temporalio/admin-tools:1.29"
"docker.io/temporalio/ui:v2.37.4"
"docker.io/nginxinc/nginx-unprivileged:1.27"
"docker.io/envoyproxy/gateway:v1.6.5"
"docker.io/envoyproxy/ratelimit:c8765e89"
"docker.io/envoyproxy/envoy:distroless-v1.36.5"
"quay.io/jetstack/cert-manager-controller:v1.20.2"
"quay.io/jetstack/cert-manager-webhook:v1.20.2"
"quay.io/jetstack/cert-manager-cainjector:v1.20.2"
"quay.io/jetstack/cert-manager-startupapicheck:v1.20.2"
"quay.io/jetstack/cert-manager-acmesolver:v1.20.2"
"ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0"
"ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie"
)
echo "Checking image manifest existence..."
for img in "${images[@]}"; do
echo "Checking: $img"
if ! docker manifest inspect "$img" > /dev/null 2>&1; then
echo " ❌ FAILED: Image not found or inaccessible"
else
echo " ✓ OK"
fi
done
echo ""
echo "Checking for security advisories (GitHub)..."
# Check for known CVEs in common images
for img in "${images[@]}"; do
repo=$(echo "$img" | cut -d: -f1 | sed 's|docker.io/library/||' | sed 's|docker.io/||' | sed 's|/|%2F|g')
echo "Advisory check for: $img"
doneRepository: NVIDIA/nv-config-manager
Length of output: 3117
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== docker availability =="
command -v docker || true
docker --version || true
docker info >/dev/null 2>&1 && echo "docker info: OK" || echo "docker info: FAILED (daemon/network?)"
echo
echo "== manifest lookup sanity check =="
# Use a very common public image
KNOWN="docker.io/library/busybox:latest"
echo "Checking: $KNOWN"
if docker manifest inspect "$KNOWN" >/dev/null 2>&1; then
echo " ✓ OK"
else
echo " ❌ FAILED"
fi
echo
echo "== tool availability (optional) =="
command -v trivy >/dev/null 2>&1 && echo "trivy: present" || echo "trivy: not present"Repository: NVIDIA/nv-config-manager
Length of output: 336
Fix the kind_preload_images verification to actually check tag existence + vulnerabilities
- The proposed
docker manifest inspectapproach is not portable: it requires a working Docker client/daemon and registry access; otherwise it fails to validate the tags. Use a Dockerless registry tool (e.g.,skopeo inspect --raw docker://<image>orcrane manifest <image>) to confirm each tag exists. - The “Checking for security advisories (GitHub)” section doesn’t perform any real advisory/CVE lookup (it only prints
Advisory check for: ...). Replace that part with an actual scanner;trivyis available, so scan the remote images for HIGH/CRITICAL and fail the job on findings.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/configs/ci-integration-test.yaml` around lines 83 - 102, Update the
kind_preload_images verification workflow to stop using docker manifest inspect
and the dummy "Advisory check for:" log: replace the docker-based manifest check
with a Dockerless registry inspect (e.g., use skopeo inspect --raw
docker://<image> or crane manifest <image>) to confirm each tag exists and
return non-zero on missing tags, and replace the placeholder advisory step with
a remote image scan using trivy (trivy image --severity HIGH,CRITICAL
--exit-code 1 --remote <image>) so the job fails when HIGH/CRITICAL findings are
present; target the sections referencing kind_preload_images and the current
"docker manifest inspect" and "Advisory check for:" steps to implement these
changes.
| - ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie | ||
| overrides: | ||
| postgresql: | ||
| repository: ghcr.io/cloudnative-pg/postgresql | ||
| tag: 18.0-system-trixie |
There was a problem hiding this comment.
PostgreSQL image specified in both preload list and overrides.
The PostgreSQL image ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie is defined in two places:
- Line 102: In the
kind_preload_imageslist - Lines 104-106: In the
overrides.postgresqlblock
While currently consistent, maintaining the same image in two locations increases the risk of divergence during future updates.
Consider adding a comment linking these two declarations or documenting that both must be updated together when upgrading PostgreSQL.
📝 Example clarifying comment
- ghcr.io/cloudnative-pg/cloudnative-pg:1.29.0
+ # Note: PostgreSQL image below must match overrides.postgresql (lines 104-106)
- ghcr.io/cloudnative-pg/postgresql:18.0-system-trixie
overrides:
+ # Note: Must match kind_preload_images entry (line 102)
postgresql:
repository: ghcr.io/cloudnative-pg/postgresql
tag: 18.0-system-trixie🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/configs/ci-integration-test.yaml` around lines 102 - 106, The
PostgreSQL image is declared twice (in kind_preload_images and in
overrides.postgresql.repository/tag), which can lead to divergence; add a
concise clarifying comment next to both declarations (the kind_preload_images
entry and the overrides.postgresql.repository/tag block) stating that these two
values must be updated together when bumping PostgreSQL, and optionally include
the intended source of truth (e.g., "update both entries when upgrading
PostgreSQL") so future editors will keep them in sync.
Signed-off-by: Zoe Blevins <zblevins@nvidia.com>
ebf71f8 to
76a5b5b
Compare
|
/ok to test 76a5b5b |
Plain `docker save` of a multi-arch tag under Docker's containerd image store exports an OCI index referencing every platform and, for indices with several platforms (e.g. cert-manager-controller v1.20.2 has 5), can omit the selected platform's config blob from the tar. This made `ctr images import --platform=linux/amd64` fail in kind-up with: ctr: failed to resolve rootfs: content digest sha256:...: not found Pass the same `--platform` to `docker save` (Docker 28+/API 1.48+) so the archive is a self-contained single-platform OCI image. Signed-off-by: Davesh Mathur <daveshm@nvidia.com>
6a92183 to
80949e7
Compare
|
/ok to test 80949e7 |
Diagnostic for docker/cli#6457: `docker save --platform <p>` is refusing to export freshly-pulled multi-arch helper images ("no suitable export target found ... does not provide the specified platform") even though all blobs for the requested platform are present. One suspected trigger is layer dedup with sibling images pulled earlier in the same preload loop (e.g. cert-manager-controller and envoyproxy/envoy:distroless both ride on the same distroless cc-debian12 base). Run `docker system prune -af` before each helper pull so the next pull cannot share blobs with the previous one. Images held by running containers (the kind node) are not affected. If this changes the failure mode the dedup hypothesis is confirmed; if not we'll move off `docker save` entirely. Signed-off-by: Davesh Mathur <daveshm@nvidia.com>
|
/ok to test 0f10224 |
Description
Fix timeout in kind integration test caused by busybox pull
Validation
The kind integration test is manual due to taking ~30 min to complete. When the PR is ready for review,
run Actions -> Kind Integration -> Run workflow against the copy-pr-bot generated
pull-request/<PR_NUMBER>branch. Use the defaulttest_pathfor the full suite,or narrow it only while debugging.
Passing Kind Integration run:
Checklist
CONTRIBUTING.md.docs screenshots, or Helm/rendered outputs.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores