Update existing training images to latest version#130
Update existing training images to latest version#130sutaakar merged 1 commit intoopendatahub-io:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request updates container image digest references across parameter files and Kubernetes runtime manifests. Five digests are replaced: two in Estimated code review effort🎯 1 (Trivial) | ⏱️ ~4 minutes Security findingsVerify image digest provenance before merge. Container image pinning via SHA256 digests is a sound supply chain security practice, but these digests must be validated against known-good sources prior to deployment to mitigate CWE-426 (Untrusted Search Path) and general supply chain attack vectors. Confirm:
If automated image scanning is not part of the deployment workflow, add it before merging any future digest updates. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@manifests/rhoai/params.env`:
- Around line 2-3: The two image digest entries
odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image in params.env reference nonexistent
digests; update those variable values to point to valid image references (either
a verified sha256 digest or a stable tag) so kustomize can pull the images;
locate and edit the odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image lines in manifests/rhoai/params.env and
replace the broken sha256 values with the correct digest or tag obtained from
the registry, then verify by doing a docker/registry pull to confirm the values
resolve.
In `@manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml`:
- Line 25: The YAML contains a hardcoded image digest for
torch-distributed-rocm64-torch29-py312 that conflicts with the existing
kustomization replacement sourced from params.env; remove the digest portion of
the image string so the image field becomes image:
quay.io/opendatahub/odh-training-rocm64-torch29-py312 (no `@sha256` suffix) and
rely solely on the kustomization replacement for injecting the digest, or
alternatively remove/disable the kustomization replacement if you intend to keep
the digest here—update the manifest referencing the image field for the resource
named torch-distributed-rocm64-torch29-py312 accordingly.
In `@manifests/rhoai/runtimes/training_hub_th05_cuda128_torch29_py312.yaml`:
- Line 25: The image digest referenced in the image field
("quay.io/opendatahub/odh-training-cuda128-torch29-py312@sha256:0be52d5775e95026c3899a208d9fbecb59489d48763664e842b92e66d3c112c8")
cannot be resolved; update the image reference in the manifest by querying the
registry for the correct digest or replacing the digest reference with a
verified, scanned image tag (e.g., a known good sha256 from quay or an approved
tag like a release image) and ensure the new value is used in the image: line so
the pod can pull a valid, integrity-checked image.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 599dd017-ea11-44df-9dbb-b5180bccdc26
📒 Files selected for processing (4)
manifests/rhoai/params.envmanifests/rhoai/runtimes/torch_distributed_cuda128_torch29_py312.yamlmanifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yamlmanifests/rhoai/runtimes/training_hub_th05_cuda128_torch29_py312.yaml
| odh-training-cuda128-torch29-py312-image=quay.io/opendatahub/odh-training-cuda128-torch29-py312@sha256:0be52d5775e95026c3899a208d9fbecb59489d48763664e842b92e66d3c112c8 | ||
| odh-training-rocm64-torch29-py312-image=quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:80878d0d51fa6bc8957f669e7f3facac13669562d393a6bfc45ca8dff277c2fa |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify both CUDA128 and ROCm64 image digests exist
CUDA_IMAGE="quay.io/opendatahub/odh-training-cuda128-torch29-py312@sha256:0be52d5775e95026c3899a208d9fbecb59489d48763664e842b92e66d3c112c8"
ROCM_IMAGE="quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:80878d0d51fa6bc8957f669e7f3facac13669562d393a6bfc45ca8dff277c2fa"
echo "=== Verifying CUDA128 image ==="
skopeo inspect --raw docker://${CUDA_IMAGE} > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✓ CUDA128 digest exists"
skopeo inspect docker://${CUDA_IMAGE} | jq '{Digest: .Digest, Created: .Created, Size: .Size}'
else
echo "✗ CUDA128 digest not found"
fi
echo -e "\n=== Verifying ROCm64 image ==="
skopeo inspect --raw docker://${ROCM_IMAGE} > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✓ ROCm64 digest exists"
skopeo inspect docker://${ROCM_IMAGE} | jq '{Digest: .Digest, Created: .Created, Size: .Size}'
else
echo "✗ ROCm64 digest not found"
fiRepository: opendatahub-io/trainer
Length of output: 182
Both image digests do not exist in the registry.
Lines 2–3 reference non-existent digests for the CUDA128 and ROCm64 images. Build will fail when kustomization attempts to pull these images. Supply chain risk (CWE-494): params.env is the authoritative source for all dependent runtimes; non-existent digests here block all downstream deployments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@manifests/rhoai/params.env` around lines 2 - 3, The two image digest entries
odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image in params.env reference nonexistent
digests; update those variable values to point to valid image references (either
a verified sha256 digest or a stable tag) so kustomize can pull the images;
locate and edit the odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image lines in manifests/rhoai/params.env and
replace the broken sha256 values with the correct digest or tag obtained from
the registry, then verify by doing a docker/registry pull to confirm the values
resolve.
| containers: | ||
| - name: node | ||
| image: quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:76a88281fe9465115d48803b05435cfb48b9d0e6b279b6575d9c9dbb6e4e1908 | ||
| image: quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:80878d0d51fa6bc8957f669e7f3facac13669562d393a6bfc45ca8dff277c2fa |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if kustomization replacement exists for ROCm64 runtime
echo "=== Checking kustomization replacement for ROCm64 runtime ==="
KUSTOMIZATION_FILE="manifests/rhoai/kustomization.yaml"
if [ -f "$KUSTOMIZATION_FILE" ]; then
# Check for replacement rule for torch-distributed-rocm64-torch29-py312
yq eval '.replacements[] | select(.targets[].select.name == "torch-distributed-rocm64-torch29-py312")' "$KUSTOMIZATION_FILE"
echo -e "\n=== Digest in runtime file ==="
yq eval '.spec.template.spec.replicatedJobs[] | select(.name == "node") | .template.spec.template.spec.containers[] | select(.name == "node") | .image' manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml
echo -e "\n=== Digest in params.env ==="
grep "odh-training-rocm64-torch29-py312-image" manifests/rhoai/params.env
else
echo "kustomization.yaml not found"
fiRepository: opendatahub-io/trainer
Length of output: 455
🏁 Script executed:
#!/bin/bash
# Read the kustomization.yaml file to check for ROCm64 replacement rules
echo "=== kustomization.yaml content ==="
head -100 manifests/rhoai/kustomization.yaml
echo -e "\n=== Searching for rocm64 in kustomization.yaml ==="
grep -i "rocm64" manifests/rhoai/kustomization.yaml || echo "No rocm64 references found"
echo -e "\n=== Searching for torch-distributed replacements ==="
grep -A 5 "torch-distributed" manifests/rhoai/kustomization.yaml || echo "No torch-distributed replacements found"
echo -e "\n=== Searching for image replacements ==="
grep -B 2 -A 5 "kind: ImagePolicy\|replacements:" manifests/rhoai/kustomization.yaml | head -50Repository: opendatahub-io/trainer
Length of output: 6751
🏁 Script executed:
#!/bin/bash
# Read the ROCm64 runtime file to see structure
echo "=== torch_distributed_rocm64_torch29_py312.yaml ==="
cat -n manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml
echo -e "\n=== For comparison, CUDA128 runtime file ==="
cat -n manifests/rhoai/runtimes/torch_distributed_cuda128_torch29_py312.yamlRepository: opendatahub-io/trainer
Length of output: 1979
Remove hardcoded digest or rely solely on kustomization replacement to avoid configuration drift.
A kustomization replacement rule exists for torch-distributed-rocm64-torch29-py312 that sources the image digest from params.env. The hardcoded digest at line 25 is redundant and will be overridden by the replacement. Having the same digest in both params.env and the YAML file creates a maintenance burden—if only one is updated, the system has conflicting values. Either remove the digest from the YAML (relying entirely on the kustomization replacement) or document why both are necessary.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml` at line
25, The YAML contains a hardcoded image digest for
torch-distributed-rocm64-torch29-py312 that conflicts with the existing
kustomization replacement sourced from params.env; remove the digest portion of
the image string so the image field becomes image:
quay.io/opendatahub/odh-training-rocm64-torch29-py312 (no `@sha256` suffix) and
rely solely on the kustomization replacement for injecting the digest, or
alternatively remove/disable the kustomization replacement if you intend to keep
the digest here—update the manifest referencing the image field for the resource
named torch-distributed-rocm64-torch29-py312 accordingly.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist:
Summary by CodeRabbit
Release Notes