Skip to content

Update existing training images to latest version#130

Merged
sutaakar merged 1 commit intoopendatahub-io:mainfrom
sutaakar:image-update
Mar 31, 2026
Merged

Update existing training images to latest version#130
sutaakar merged 1 commit intoopendatahub-io:mainfrom
sutaakar:image-update

Conversation

@sutaakar
Copy link
Copy Markdown
Collaborator

@sutaakar sutaakar commented Mar 31, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Summary by CodeRabbit

Release Notes

  • Chores
    • Updated container images for distributed PyTorch training runtimes supporting CUDA 128 and ROCm 64 configurations.

@sutaakar sutaakar requested a review from MStokluska March 31, 2026 09:52
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

This pull request updates container image digest references across parameter files and Kubernetes runtime manifests. Five digests are replaced: two in manifests/rhoai/params.env for CUDA and ROCm training images, and one each in three runtime YAML files (torch_distributed_cuda128_torch29_py312.yaml, torch_distributed_rocm64_torch29_py312.yaml, training_hub_th05_cuda128_torch29_py312.yaml). No structural changes or configuration logic modifications are present.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~4 minutes

Security findings

Verify image digest provenance before merge. Container image pinning via SHA256 digests is a sound supply chain security practice, but these digests must be validated against known-good sources prior to deployment to mitigate CWE-426 (Untrusted Search Path) and general supply chain attack vectors. Confirm:

  • Each new digest corresponds to a legitimately built, signed, and scanned image from the authoritative registry (quay.io/opendatahub/)
  • Images have passed vulnerability scanning and attestation checks
  • No digest tampering or man-in-the-middle substitution occurred during the update process

If automated image scanning is not part of the deployment workflow, add it before merging any future digest updates.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: updating container image digests across multiple training runtime manifests to pinned versions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@manifests/rhoai/params.env`:
- Around line 2-3: The two image digest entries
odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image in params.env reference nonexistent
digests; update those variable values to point to valid image references (either
a verified sha256 digest or a stable tag) so kustomize can pull the images;
locate and edit the odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image lines in manifests/rhoai/params.env and
replace the broken sha256 values with the correct digest or tag obtained from
the registry, then verify by doing a docker/registry pull to confirm the values
resolve.

In `@manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml`:
- Line 25: The YAML contains a hardcoded image digest for
torch-distributed-rocm64-torch29-py312 that conflicts with the existing
kustomization replacement sourced from params.env; remove the digest portion of
the image string so the image field becomes image:
quay.io/opendatahub/odh-training-rocm64-torch29-py312 (no `@sha256` suffix) and
rely solely on the kustomization replacement for injecting the digest, or
alternatively remove/disable the kustomization replacement if you intend to keep
the digest here—update the manifest referencing the image field for the resource
named torch-distributed-rocm64-torch29-py312 accordingly.

In `@manifests/rhoai/runtimes/training_hub_th05_cuda128_torch29_py312.yaml`:
- Line 25: The image digest referenced in the image field
("quay.io/opendatahub/odh-training-cuda128-torch29-py312@sha256:0be52d5775e95026c3899a208d9fbecb59489d48763664e842b92e66d3c112c8")
cannot be resolved; update the image reference in the manifest by querying the
registry for the correct digest or replacing the digest reference with a
verified, scanned image tag (e.g., a known good sha256 from quay or an approved
tag like a release image) and ensure the new value is used in the image: line so
the pod can pull a valid, integrity-checked image.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 599dd017-ea11-44df-9dbb-b5180bccdc26

📥 Commits

Reviewing files that changed from the base of the PR and between a85096e and 42baa0f.

📒 Files selected for processing (4)
  • manifests/rhoai/params.env
  • manifests/rhoai/runtimes/torch_distributed_cuda128_torch29_py312.yaml
  • manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml
  • manifests/rhoai/runtimes/training_hub_th05_cuda128_torch29_py312.yaml

Comment on lines +2 to +3
odh-training-cuda128-torch29-py312-image=quay.io/opendatahub/odh-training-cuda128-torch29-py312@sha256:0be52d5775e95026c3899a208d9fbecb59489d48763664e842b92e66d3c112c8
odh-training-rocm64-torch29-py312-image=quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:80878d0d51fa6bc8957f669e7f3facac13669562d393a6bfc45ca8dff277c2fa
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify both CUDA128 and ROCm64 image digests exist

CUDA_IMAGE="quay.io/opendatahub/odh-training-cuda128-torch29-py312@sha256:0be52d5775e95026c3899a208d9fbecb59489d48763664e842b92e66d3c112c8"
ROCM_IMAGE="quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:80878d0d51fa6bc8957f669e7f3facac13669562d393a6bfc45ca8dff277c2fa"

echo "=== Verifying CUDA128 image ==="
skopeo inspect --raw docker://${CUDA_IMAGE} > /dev/null 2>&1
if [ $? -eq 0 ]; then
  echo "✓ CUDA128 digest exists"
  skopeo inspect docker://${CUDA_IMAGE} | jq '{Digest: .Digest, Created: .Created, Size: .Size}'
else
  echo "✗ CUDA128 digest not found"
fi

echo -e "\n=== Verifying ROCm64 image ==="
skopeo inspect --raw docker://${ROCM_IMAGE} > /dev/null 2>&1
if [ $? -eq 0 ]; then
  echo "✓ ROCm64 digest exists"
  skopeo inspect docker://${ROCM_IMAGE} | jq '{Digest: .Digest, Created: .Created, Size: .Size}'
else
  echo "✗ ROCm64 digest not found"
fi

Repository: opendatahub-io/trainer

Length of output: 182


Both image digests do not exist in the registry.

Lines 2–3 reference non-existent digests for the CUDA128 and ROCm64 images. Build will fail when kustomization attempts to pull these images. Supply chain risk (CWE-494): params.env is the authoritative source for all dependent runtimes; non-existent digests here block all downstream deployments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/rhoai/params.env` around lines 2 - 3, The two image digest entries
odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image in params.env reference nonexistent
digests; update those variable values to point to valid image references (either
a verified sha256 digest or a stable tag) so kustomize can pull the images;
locate and edit the odh-training-cuda128-torch29-py312-image and
odh-training-rocm64-torch29-py312-image lines in manifests/rhoai/params.env and
replace the broken sha256 values with the correct digest or tag obtained from
the registry, then verify by doing a docker/registry pull to confirm the values
resolve.

containers:
- name: node
image: quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:76a88281fe9465115d48803b05435cfb48b9d0e6b279b6575d9c9dbb6e4e1908
image: quay.io/opendatahub/odh-training-rocm64-torch29-py312@sha256:80878d0d51fa6bc8957f669e7f3facac13669562d393a6bfc45ca8dff277c2fa
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if kustomization replacement exists for ROCm64 runtime

echo "=== Checking kustomization replacement for ROCm64 runtime ==="
KUSTOMIZATION_FILE="manifests/rhoai/kustomization.yaml"

if [ -f "$KUSTOMIZATION_FILE" ]; then
  # Check for replacement rule for torch-distributed-rocm64-torch29-py312
  yq eval '.replacements[] | select(.targets[].select.name == "torch-distributed-rocm64-torch29-py312")' "$KUSTOMIZATION_FILE"
  
  echo -e "\n=== Digest in runtime file ==="
  yq eval '.spec.template.spec.replicatedJobs[] | select(.name == "node") | .template.spec.template.spec.containers[] | select(.name == "node") | .image' manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml
  
  echo -e "\n=== Digest in params.env ==="
  grep "odh-training-rocm64-torch29-py312-image" manifests/rhoai/params.env
else
  echo "kustomization.yaml not found"
fi

Repository: opendatahub-io/trainer

Length of output: 455


🏁 Script executed:

#!/bin/bash
# Read the kustomization.yaml file to check for ROCm64 replacement rules
echo "=== kustomization.yaml content ===" 
head -100 manifests/rhoai/kustomization.yaml

echo -e "\n=== Searching for rocm64 in kustomization.yaml ===" 
grep -i "rocm64" manifests/rhoai/kustomization.yaml || echo "No rocm64 references found"

echo -e "\n=== Searching for torch-distributed replacements ===" 
grep -A 5 "torch-distributed" manifests/rhoai/kustomization.yaml || echo "No torch-distributed replacements found"

echo -e "\n=== Searching for image replacements ===" 
grep -B 2 -A 5 "kind: ImagePolicy\|replacements:" manifests/rhoai/kustomization.yaml | head -50

Repository: opendatahub-io/trainer

Length of output: 6751


🏁 Script executed:

#!/bin/bash
# Read the ROCm64 runtime file to see structure
echo "=== torch_distributed_rocm64_torch29_py312.yaml ===" 
cat -n manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml

echo -e "\n=== For comparison, CUDA128 runtime file ===" 
cat -n manifests/rhoai/runtimes/torch_distributed_cuda128_torch29_py312.yaml

Repository: opendatahub-io/trainer

Length of output: 1979


Remove hardcoded digest or rely solely on kustomization replacement to avoid configuration drift.

A kustomization replacement rule exists for torch-distributed-rocm64-torch29-py312 that sources the image digest from params.env. The hardcoded digest at line 25 is redundant and will be overridden by the replacement. Having the same digest in both params.env and the YAML file creates a maintenance burden—if only one is updated, the system has conflicting values. Either remove the digest from the YAML (relying entirely on the kustomization replacement) or document why both are necessary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/rhoai/runtimes/torch_distributed_rocm64_torch29_py312.yaml` at line
25, The YAML contains a hardcoded image digest for
torch-distributed-rocm64-torch29-py312 that conflicts with the existing
kustomization replacement sourced from params.env; remove the digest portion of
the image string so the image field becomes image:
quay.io/opendatahub/odh-training-rocm64-torch29-py312 (no `@sha256` suffix) and
rely solely on the kustomization replacement for injecting the digest, or
alternatively remove/disable the kustomization replacement if you intend to keep
the digest here—update the manifest referencing the image field for the resource
named torch-distributed-rocm64-torch29-py312 accordingly.

@sutaakar sutaakar merged commit 66c245b into opendatahub-io:main Mar 31, 2026
6 checks passed
@sutaakar sutaakar deleted the image-update branch March 31, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants