Add Ray CUDA 12.9 image with Training Hub support by Fiona-Waters · Pull Request #929 · opendatahub-io/distributed-workloads

Fiona-Waters · 2026-06-22T15:36:32Z

Summary

RHAISTRAT-1693: Integrate Training Hub with Ray so that Training Hub algorithms (SFT, OSFT, LoRA, GRPO) can be run on Ray clusters in OpenShift AI.
Jira for the image work: https://redhat.atlassian.net/browse/RHOAIENG-61568

Adds a new Ray CUDA 12.9 image (2.55.1-py312-cu129) with training-hub and its runtime dependencies (vllm, verl, CUDA extensions) sourced from the RHOAI 3.3 AIPCC index. The existing CUDA 12.8 image (2.55.1-py312-cu128) is kept unchanged.

What's added

New image: 2.55.1-py312-cu129

CUDA 12.9.2 toolkit (NCCL 2.27.3, cuDNN 9.10.2.21)
Ray 2.55.1 + training-hub 0.8.1 and transitive dependencies via Pipfile.lock
AIPCC CUDA wheels: torch 2.9.0, torchvision, torchaudio, triton, vllm 0.13.0, flash-attn 2.8.3, mamba-ssm, causal-conv1d, xformers
verl 0.8.0 (installed with --no-deps to avoid numpy conflict)
vllm + verl runtime dependencies (anthropic, compressed-tensors, xgrammar, etc.)
pyzmq from PyPI (AIPCC wheel expects system libzmq which doesn't exist in UBI9)

New Tekton pipeline: ray-2.55.1-py312-cu129-push.yaml

Builds and pushes to quay.io/modh/ray:2.55.1-py312-cu129
Triggers on changes to images/runtime/ray/cuda/2.55.1-py312-cu129/**

Unchanged: 2.55.1-py312-cu128

Existing image, Pipfile, Pipfile.lock, and Tekton pipeline are preserved as-is.

Pipfile changes (cu129)

Added: training-hub==0.8.1, transformers>=4.57.6,<5.0, kernels>=0.9.0,<0.15, unsloth>=2026.1.1, einops>=0.8, bitsandbytes>=0.47.0, liger-kernel>=0.5.10

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

New Features
- Added a new Ray runtime image variant for CUDA 12.9 with Python 3.12.
- Updated build automation to publish and tag the new CUDA 12.9 image variant.
Documentation
- Updated the runtime documentation to reflect CUDA 12.9 support.
Chores
- Added supporting licensing and repository configuration for CUDA package installation.

openshift-ci · 2026-06-22T15:36:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-22T15:36:51Z

📝 Walkthrough

Walkthrough

Adds a new Ray 2.55.1 Python 3.12 CUDA 12.9 (cu129) runtime image: Dockerfile with per-arch CUDA repo selection, driver/cuDNN installation, and pinned PyTorch/verl/serving dependencies installed via pip and Pipfile; new Pipfile, README update, NVIDIA container license file, and x86_64/arm64 cuda.repo files. Tekton PipelineRun/trigger config updated to build and push this cu129 image variant.

Estimated code review effort: 4 (Complex) | ~60 minutes

Security concerns (CWE-flagged)

GPG key import over unauthenticated transport risk: cuda.repo-x86_64/cuda.repo-arm64 reference file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA — verify the key file itself is fetched via integrity-checked channel during Dockerfile build (CWE-494: Download of Code Without Integrity Check) since the repo files only enforce gpgcheck post-import, not the initial key retrieval.
pip install with --no-deps for verl==0.8.0 and force-reinstall of ABI-matched wheels from a third-party AIPCC CUDA index bypasses supply-chain hash verification (CWE-829: Inclusion of Functionality from Untrusted Control Sphere). No hash pinning (--require-hashes) visible for pip-installed packages — full dependency tree (torch, vllm, flash-attn, mamba-ssm, xformers, triton) pulled from an external, non-PyPI index without cited checksum verification.
Root-to-non-root user switch (USER 1001) occurs only at line 213–215, after all package installs — standard practice, but confirm no writable setuid/setgid artifacts or credentials (e.g. pip cache, cuda.repo GPG key copies) persist in image layers accessible to UID 1001 (CWE-732: Incorrect Permission Assignment for Critical Resource).
Pipfile uses unpinned/loose version ranges for transformers, kernels, unsloth, einops, bitsandbytes, liger-kernel — floating dependency ranges increase risk of unreviewed code injection via future malicious releases (CWE-1104: Use of Unmaintained Third Party Components / dependency confusion surface). Recommend pinning exact versions or hash-locking via Pipfile.lock committed to repo (noted as deleted mid-build, so no lockfile persists for audit — CWE-1329: Reliance on Component That is Not Updateable).
Tekton pipeline trigger CEL condition and additional-tag/output-image params directly interpolate {{revision}} — confirm this is not attacker-controllable via untrusted PR source (CWE-88: Argument Injection) given this is a push-triggered pipeline building/publishing to quay.io/modh/ray.
NVIDIA_REQUIRE_CUDA driver/brand constraint string is duplicated boilerplate — no CVE noted here, but ensure the CUDA 12.9 base packages pinned in Dockerfile are current: verify no known CVEs pending against pinned cuda-cudart/libcublas/libnccl versions before merge.

No praise. Verify hash pinning and lockfile retention before approval.

🚥 Pre-merge checks | ✅ 10

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: a new Ray CUDA 12.9 image with Training Hub support.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection	✅ Passed	Only one weak signal: a template-like PR body. No second-category evidence; author has prior repo commits and the description includes specific Jira links and concrete image details.
No Hardcoded Secrets	✅ Passed	No hardcoded secrets found; Tekton only references templated secretName '{{ git_auth_secret }}', and no embedded creds/private keys appeared (CWE-798).
No Weak Cryptography	✅ Passed	No MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret compares found; only SHA-256 checksums and harmless deps like cryptography.
No Injection Vectors	✅ Passed	PASS: No CWE-78/89/94/502/79 sink found; new YAML/Dockerfile only use hardcoded values and trusted enum branching (e.g., TARGETARCH).
No Privileged Containers	✅ Passed	No privileged/hostPID/hostNetwork/hostIPC/allowPrivilegeEscalation/runAsUser:0 settings were found in the new Ray CUDA manifest tree, and the Dockerfile drops to USER 1001 at the end.
No Sensitive Data In Logs	✅ Passed	No log statements or debug output exposing secrets/PII were added; this PR is build/config only. No CWE-532 evidence.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile`:
- Around line 1-215: The Dockerfile currently uses a single-stage build that
includes unnecessary build tooling, development packages, and intermediate
artifacts in the final image, increasing the attack surface. Convert this to a
multi-stage build by creating a builder stage that performs all installations
and a runtime stage that copies only the required runtime artifacts. In the
builder stage, keep all the current installation steps including the yum install
commands for development packages and the pip install commands. In a new runtime
stage, use the same base image (UBI9 Python) and copy only the necessary runtime
components from the builder: the installed Python packages from site-packages,
CUDA runtime libraries, and runtime configuration files. Remove the
development-only packages (those with -devel suffix, make, findutils,
cuda-command-line-tools, and similar build tools) from the runtime stage by not
copying those artifacts and not installing them in the final stage. Ensure the
environment variables for CUDA runtime paths and Python are preserved in the
runtime stage.
- Around line 149-151: The pip install commands use --extra-index-url for AIPCC
which makes it a secondary index with PyPI as primary, creating a dependency
confusion vulnerability where unpinned packages could resolve to malicious PyPI
versions. Fix this by changing --extra-index-url to --index-url for the
AIPCC_INDEX variable to make it the primary index, adding --extra-index-url for
PyPI as a fallback, and pinning all packages (torch, blake3, cachetools, cbor2,
cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, and any
others) to exact versions instead of leaving them unpinned. Apply this change to
all affected pip install commands in the Dockerfile.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4a59c4ee-5477-42f9-a5d3-be4330a8dc80

📥 Commits

Reviewing files that changed from the base of the PR and between 36f9003 and 7f879b1.

⛔ Files ignored due to path filters (2)

images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile.lock is excluded by !**/*.lock
images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile.lock is excluded by !**/*.lock

📒 Files selected for processing (9)

.tekton/ray-2.55.1-py312-cu129-push.yaml
images/runtime/ray/cuda/2.55.1-py312-cu128/Dockerfile
images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile
images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile
images/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSE
images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile
images/runtime/ray/cuda/2.55.1-py312-cu129/README.md
images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-arm64
images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-x86_64

💤 Files with no reviewable changes (2)

images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile
images/runtime/ray/cuda/2.55.1-py312-cu128/Dockerfile

coderabbitai · 2026-06-22T15:42:50Z

+ARG PYTHON_VERSION=312
+ARG IMAGE_TAG=9.7-1778488949
+
+FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG}
+
+ARG TARGETARCH
+
+LABEL name="ray-ubi9-py312-cu129" \
+      summary="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \
+      description="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \
+      io.k8s.display-name="CUDA 12.9 Python 3.12 base image for Ray" \
+      io.k8s.description="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \
+      authoritative-source-url="https://github.com/opendatahub-io/distributed-workloads"
+
+# Install CUDA base from:
+# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/base/Dockerfile
+USER 0
+WORKDIR /opt/app-root/bin
+
+ENV NVIDIA_REQUIRE_CUDA="cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=580,driver<581 brand=grid,driver>=580,driver<581 brand=tesla,driver>=580,driver<581 brand=nvidia,driver>=580,driver<581 brand=quadro,driver>=580,driver<581 brand=quadrortx,driver>=580,driver<581 brand=nvidiartx,driver>=580,driver<581 brand=vapps,driver>=580,driver<581 brand=vpc,driver>=580,driver<581 brand=vcs,driver>=580,driver<581 brand=vws,driver>=580,driver<581 brand=cloudgaming,driver>=580,driver<581"
+ENV NV_CUDA_CUDART_VERSION=12.9.79-1
+
+RUN NVIDIA_GPGKEY_SUM=d0664fbbdb8c32356d45de36c5984617217b2d0bef41b93ccecd326ba3b80c87 && \
+    if [ "${TARGETARCH}" = "arm64" ]; then NVARCH=sbsa; else NVARCH=x86_64; fi && \
+    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel9/${NVARCH}/D42D0685.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \
+    echo "$NVIDIA_GPGKEY_SUM  /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict -
+
+ENV CUDA_VERSION=12.9.2
+
+COPY cuda.repo-* ./
+COPY NGC-DL-CONTAINER-LICENSE /
+
+RUN if [ "${TARGETARCH}" = "arm64" ]; then \
+        cp cuda.repo-arm64 /etc/yum.repos.d/cuda.repo; \
+    else \
+        cp cuda.repo-x86_64 /etc/yum.repos.d/cuda.repo; \
+    fi
+
+# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
+RUN yum upgrade -y && yum install -y \
+    cuda-cudart-12-9-${NV_CUDA_CUDART_VERSION} \
+    cuda-compat-12-9 \
+    && yum clean all \
+    && rm -rf /var/cache/yum/*
+
+# nvidia-docker 1.0
+RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
+    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
+
+ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
+ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+
+# nvidia-container-runtime
+ENV NVIDIA_VISIBLE_DEVICES=all
+ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
+
+# Install CUDA runtime from:
+# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/runtime/Dockerfile
+ENV NV_CUDA_LIB_VERSION=12.9.2-1
+ENV NV_NVTX_VERSION=12.9.79-1
+ENV NV_LIBNPP_VERSION=12.4.1.87-1
+ENV NV_LIBNPP_PACKAGE=libnpp-12-9-${NV_LIBNPP_VERSION}
+ENV NV_LIBCUBLAS_VERSION=12.9.2.10-1
+ENV NV_LIBNCCL_PACKAGE_NAME=libnccl
+ENV NV_LIBNCCL_PACKAGE_VERSION=2.27.3-1
+ENV NV_LIBNCCL_VERSION=2.27.3
+ENV NCCL_VERSION=2.27.3
+ENV NV_LIBNCCL_PACKAGE=${NV_LIBNCCL_PACKAGE_NAME}-${NV_LIBNCCL_PACKAGE_VERSION}+cuda12.9
+
+RUN yum install -y \
+    cuda-libraries-12-9-${NV_CUDA_LIB_VERSION} \
+    cuda-nvtx-12-9-${NV_NVTX_VERSION} \
+    ${NV_LIBNPP_PACKAGE} \
+    libcublas-12-9-${NV_LIBCUBLAS_VERSION} \
+    ${NV_LIBNCCL_PACKAGE} \
+    && yum clean all \
+    && rm -rf /var/cache/yum/*
+
+# Set this flag so that libraries can find the location of CUDA
+ENV XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda
+
+# Install CUDA devel from:
+# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/devel/Dockerfile
+ENV NV_CUDA_LIB_VERSION=12.9.2-1
+# ARM64 doesn't have nvprof package - set in runtime
+ENV NV_NVPROF_VERSION=12.9.79-1
+ENV NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-9-${NV_NVPROF_VERSION}
+ENV NV_CUDA_CUDART_DEV_VERSION=12.9.79-1
+ENV NV_NVML_DEV_VERSION=12.9.79-1
+ENV NV_LIBCUBLAS_DEV_VERSION=12.9.2.10-1
+ENV NV_LIBNPP_DEV_VERSION=12.4.1.87-1
+ENV NV_LIBNPP_DEV_PACKAGE=libnpp-devel-12-9-${NV_LIBNPP_DEV_VERSION}
+ENV NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-devel
+ENV NV_LIBNCCL_DEV_PACKAGE_VERSION=2.27.3-1
+ENV NCCL_VERSION=2.27.3
+ENV NV_LIBNCCL_DEV_PACKAGE=${NV_LIBNCCL_DEV_PACKAGE_NAME}-${NV_LIBNCCL_DEV_PACKAGE_VERSION}+cuda12.9
+ENV NV_CUDA_NSIGHT_COMPUTE_VERSION=12.9.2-1
+ENV NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-9-${NV_CUDA_NSIGHT_COMPUTE_VERSION}
+
+RUN yum install -y \
+    make \
+    findutils \
+    cuda-command-line-tools-12-9-${NV_CUDA_LIB_VERSION} \
+    cuda-libraries-devel-12-9-${NV_CUDA_LIB_VERSION} \
+    cuda-minimal-build-12-9-${NV_CUDA_LIB_VERSION} \
+    cuda-cudart-devel-12-9-${NV_CUDA_CUDART_DEV_VERSION} \
+    cuda-nvml-devel-12-9-${NV_NVML_DEV_VERSION} \
+    libcublas-devel-12-9-${NV_LIBCUBLAS_DEV_VERSION} \
+    ${NV_LIBNPP_DEV_PACKAGE} \
+    ${NV_LIBNCCL_DEV_PACKAGE} \
+    ${NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE} \
+    && if [ "${TARGETARCH}" != "arm64" ]; then \
+        yum install -y ${NV_NVPROF_DEV_PACKAGE}; \
+    fi \
+    && yum clean all \
+    && rm -rf /var/cache/yum/*
+
+ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs
+
+# Install CUDA devel cudnn from:
+# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/devel/cudnn/Dockerfile
+ENV NV_CUDNN_VERSION=9.10.2.21-1
+ENV NV_CUDNN_PACKAGE=libcudnn9-cuda-12-${NV_CUDNN_VERSION}
+ENV NV_CUDNN_PACKAGE_DEV=libcudnn9-devel-cuda-12-${NV_CUDNN_VERSION}
+
+LABEL com.nvidia.cudnn.version="${NV_CUDNN_VERSION}"
+
+RUN yum install -y \
+    ${NV_CUDNN_PACKAGE} \
+    ${NV_CUDNN_PACKAGE_DEV} \
+    && yum clean all \
+    && rm -rf /var/cache/yum/*
+
+# ---------------------------------------------------------------------------
+# Install Python packages
+# ---------------------------------------------------------------------------
+
+RUN pip install --no-cache-dir -U "micropipenv[toml]"
+
+# Pipfile.lock provides ray, training-hub, and their transitive Python deps.
+# torch is NOT in the Pipfile — it comes exclusively from the AIPCC index below.
+COPY Pipfile.lock ./
+RUN micropipenv install && rm -f ./Pipfile.lock
+
+# AIPCC index for pre-built CUDA wheels (all compiled against torch 2.9.0 build 13).
+ENV AIPCC_INDEX=https://packages.redhat.com/api/pypi/public-rhai/rhoai/3.3/cuda12.9-ubi9/simple/
+
+# CUDA extensions from AIPCC — overwrites PyPI torch with the ABI-matched build.
+RUN pip install --no-cache-dir --no-deps --force-reinstall \
+    --extra-index-url ${AIPCC_INDEX} \
+    "torch==2.9.0" \
+    "torchvision==0.24.0" \
+    "torchaudio==2.9.0" \
+    "triton==3.5.0" \
+    "vllm==0.13.0" \
+    "flash-attn==2.8.3" \
+    "mamba-ssm==2.3.0" \
+    "causal-conv1d==1.6.0" \
+    "xformers==0.0.33.post2"
+
+# verl: --no-deps because its numpy<2.0.0 pin conflicts with vllm's numpy>=2.
+RUN pip install --no-cache-dir --no-deps verl==0.8.0
+
+# vllm 0.13.0 + verl 0.8.0 runtime dependencies.
+# Many deps (aiohttp, fastapi, pydantic, numpy, ray, transformers, etc.) are
+# already installed via Pipfile.lock and are not repeated here.
+RUN pip install --no-cache-dir \
+    --extra-index-url ${AIPCC_INDEX} \
+    anthropic==0.71.0 \
+    blake3 \
+    cachetools \
+    cbor2 \
+    cloudpickle \
+    "compressed-tensors==0.13.0" \
+    depyf==0.20.0 \
+    diskcache==5.6.3 \
+    email-validator \
+    "gguf>=0.17.0" \
+    ijson \
+    lark==1.2.2 \
+    "llguidance>=1.3.0,<1.4.0" \
+    lm-format-enforcer==0.11.3 \
+    mcp \
+    "mistral-common>=1.8.5" \
+    "model-hosting-container-standards>=0.1.9,<1.0.0" \
+    msgspec \
+    openai \
+    "openai-harmony>=0.0.3" \
+    outlines-core==0.2.11 \
+    partial-json-parser \
+    "prometheus-fastapi-instrumentator>=7.0.0" \
+    pybase64 \
+    python-json-logger \
+    python-multipart \
+    setproctitle \
+    tiktoken \
+    watchfiles \
+    xgrammar==0.1.27 \
+    codetiming \
+    hydra-core \
+    pybind11 \
+    pylatexenc \
+    tensorboard \
+    "tensordict!=0.9.0,<=0.10.0,>=0.8.0" \
+    torchdata \
+    wandb \
+    "torch==2.9.0"
+
+# pyzmq: must come from PyPI (manylinux wheel bundles libzmq.so.5).
+# The AIPCC wheel expects system libzmq which doesn't exist in UBI9.
+RUN pip install --no-cache-dir pyzmq
+
+# Restore user workspace
+USER 1001
+WORKDIR /opt/app-root/src


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Dockerfile must be converted to multi-stage to satisfy the repository’s Docker security policy.

Severity: Medium. Exploit scenario: shipping build tooling and intermediate artifacts in the final image increases attack surface and post-exploit utility (CWE-250 context: least-privilege hardening gap).
Split into builder/runtime stages and copy only required runtime artifacts.

Remediation code (pattern)

-FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} +FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} AS builder @@ -RUN pip install --no-cache-dir ... +RUN pip install --no-cache-dir ... @@ -USER 1001 -WORKDIR /opt/app-root/src +FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} AS runtime +USER 0 +# install only runtime OS/CUDA libs needed at runtime +COPY --from=builder /opt/app-root /opt/app-root +COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages +USER 1001 +WORKDIR /opt/app-root/src

As per coding guidelines, "**/Dockerfile*: DOCKERFILE SECURITY ... 4. Use multi-stage builds".

🧰 Tools

🪛 Betterleaks (1.5.0)

[high] 23-23: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` around lines 1 - 215, The Dockerfile currently uses a single-stage build that includes unnecessary build tooling, development packages, and intermediate artifacts in the final image, increasing the attack surface. Convert this to a multi-stage build by creating a builder stage that performs all installations and a runtime stage that copies only the required runtime artifacts. In the builder stage, keep all the current installation steps including the yum install commands for development packages and the pip install commands. In a new runtime stage, use the same base image (UBI9 Python) and copy only the necessary runtime components from the builder: the installed Python packages from site-packages, CUDA runtime libraries, and runtime configuration files. Remove the development-only packages (those with -devel suffix, make, findutils, cuda-command-line-tools, and similar build tools) from the runtime stage by not copying those artifacts and not installing them in the final stage. Ensure the environment variables for CUDA runtime paths and Python are preserved in the runtime stage.

Source: Coding guidelines

coderabbitai · 2026-06-22T15:42:50Z

+RUN pip install --no-cache-dir --no-deps --force-reinstall \
+    --extra-index-url ${AIPCC_INDEX} \
+    "torch==2.9.0" \


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "Dockerfile" -path "*ray/cuda/2.55.1*" -type f

Repository: opendatahub-io/distributed-workloads

Length of output: 136

🏁 Script executed:

wc -l images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 138

🏁 Script executed:

sed -n '1,30p' images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 3938

🏁 Script executed:

sed -n '140,160p' images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 900

🏁 Script executed:

sed -n '160,220p' images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 1579

🏁 Script executed:

# Check if Dockerfile uses multi-stage build (FROM statements) grep -n "^FROM" images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 157

🏁 Script executed:

# Verify guideline enforcement context in repo find . -name "*.md" -o -name "*.yml" | head -20 | xargs grep -l "multi-stage\|Dockerfile" 2>/dev/null | head -5

Repository: opendatahub-io/distributed-workloads

Length of output: 125

🏁 Script executed:

# Check if multi-stage build is a repo-wide requirement cat ./.github/workflows/snyk-dockerfile-scan.yml | head -50

Repository: opendatahub-io/distributed-workloads

Length of output: 1131

Pin all packages and use AIPCC as primary index to prevent dependency confusion (CWE-829).

Unpinned packages (blake3, cachetools, cbor2, cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, etc.) with --extra-index-url allow pip to resolve to malicious PyPI versions if AIPCC doesn't provide them. Correct approach: make AIPCC primary index (--index-url), make PyPI a fallback (--extra-index-url), and require exact version pins for all packages.

Remediation code

RUN pip install --no-cache-dir --no-deps --force-reinstall \ + --index-url ${AIPCC_INDEX} \ - --extra-index-url ${AIPCC_INDEX} \ + --extra-index-url https://pypi.org/simple \ "torch==2.9.0" \ "torchvision==0.24.0" \ "torchaudio==2.9.0" \ @@ RUN pip install --no-cache-dir \ + --index-url ${AIPCC_INDEX} \ - --extra-index-url ${AIPCC_INDEX} \ + --extra-index-url https://pypi.org/simple \ anthropic==0.71.0 \ - blake3 \ - cachetools \ - cbor2 \ - cloudpickle \ + blake3==<exact-version> \ + cachetools==<exact-version> \ + cbor2==<exact-version> \ + cloudpickle==<exact-version> \ - email-validator \ + email-validator==<exact-version> \ @@ - ijson \ + ijson==<exact-version> \ lark==1.2.2 \ "llguidance>=1.3.0,<1.4.0" \ lm-format-enforcer==0.11.3 \ - mcp \ + mcp==<exact-version> \ "mistral-common>=1.8.5" \ "model-hosting-container-standards>=0.1.9,<1.0.0" \ - msgspec \ - openai \ + msgspec==<exact-version> \ + openai==<exact-version> \ "openai-harmony>=0.0.3" \ outlines-core==0.2.11 \ - partial-json-parser \ + partial-json-parser==<exact-version> \ "prometheus-fastapi-instrumentator>=7.0.0" \ - pybase64 \ - python-json-logger \ - python-multipart \ + pybase64==<exact-version> \ + python-json-logger==<exact-version> \ + python-multipart==<exact-version> \ - setproctitle \ + setproctitle==<exact-version> \ tiktoken \ watchfiles \ xgrammar==0.1.27 \ - codetiming \ - hydra-core \ - pybind11 \ - pylatexenc \ + codetiming==<exact-version> \ + hydra-core==<exact-version> \ + pybind11==<exact-version> \ + pylatexenc==<exact-version> \ tensorboard \ "tensordict!=0.9.0,<=0.10.0,>=0.8.0" \ torchdata \ - wandb \ + wandb==<exact-version> \ "torch==2.9.0"

Also applies to: lines 167–207.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` around lines 149 - 151, The pip install commands use --extra-index-url for AIPCC which makes it a secondary index with PyPI as primary, creating a dependency confusion vulnerability where unpinned packages could resolve to malicious PyPI versions. Fix this by changing --extra-index-url to --index-url for the AIPCC_INDEX variable to make it the primary index, adding --extra-index-url for PyPI as a fallback, and pinning all packages (torch, blake3, cachetools, cbor2, cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, and any others) to exact versions instead of leaving them unpinned. Apply this change to all affected pip install commands in the Dockerfile.

Source: Coding guidelines

Rename 2.55.1-py312-cu128 to 2.55.1-py312-cu129 and add training-hub runtime with vllm 0.13.0, verl 0.8.0, and AIPCC CUDA extensions. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep the existing 2.55.1-py312-cu128 image and Tekton pipeline unchanged, adding the cu129 Training Hub variant as a separate image rather than replacing cu128. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai

🧹 Nitpick comments (2)

images/runtime/ray/cuda/2.55.1-py312-cu129/README.md (1)
3-3: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Hyphenate compound modifier.

"CUDA enabled" should be "CUDA-enabled" when used attributively before "container image."
Fix
-CUDA enabled container image for Ray in OpenShift AI.
+CUDA-enabled container image for Ray in OpenShift AI.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/README.md` at line 3, Hyphenate
the attributive compound in the README summary: update the phrase in the image
description so that “CUDA enabled” becomes “CUDA-enabled” before “container
image.” Make the wording change in the introductory description only, keeping
the rest of the text unchanged.
Source: Linters/SAST tools
images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile (1)
138-138: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Pin micropipenv[toml] to an exact version.

Unpinned build tooling can silently change how Pipfile.lock is resolved/installed between builds (CWE-829 class risk), undermining the reproducibility the rest of this file otherwise enforces via strict CUDA/NCCL/cuDNN pins.
Remediation
-RUN pip install --no-cache-dir -U "micropipenv[toml]"
+RUN pip install --no-cache-dir "micropipenv[toml]==<pinned-version>"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` at line 138, The
Dockerfile currently installs micropipenv[toml] without an exact version, which
leaves build tooling behavior mutable across builds. Update the RUN pip install
step to pin micropipenv[toml] to a specific version in this CUDA image so the
build remains reproducible and consistent with the other strict dependency pins.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile`:
- Line 138: The Dockerfile currently installs micropipenv[toml] without an exact
version, which leaves build tooling behavior mutable across builds. Update the
RUN pip install step to pin micropipenv[toml] to a specific version in this CUDA
image so the build remains reproducible and consistent with the other strict
dependency pins.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/README.md`:
- Line 3: Hyphenate the attributive compound in the README summary: update the
phrase in the image description so that “CUDA enabled” becomes “CUDA-enabled”
before “container image.” Make the wording change in the introductory
description only, keeping the rest of the text unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 60d831ed-2c0f-440f-90b1-e06420f9b482

📥 Commits

Reviewing files that changed from the base of the PR and between 7f879b1 and 19c9704.

⛔ Files ignored due to path filters (1)

images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

.tekton/ray-2.55.1-py312-cu129-push.yaml
images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile
images/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSE
images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile
images/runtime/ray/cuda/2.55.1-py312-cu129/README.md
images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-arm64
images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-x86_64

✅ Files skipped from review due to trivial changes (1)

images/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSE

🚧 Files skipped from review as they are similar to previous changes (2)

images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile
.tekton/ray-2.55.1-py312-cu129-push.yaml

openshift-ci Bot requested review from laurafitzgerald and sutaakar June 22, 2026 15:36

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Fiona-Waters changed the title ~~Upgrade Ray CUDA for use with Training Hub~~ Add Ray CUDA 12.9 image with Training Hub support Jul 2, 2026

Fiona-Waters and others added 2 commits July 3, 2026 08:30

Upgrade Ray CUDA image from 12.8 to 12.9

3171138

Rename 2.55.1-py312-cu128 to 2.55.1-py312-cu129 and add training-hub runtime with vllm 0.13.0, verl 0.8.0, and AIPCC CUDA extensions. Co-authored-by: Cursor <cursoragent@cursor.com>

Restore cu128 Ray image alongside new cu129

19c9704

Keep the existing 2.55.1-py312-cu128 image and Tekton pipeline unchanged, adding the cu129 Training Hub variant as a separate image rather than replacing cu128. Co-authored-by: Cursor <cursoragent@cursor.com>

Fiona-Waters force-pushed the training-hub-ray-image branch from 3ad10ae to 19c9704 Compare July 3, 2026 07:31

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Ray CUDA 12.9 image with Training Hub support#929

Add Ray CUDA 12.9 image with Training Hub support#929
Fiona-Waters wants to merge 2 commits into
opendatahub-io:mainfrom
Fiona-Waters:training-hub-ray-image

Fiona-Waters commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Security concerns (CWE-flagged)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Fiona-Waters commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

Pipfile changes (cu129)

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Security concerns (CWE-flagged)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fiona-Waters commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading