Skip to content

Add Ray CUDA 12.9 image with Training Hub support#929

Open
Fiona-Waters wants to merge 2 commits into
opendatahub-io:mainfrom
Fiona-Waters:training-hub-ray-image
Open

Add Ray CUDA 12.9 image with Training Hub support#929
Fiona-Waters wants to merge 2 commits into
opendatahub-io:mainfrom
Fiona-Waters:training-hub-ray-image

Conversation

@Fiona-Waters

@Fiona-Waters Fiona-Waters commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

RHAISTRAT-1693: Integrate Training Hub with Ray so that Training Hub algorithms (SFT, OSFT, LoRA, GRPO) can be run on Ray clusters in OpenShift AI.
Jira for the image work: https://redhat.atlassian.net/browse/RHOAIENG-61568

Adds a new Ray CUDA 12.9 image (2.55.1-py312-cu129) with training-hub and its runtime dependencies (vllm, verl, CUDA extensions) sourced from the RHOAI 3.3 AIPCC index. The existing CUDA 12.8 image (2.55.1-py312-cu128) is kept unchanged.

What's added

New image: 2.55.1-py312-cu129

  • CUDA 12.9.2 toolkit (NCCL 2.27.3, cuDNN 9.10.2.21)
  • Ray 2.55.1 + training-hub 0.8.1 and transitive dependencies via Pipfile.lock
  • AIPCC CUDA wheels: torch 2.9.0, torchvision, torchaudio, triton, vllm 0.13.0, flash-attn 2.8.3, mamba-ssm, causal-conv1d, xformers
  • verl 0.8.0 (installed with --no-deps to avoid numpy conflict)
  • vllm + verl runtime dependencies (anthropic, compressed-tensors, xgrammar, etc.)
  • pyzmq from PyPI (AIPCC wheel expects system libzmq which doesn't exist in UBI9)

New Tekton pipeline: ray-2.55.1-py312-cu129-push.yaml

  • Builds and pushes to quay.io/modh/ray:2.55.1-py312-cu129
  • Triggers on changes to images/runtime/ray/cuda/2.55.1-py312-cu129/**

Unchanged: 2.55.1-py312-cu128

  • Existing image, Pipfile, Pipfile.lock, and Tekton pipeline are preserved as-is.

Pipfile changes (cu129)

  • Added: training-hub==0.8.1, transformers>=4.57.6,<5.0, kernels>=0.9.0,<0.15, unsloth>=2026.1.1, einops>=0.8, bitsandbytes>=0.47.0, liger-kernel>=0.5.10

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added a new Ray runtime image variant for CUDA 12.9 with Python 3.12.
    • Updated build automation to publish and tag the new CUDA 12.9 image variant.
  • Documentation

    • Updated the runtime documentation to reflect CUDA 12.9 support.
  • Chores

    • Added supporting licensing and repository configuration for CUDA package installation.

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds a new Ray 2.55.1 Python 3.12 CUDA 12.9 (cu129) runtime image: Dockerfile with per-arch CUDA repo selection, driver/cuDNN installation, and pinned PyTorch/verl/serving dependencies installed via pip and Pipfile; new Pipfile, README update, NVIDIA container license file, and x86_64/arm64 cuda.repo files. Tekton PipelineRun/trigger config updated to build and push this cu129 image variant.

Estimated code review effort: 4 (Complex) | ~60 minutes

Security concerns (CWE-flagged)

  • GPG key import over unauthenticated transport risk: cuda.repo-x86_64/cuda.repo-arm64 reference file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA — verify the key file itself is fetched via integrity-checked channel during Dockerfile build (CWE-494: Download of Code Without Integrity Check) since the repo files only enforce gpgcheck post-import, not the initial key retrieval.
  • pip install with --no-deps for verl==0.8.0 and force-reinstall of ABI-matched wheels from a third-party AIPCC CUDA index bypasses supply-chain hash verification (CWE-829: Inclusion of Functionality from Untrusted Control Sphere). No hash pinning (--require-hashes) visible for pip-installed packages — full dependency tree (torch, vllm, flash-attn, mamba-ssm, xformers, triton) pulled from an external, non-PyPI index without cited checksum verification.
  • Root-to-non-root user switch (USER 1001) occurs only at line 213–215, after all package installs — standard practice, but confirm no writable setuid/setgid artifacts or credentials (e.g. pip cache, cuda.repo GPG key copies) persist in image layers accessible to UID 1001 (CWE-732: Incorrect Permission Assignment for Critical Resource).
  • Pipfile uses unpinned/loose version ranges for transformers, kernels, unsloth, einops, bitsandbytes, liger-kernel — floating dependency ranges increase risk of unreviewed code injection via future malicious releases (CWE-1104: Use of Unmaintained Third Party Components / dependency confusion surface). Recommend pinning exact versions or hash-locking via Pipfile.lock committed to repo (noted as deleted mid-build, so no lockfile persists for audit — CWE-1329: Reliance on Component That is Not Updateable).
  • Tekton pipeline trigger CEL condition and additional-tag/output-image params directly interpolate {{revision}} — confirm this is not attacker-controllable via untrusted PR source (CWE-88: Argument Injection) given this is a push-triggered pipeline building/publishing to quay.io/modh/ray.
  • NVIDIA_REQUIRE_CUDA driver/brand constraint string is duplicated boilerplate — no CVE noted here, but ensure the CUDA 12.9 base packages pinned in Dockerfile are current: verify no known CVEs pending against pinned cuda-cudart/libcublas/libnccl versions before merge.

No praise. Verify hash pinning and lockfile retention before approval.

🚥 Pre-merge checks | ✅ 10
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: a new Ray CUDA 12.9 image with Training Hub support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection ✅ Passed Only one weak signal: a template-like PR body. No second-category evidence; author has prior repo commits and the description includes specific Jira links and concrete image details.
No Hardcoded Secrets ✅ Passed No hardcoded secrets found; Tekton only references templated secretName '{{ git_auth_secret }}', and no embedded creds/private keys appeared (CWE-798).
No Weak Cryptography ✅ Passed No MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret compares found; only SHA-256 checksums and harmless deps like cryptography.
No Injection Vectors ✅ Passed PASS: No CWE-78/89/94/502/79 sink found; new YAML/Dockerfile only use hardcoded values and trusted enum branching (e.g., TARGETARCH).
No Privileged Containers ✅ Passed No privileged/hostPID/hostNetwork/hostIPC/allowPrivilegeEscalation/runAsUser:0 settings were found in the new Ray CUDA manifest tree, and the Dockerfile drops to USER 1001 at the end.
No Sensitive Data In Logs ✅ Passed No log statements or debug output exposing secrets/PII were added; this PR is build/config only. No CWE-532 evidence.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile`:
- Around line 1-215: The Dockerfile currently uses a single-stage build that
includes unnecessary build tooling, development packages, and intermediate
artifacts in the final image, increasing the attack surface. Convert this to a
multi-stage build by creating a builder stage that performs all installations
and a runtime stage that copies only the required runtime artifacts. In the
builder stage, keep all the current installation steps including the yum install
commands for development packages and the pip install commands. In a new runtime
stage, use the same base image (UBI9 Python) and copy only the necessary runtime
components from the builder: the installed Python packages from site-packages,
CUDA runtime libraries, and runtime configuration files. Remove the
development-only packages (those with -devel suffix, make, findutils,
cuda-command-line-tools, and similar build tools) from the runtime stage by not
copying those artifacts and not installing them in the final stage. Ensure the
environment variables for CUDA runtime paths and Python are preserved in the
runtime stage.
- Around line 149-151: The pip install commands use --extra-index-url for AIPCC
which makes it a secondary index with PyPI as primary, creating a dependency
confusion vulnerability where unpinned packages could resolve to malicious PyPI
versions. Fix this by changing --extra-index-url to --index-url for the
AIPCC_INDEX variable to make it the primary index, adding --extra-index-url for
PyPI as a fallback, and pinning all packages (torch, blake3, cachetools, cbor2,
cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, and any
others) to exact versions instead of leaving them unpinned. Apply this change to
all affected pip install commands in the Dockerfile.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4a59c4ee-5477-42f9-a5d3-be4330a8dc80

📥 Commits

Reviewing files that changed from the base of the PR and between 36f9003 and 7f879b1.

⛔ Files ignored due to path filters (2)
  • images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile.lock is excluded by !**/*.lock
  • images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • .tekton/ray-2.55.1-py312-cu129-push.yaml
  • images/runtime/ray/cuda/2.55.1-py312-cu128/Dockerfile
  • images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile
  • images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile
  • images/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSE
  • images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile
  • images/runtime/ray/cuda/2.55.1-py312-cu129/README.md
  • images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-arm64
  • images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-x86_64
💤 Files with no reviewable changes (2)
  • images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile
  • images/runtime/ray/cuda/2.55.1-py312-cu128/Dockerfile

Comment on lines +1 to +215
ARG PYTHON_VERSION=312
ARG IMAGE_TAG=9.7-1778488949

FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG}

ARG TARGETARCH

LABEL name="ray-ubi9-py312-cu129" \
summary="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \
description="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \
io.k8s.display-name="CUDA 12.9 Python 3.12 base image for Ray" \
io.k8s.description="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \
authoritative-source-url="https://github.com/opendatahub-io/distributed-workloads"

# Install CUDA base from:
# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/base/Dockerfile
USER 0
WORKDIR /opt/app-root/bin

ENV NVIDIA_REQUIRE_CUDA="cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=580,driver<581 brand=grid,driver>=580,driver<581 brand=tesla,driver>=580,driver<581 brand=nvidia,driver>=580,driver<581 brand=quadro,driver>=580,driver<581 brand=quadrortx,driver>=580,driver<581 brand=nvidiartx,driver>=580,driver<581 brand=vapps,driver>=580,driver<581 brand=vpc,driver>=580,driver<581 brand=vcs,driver>=580,driver<581 brand=vws,driver>=580,driver<581 brand=cloudgaming,driver>=580,driver<581"
ENV NV_CUDA_CUDART_VERSION=12.9.79-1

RUN NVIDIA_GPGKEY_SUM=d0664fbbdb8c32356d45de36c5984617217b2d0bef41b93ccecd326ba3b80c87 && \
if [ "${TARGETARCH}" = "arm64" ]; then NVARCH=sbsa; else NVARCH=x86_64; fi && \
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel9/${NVARCH}/D42D0685.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \
echo "$NVIDIA_GPGKEY_SUM /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict -

ENV CUDA_VERSION=12.9.2

COPY cuda.repo-* ./
COPY NGC-DL-CONTAINER-LICENSE /

RUN if [ "${TARGETARCH}" = "arm64" ]; then \
cp cuda.repo-arm64 /etc/yum.repos.d/cuda.repo; \
else \
cp cuda.repo-x86_64 /etc/yum.repos.d/cuda.repo; \
fi

# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
RUN yum upgrade -y && yum install -y \
cuda-cudart-12-9-${NV_CUDA_CUDART_VERSION} \
cuda-compat-12-9 \
&& yum clean all \
&& rm -rf /var/cache/yum/*

# nvidia-docker 1.0
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

# Install CUDA runtime from:
# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/runtime/Dockerfile
ENV NV_CUDA_LIB_VERSION=12.9.2-1
ENV NV_NVTX_VERSION=12.9.79-1
ENV NV_LIBNPP_VERSION=12.4.1.87-1
ENV NV_LIBNPP_PACKAGE=libnpp-12-9-${NV_LIBNPP_VERSION}
ENV NV_LIBCUBLAS_VERSION=12.9.2.10-1
ENV NV_LIBNCCL_PACKAGE_NAME=libnccl
ENV NV_LIBNCCL_PACKAGE_VERSION=2.27.3-1
ENV NV_LIBNCCL_VERSION=2.27.3
ENV NCCL_VERSION=2.27.3
ENV NV_LIBNCCL_PACKAGE=${NV_LIBNCCL_PACKAGE_NAME}-${NV_LIBNCCL_PACKAGE_VERSION}+cuda12.9

RUN yum install -y \
cuda-libraries-12-9-${NV_CUDA_LIB_VERSION} \
cuda-nvtx-12-9-${NV_NVTX_VERSION} \
${NV_LIBNPP_PACKAGE} \
libcublas-12-9-${NV_LIBCUBLAS_VERSION} \
${NV_LIBNCCL_PACKAGE} \
&& yum clean all \
&& rm -rf /var/cache/yum/*

# Set this flag so that libraries can find the location of CUDA
ENV XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda

# Install CUDA devel from:
# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/devel/Dockerfile
ENV NV_CUDA_LIB_VERSION=12.9.2-1
# ARM64 doesn't have nvprof package - set in runtime
ENV NV_NVPROF_VERSION=12.9.79-1
ENV NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-9-${NV_NVPROF_VERSION}
ENV NV_CUDA_CUDART_DEV_VERSION=12.9.79-1
ENV NV_NVML_DEV_VERSION=12.9.79-1
ENV NV_LIBCUBLAS_DEV_VERSION=12.9.2.10-1
ENV NV_LIBNPP_DEV_VERSION=12.4.1.87-1
ENV NV_LIBNPP_DEV_PACKAGE=libnpp-devel-12-9-${NV_LIBNPP_DEV_VERSION}
ENV NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-devel
ENV NV_LIBNCCL_DEV_PACKAGE_VERSION=2.27.3-1
ENV NCCL_VERSION=2.27.3
ENV NV_LIBNCCL_DEV_PACKAGE=${NV_LIBNCCL_DEV_PACKAGE_NAME}-${NV_LIBNCCL_DEV_PACKAGE_VERSION}+cuda12.9
ENV NV_CUDA_NSIGHT_COMPUTE_VERSION=12.9.2-1
ENV NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-9-${NV_CUDA_NSIGHT_COMPUTE_VERSION}

RUN yum install -y \
make \
findutils \
cuda-command-line-tools-12-9-${NV_CUDA_LIB_VERSION} \
cuda-libraries-devel-12-9-${NV_CUDA_LIB_VERSION} \
cuda-minimal-build-12-9-${NV_CUDA_LIB_VERSION} \
cuda-cudart-devel-12-9-${NV_CUDA_CUDART_DEV_VERSION} \
cuda-nvml-devel-12-9-${NV_NVML_DEV_VERSION} \
libcublas-devel-12-9-${NV_LIBCUBLAS_DEV_VERSION} \
${NV_LIBNPP_DEV_PACKAGE} \
${NV_LIBNCCL_DEV_PACKAGE} \
${NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE} \
&& if [ "${TARGETARCH}" != "arm64" ]; then \
yum install -y ${NV_NVPROF_DEV_PACKAGE}; \
fi \
&& yum clean all \
&& rm -rf /var/cache/yum/*

ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs

# Install CUDA devel cudnn from:
# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/devel/cudnn/Dockerfile
ENV NV_CUDNN_VERSION=9.10.2.21-1
ENV NV_CUDNN_PACKAGE=libcudnn9-cuda-12-${NV_CUDNN_VERSION}
ENV NV_CUDNN_PACKAGE_DEV=libcudnn9-devel-cuda-12-${NV_CUDNN_VERSION}

LABEL com.nvidia.cudnn.version="${NV_CUDNN_VERSION}"

RUN yum install -y \
${NV_CUDNN_PACKAGE} \
${NV_CUDNN_PACKAGE_DEV} \
&& yum clean all \
&& rm -rf /var/cache/yum/*

# ---------------------------------------------------------------------------
# Install Python packages
# ---------------------------------------------------------------------------

RUN pip install --no-cache-dir -U "micropipenv[toml]"

# Pipfile.lock provides ray, training-hub, and their transitive Python deps.
# torch is NOT in the Pipfile — it comes exclusively from the AIPCC index below.
COPY Pipfile.lock ./
RUN micropipenv install && rm -f ./Pipfile.lock

# AIPCC index for pre-built CUDA wheels (all compiled against torch 2.9.0 build 13).
ENV AIPCC_INDEX=https://packages.redhat.com/api/pypi/public-rhai/rhoai/3.3/cuda12.9-ubi9/simple/

# CUDA extensions from AIPCC — overwrites PyPI torch with the ABI-matched build.
RUN pip install --no-cache-dir --no-deps --force-reinstall \
--extra-index-url ${AIPCC_INDEX} \
"torch==2.9.0" \
"torchvision==0.24.0" \
"torchaudio==2.9.0" \
"triton==3.5.0" \
"vllm==0.13.0" \
"flash-attn==2.8.3" \
"mamba-ssm==2.3.0" \
"causal-conv1d==1.6.0" \
"xformers==0.0.33.post2"

# verl: --no-deps because its numpy<2.0.0 pin conflicts with vllm's numpy>=2.
RUN pip install --no-cache-dir --no-deps verl==0.8.0

# vllm 0.13.0 + verl 0.8.0 runtime dependencies.
# Many deps (aiohttp, fastapi, pydantic, numpy, ray, transformers, etc.) are
# already installed via Pipfile.lock and are not repeated here.
RUN pip install --no-cache-dir \
--extra-index-url ${AIPCC_INDEX} \
anthropic==0.71.0 \
blake3 \
cachetools \
cbor2 \
cloudpickle \
"compressed-tensors==0.13.0" \
depyf==0.20.0 \
diskcache==5.6.3 \
email-validator \
"gguf>=0.17.0" \
ijson \
lark==1.2.2 \
"llguidance>=1.3.0,<1.4.0" \
lm-format-enforcer==0.11.3 \
mcp \
"mistral-common>=1.8.5" \
"model-hosting-container-standards>=0.1.9,<1.0.0" \
msgspec \
openai \
"openai-harmony>=0.0.3" \
outlines-core==0.2.11 \
partial-json-parser \
"prometheus-fastapi-instrumentator>=7.0.0" \
pybase64 \
python-json-logger \
python-multipart \
setproctitle \
tiktoken \
watchfiles \
xgrammar==0.1.27 \
codetiming \
hydra-core \
pybind11 \
pylatexenc \
tensorboard \
"tensordict!=0.9.0,<=0.10.0,>=0.8.0" \
torchdata \
wandb \
"torch==2.9.0"

# pyzmq: must come from PyPI (manylinux wheel bundles libzmq.so.5).
# The AIPCC wheel expects system libzmq which doesn't exist in UBI9.
RUN pip install --no-cache-dir pyzmq

# Restore user workspace
USER 1001
WORKDIR /opt/app-root/src No newline at end of file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Dockerfile must be converted to multi-stage to satisfy the repository’s Docker security policy.

Severity: Medium. Exploit scenario: shipping build tooling and intermediate artifacts in the final image increases attack surface and post-exploit utility (CWE-250 context: least-privilege hardening gap).
Split into builder/runtime stages and copy only required runtime artifacts.

Remediation code (pattern)
-FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG}
+FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} AS builder
@@
-RUN pip install --no-cache-dir ...
+RUN pip install --no-cache-dir ...
@@
-USER 1001
-WORKDIR /opt/app-root/src
+FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} AS runtime
+USER 0
+# install only runtime OS/CUDA libs needed at runtime
+COPY --from=builder /opt/app-root /opt/app-root
+COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
+USER 1001
+WORKDIR /opt/app-root/src

As per coding guidelines, "**/Dockerfile*: DOCKERFILE SECURITY ... 4. Use multi-stage builds".

🧰 Tools
🪛 Betterleaks (1.5.0)

[high] 23-23: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` around lines 1 - 215,
The Dockerfile currently uses a single-stage build that includes unnecessary
build tooling, development packages, and intermediate artifacts in the final
image, increasing the attack surface. Convert this to a multi-stage build by
creating a builder stage that performs all installations and a runtime stage
that copies only the required runtime artifacts. In the builder stage, keep all
the current installation steps including the yum install commands for
development packages and the pip install commands. In a new runtime stage, use
the same base image (UBI9 Python) and copy only the necessary runtime components
from the builder: the installed Python packages from site-packages, CUDA runtime
libraries, and runtime configuration files. Remove the development-only packages
(those with -devel suffix, make, findutils, cuda-command-line-tools, and similar
build tools) from the runtime stage by not copying those artifacts and not
installing them in the final stage. Ensure the environment variables for CUDA
runtime paths and Python are preserved in the runtime stage.

Source: Coding guidelines

Comment on lines +149 to +151
RUN pip install --no-cache-dir --no-deps --force-reinstall \
--extra-index-url ${AIPCC_INDEX} \
"torch==2.9.0" \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "Dockerfile" -path "*ray/cuda/2.55.1*" -type f

Repository: opendatahub-io/distributed-workloads

Length of output: 136


🏁 Script executed:

wc -l images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 138


🏁 Script executed:

sed -n '1,30p' images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 3938


🏁 Script executed:

sed -n '140,160p' images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 900


🏁 Script executed:

sed -n '160,220p' images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 1579


🏁 Script executed:

# Check if Dockerfile uses multi-stage build (FROM statements)
grep -n "^FROM" images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 157


🏁 Script executed:

# Verify guideline enforcement context in repo
find . -name "*.md" -o -name "*.yml" | head -20 | xargs grep -l "multi-stage\|Dockerfile" 2>/dev/null | head -5

Repository: opendatahub-io/distributed-workloads

Length of output: 125


🏁 Script executed:

# Check if multi-stage build is a repo-wide requirement
cat ./.github/workflows/snyk-dockerfile-scan.yml | head -50

Repository: opendatahub-io/distributed-workloads

Length of output: 1131


Pin all packages and use AIPCC as primary index to prevent dependency confusion (CWE-829).

Unpinned packages (blake3, cachetools, cbor2, cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, etc.) with --extra-index-url allow pip to resolve to malicious PyPI versions if AIPCC doesn't provide them. Correct approach: make AIPCC primary index (--index-url), make PyPI a fallback (--extra-index-url), and require exact version pins for all packages.

Remediation code
RUN pip install --no-cache-dir --no-deps --force-reinstall \
+    --index-url ${AIPCC_INDEX} \
-    --extra-index-url ${AIPCC_INDEX} \
+    --extra-index-url https://pypi.org/simple \
     "torch==2.9.0" \
     "torchvision==0.24.0" \
     "torchaudio==2.9.0" \
@@
RUN pip install --no-cache-dir \
+    --index-url ${AIPCC_INDEX} \
-    --extra-index-url ${AIPCC_INDEX} \
+    --extra-index-url https://pypi.org/simple \
     anthropic==0.71.0 \
-    blake3 \
-    cachetools \
-    cbor2 \
-    cloudpickle \
+    blake3==<exact-version> \
+    cachetools==<exact-version> \
+    cbor2==<exact-version> \
+    cloudpickle==<exact-version> \
-    email-validator \
+    email-validator==<exact-version> \
@@
-    ijson \
+    ijson==<exact-version> \
     lark==1.2.2 \
     "llguidance>=1.3.0,<1.4.0" \
     lm-format-enforcer==0.11.3 \
-    mcp \
+    mcp==<exact-version> \
     "mistral-common>=1.8.5" \
     "model-hosting-container-standards>=0.1.9,<1.0.0" \
-    msgspec \
-    openai \
+    msgspec==<exact-version> \
+    openai==<exact-version> \
     "openai-harmony>=0.0.3" \
     outlines-core==0.2.11 \
-    partial-json-parser \
+    partial-json-parser==<exact-version> \
     "prometheus-fastapi-instrumentator>=7.0.0" \
-    pybase64 \
-    python-json-logger \
-    python-multipart \
+    pybase64==<exact-version> \
+    python-json-logger==<exact-version> \
+    python-multipart==<exact-version> \
-    setproctitle \
+    setproctitle==<exact-version> \
     tiktoken \
     watchfiles \
     xgrammar==0.1.27 \
-    codetiming \
-    hydra-core \
-    pybind11 \
-    pylatexenc \
+    codetiming==<exact-version> \
+    hydra-core==<exact-version> \
+    pybind11==<exact-version> \
+    pylatexenc==<exact-version> \
     tensorboard \
     "tensordict!=0.9.0,<=0.10.0,>=0.8.0" \
     torchdata \
-    wandb \
+    wandb==<exact-version> \
     "torch==2.9.0"

Also applies to: lines 167–207.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` around lines 149 -
151, The pip install commands use --extra-index-url for AIPCC which makes it a
secondary index with PyPI as primary, creating a dependency confusion
vulnerability where unpinned packages could resolve to malicious PyPI versions.
Fix this by changing --extra-index-url to --index-url for the AIPCC_INDEX
variable to make it the primary index, adding --extra-index-url for PyPI as a
fallback, and pinning all packages (torch, blake3, cachetools, cbor2,
cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, and any
others) to exact versions instead of leaving them unpinned. Apply this change to
all affected pip install commands in the Dockerfile.

Source: Coding guidelines

@Fiona-Waters Fiona-Waters changed the title Upgrade Ray CUDA for use with Training Hub Add Ray CUDA 12.9 image with Training Hub support Jul 2, 2026
Fiona-Waters and others added 2 commits July 3, 2026 08:30
Rename 2.55.1-py312-cu128 to 2.55.1-py312-cu129 and add training-hub
runtime with vllm 0.13.0, verl 0.8.0, and AIPCC CUDA extensions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the existing 2.55.1-py312-cu128 image and Tekton pipeline
unchanged, adding the cu129 Training Hub variant as a separate image
rather than replacing cu128.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Fiona-Waters Fiona-Waters force-pushed the training-hub-ray-image branch from 3ad10ae to 19c9704 Compare July 3, 2026 07:31

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
images/runtime/ray/cuda/2.55.1-py312-cu129/README.md (1)

3-3: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Hyphenate compound modifier.

"CUDA enabled" should be "CUDA-enabled" when used attributively before "container image."

Fix
-CUDA enabled container image for Ray in OpenShift AI.
+CUDA-enabled container image for Ray in OpenShift AI.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/README.md` at line 3, Hyphenate
the attributive compound in the README summary: update the phrase in the image
description so that “CUDA enabled” becomes “CUDA-enabled” before “container
image.” Make the wording change in the introductory description only, keeping
the rest of the text unchanged.

Source: Linters/SAST tools

images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile (1)

138-138: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Pin micropipenv[toml] to an exact version.

Unpinned build tooling can silently change how Pipfile.lock is resolved/installed between builds (CWE-829 class risk), undermining the reproducibility the rest of this file otherwise enforces via strict CUDA/NCCL/cuDNN pins.

Remediation
-RUN pip install --no-cache-dir -U "micropipenv[toml]"
+RUN pip install --no-cache-dir "micropipenv[toml]==<pinned-version>"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` at line 138, The
Dockerfile currently installs micropipenv[toml] without an exact version, which
leaves build tooling behavior mutable across builds. Update the RUN pip install
step to pin micropipenv[toml] to a specific version in this CUDA image so the
build remains reproducible and consistent with the other strict dependency pins.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile`:
- Line 138: The Dockerfile currently installs micropipenv[toml] without an exact
version, which leaves build tooling behavior mutable across builds. Update the
RUN pip install step to pin micropipenv[toml] to a specific version in this CUDA
image so the build remains reproducible and consistent with the other strict
dependency pins.

In `@images/runtime/ray/cuda/2.55.1-py312-cu129/README.md`:
- Line 3: Hyphenate the attributive compound in the README summary: update the
phrase in the image description so that “CUDA enabled” becomes “CUDA-enabled”
before “container image.” Make the wording change in the introductory
description only, keeping the rest of the text unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 60d831ed-2c0f-440f-90b1-e06420f9b482

📥 Commits

Reviewing files that changed from the base of the PR and between 7f879b1 and 19c9704.

⛔ Files ignored due to path filters (1)
  • images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • .tekton/ray-2.55.1-py312-cu129-push.yaml
  • images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile
  • images/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSE
  • images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile
  • images/runtime/ray/cuda/2.55.1-py312-cu129/README.md
  • images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-arm64
  • images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-x86_64
✅ Files skipped from review due to trivial changes (1)
  • images/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSE
🚧 Files skipped from review as they are similar to previous changes (2)
  • images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile
  • .tekton/ray-2.55.1-py312-cu129-push.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant