Add Ray CUDA 12.9 image with Training Hub support#929
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdds a new Ray 2.55.1 Python 3.12 CUDA 12.9 (cu129) runtime image: Dockerfile with per-arch CUDA repo selection, driver/cuDNN installation, and pinned PyTorch/verl/serving dependencies installed via pip and Pipfile; new Pipfile, README update, NVIDIA container license file, and x86_64/arm64 cuda.repo files. Tekton PipelineRun/trigger config updated to build and push this cu129 image variant. Estimated code review effort: 4 (Complex) | ~60 minutes Security concerns (CWE-flagged)
No praise. Verify hash pinning and lockfile retention before approval. 🚥 Pre-merge checks | ✅ 10✅ Passed checks (10 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile`:
- Around line 1-215: The Dockerfile currently uses a single-stage build that
includes unnecessary build tooling, development packages, and intermediate
artifacts in the final image, increasing the attack surface. Convert this to a
multi-stage build by creating a builder stage that performs all installations
and a runtime stage that copies only the required runtime artifacts. In the
builder stage, keep all the current installation steps including the yum install
commands for development packages and the pip install commands. In a new runtime
stage, use the same base image (UBI9 Python) and copy only the necessary runtime
components from the builder: the installed Python packages from site-packages,
CUDA runtime libraries, and runtime configuration files. Remove the
development-only packages (those with -devel suffix, make, findutils,
cuda-command-line-tools, and similar build tools) from the runtime stage by not
copying those artifacts and not installing them in the final stage. Ensure the
environment variables for CUDA runtime paths and Python are preserved in the
runtime stage.
- Around line 149-151: The pip install commands use --extra-index-url for AIPCC
which makes it a secondary index with PyPI as primary, creating a dependency
confusion vulnerability where unpinned packages could resolve to malicious PyPI
versions. Fix this by changing --extra-index-url to --index-url for the
AIPCC_INDEX variable to make it the primary index, adding --extra-index-url for
PyPI as a fallback, and pinning all packages (torch, blake3, cachetools, cbor2,
cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, and any
others) to exact versions instead of leaving them unpinned. Apply this change to
all affected pip install commands in the Dockerfile.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4a59c4ee-5477-42f9-a5d3-be4330a8dc80
⛔ Files ignored due to path filters (2)
images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile.lockis excluded by!**/*.lockimages/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
.tekton/ray-2.55.1-py312-cu129-push.yamlimages/runtime/ray/cuda/2.55.1-py312-cu128/Dockerfileimages/runtime/ray/cuda/2.55.1-py312-cu128/Pipfileimages/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfileimages/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSEimages/runtime/ray/cuda/2.55.1-py312-cu129/Pipfileimages/runtime/ray/cuda/2.55.1-py312-cu129/README.mdimages/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-arm64images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-x86_64
💤 Files with no reviewable changes (2)
- images/runtime/ray/cuda/2.55.1-py312-cu128/Pipfile
- images/runtime/ray/cuda/2.55.1-py312-cu128/Dockerfile
| ARG PYTHON_VERSION=312 | ||
| ARG IMAGE_TAG=9.7-1778488949 | ||
|
|
||
| FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} | ||
|
|
||
| ARG TARGETARCH | ||
|
|
||
| LABEL name="ray-ubi9-py312-cu129" \ | ||
| summary="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \ | ||
| description="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \ | ||
| io.k8s.display-name="CUDA 12.9 Python 3.12 base image for Ray" \ | ||
| io.k8s.description="CUDA 12.9 Python 3.12 image based on UBI9 for Ray" \ | ||
| authoritative-source-url="https://github.com/opendatahub-io/distributed-workloads" | ||
|
|
||
| # Install CUDA base from: | ||
| # https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/base/Dockerfile | ||
| USER 0 | ||
| WORKDIR /opt/app-root/bin | ||
|
|
||
| ENV NVIDIA_REQUIRE_CUDA="cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=580,driver<581 brand=grid,driver>=580,driver<581 brand=tesla,driver>=580,driver<581 brand=nvidia,driver>=580,driver<581 brand=quadro,driver>=580,driver<581 brand=quadrortx,driver>=580,driver<581 brand=nvidiartx,driver>=580,driver<581 brand=vapps,driver>=580,driver<581 brand=vpc,driver>=580,driver<581 brand=vcs,driver>=580,driver<581 brand=vws,driver>=580,driver<581 brand=cloudgaming,driver>=580,driver<581" | ||
| ENV NV_CUDA_CUDART_VERSION=12.9.79-1 | ||
|
|
||
| RUN NVIDIA_GPGKEY_SUM=d0664fbbdb8c32356d45de36c5984617217b2d0bef41b93ccecd326ba3b80c87 && \ | ||
| if [ "${TARGETARCH}" = "arm64" ]; then NVARCH=sbsa; else NVARCH=x86_64; fi && \ | ||
| curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/rhel9/${NVARCH}/D42D0685.pub | sed '/^Version/d' > /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA && \ | ||
| echo "$NVIDIA_GPGKEY_SUM /etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA" | sha256sum -c --strict - | ||
|
|
||
| ENV CUDA_VERSION=12.9.2 | ||
|
|
||
| COPY cuda.repo-* ./ | ||
| COPY NGC-DL-CONTAINER-LICENSE / | ||
|
|
||
| RUN if [ "${TARGETARCH}" = "arm64" ]; then \ | ||
| cp cuda.repo-arm64 /etc/yum.repos.d/cuda.repo; \ | ||
| else \ | ||
| cp cuda.repo-x86_64 /etc/yum.repos.d/cuda.repo; \ | ||
| fi | ||
|
|
||
| # For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a | ||
| RUN yum upgrade -y && yum install -y \ | ||
| cuda-cudart-12-9-${NV_CUDA_CUDART_VERSION} \ | ||
| cuda-compat-12-9 \ | ||
| && yum clean all \ | ||
| && rm -rf /var/cache/yum/* | ||
|
|
||
| # nvidia-docker 1.0 | ||
| RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \ | ||
| echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf | ||
|
|
||
| ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} | ||
| ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 | ||
|
|
||
| # nvidia-container-runtime | ||
| ENV NVIDIA_VISIBLE_DEVICES=all | ||
| ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility | ||
|
|
||
| # Install CUDA runtime from: | ||
| # https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/runtime/Dockerfile | ||
| ENV NV_CUDA_LIB_VERSION=12.9.2-1 | ||
| ENV NV_NVTX_VERSION=12.9.79-1 | ||
| ENV NV_LIBNPP_VERSION=12.4.1.87-1 | ||
| ENV NV_LIBNPP_PACKAGE=libnpp-12-9-${NV_LIBNPP_VERSION} | ||
| ENV NV_LIBCUBLAS_VERSION=12.9.2.10-1 | ||
| ENV NV_LIBNCCL_PACKAGE_NAME=libnccl | ||
| ENV NV_LIBNCCL_PACKAGE_VERSION=2.27.3-1 | ||
| ENV NV_LIBNCCL_VERSION=2.27.3 | ||
| ENV NCCL_VERSION=2.27.3 | ||
| ENV NV_LIBNCCL_PACKAGE=${NV_LIBNCCL_PACKAGE_NAME}-${NV_LIBNCCL_PACKAGE_VERSION}+cuda12.9 | ||
|
|
||
| RUN yum install -y \ | ||
| cuda-libraries-12-9-${NV_CUDA_LIB_VERSION} \ | ||
| cuda-nvtx-12-9-${NV_NVTX_VERSION} \ | ||
| ${NV_LIBNPP_PACKAGE} \ | ||
| libcublas-12-9-${NV_LIBCUBLAS_VERSION} \ | ||
| ${NV_LIBNCCL_PACKAGE} \ | ||
| && yum clean all \ | ||
| && rm -rf /var/cache/yum/* | ||
|
|
||
| # Set this flag so that libraries can find the location of CUDA | ||
| ENV XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda | ||
|
|
||
| # Install CUDA devel from: | ||
| # https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/devel/Dockerfile | ||
| ENV NV_CUDA_LIB_VERSION=12.9.2-1 | ||
| # ARM64 doesn't have nvprof package - set in runtime | ||
| ENV NV_NVPROF_VERSION=12.9.79-1 | ||
| ENV NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-9-${NV_NVPROF_VERSION} | ||
| ENV NV_CUDA_CUDART_DEV_VERSION=12.9.79-1 | ||
| ENV NV_NVML_DEV_VERSION=12.9.79-1 | ||
| ENV NV_LIBCUBLAS_DEV_VERSION=12.9.2.10-1 | ||
| ENV NV_LIBNPP_DEV_VERSION=12.4.1.87-1 | ||
| ENV NV_LIBNPP_DEV_PACKAGE=libnpp-devel-12-9-${NV_LIBNPP_DEV_VERSION} | ||
| ENV NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-devel | ||
| ENV NV_LIBNCCL_DEV_PACKAGE_VERSION=2.27.3-1 | ||
| ENV NCCL_VERSION=2.27.3 | ||
| ENV NV_LIBNCCL_DEV_PACKAGE=${NV_LIBNCCL_DEV_PACKAGE_NAME}-${NV_LIBNCCL_DEV_PACKAGE_VERSION}+cuda12.9 | ||
| ENV NV_CUDA_NSIGHT_COMPUTE_VERSION=12.9.2-1 | ||
| ENV NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-9-${NV_CUDA_NSIGHT_COMPUTE_VERSION} | ||
|
|
||
| RUN yum install -y \ | ||
| make \ | ||
| findutils \ | ||
| cuda-command-line-tools-12-9-${NV_CUDA_LIB_VERSION} \ | ||
| cuda-libraries-devel-12-9-${NV_CUDA_LIB_VERSION} \ | ||
| cuda-minimal-build-12-9-${NV_CUDA_LIB_VERSION} \ | ||
| cuda-cudart-devel-12-9-${NV_CUDA_CUDART_DEV_VERSION} \ | ||
| cuda-nvml-devel-12-9-${NV_NVML_DEV_VERSION} \ | ||
| libcublas-devel-12-9-${NV_LIBCUBLAS_DEV_VERSION} \ | ||
| ${NV_LIBNPP_DEV_PACKAGE} \ | ||
| ${NV_LIBNCCL_DEV_PACKAGE} \ | ||
| ${NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE} \ | ||
| && if [ "${TARGETARCH}" != "arm64" ]; then \ | ||
| yum install -y ${NV_NVPROF_DEV_PACKAGE}; \ | ||
| fi \ | ||
| && yum clean all \ | ||
| && rm -rf /var/cache/yum/* | ||
|
|
||
| ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs | ||
|
|
||
| # Install CUDA devel cudnn from: | ||
| # https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.9.2/ubi9/devel/cudnn/Dockerfile | ||
| ENV NV_CUDNN_VERSION=9.10.2.21-1 | ||
| ENV NV_CUDNN_PACKAGE=libcudnn9-cuda-12-${NV_CUDNN_VERSION} | ||
| ENV NV_CUDNN_PACKAGE_DEV=libcudnn9-devel-cuda-12-${NV_CUDNN_VERSION} | ||
|
|
||
| LABEL com.nvidia.cudnn.version="${NV_CUDNN_VERSION}" | ||
|
|
||
| RUN yum install -y \ | ||
| ${NV_CUDNN_PACKAGE} \ | ||
| ${NV_CUDNN_PACKAGE_DEV} \ | ||
| && yum clean all \ | ||
| && rm -rf /var/cache/yum/* | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Install Python packages | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| RUN pip install --no-cache-dir -U "micropipenv[toml]" | ||
|
|
||
| # Pipfile.lock provides ray, training-hub, and their transitive Python deps. | ||
| # torch is NOT in the Pipfile — it comes exclusively from the AIPCC index below. | ||
| COPY Pipfile.lock ./ | ||
| RUN micropipenv install && rm -f ./Pipfile.lock | ||
|
|
||
| # AIPCC index for pre-built CUDA wheels (all compiled against torch 2.9.0 build 13). | ||
| ENV AIPCC_INDEX=https://packages.redhat.com/api/pypi/public-rhai/rhoai/3.3/cuda12.9-ubi9/simple/ | ||
|
|
||
| # CUDA extensions from AIPCC — overwrites PyPI torch with the ABI-matched build. | ||
| RUN pip install --no-cache-dir --no-deps --force-reinstall \ | ||
| --extra-index-url ${AIPCC_INDEX} \ | ||
| "torch==2.9.0" \ | ||
| "torchvision==0.24.0" \ | ||
| "torchaudio==2.9.0" \ | ||
| "triton==3.5.0" \ | ||
| "vllm==0.13.0" \ | ||
| "flash-attn==2.8.3" \ | ||
| "mamba-ssm==2.3.0" \ | ||
| "causal-conv1d==1.6.0" \ | ||
| "xformers==0.0.33.post2" | ||
|
|
||
| # verl: --no-deps because its numpy<2.0.0 pin conflicts with vllm's numpy>=2. | ||
| RUN pip install --no-cache-dir --no-deps verl==0.8.0 | ||
|
|
||
| # vllm 0.13.0 + verl 0.8.0 runtime dependencies. | ||
| # Many deps (aiohttp, fastapi, pydantic, numpy, ray, transformers, etc.) are | ||
| # already installed via Pipfile.lock and are not repeated here. | ||
| RUN pip install --no-cache-dir \ | ||
| --extra-index-url ${AIPCC_INDEX} \ | ||
| anthropic==0.71.0 \ | ||
| blake3 \ | ||
| cachetools \ | ||
| cbor2 \ | ||
| cloudpickle \ | ||
| "compressed-tensors==0.13.0" \ | ||
| depyf==0.20.0 \ | ||
| diskcache==5.6.3 \ | ||
| email-validator \ | ||
| "gguf>=0.17.0" \ | ||
| ijson \ | ||
| lark==1.2.2 \ | ||
| "llguidance>=1.3.0,<1.4.0" \ | ||
| lm-format-enforcer==0.11.3 \ | ||
| mcp \ | ||
| "mistral-common>=1.8.5" \ | ||
| "model-hosting-container-standards>=0.1.9,<1.0.0" \ | ||
| msgspec \ | ||
| openai \ | ||
| "openai-harmony>=0.0.3" \ | ||
| outlines-core==0.2.11 \ | ||
| partial-json-parser \ | ||
| "prometheus-fastapi-instrumentator>=7.0.0" \ | ||
| pybase64 \ | ||
| python-json-logger \ | ||
| python-multipart \ | ||
| setproctitle \ | ||
| tiktoken \ | ||
| watchfiles \ | ||
| xgrammar==0.1.27 \ | ||
| codetiming \ | ||
| hydra-core \ | ||
| pybind11 \ | ||
| pylatexenc \ | ||
| tensorboard \ | ||
| "tensordict!=0.9.0,<=0.10.0,>=0.8.0" \ | ||
| torchdata \ | ||
| wandb \ | ||
| "torch==2.9.0" | ||
|
|
||
| # pyzmq: must come from PyPI (manylinux wheel bundles libzmq.so.5). | ||
| # The AIPCC wheel expects system libzmq which doesn't exist in UBI9. | ||
| RUN pip install --no-cache-dir pyzmq | ||
|
|
||
| # Restore user workspace | ||
| USER 1001 | ||
| WORKDIR /opt/app-root/src No newline at end of file |
There was a problem hiding this comment.
Dockerfile must be converted to multi-stage to satisfy the repository’s Docker security policy.
Severity: Medium. Exploit scenario: shipping build tooling and intermediate artifacts in the final image increases attack surface and post-exploit utility (CWE-250 context: least-privilege hardening gap).
Split into builder/runtime stages and copy only required runtime artifacts.
Remediation code (pattern)
-FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG}
+FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} AS builder
@@
-RUN pip install --no-cache-dir ...
+RUN pip install --no-cache-dir ...
@@
-USER 1001
-WORKDIR /opt/app-root/src
+FROM registry.access.redhat.com/ubi9/python-${PYTHON_VERSION}:${IMAGE_TAG} AS runtime
+USER 0
+# install only runtime OS/CUDA libs needed at runtime
+COPY --from=builder /opt/app-root /opt/app-root
+COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
+USER 1001
+WORKDIR /opt/app-root/srcAs per coding guidelines, "**/Dockerfile*: DOCKERFILE SECURITY ... 4. Use multi-stage builds".
🧰 Tools
🪛 Betterleaks (1.5.0)
[high] 23-23: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` around lines 1 - 215,
The Dockerfile currently uses a single-stage build that includes unnecessary
build tooling, development packages, and intermediate artifacts in the final
image, increasing the attack surface. Convert this to a multi-stage build by
creating a builder stage that performs all installations and a runtime stage
that copies only the required runtime artifacts. In the builder stage, keep all
the current installation steps including the yum install commands for
development packages and the pip install commands. In a new runtime stage, use
the same base image (UBI9 Python) and copy only the necessary runtime components
from the builder: the installed Python packages from site-packages, CUDA runtime
libraries, and runtime configuration files. Remove the development-only packages
(those with -devel suffix, make, findutils, cuda-command-line-tools, and similar
build tools) from the runtime stage by not copying those artifacts and not
installing them in the final stage. Ensure the environment variables for CUDA
runtime paths and Python are preserved in the runtime stage.
Source: Coding guidelines
| RUN pip install --no-cache-dir --no-deps --force-reinstall \ | ||
| --extra-index-url ${AIPCC_INDEX} \ | ||
| "torch==2.9.0" \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "Dockerfile" -path "*ray/cuda/2.55.1*" -type fRepository: opendatahub-io/distributed-workloads
Length of output: 136
🏁 Script executed:
wc -l images/runtime/ray/cuda/2.55.1-py312-cu129/DockerfileRepository: opendatahub-io/distributed-workloads
Length of output: 138
🏁 Script executed:
sed -n '1,30p' images/runtime/ray/cuda/2.55.1-py312-cu129/DockerfileRepository: opendatahub-io/distributed-workloads
Length of output: 3938
🏁 Script executed:
sed -n '140,160p' images/runtime/ray/cuda/2.55.1-py312-cu129/DockerfileRepository: opendatahub-io/distributed-workloads
Length of output: 900
🏁 Script executed:
sed -n '160,220p' images/runtime/ray/cuda/2.55.1-py312-cu129/DockerfileRepository: opendatahub-io/distributed-workloads
Length of output: 1579
🏁 Script executed:
# Check if Dockerfile uses multi-stage build (FROM statements)
grep -n "^FROM" images/runtime/ray/cuda/2.55.1-py312-cu129/DockerfileRepository: opendatahub-io/distributed-workloads
Length of output: 157
🏁 Script executed:
# Verify guideline enforcement context in repo
find . -name "*.md" -o -name "*.yml" | head -20 | xargs grep -l "multi-stage\|Dockerfile" 2>/dev/null | head -5Repository: opendatahub-io/distributed-workloads
Length of output: 125
🏁 Script executed:
# Check if multi-stage build is a repo-wide requirement
cat ./.github/workflows/snyk-dockerfile-scan.yml | head -50Repository: opendatahub-io/distributed-workloads
Length of output: 1131
Pin all packages and use AIPCC as primary index to prevent dependency confusion (CWE-829).
Unpinned packages (blake3, cachetools, cbor2, cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, etc.) with --extra-index-url allow pip to resolve to malicious PyPI versions if AIPCC doesn't provide them. Correct approach: make AIPCC primary index (--index-url), make PyPI a fallback (--extra-index-url), and require exact version pins for all packages.
Remediation code
RUN pip install --no-cache-dir --no-deps --force-reinstall \
+ --index-url ${AIPCC_INDEX} \
- --extra-index-url ${AIPCC_INDEX} \
+ --extra-index-url https://pypi.org/simple \
"torch==2.9.0" \
"torchvision==0.24.0" \
"torchaudio==2.9.0" \
@@
RUN pip install --no-cache-dir \
+ --index-url ${AIPCC_INDEX} \
- --extra-index-url ${AIPCC_INDEX} \
+ --extra-index-url https://pypi.org/simple \
anthropic==0.71.0 \
- blake3 \
- cachetools \
- cbor2 \
- cloudpickle \
+ blake3==<exact-version> \
+ cachetools==<exact-version> \
+ cbor2==<exact-version> \
+ cloudpickle==<exact-version> \
- email-validator \
+ email-validator==<exact-version> \
@@
- ijson \
+ ijson==<exact-version> \
lark==1.2.2 \
"llguidance>=1.3.0,<1.4.0" \
lm-format-enforcer==0.11.3 \
- mcp \
+ mcp==<exact-version> \
"mistral-common>=1.8.5" \
"model-hosting-container-standards>=0.1.9,<1.0.0" \
- msgspec \
- openai \
+ msgspec==<exact-version> \
+ openai==<exact-version> \
"openai-harmony>=0.0.3" \
outlines-core==0.2.11 \
- partial-json-parser \
+ partial-json-parser==<exact-version> \
"prometheus-fastapi-instrumentator>=7.0.0" \
- pybase64 \
- python-json-logger \
- python-multipart \
+ pybase64==<exact-version> \
+ python-json-logger==<exact-version> \
+ python-multipart==<exact-version> \
- setproctitle \
+ setproctitle==<exact-version> \
tiktoken \
watchfiles \
xgrammar==0.1.27 \
- codetiming \
- hydra-core \
- pybind11 \
- pylatexenc \
+ codetiming==<exact-version> \
+ hydra-core==<exact-version> \
+ pybind11==<exact-version> \
+ pylatexenc==<exact-version> \
tensorboard \
"tensordict!=0.9.0,<=0.10.0,>=0.8.0" \
torchdata \
- wandb \
+ wandb==<exact-version> \
"torch==2.9.0"Also applies to: lines 167–207.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` around lines 149 -
151, The pip install commands use --extra-index-url for AIPCC which makes it a
secondary index with PyPI as primary, creating a dependency confusion
vulnerability where unpinned packages could resolve to malicious PyPI versions.
Fix this by changing --extra-index-url to --index-url for the AIPCC_INDEX
variable to make it the primary index, adding --extra-index-url for PyPI as a
fallback, and pinning all packages (torch, blake3, cachetools, cbor2,
cloudpickle, email-validator, ijson, mcp, msgspec, openai, wandb, and any
others) to exact versions instead of leaving them unpinned. Apply this change to
all affected pip install commands in the Dockerfile.
Source: Coding guidelines
Rename 2.55.1-py312-cu128 to 2.55.1-py312-cu129 and add training-hub runtime with vllm 0.13.0, verl 0.8.0, and AIPCC CUDA extensions. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the existing 2.55.1-py312-cu128 image and Tekton pipeline unchanged, adding the cu129 Training Hub variant as a separate image rather than replacing cu128. Co-authored-by: Cursor <cursoragent@cursor.com>
3ad10ae to
19c9704
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
images/runtime/ray/cuda/2.55.1-py312-cu129/README.md (1)
3-3: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winHyphenate compound modifier.
"CUDA enabled" should be "CUDA-enabled" when used attributively before "container image."
Fix
-CUDA enabled container image for Ray in OpenShift AI. +CUDA-enabled container image for Ray in OpenShift AI.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@images/runtime/ray/cuda/2.55.1-py312-cu129/README.md` at line 3, Hyphenate the attributive compound in the README summary: update the phrase in the image description so that “CUDA enabled” becomes “CUDA-enabled” before “container image.” Make the wording change in the introductory description only, keeping the rest of the text unchanged.Source: Linters/SAST tools
images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile (1)
138-138: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winPin
micropipenv[toml]to an exact version.Unpinned build tooling can silently change how
Pipfile.lockis resolved/installed between builds (CWE-829 class risk), undermining the reproducibility the rest of this file otherwise enforces via strict CUDA/NCCL/cuDNN pins.Remediation
-RUN pip install --no-cache-dir -U "micropipenv[toml]" +RUN pip install --no-cache-dir "micropipenv[toml]==<pinned-version>"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile` at line 138, The Dockerfile currently installs micropipenv[toml] without an exact version, which leaves build tooling behavior mutable across builds. Update the RUN pip install step to pin micropipenv[toml] to a specific version in this CUDA image so the build remains reproducible and consistent with the other strict dependency pins.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfile`:
- Line 138: The Dockerfile currently installs micropipenv[toml] without an exact
version, which leaves build tooling behavior mutable across builds. Update the
RUN pip install step to pin micropipenv[toml] to a specific version in this CUDA
image so the build remains reproducible and consistent with the other strict
dependency pins.
In `@images/runtime/ray/cuda/2.55.1-py312-cu129/README.md`:
- Line 3: Hyphenate the attributive compound in the README summary: update the
phrase in the image description so that “CUDA enabled” becomes “CUDA-enabled”
before “container image.” Make the wording change in the introductory
description only, keeping the rest of the text unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 60d831ed-2c0f-440f-90b1-e06420f9b482
⛔ Files ignored due to path filters (1)
images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
.tekton/ray-2.55.1-py312-cu129-push.yamlimages/runtime/ray/cuda/2.55.1-py312-cu129/Dockerfileimages/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSEimages/runtime/ray/cuda/2.55.1-py312-cu129/Pipfileimages/runtime/ray/cuda/2.55.1-py312-cu129/README.mdimages/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-arm64images/runtime/ray/cuda/2.55.1-py312-cu129/cuda.repo-x86_64
✅ Files skipped from review due to trivial changes (1)
- images/runtime/ray/cuda/2.55.1-py312-cu129/NGC-DL-CONTAINER-LICENSE
🚧 Files skipped from review as they are similar to previous changes (2)
- images/runtime/ray/cuda/2.55.1-py312-cu129/Pipfile
- .tekton/ray-2.55.1-py312-cu129-push.yaml
Summary
RHAISTRAT-1693: Integrate Training Hub with Ray so that Training Hub algorithms (SFT, OSFT, LoRA, GRPO) can be run on Ray clusters in OpenShift AI.
Jira for the image work: https://redhat.atlassian.net/browse/RHOAIENG-61568
Adds a new Ray CUDA 12.9 image (
2.55.1-py312-cu129) with training-hub and its runtime dependencies (vllm, verl, CUDA extensions) sourced from the RHOAI 3.3 AIPCC index. The existing CUDA 12.8 image (2.55.1-py312-cu128) is kept unchanged.What's added
New image:
2.55.1-py312-cu129--no-depsto avoid numpy conflict)New Tekton pipeline:
ray-2.55.1-py312-cu129-push.yamlquay.io/modh/ray:2.55.1-py312-cu129images/runtime/ray/cuda/2.55.1-py312-cu129/**Unchanged:
2.55.1-py312-cu128Pipfile changes (cu129)
training-hub==0.8.1,transformers>=4.57.6,<5.0,kernels>=0.9.0,<0.15,unsloth>=2026.1.1,einops>=0.8,bitsandbytes>=0.47.0,liger-kernel>=0.5.10How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
New Features
Documentation
Chores