Skip to content

feat: add Dockerfile and configuration for ROCm MPI training runtime …#650

Merged
sutaakar merged 15 commits into
opendatahub-io:mainfrom
kapil27:mpi-rocm-image
Mar 11, 2026
Merged

feat: add Dockerfile and configuration for ROCm MPI training runtime …#650
sutaakar merged 15 commits into
opendatahub-io:mainfrom
kapil27:mpi-rocm-image

Conversation

@kapil27

@kapil27 kapil27 commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added ROCm 6.4-based GPU training runtime with OpenMPI support for distributed training operations and InfiniBand/RDMA acceleration.
  • Chores

    • Configured runtime startup with automatic UID registration for container compatibility.
    • Configured Python environment with training frameworks and deep learning libraries including PyTorch, HuggingFace, and related dependencies.

@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adds a new ROCm-based MPI training container runtime targeting PyTorch 2.9 with OpenMPI 4.1 on Python 3.12. Four new files configure the runtime: a Dockerfile.konflux that builds from ROCm 6.4 UBI9 with SSH, OpenMPI wrappers, and dependency management via micropipenv; two shell entrypoint scripts (uid_entrypoint.sh and mpirun_wrapper.sh) that register the container's random UID in /etc/passwd at startup; and a pyproject.toml declaring project dependencies for ML workloads including PyTorch, HuggingFace models, and training utilities.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security findings

CWE-20: Improper Input Validation — mpirun_wrapper.sh validates HOME for colon, newline, and carriage return, but does not reject other passwd-format-breaking characters (spaces, forward slash in the UID field context, or unescaped special characters). An attacker controlling HOME could inject additional passwd entries. Validate HOME against a strict allowlist: ^[a-zA-Z0-9._/-]+$ or similar, and reject anything else.

CWE-427: Uncontrolled Search Path Element — The Dockerfile propagates LD_LIBRARY_PATH for SSH and OpenMPI sessions. Confirm that:

  • Library paths are absolute and do not include relative directories (e.g., .)
  • LD_LIBRARY_PATH is not set by untrusted sources before the entrypoint
  • InfiniBand/RDMA libraries installed do not load unsigned or unverified plugins

CWE-269: Improper Access Control / GID 0 Handling — File permissions set with g=u and GID 0 compatibility for OpenShift UIDs require verification that world-readable secrets (SSH keys, credentials in /etc/passwd entries) are not inadvertently exposed. Audit mpiuser workspace and SSH configuration for overpermissive modes.

Missing entrypoint error handling — uid_entrypoint.sh exits non-zero if /etc/passwd is non-writable, but does not log the specific error to stderr before exit. Add explicit error messages to aid troubleshooting without silently failing container startup.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adding a Dockerfile and configuration for a ROCm MPI training runtime, which is the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`:
- Around line 3-96: Turn this single-stage Dockerfile into a multi-stage build:
create a builder stage (FROM ... AS builder) that runs all dnf installs,
micropipenv install (referencing requirements.txt and micropipenv install), pip
installs (s3fs==2026.1.0), creates /opt/app-root site-packages and the
/usr/lib64/openmpi symlinks and /var/run/sshd setup, then in the final minimal
runtime stage copy only the runtime artifacts needed (copy /opt/app-root/bin,
/opt/app-root/lib/python3.12/site-packages, the /usr/lib64/openmpi/bin and
/usr/lib64/openmpi/lib files or symlinks, and uid_entrypoint.sh) and re-create
minimal config (ENV PATH, ENV LD_LIBRARY_PATH, ENTRYPOINT
["/usr/local/bin/uid_entrypoint.sh"], USER 1001) while omitting package
managers, build caches, and dnf/pip tools to reduce attack surface; ensure files
referenced in scripts (uid_entrypoint.sh,
/home/mpiuser/.openmpi/mca-params.conf) are present in the final image and
preserve permissions (chmod +x uid_entrypoint.sh, chgrp/chmod on /home/mpiuser)
when copying.
- Around line 36-39: The Dockerfile currently disables SSH host verification by
setting StrictHostKeyChecking to no and UserKnownHostsFile to /dev/null in
/etc/ssh/ssh_config; revert this change and instead ensure host key verification
remains enabled (remove the sed and echo lines that set StrictHostKeyChecking
and UserKnownHostsFile), and implement a secure alternative: populate
/etc/ssh/ssh_known_hosts or the user's ~/.ssh/known_hosts with the cluster node
public host keys at build/runtime (or use StrictHostKeyChecking accept-new if
you must auto-add on first connect), keep the SSH_PORT echo line only if needed,
and do not suppress SendEnv unless intentional; reference the modified RUN
command that edits /etc/ssh/ssh_config and the SSH_PORT environment variable
when making these changes.
- Around line 83-84: Remove the RUN chmod g=u /etc/passwd line from the
Dockerfile.konflux to avoid making /etc/passwd writable; instead update the
uid_entrypoint.sh invoked at container start to create a temp passwd file (copy
/etc/passwd, append the generated UID entry there) and export NSS_WRAPPER_PASSWD
pointing to that temp file and NSS_WRAPPER_GROUP=/etc/group so identity
resolution uses the temp copy without altering the real /etc/passwd; ensure any
Dockerfile changes remove the chmod and that uid_entrypoint.sh performs the safe
copy/append and exports NSS_WRAPPER_* before launching the main process.
- Around line 67-76: The Dockerfile is copying a non-existent requirements.txt;
replace that COPY and the subsequent RUN which expects requirements.txt with
operations that use Pipfile.lock (the hermetic source for micropipenv).
Specifically, change the COPY from "requirements.txt" to "Pipfile.lock" and
update the RUN that calls micropipenv (in the block that currently runs
micropipenv install && rm -f ./requirements.txt && pip install ... s3fs) to
install from the copied Pipfile.lock (invoke micropipenv install against the
lock file or use the --deploy flag), remove the rm -f ./requirements.txt step,
and keep the pip install --no-cache-dir --no-deps s3fs==2026.1.0 line; reference
the COPY and RUN lines around micropipenv, Pipfile.lock, and the s3fs install to
locate where to change.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Pipfile`:
- Around line 8-39: The build fails because Dockerfile.konflux expects a
requirements.txt (the Dockerfile.konflux line with COPY requirements.txt ./) but
none is committed; fix by either (A) generating a requirements.txt from the
committed Pipfile.lock (e.g., pipenv lock -r or micropipenv lock --requirements)
and commit that requirements.txt so COPY requirements.txt ./ succeeds, or (B)
modify Dockerfile.konflux to install from Pipfile.lock directly using
pipenv/micropipenv (e.g., micropipenv install --deploy --requirement
Pipfile.lock or pip install -r <(pipenv lock -r)) so it no longer references
requirements.txt; update CI/Tekton to use the same approach and ensure
Pipfile.lock remains the authoritative lockfile.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/uid_entrypoint.sh`:
- Around line 10-11: Sanitize the HOME value before writing /etc/passwd in
uid_entrypoint.sh: validate and reject or sanitize ${HOME} (the home_dir
variable) so it contains no colon ':' or newline characters (or other
passwd-delimiting bytes), falling back to a safe default (/home/mpiuser) if
validation fails; then use the sanitized variable when calling printf to append
the mpiuser entry (referencing home_dir and uid) so no untrusted delimiter or
newline can inject additional passwd records.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8d074fcd-4519-4e13-8a39-71cce4d4dbd0

📥 Commits

Reviewing files that changed from the base of the PR and between 45ec38c and 207ef30.

⛔ Files ignored due to path filters (1)
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Pipfile.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Pipfile
  • images/runtime/training/py312-rocm64-torch29-openmpi41/uid_entrypoint.sh

Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux Outdated
Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/Pipfile Outdated
Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux Outdated
kapil27 added 4 commits March 6, 2026 16:58
- Added `mpirun_wrapper.sh` to register OpenShift random UID in `/etc/passwd` for proper user identification.
- Updated Dockerfile to include the `mpirun_wrapper.sh` and adjusted symlink for `mpirun`.
- Changed installation method for Python dependencies from `requirements.txt` to `Pipfile.lock` for better dependency management.
- Modified environment variables to prioritize `/usr/local/bin` in the PATH for the new wrapper script.
@sutaakar

sutaakar commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot removed the lgtm label Mar 9, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux (1)

48-51: Redundant sshd_config entries — sed already sets values, then echo appends duplicates.

Lines 48-49 use sed to uncomment and set StrictModes and Port, then lines 50-51 append the same directives again. This creates duplicate config entries.

🔧 Proposed fix
 RUN sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config \
     && sed -i "s/#\(Port \).*/\1${SSH_PORT}/g" /etc/ssh/sshd_config \
-    && echo "StrictModes no" >> /etc/ssh/sshd_config \
-    && echo "Port ${SSH_PORT}" >> /etc/ssh/sshd_config
+    && echo "AcceptEnv PATH LD_LIBRARY_PATH" >> /etc/ssh/sshd_config

Note: The AcceptEnv directive is needed to complement SendEnv from line 45, enabling environment propagation for MPI processes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`
around lines 48 - 51, The Dockerfile edits to /etc/ssh/sshd_config currently run
sed to set/uncomment StrictModes and Port and then append the same directives,
causing duplicates; remove the redundant echo lines that append "StrictModes no"
and "Port ${SSH_PORT}" so only the sed changes remain, and ensure an AcceptEnv
directive is present (e.g., add or uncomment "AcceptEnv ..." to complement the
existing SendEnv) so MPI environment variables propagate for OpenMPI processes;
locate occurrences referencing StrictModes, Port, sed, and sshd_config in the
Dockerfile.konflux to update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`:
- Line 102: The ENV line sets LD_LIBRARY_PATH with a trailing colon when the
base LD_LIBRARY_PATH is unset, causing an unsafe addition of the current
directory; update the ENV assignment for LD_LIBRARY_PATH (the line with ENV
LD_LIBRARY_PATH) to append the existing value only when it is non-empty using
parameter expansion (e.g., use ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}) so the
resulting value is /usr/lib64/openmpi/lib when unset and
/usr/lib64/openmpi/lib:<old> when set.

---

Nitpick comments:
In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`:
- Around line 48-51: The Dockerfile edits to /etc/ssh/sshd_config currently run
sed to set/uncomment StrictModes and Port and then append the same directives,
causing duplicates; remove the redundant echo lines that append "StrictModes no"
and "Port ${SSH_PORT}" so only the sed changes remain, and ensure an AcceptEnv
directive is present (e.g., add or uncomment "AcceptEnv ..." to complement the
existing SendEnv) so MPI environment variables propagate for OpenMPI processes;
locate occurrences referencing StrictModes, Port, sed, and sshd_config in the
Dockerfile.konflux to update.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: f44cfaea-51fd-4d27-8d6b-04355eb792ff

📥 Commits

Reviewing files that changed from the base of the PR and between 207ef30 and 42da707.

⛔ Files ignored due to path filters (1)
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Pipfile.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux
  • images/runtime/training/py312-rocm64-torch29-openmpi41/mpirun_wrapper.sh
  • images/runtime/training/py312-rocm64-torch29-openmpi41/uid_entrypoint.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • images/runtime/training/py312-rocm64-torch29-openmpi41/uid_entrypoint.sh

Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux Outdated
Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/Pipfile Outdated
- Removed outdated comments from Dockerfile related to OpenMPI configuration.
- Added new dependencies `causal-conv1d` and `mamba-ssm` to Pipfile for enhanced functionality.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux (3)

86-92: ⚠️ Potential issue | 🟠 Major

Keep /etc/passwd immutable.

chmod g=u /etc/passwd turns every GID 0 process into an account-database writer (CWE-732). With the current UID-registration flow, a compromised training process can poison identity resolution for later SSH or Python helper calls.

🔧 Remediation
-# Allow uid_entrypoint to add random UID to /etc/passwd at runtime
-RUN chmod g=u /etc/passwd
+RUN dnf install -y nss_wrapper && dnf clean all
+ENV LD_PRELOAD=/usr/lib64/libnss_wrapper.so

Then update both runtime UID-registration scripts to use a temporary passwd file instead of appending to /etc/passwd:

passwd_file="$(mktemp)"
cp /etc/passwd "$passwd_file"
printf 'mpiuser:x:%s:0:mpiuser:%s:/bin/sh\n' "$(id -u)" "${HOME:-/home/mpiuser}" >> "$passwd_file"
export NSS_WRAPPER_PASSWD="$passwd_file"
export NSS_WRAPPER_GROUP=/etc/group

Run this to confirm the current scripts still append directly to /etc/passwd and do not use NSS_WRAPPER_*:

#!/bin/bash
set -euo pipefail

fd -HI '^(uid_entrypoint|mpirun_wrapper)\.sh$' images/runtime/training \
  -x sh -c '
    echo "=== $1 ==="
    rg -n -C2 "/etc/passwd|NSS_WRAPPER" "$1" || true
    echo
  ' sh {}

Expected result: the current scripts will show direct /etc/passwd writes and no NSS_WRAPPER_PASSWD/NSS_WRAPPER_GROUP exports.

As per coding guidelines **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`
around lines 86 - 92, The Dockerfile currently makes /etc/passwd group-writable
(RUN chmod g=u /etc/passwd) which is insecure; remove that RUN line and change
the uid registration flow in uid_entrypoint.sh (and any mpirun_wrapper.sh) so
they do not append to /etc/passwd but instead create a temporary passwd file
(mktemp), copy /etc/passwd into it, append the runtime UID entry there, and
export NSS_WRAPPER_PASSWD pointing to that temp file and
NSS_WRAPPER_GROUP=/etc/group before starting the process; update references to
uid_entrypoint.sh in the Dockerfile (COPY/CHMOD steps remain) and ensure no code
appends directly to /etc/passwd anywhere in the image.

2-99: 🛠️ Refactor suggestion | 🟠 Major

Split the builder from the runtime image.

This single-stage build leaves dnf, pip, and build-time setup artifacts in the runtime layer, which expands the post-compromise surface and violates the repo’s Dockerfile hardening rule.

🔧 Refactor skeleton
-FROM quay.io/aipcc/base-images/rocm-6.4-el9.6:3.4.0-ea.1-1772615238
+FROM quay.io/aipcc/base-images/rocm-6.4-el9.6:3.4.0-ea.1-1772615238 AS builder
 ...
-RUN pip install --no-cache-dir -U "micropipenv[toml]"
+RUN pip install --no-cache-dir -U "micropipenv[toml]"
 ...
+FROM quay.io/aipcc/base-images/rocm-6.4-el9.6:3.4.0-ea.1-1772615238 AS runtime
+COPY --from=builder /opt/app-root /opt/app-root
+COPY --from=builder /usr/local/bin/mpirun /usr/local/bin/mpirun
+COPY --from=builder /usr/local/bin/python /usr/local/bin/python
+COPY --from=builder /usr/local/bin/uid_entrypoint.sh /usr/local/bin/uid_entrypoint.sh

As per coding guidelines **/Dockerfile*: DOCKERFILE SECURITY (CWE-798, CWE-250): 4. Use multi-stage builds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`
around lines 2 - 99, The Dockerfile is a single-stage image that leaves dnf, pip
and build-time artifacts in the runtime layer; convert it into a multi-stage
build by adding a builder stage (e.g., FROM ... AS builder) that performs all
RUN dnf install ..., pip install --no-cache-dir -U "micropipenv[toml]",
micropipenv install, and any builds, and then copy only the runtime artifacts
into a slim runtime stage (the original base image or a minimal UBI9) —
specifically copy the installed Python site-packages (from
/opt/app-root/lib/python3.12/site-packages), the mpirun_wrapper.sh
(/usr/local/bin/mpirun), uid_entrypoint.sh (/usr/local/bin/uid_entrypoint.sh),
the /usr/local/bin/python wrapper, and any required OpenMPI libs or symlinks
into the final image; ensure final stage does not run dnf/pip, preserves
ENTRYPOINT ["/usr/local/bin/uid_entrypoint.sh"], sets HOME/PATH/LD_LIBRARY_PATH
and file permissions (chgrp/chmod for /home/mpiuser and /etc/passwd) as needed,
and removes package manager artifacts so no build tools remain in the runtime
layer.

41-44: ⚠️ Potential issue | 🔴 Critical

Re-enable SSH host verification.

StrictHostKeyChecking no plus UserKnownHostsFile /dev/null removes peer authentication (CWE-295). On a shared cluster network, a spoofed launcher or worker can impersonate a peer and receive MPI-initiated remote commands.

🔧 Remediation
-RUN sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g" /etc/ssh/ssh_config \
-    && echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \
+RUN install -d -m 700 /home/mpiuser/.ssh \
+    && touch /home/mpiuser/.ssh/known_hosts \
+    && chmod 600 /home/mpiuser/.ssh/known_hosts \
+    && sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1accept-new/g" /etc/ssh/ssh_config \
+    && echo "    UserKnownHostsFile /home/mpiuser/.ssh/known_hosts" >> /etc/ssh/ssh_config \
     && echo "    Port ${SSH_PORT}" >> /etc/ssh/ssh_config \
     && echo "    SendEnv PATH LD_LIBRARY_PATH" >> /etc/ssh/ssh_config

As per coding guidelines **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`
around lines 41 - 44, The Dockerfile step is disabling SSH host verification by
setting "StrictHostKeyChecking no" and "UserKnownHostsFile /dev/null" (and
appending SSH config lines using SSH_PORT and SendEnv); revert this by removing
or not writing those insecure settings and instead allow the default host
verification (or use "StrictHostKeyChecking accept-new" if automated first-time
acceptance is required), and ensure known_hosts is provisioned into the image at
build or runtime; update the RUN block that edits /etc/ssh/ssh_config (the sed
and echo lines referencing StrictHostKeyChecking, UserKnownHostsFile, SSH_PORT,
and SendEnv) accordingly so peer authentication is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`:
- Around line 77-79: The Dockerfile currently installs s3fs with pip using
--no-deps which bypasses dependency resolution and prevents
aiobotocore/botocore/aioitertools from being locked; instead add s3fs==2026.1.0
to the project's Pipfile, remove the `pip install --no-cache-dir --no-deps
s3fs==2026.1.0` line from Dockerfile.konflux (the `pip install` invocation),
regenerate the lockfile with micropipenv/pipenv (e.g., run micropipenv lock or
the project's lock command) so Pipfile.lock contains s3fs and its transitive
deps, and then keep the Dockerfile step that runs `micropipenv install` so the
locked deps are installed.

---

Duplicate comments:
In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`:
- Around line 86-92: The Dockerfile currently makes /etc/passwd group-writable
(RUN chmod g=u /etc/passwd) which is insecure; remove that RUN line and change
the uid registration flow in uid_entrypoint.sh (and any mpirun_wrapper.sh) so
they do not append to /etc/passwd but instead create a temporary passwd file
(mktemp), copy /etc/passwd into it, append the runtime UID entry there, and
export NSS_WRAPPER_PASSWD pointing to that temp file and
NSS_WRAPPER_GROUP=/etc/group before starting the process; update references to
uid_entrypoint.sh in the Dockerfile (COPY/CHMOD steps remain) and ensure no code
appends directly to /etc/passwd anywhere in the image.
- Around line 2-99: The Dockerfile is a single-stage image that leaves dnf, pip
and build-time artifacts in the runtime layer; convert it into a multi-stage
build by adding a builder stage (e.g., FROM ... AS builder) that performs all
RUN dnf install ..., pip install --no-cache-dir -U "micropipenv[toml]",
micropipenv install, and any builds, and then copy only the runtime artifacts
into a slim runtime stage (the original base image or a minimal UBI9) —
specifically copy the installed Python site-packages (from
/opt/app-root/lib/python3.12/site-packages), the mpirun_wrapper.sh
(/usr/local/bin/mpirun), uid_entrypoint.sh (/usr/local/bin/uid_entrypoint.sh),
the /usr/local/bin/python wrapper, and any required OpenMPI libs or symlinks
into the final image; ensure final stage does not run dnf/pip, preserves
ENTRYPOINT ["/usr/local/bin/uid_entrypoint.sh"], sets HOME/PATH/LD_LIBRARY_PATH
and file permissions (chgrp/chmod for /home/mpiuser and /etc/passwd) as needed,
and removes package manager artifacts so no build tools remain in the runtime
layer.
- Around line 41-44: The Dockerfile step is disabling SSH host verification by
setting "StrictHostKeyChecking no" and "UserKnownHostsFile /dev/null" (and
appending SSH config lines using SSH_PORT and SendEnv); revert this by removing
or not writing those insecure settings and instead allow the default host
verification (or use "StrictHostKeyChecking accept-new" if automated first-time
acceptance is required), and ensure known_hosts is provisioned into the image at
build or runtime; update the RUN block that edits /etc/ssh/ssh_config (the sed
and echo lines referencing StrictHostKeyChecking, UserKnownHostsFile, SSH_PORT,
and SendEnv) accordingly so peer authentication is preserved.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 21bba76e-cbeb-4c92-8be4-56360fe8da24

📥 Commits

Reviewing files that changed from the base of the PR and between 475e781 and 25fbf34.

📒 Files selected for processing (2)
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Pipfile

Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux Outdated
kapil27 added 2 commits March 10, 2026 19:52
…MPI training runtime

- Updated Dockerfile comments for clarity and accuracy regarding ROCm and OpenMPI configurations.
- Replaced Pipfile with requirements.txt for Python dependencies, generated from pyproject.toml.
- Added new pyproject.toml for better dependency management and streamlined installation process.
- Removed outdated Pipfile and Pipfile.lock to eliminate redundancy.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux (1)

47-50: StrictModes no applied twice — redundant.

Line 47 uses sed to set StrictModes no, and line 49 appends StrictModes no again. The echo is superfluous.

Proposed fix
 RUN sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config \
     && sed -i "s/#\(Port \).*/\1${SSH_PORT}/g" /etc/ssh/sshd_config \
-    && echo "StrictModes no" >> /etc/ssh/sshd_config \
     && echo "Port ${SSH_PORT}" >> /etc/ssh/sshd_config
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`
around lines 47 - 50, The Dockerfile sets StrictModes twice in
/etc/ssh/sshd_config: the sed invocation sed -i "s/#\(StrictModes \).*/\1no/g"
already enables StrictModes no, so remove the redundant echo "StrictModes no"
append (and related duplicate echo of Port if present) to avoid duplicate
entries; keep the sed lines (sed -i "s/#\(StrictModes \).*/\1no/g" and sed -i
"s/#\(Port \).*/\1${SSH_PORT}/g") and only append values for keys that sed does
not handle.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@images/runtime/training/py312-rocm64-torch29-openmpi41/pyproject.toml`:
- Line 61: pyproject.toml pins s3fs==2025.9.0 but Dockerfile.konflux separately
installs s3fs==2026.1.0 with pip install --no-deps, causing inconsistent
lockfile and missing transitive deps; fix by making versions consistent: either
update pyproject.toml to s3fs==2026.1.0 and regenerate requirements.txt
(preferred) or remove the explicit pip install --no-deps from Dockerfile.konflux
so the build uses the lockfile-managed s3fs, and ensure requirements.txt is
regenerated after any version change.
- Around line 80-81: The pyproject.toml currently sets index-url and also
configures extra-index-url pointing to https://pypi.org/simple which allows
fallback to an unvetted source; remove the extra-index-url entry (or replace it
with a secure internal fallback) so the installer uses only the vetted AIPCC
index, or if the fallback is required, add a documented security justification
and an approval tag in the file explaining why PyPI is permitted; locate the
extra-index-url key in pyproject.toml and either delete that line or replace it
with the documented justification and approval metadata.

---

Nitpick comments:
In `@images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux`:
- Around line 47-50: The Dockerfile sets StrictModes twice in
/etc/ssh/sshd_config: the sed invocation sed -i "s/#\(StrictModes \).*/\1no/g"
already enables StrictModes no, so remove the redundant echo "StrictModes no"
append (and related duplicate echo of Port if present) to avoid duplicate
entries; keep the sed lines (sed -i "s/#\(StrictModes \).*/\1no/g" and sed -i
"s/#\(Port \).*/\1${SSH_PORT}/g") and only append values for keys that sed does
not handle.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: d2e1c3fb-4dd7-46c3-af60-aabfe126c250

📥 Commits

Reviewing files that changed from the base of the PR and between 25fbf34 and f1a4f46.

📒 Files selected for processing (3)
  • images/runtime/training/py312-rocm64-torch29-openmpi41/Dockerfile.konflux
  • images/runtime/training/py312-rocm64-torch29-openmpi41/pyproject.toml
  • images/runtime/training/py312-rocm64-torch29-openmpi41/requirements.txt

"mlflow>=3.8.1",

# Storage
"s3fs==2025.9.0",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

s3fs version mismatch: pyproject.toml declares 2025.9.0, Dockerfile installs 2026.1.0.

pyproject.toml specifies s3fs==2025.9.0 but Dockerfile.konflux line 85 installs s3fs==2026.1.0 via a separate pip install --no-deps call. This creates two problems:

  1. The lockfile (requirements.txt) will contain the wrong version.
  2. --no-deps bypasses dependency resolution, risking missing transitive deps (aiobotocore, botocore, etc.).

Either align the version here and regenerate requirements.txt (preferred), or remove the separate pip install in the Dockerfile.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@images/runtime/training/py312-rocm64-torch29-openmpi41/pyproject.toml` at
line 61, pyproject.toml pins s3fs==2025.9.0 but Dockerfile.konflux separately
installs s3fs==2026.1.0 with pip install --no-deps, causing inconsistent
lockfile and missing transitive deps; fix by making versions consistent: either
update pyproject.toml to s3fs==2026.1.0 and regenerate requirements.txt
(preferred) or remove the explicit pip install --no-deps from Dockerfile.konflux
so the build uses the lockfile-managed s3fs, and ensure requirements.txt is
regenerated after any version change.

Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/pyproject.toml Outdated
Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/pyproject.toml Outdated
Comment thread images/runtime/training/py312-rocm64-torch29-openmpi41/requirements.txt Outdated
…ments.txt

Co-authored-by: Karel Suta <krlsuta@gmail.com>
…ct.toml

Co-authored-by: Karel Suta <krlsuta@gmail.com>
@openshift-ci openshift-ci Bot added the lgtm label Mar 11, 2026
@openshift-ci

openshift-ci Bot commented Mar 11, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sutaakar sutaakar merged commit 692c377 into opendatahub-io:main Mar 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants