Skip to content

fix: install rocm280 flash-attn outside lockfile on main#936

Merged
sutaakar merged 1 commit into
opendatahub-io:mainfrom
kapil27:fix-rocm280-flash-attn-install-odh-main
Jun 30, 2026
Merged

fix: install rocm280 flash-attn outside lockfile on main#936
sutaakar merged 1 commit into
opendatahub-io:mainfrom
kapil27:fix-rocm280-flash-attn-install-odh-main

Conversation

@kapil27

@kapil27 kapil27 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Flash-attn build hooks can fail during micropipenv lockfile installation with torch import errors in this image path. Remove flash-attn from lockfile resolution and install it explicitly with no-build-isolation after base dependencies are installed.

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added support for an additional attention optimization package in the Python 3.12 ROCm training image.
    • Improved the training environment setup so builds complete with the required dependencies already available.
  • Bug Fixes

    • Resolved build-time issues related to package installation order, helping training images build more reliably.

Flash-attn build hooks can fail during micropipenv lockfile installation with torch import errors in this image path. Remove flash-attn from lockfile resolution and install it explicitly with no-build-isolation after base dependencies are installed.

Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci openshift-ci Bot requested review from Fiona-Waters and efazal June 30, 2026 11:14
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

A new build step is added to images/runtime/training/py312-rocm64-torch280/Dockerfile that installs flash-attn==2.8.3 outside lockfile resolution using --no-build-isolation, --no-deps, and --no-cache-dir, then runs fix-permissions /opt/app-root -P. A comment notes that Torch must be visible at build time for this step.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes


Supply chain / security flags — no praise, just findings:

CWE-829 (Inclusion of Functionality from Untrusted Control Sphere) / CWE-494 (Download of Code Without Integrity Check)

flash-attn==2.8.3 is pinned by version only. There is no hash verification (--require-hashes). Version pinning alone does not prevent a compromised or substituted package on PyPI — a malicious upload under the same version string (e.g., via a yanked+re-uploaded release or index confusion) would be silently accepted.

Mitigation: add --require-hashes with the expected SHA-256 digest for flash-attn==2.8.3 and its wheel artifact.

--no-build-isolation as an attack surface

Disabling build isolation (--no-build-isolation) causes pip to use whatever is installed in the container environment as the build backend, rather than a clean, isolated environment. If any package in the base image is compromised or shadowed, it participates in the build of flash-attn. This is a known supply-chain risk in CI/CD image builds (relevant to CWE-426: Untrusted Search Path).

--no-deps — dependency skew

flash-attn dependencies are suppressed. If flash-attn==2.8.3 has transitive requirements that conflict or are missing at runtime, this will manifest as a runtime failure rather than a build failure. More critically, if a future update pulls in a dependency that was previously expected to be supplied by --no-deps, that gap is invisible.

No provenance / SBOM update visible

This change adds a package outside the lockfile resolution mechanism. Confirm whether the SBOM or lockfile for this image is updated elsewhere to track this addition — absent that, the image's software inventory is incomplete (relevant to NIST SSDF PW.4).

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
No Privileged Containers ⚠️ Warning Dockerfile runs flash-attn install under USER 0 with no justification, exposing root build-time execution (CWE-250/CWE-494). Move the install to an unprivileged builder stage or switch to a non-root USER before third-party build hooks.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: moving flash-attn installation outside the lockfile for the ROCm 2.8.0 image on main.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection ✅ Passed Weak signals only: templated body, empty testing section, and one small Dockerfile change. No second category of spam/low-quality evidence is strong enough.
No Hardcoded Secrets ✅ Passed No hardcoded creds/secrets found; the Dockerfile only adds a version-pinned flash-attn install and repo URLs with no embedded credentials (CWE-798/CWE-321).
No Weak Cryptography ✅ Passed Touched Dockerfile only adds a pip install; no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret comparisons (CWE-327/CWE-328).
No Injection Vectors ✅ Passed Only a constant pip install flash-attn==2.8.3 was added; no user-controlled interpolation, eval/exec, YAML, SQL, or shell-injection primitives (CWE-78/89/94/502/79).
No Sensitive Data In Logs ✅ Passed No log/print/echo statements or raw request bodies added; no CWE-532/CWE-200 sensitive-data exposure in the Dockerfile.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@images/runtime/training/py312-rocm64-torch280/Dockerfile`:
- Around line 81-83: Move the flash-attn installation out of the final
root-owned stage and into a throwaway unprivileged builder stage, since the
current RUN in the Dockerfile executes third-party build hooks as USER 0. Build
the wheel in that stage using the existing flash-attn install flow, then copy
only the resulting wheel into the final image and install from it there; keep
the unique install step around the flash-attn==2.8.3 command and preserve the
fix-permissions behavior in the final stage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b011dcb7-a307-4c67-9504-12f9684078f8

📥 Commits

Reviewing files that changed from the base of the PR and between 067f0b8 and 0cdc6e7.

⛔ Files ignored due to path filters (1)
  • images/runtime/training/py312-rocm64-torch280/Pipfile.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • images/runtime/training/py312-rocm64-torch280/Dockerfile

Comment on lines +81 to +83
# Install flash-attn outside lockfile resolution; it needs torch visible at build time.
RUN pip install --no-build-isolation --no-cache-dir --no-deps flash-attn==2.8.3 && \
fix-permissions /opt/app-root -P

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '== Dockerfile context ==\n'
nl -ba images/runtime/training/py312-rocm64-torch280/Dockerfile | sed -n '1,140p'

printf '\n== README context around flash-attn ==\n'
nl -ba images/runtime/training/README.md | sed -n '40,110p'

printf '\n== Multi-stage indicators in this Dockerfile ==\n'
rg -n '^FROM |^USER |flash-attn|no-build-isolation|fix-permissions' images/runtime/training/py312-rocm64-torch280/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 239


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '== Dockerfile context ==\n'
awk 'NR>=1 && NR<=140 { printf "%4d  %s\n", NR, $0 }' images/runtime/training/py312-rocm64-torch280/Dockerfile

printf '\n== README context around flash-attn ==\n'
awk 'NR>=40 && NR<=110 { printf "%4d  %s\n", NR, $0 }' images/runtime/training/README.md

printf '\n== Multi-stage indicators in this Dockerfile ==\n'
rg -n '^FROM |^USER |flash-attn|no-build-isolation|fix-permissions' images/runtime/training/py312-rocm64-torch280/Dockerfile

Repository: opendatahub-io/distributed-workloads

Length of output: 7024


Build flash-attn in a non-root builder stage. pip install --no-build-isolation --no-deps flash-attn==2.8.3 runs third-party build hooks as USER 0, so a compromised sdist or index response becomes build-time root RCE and can taint the image (CWE-250, CWE-494). Move this install to a throwaway unprivileged stage and copy only the built wheel into the final image.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@images/runtime/training/py312-rocm64-torch280/Dockerfile` around lines 81 -
83, Move the flash-attn installation out of the final root-owned stage and into
a throwaway unprivileged builder stage, since the current RUN in the Dockerfile
executes third-party build hooks as USER 0. Build the wheel in that stage using
the existing flash-attn install flow, then copy only the resulting wheel into
the final image and install from it there; keep the unique install step around
the flash-attn==2.8.3 command and preserve the fix-permissions behavior in the
final stage.

Source: Path instructions

@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sutaakar sutaakar merged commit 2626e9b into opendatahub-io:main Jun 30, 2026
6 checks passed
@rhods-ci-bot

Copy link
Copy Markdown

@kapil27: The following test has Failed:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:odh-pr-test-distributed-workloads-kw9tr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants