Add AIPCC-based Containerfile for Spark operator#74
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new OpenShift-targeted multi-stage Dockerfile at examples/openshift/Dockerfile.odh. Stage 1 builds a static Go Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Security and Code Quality IssuesCWE-20 / CWE-426 — Unvalidated external input / supply-chain injection
CWE-276 / CWE-732 — Incorrect default permissions / Exposure of sensitive files
CWE-269 — Improper Access Control for Arbitrary UID containers
CWE-1104 / Supply-chain drift risk
CWE-444 / Build determinism and platform mismatch
Actionable Dockerfile checks
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
examples/openshift/Dockerfile.odh (1)
108-108: Redundant permission setting.
chmod g+w /opt/spark/work-dirhere is overridden bychmod -R 775 /opt/spark/work-dirat line 131.Remove redundant line
ln -s ${PYSPARK_DIR} /opt/spark; \ - mkdir -p /opt/spark/work-dir /opt/spark/logs; \ - chmod g+w /opt/spark/work-dir; \ + mkdir -p /opt/spark/work-dir /opt/spark/logs; \ touch /opt/spark/RELEASE; \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/openshift/Dockerfile.odh` at line 108, The Dockerfile contains a redundant permission change: remove the standalone "chmod g+w /opt/spark/work-dir" command since a later "chmod -R 775 /opt/spark/work-dir" already sets the intended permissions recursively; simply delete the earlier chmod g+w invocation so only the recursive chmod -R 775 on /opt/spark/work-dir remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/openshift/Dockerfile.odh`:
- Around line 95-96: The Dockerfile currently downloads tini using the
TINI_VERSION variable and writes it to /usr/bin/tini without integrity checks;
modify the RUN step to also fetch or embed the known SHA256 for the specific
TINI_VERSION, verify the downloaded file's checksum (e.g., with sha256sum or by
comparing to the .sha256 from the release) before chmod +x and only install if
the checksum matches, failing the build on mismatch; reference the TINI_VERSION
variable and the /usr/bin/tini target when updating the RUN command so the
download, checksum verification, and installation are performed atomically.
- Around line 117-119: The Dockerfile currently downloads the Spark entrypoint
script via RUN curl ... -o /opt/entrypoint.sh (using PYSPARK_VERSION) without
integrity verification; either vendor the exact spark-entrypoint.sh (e.g.,
spark-entrypoint.sh from Spark 4.0.1) into the repository and update the
Dockerfile to COPY that file into /opt/entrypoint.sh and set executable
permissions, or if you must download at build time, add a known-good SHA256
(stored in the repo or as a build ARG) and validate the downloaded
/opt/entrypoint.sh against that checksum before setting permissions (failing the
build if the hash does not match) so the RUN curl ... command is followed by a
checksum verification step referencing /opt/entrypoint.sh and the stored
expected hash.
- Around line 69-83: The RUN heredoc that writes /etc/yum.repos.d/ubi9.repo (the
block starting with "RUN cat > /etc/yum.repos.d/ubi9.repo << 'EOF'") triggers
Hadolint DL1000 because heredocs in RUN require BuildKit; fix by switching to a
BuildKit-safe form: add the BuildKit syntax directive "#
syntax=docker/dockerfile:1" at the top of the Dockerfile if missing, then
replace the RUN heredoc with a COPY heredoc (or use the BuildKit-native heredoc
form) that embeds the repo content and escape the variable $basearch as
\$basearch so the literal $basearch is preserved in the Dockerfile; ensure the
target path remains /etc/yum.repos.d/ubi9.repo and the gpgkey/gpgcheck lines are
retained.
---
Nitpick comments:
In `@examples/openshift/Dockerfile.odh`:
- Line 108: The Dockerfile contains a redundant permission change: remove the
standalone "chmod g+w /opt/spark/work-dir" command since a later "chmod -R 775
/opt/spark/work-dir" already sets the intended permissions recursively; simply
delete the earlier chmod g+w invocation so only the recursive chmod -R 775 on
/opt/spark/work-dir remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 1b8f1a98-b12f-4376-9eb9-cbe30a87bc82
📒 Files selected for processing (1)
examples/openshift/Dockerfile.odh
examples/openshift/Dockerfile.odh
Outdated
| RUN curl -fsSL https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini -o /usr/bin/tini && \ | ||
| chmod +x /usr/bin/tini |
There was a problem hiding this comment.
CWE-494: tini downloaded without integrity verification.
The tini binary is fetched from GitHub without checksum validation. A compromised release or MITM attack could inject a malicious init process.
Proposed fix with SHA256 verification
-RUN curl -fsSL https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini -o /usr/bin/tini && \
- chmod +x /usr/bin/tini
+ARG TINI_SHA256=12d20136605531b09a2c2dac02ccee85e1b874eb322ef6baf7561cd93f93c855
+RUN curl -fsSL https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini -o /usr/bin/tini && \
+ echo "${TINI_SHA256} /usr/bin/tini" | sha256sum -c - && \
+ chmod +x /usr/bin/tiniGet the correct SHA256 from tini's GitHub releases page for v0.19.0.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RUN curl -fsSL https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini -o /usr/bin/tini && \ | |
| chmod +x /usr/bin/tini | |
| ARG TINI_SHA256=12d20136605531b09a2c2dac02ccee85e1b874eb322ef6baf7561cd93f93c855 | |
| RUN curl -fsSL https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini -o /usr/bin/tini && \ | |
| echo "${TINI_SHA256} /usr/bin/tini" | sha256sum -c - && \ | |
| chmod +x /usr/bin/tini |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/openshift/Dockerfile.odh` around lines 95 - 96, The Dockerfile
currently downloads tini using the TINI_VERSION variable and writes it to
/usr/bin/tini without integrity checks; modify the RUN step to also fetch or
embed the known SHA256 for the specific TINI_VERSION, verify the downloaded
file's checksum (e.g., with sha256sum or by comparing to the .sha256 from the
release) before chmod +x and only install if the checksum matches, failing the
build on mismatch; reference the TINI_VERSION variable and the /usr/bin/tini
target when updating the RUN command so the download, checksum verification, and
installation are performed atomically.
There was a problem hiding this comment.
good catch by coderabbit, Dockerfile uses ARG TARGETARCH for multi-architecture support but downloads the generic tini binary which defaults to amd64. This will fail on ARM64 systems, and issue like:
- No checksum verification for tini binary download
- No checksum verification for Spark entrypoint.sh download
this makes it vulnerable to MITM attack
There was a problem hiding this comment.
@RobuRishabh, excellent points — there are actually three compounding problems here:
Issue 1 — Wrong tini binary for non-amd64 architectures (CWE-20)
The current code downloads the generic tini binary, which is the amd64 build. tini v0.19.0 ships separate architecture-specific binaries: tini-amd64, tini-arm64, etc. A build with --platform=linux/arm64 will silently install the wrong binary, causing runtime failures.
Issue 2 — No integrity check on tini (CWE-494)
The generic tini binary is downloaded without checksum verification. The official tini documentation recommends downloading both the binary and its .sha256sum companion file and verifying before installation.
Issue 3 — No integrity check on entrypoint.sh (CWE-494)
Per the AI summary, the Dockerfile also downloads Spark's Kubernetes entrypoint.sh from a remote URL without any hash verification — same MITM risk.
Proposed fix — all three issues combined
ARG TARGETARCH
# Map Docker TARGETARCH (amd64/arm64) to tini binary name
RUN TINI_ARCH="${TARGETARCH}" && \
curl -fsSL \
"https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TINI_ARCH}" \
-o /usr/bin/tini && \
curl -fsSL \
"https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TINI_ARCH}.sha256sum" \
-o /tmp/tini.sha256sum && \
# Adjust the filename in the checksum file to match the install target
sed -i "s|tini-${TINI_ARCH}|/usr/bin/tini|" /tmp/tini.sha256sum && \
sha256sum -c /tmp/tini.sha256sum && \
rm -f /tmp/tini.sha256sum && \
chmod +x /usr/bin/tiniFor entrypoint.sh, pin the download to a specific git commit SHA and verify its hash:
ARG SPARK_ENTRYPOINT_SHA256=<known-sha256-for-the-pinned-version>
RUN curl -fsSL \
"https://raw.githubusercontent.com/apache/spark/<commit-sha>/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh" \
-o /opt/entrypoint.sh && \
echo "${SPARK_ENTRYPOINT_SHA256} /opt/entrypoint.sh" | sha256sum -c - && \
chmod +x /opt/entrypoint.shNote: Replace
<known-sha256-for-the-pinned-version>and<commit-sha>with the values from the specific Spark 4.0.1 release tag. Using a raw GitHub URL pinned to a release tag (e.g.,refs/tags/v4.0.1) rather than a branch/main also reduces drift risk.
examples/openshift/Dockerfile.odh
Outdated
| RUN curl -fsSL https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini -o /usr/bin/tini && \ | ||
| chmod +x /usr/bin/tini |
There was a problem hiding this comment.
good catch by coderabbit, Dockerfile uses ARG TARGETARCH for multi-architecture support but downloads the generic tini binary which defaults to amd64. This will fail on ARM64 systems, and issue like:
- No checksum verification for tini binary download
- No checksum verification for Spark entrypoint.sh download
this makes it vulnerable to MITM attack
examples/openshift/Dockerfile.odh
Outdated
| chown -R spark:spark /opt/spark/work-dir /opt/spark/RELEASE | ||
|
|
||
| # Download Spark's Kubernetes entrypoint for driver/executor pods | ||
| RUN curl -fsSL https://raw.githubusercontent.com/apache/spark/v${PYSPARK_VERSION}/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh \ |
There was a problem hiding this comment.
See above comment "No checksum verification for Spark entrypoint.sh download" prone to MITM attack. TIni runs as the root init process (PID1) for every container, so its very important no one gets complete control of our container.
There was a problem hiding this comment.
For the Spark entrypoint, consider either:
- Adding checksum verification
- Bundling the script in the repository instead of downloading from GitHub
examples/openshift/Dockerfile.odh
Outdated
| @@ -0,0 +1,147 @@ | |||
| # Containerfile for Spark Operator based on AIPCC base image | |||
| # | |||
| # Build: docker build -f examples/openshift/Containerfile -t <image> . | |||
There was a problem hiding this comment.
instead of Containerfile it should be Dockerfile.odh
Build: docker build -f examples/openshift/Dockerfile.odh -t <image> .
examples/openshift/Dockerfile.odh
Outdated
|
|
||
| USER ${SPARK_UID} | ||
|
|
||
| ENTRYPOINT ["/usr/bin/wrapper-entrypoint.sh"] No newline at end of file |
There was a problem hiding this comment.
add a newline after this last line. POSIX compliance requires files to end with a newline character.
examples/openshift/Dockerfile.odh
Outdated
| # Build Arguments | ||
| ################################################################################ | ||
| ARG GO_BUILDER_IMAGE=registry.access.redhat.com/ubi9/go-toolset:1.25.7 | ||
| ARG BASE_IMAGE=quay.io/aipcc/base-images/cpu:3.4.0-1773957977 |
There was a problem hiding this comment.
Is this tag guaranteed to be immutable? If it's a rolling tag, builds could become non-reproducible. I am not sure abt this.
examples/openshift/Dockerfile.odh
Outdated
| if [ -f ${PYSPARK_DIR}/kubernetes/dockerfiles/spark/decom.sh ]; then \ | ||
| cp ${PYSPARK_DIR}/kubernetes/dockerfiles/spark/decom.sh /opt/decom.sh; \ | ||
| chmod a+x /opt/decom.sh; \ |
There was a problem hiding this comment.
Need to add error handling here for Silent failure if the file doesn't exist.
Is decom.sh required for Spark operator functionality? If yes, the build should fail if it's missing. If no, add a comment explaining why it's optional.
examples/openshift/Dockerfile.odh
Outdated
| RUN printf '#!/bin/bash\ncase "${1:-}" in\n driver|driver-py|driver-r|executor)\n exec /opt/entrypoint.sh "$@"\n ;;\n *)\n exec /usr/bin/entrypoint.sh "$@"\n ;;\nesac\n' \ | ||
| > /usr/bin/wrapper-entrypoint.sh && \ | ||
| chmod +x /usr/bin/wrapper-entrypoint.sh |
There was a problem hiding this comment.
Add shebang validation here, and error handling like set -e. We can also create wrapper-entrypoint.sh as a separate file in the repository
examples/openshift/Dockerfile.odh
Outdated
| RUN dnf --setopt=install_weak_deps=0 --setopt=tsflags=nodocs install -y \ | ||
| java-17-openjdk-headless \ | ||
| procps-ng net-tools krb5-workstation nss_wrapper \ | ||
| && dnf clean all && rm -rf /var/cache/dnf/* | ||
|
|
||
| ENV JAVA_HOME=/usr/lib/jvm/jre-17-openjdk | ||
| ENV PATH="${JAVA_HOME}/bin:${PATH}" |
There was a problem hiding this comment.
add java installation verification here
RUN java -version 2>&1 | grep "17\."
| ################################################################################ | ||
| FROM ${BASE_IMAGE} | ||
|
|
||
| LABEL name="spark-operator" \ |
There was a problem hiding this comment.
are we going to add or have any kind of label version also here @alimaredia ?
|
@RobuRishabh yesterday, @Vedant-Deshpande and I talked and the following changes will be made:
I'm not going to give another round of review until he's got a new version of the Dockerfile with those changes |
db07eee to
8fef643
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/openshift/Dockerfile.odh`:
- Around line 51-72: The account-creation block hardcodes UID/GID 185 instead of
using the build arg SPARK_UID, causing mismatched ownership when SPARK_UID is
changed; update the conditional checks and commands (getent group/passwd,
groupmod, groupadd, usermod, useradd, chown) to reference the variable SPARK_UID
(and use it for both UID and GID where appropriate) so the group/user
creation/renaming and final chown use ${SPARK_UID} consistently and match the
runtime user selection.
- Around line 74-79: The Dockerfile currently copies system crypto and Java
config from the separate JDK stage via the COPY --from=jdk ... lines (notably
the COPY lines for /etc/crypto-policies, /etc/java, /etc/pki/java and related
/usr/lib/jvm), which risks crypto/TLS/FIPS divergence; remove the COPY
--from=jdk of /etc/crypto-policies (and avoid pulling system-wide /etc/* trees
from a different image lineage) and instead either install Java directly into
the runtime base image during build or switch the base to the supported Red Hat
Java runtime image so Java and system crypto policies come from the same image
lineage; update or delete the COPY --from=jdk /etc/crypto-policies (and related
/etc/* system copies) and adjust subsequent Dockerfile steps that assume those
files exist.
- Around line 97-115: The Dockerfile creates a symlink from the installed
pyspark package into /opt/spark (via PYSPARK_DIR -> /opt/spark) which causes
later recursive chmod/chgrp (chgrp -R 0 /opt/spark; chmod -R g=u /opt/spark) to
modify the package; instead copy the Spark distribution into /opt/spark rather
than symlinking (replace the ln -s ${PYSPARK_DIR} /opt/spark step with a safe
copy of only the distribution files), and limit recursive permission changes to
the state directories only (apply chown/chmod -R to /opt/spark/work-dir and
/opt/spark/logs instead of /opt/spark); adjust the related commands that set
SPARK_HOME, PYSPARK_DIR handling, chown -R spark:spark, chgrp -R and chmod -R
usages so no recursive operations touch the installed pyspark package.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: a45ab083-3889-4b1f-a30a-8caf2e96c228
📒 Files selected for processing (2)
entrypoint.shexamples/openshift/Dockerfile.odh
There was a problem hiding this comment.
♻️ Duplicate comments (5)
examples/openshift/Dockerfile.odh (5)
10-10:⚠️ Potential issue | 🟠 MajorPin
JDK_IMAGEto an immutable reference.
registry.redhat.io/ubi9/openjdk-21:latestmakes this stage non-reproducible. A later push tolatestcan silently change the Java payload copied into the runtime image without any Dockerfile change. Pin a concrete release and, ideally, its digest.Suggested fix
-ARG JDK_IMAGE=registry.redhat.io/ubi9/openjdk-21:latest +ARG JDK_IMAGE=registry.redhat.io/ubi9/openjdk-21:<release>@sha256:<digest>As per coding guidelines,
**/Dockerfile*: Use specific image tags (not :latest).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/openshift/Dockerfile.odh` at line 10, The build ARG JDK_IMAGE is currently set to an unstable mutable tag ("registry.redhat.io/ubi9/openjdk-21:latest"); update the ARG JDK_IMAGE declaration to reference a concrete, immutable image (use a specific release tag or, preferably, the image digest form) so the runtime image is reproducible—locate the ARG JDK_IMAGE in the Dockerfile.odh and replace the :latest value with a stable version or `@sha256`:<digest>.
43-43:⚠️ Potential issue | 🟠 MajorCWE-16: don't overlay Java security config from a different base image.
Line 43 builds the runtime from
${BASE_IMAGE}, then Lines 75-79 copy/etc/java,/etc/pki/java, and/etc/crypto-policiesout of${JDK_IMAGE}. That splices together two independent patch streams. After a base refresh, the final image can carry stale Java CAs or crypto policy relative to the OS. Make the Red Hat Java image the actual runtime base, or install Java from the runtime image's own repos instead of copying/etc/*trees.As per coding guidelines,
**: REVIEW PRIORITIES: 1. Security vulnerabilities 2. Architectural issues and anti-patterns.Also applies to: 74-79
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/openshift/Dockerfile.odh` at line 43, The Dockerfile currently uses FROM ${BASE_IMAGE} then later copies Java config trees from ${JDK_IMAGE} (/etc/java, /etc/pki/java, /etc/crypto-policies), which mixes two different base images and can leave stale CAs/crypto policies; fix by making the Red Hat Java image the actual runtime base (use ${JDK_IMAGE} in FROM) or remove the COPY of those /etc/* trees and instead install Java and crypto policy from the runtime image's package repos (do not splice config from ${JDK_IMAGE} into ${BASE_IMAGE}); update references to BASE_IMAGE, JDK_IMAGE, and the COPY steps that target /etc/java, /etc/pki/java, and /etc/crypto-policies accordingly.
120-125:⚠️ Potential issue | 🔴 CriticalCritical: this entrypoint never dispatches to Spark runtime mode.
COPY --chmod=0755 entrypoint.sh /usr/bin/installs the repository rootentrypoint.sh, andentrypoint.sh, Lines 1-26 always end by execing/usr/bin/spark-operator. This Dockerfile also never copies a Spark Kubernetes entrypoint into the image. Any pod using this image as a Spark runtime will feed Spark args to the operator binary and fail. Copy a real wrapper here that dispatches Spark runtime commands to Spark's entrypoint and everything else tospark-operator, or split the operator and runtime images.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/openshift/Dockerfile.odh` around lines 120 - 125, The current COPY of entrypoint.sh (COPY --chmod=0755 entrypoint.sh /usr/bin/) installs a script that always execs /usr/bin/spark-operator, so any Spark runtime invocation ends up being passed to the operator binary; fix this by replacing or augmenting the deployed entrypoint: add a wrapper entrypoint that inspects the command/args (e.g., check first arg or presence of "spark-submit"/"spark-submit" style flags) and, for Spark runtime invocations, dispatches to the real Spark entrypoint (the cluster Spark binary/launch script you include in the image) while forwarding all other invocations to exec /usr/bin/spark-operator; alternatively split into separate operator and runtime images so the Dockerfile no longer copies the operator-only entrypoint.sh as the image ENTRYPOINT.
98-115:⚠️ Potential issue | 🟠 MajorCWE-732:
/opt/sparkbecomes group-writable because the symlink is followed.
ln -s ${PYSPARK_DIR} /opt/sparkmakes/opt/sparkpoint at the installedpysparkpackage, then Lines 113-114 recurse through that link and grant GID0write access to shipped code. Any process sharing the OpenShift root group can tamper with Spark jars/scripts that later execute. Copy the distribution into/opt/sparkand limit recursive permission changes towork-dirandlogs.Suggested fix
RUN PYSPARK_DIR=$(python3 -c "import pyspark; import os; print(os.path.dirname(pyspark.__file__))") && \ - ln -s ${PYSPARK_DIR} /opt/spark && \ + mkdir -p /opt/spark && \ + cp -a "${PYSPARK_DIR}/." /opt/spark/ && \ mkdir -p /opt/spark/work-dir /opt/spark/logs && \ chmod g+w /opt/spark/work-dir && \ touch /opt/spark/RELEASE && \ - chown -R spark:spark /opt/spark/work-dir /opt/spark/RELEASE + chown -R spark:spark /opt/spark/work-dir /opt/spark/logs /opt/spark/RELEASE -RUN chgrp -R 0 /opt/spark /etc/k8s-webhook-server && \ - chmod -R g=u /opt/spark /etc/k8s-webhook-server && \ +RUN chgrp -R 0 /opt/spark/work-dir /opt/spark/logs /etc/k8s-webhook-server && \ + chmod -R g=u /opt/spark/work-dir /opt/spark/logs /etc/k8s-webhook-server && \ chmod -R 775 /opt/spark/work-dir /opt/spark/logs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/openshift/Dockerfile.odh` around lines 98 - 115, Replace the symlink creation (ln -s ${PYSPARK_DIR} /opt/spark) with a safe copy of the pyspark distribution into /opt/spark (e.g., cp -a ${PYSPARK_DIR} /opt/spark or rsync -a) so you don't grant group write on the installed package; then restrict recursive ownership/permission changes (chown -R, chgrp -R, chmod -R) to only /opt/spark/work-dir and /opt/spark/logs and keep a non-recursive chown/chmod for the /opt/spark top-level RELEASE file if needed (update the RUN block that creates /opt/spark, /opt/spark/work-dir, /opt/spark/logs and the later OpenShift permission block to operate only on work-dir/logs rather than following a symlink into the pyspark package).
59-72:⚠️ Potential issue | 🟠 MajorUse
${SPARK_UID}consistently during account creation.Lines 59-69 hardcode
185, but Line 123 runs as${SPARK_UID}. Any non-default build arg produces a runtime UID that does not own/home/sparkand may not have matching passwd/group entries, which breaks the arbitrary-UID path this image is trying to support.Suggested fix
-RUN if getent group 185 > /dev/null 2>&1; then \ - EXISTING_GROUP=$(getent group 185 | cut -d: -f1); \ +RUN if getent group "${SPARK_UID}" > /dev/null 2>&1; then \ + EXISTING_GROUP=$(getent group "${SPARK_UID}" | cut -d: -f1); \ groupmod -n spark "$EXISTING_GROUP"; \ else \ - groupadd -g 185 spark; \ + groupadd -g "${SPARK_UID}" spark; \ fi && \ - if getent passwd 185 > /dev/null 2>&1; then \ - EXISTING_USER=$(getent passwd 185 | cut -d: -f1); \ + if getent passwd "${SPARK_UID}" > /dev/null 2>&1; then \ + EXISTING_USER=$(getent passwd "${SPARK_UID}" | cut -d: -f1); \ usermod -l spark -d /home/spark "$EXISTING_USER"; \ else \ - useradd -u 185 -g spark -d /home/spark -s /bin/bash spark; \ + useradd -u "${SPARK_UID}" -g spark -d /home/spark -s /bin/bash spark; \ fi && \ mkdir -p /home/spark && \ chown spark:spark /home/sparkAlso applies to: 123-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/openshift/Dockerfile.odh` around lines 59 - 72, The Dockerfile block hardcodes UID/GID 185 which breaks arbitrary-UID builds; replace every literal 185 in the group/user checks and creation with the build arg variable ${SPARK_UID} (e.g. change getent group 185, getent passwd 185, groupadd -g 185, useradd -u 185) and use the same variable when modifying names (groupmod/usermod) so lookups and creations use ${SPARK_UID}; finally ensure ownership of /home/spark is set to the numeric UID/GID (chown "${SPARK_UID}:${SPARK_UID}" /home/spark) so a non-default runtime UID owns the home directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@examples/openshift/Dockerfile.odh`:
- Line 10: The build ARG JDK_IMAGE is currently set to an unstable mutable tag
("registry.redhat.io/ubi9/openjdk-21:latest"); update the ARG JDK_IMAGE
declaration to reference a concrete, immutable image (use a specific release tag
or, preferably, the image digest form) so the runtime image is
reproducible—locate the ARG JDK_IMAGE in the Dockerfile.odh and replace the
:latest value with a stable version or `@sha256`:<digest>.
- Line 43: The Dockerfile currently uses FROM ${BASE_IMAGE} then later copies
Java config trees from ${JDK_IMAGE} (/etc/java, /etc/pki/java,
/etc/crypto-policies), which mixes two different base images and can leave stale
CAs/crypto policies; fix by making the Red Hat Java image the actual runtime
base (use ${JDK_IMAGE} in FROM) or remove the COPY of those /etc/* trees and
instead install Java and crypto policy from the runtime image's package repos
(do not splice config from ${JDK_IMAGE} into ${BASE_IMAGE}); update references
to BASE_IMAGE, JDK_IMAGE, and the COPY steps that target /etc/java,
/etc/pki/java, and /etc/crypto-policies accordingly.
- Around line 120-125: The current COPY of entrypoint.sh (COPY --chmod=0755
entrypoint.sh /usr/bin/) installs a script that always execs
/usr/bin/spark-operator, so any Spark runtime invocation ends up being passed to
the operator binary; fix this by replacing or augmenting the deployed
entrypoint: add a wrapper entrypoint that inspects the command/args (e.g., check
first arg or presence of "spark-submit"/"spark-submit" style flags) and, for
Spark runtime invocations, dispatches to the real Spark entrypoint (the cluster
Spark binary/launch script you include in the image) while forwarding all other
invocations to exec /usr/bin/spark-operator; alternatively split into separate
operator and runtime images so the Dockerfile no longer copies the operator-only
entrypoint.sh as the image ENTRYPOINT.
- Around line 98-115: Replace the symlink creation (ln -s ${PYSPARK_DIR}
/opt/spark) with a safe copy of the pyspark distribution into /opt/spark (e.g.,
cp -a ${PYSPARK_DIR} /opt/spark or rsync -a) so you don't grant group write on
the installed package; then restrict recursive ownership/permission changes
(chown -R, chgrp -R, chmod -R) to only /opt/spark/work-dir and /opt/spark/logs
and keep a non-recursive chown/chmod for the /opt/spark top-level RELEASE file
if needed (update the RUN block that creates /opt/spark, /opt/spark/work-dir,
/opt/spark/logs and the later OpenShift permission block to operate only on
work-dir/logs rather than following a symlink into the pyspark package).
- Around line 59-72: The Dockerfile block hardcodes UID/GID 185 which breaks
arbitrary-UID builds; replace every literal 185 in the group/user checks and
creation with the build arg variable ${SPARK_UID} (e.g. change getent group 185,
getent passwd 185, groupadd -g 185, useradd -u 185) and use the same variable
when modifying names (groupmod/usermod) so lookups and creations use
${SPARK_UID}; finally ensure ownership of /home/spark is set to the numeric
UID/GID (chown "${SPARK_UID}:${SPARK_UID}" /home/spark) so a non-default runtime
UID owns the home directory.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: c257d66e-45c7-4206-8fd1-b0166e3facc6
📒 Files selected for processing (1)
examples/openshift/Dockerfile.odh
8fef643 to
b5e9b69
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
examples/openshift/Dockerfile.odh (1)
92-104:⚠️ Potential issue | 🟠 MajorSpark runtime is still present; requested removal is not implemented.
Lines 92-104 still add PySpark/Spark artifacts. This conflicts with the stated requested change to remove Spark installation from this image.
Remediation (operator-only image path)
-ARG PYSPARK_VERSION=4.0.1 -ARG PYSPARK_INDEX_URL=https://packages.redhat.com/api/pypi/public-rhai/rhoai/3.4-EA2/cpu-ubi9/simple/ @@ -# Install PySpark from Red Hat's internal PyPI index -RUN pip3 install --no-cache-dir \ - --index-url ${PYSPARK_INDEX_URL} \ - pyspark==${PYSPARK_VERSION} - -# Copy Spark distribution to /opt/spark -ENV SPARK_HOME=/opt/spark -RUN PYSPARK_DIR=$(python3 -c "import pyspark; import os; print(os.path.dirname(pyspark.__file__))") && \ - cp -r "${PYSPARK_DIR}" /opt/spark && \ - mkdir -p /opt/spark/work-dir /opt/spark/logs && \ - touch /opt/spark/RELEASE && \ - chown -R spark:spark /opt/spark/work-dir /opt/spark/logs /opt/spark/RELEASE - -ENV PATH="${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/openshift/Dockerfile.odh` around lines 92 - 104, The Dockerfile still installs PySpark and copies the Spark distribution—remove the RUN pip3 install ... pyspark==${PYSPARK_VERSION} step, delete the block that computes PYSPARK_DIR and copies to /opt/spark (the RUN that sets PYSPARK_DIR and cp -r ... /opt/spark and related mkdir/chown/touch), and drop the SPARK_HOME and PATH environment lines (ENV SPARK_HOME=/opt/spark and ENV PATH="${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin") so no Spark artifacts remain in the image; ensure any references to SPARK_HOME, PYSPARK_DIR, or pyspark installation are removed or gated behind an operator-only build path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/openshift/Dockerfile.odh`:
- Line 10: The ARG JDK_IMAGE currently uses a mutable tag
("registry.redhat.io/ubi9/openjdk-21:latest"); replace it with an immutable
digest form (e.g., "registry.redhat.io/ubi9/openjdk-21@sha256:<digest>") by
resolving the correct sha256 digest for the desired image version from the
registry and updating the ARG JDK_IMAGE value accordingly so builds always pull
a known immutable image.
---
Duplicate comments:
In `@examples/openshift/Dockerfile.odh`:
- Around line 92-104: The Dockerfile still installs PySpark and copies the Spark
distribution—remove the RUN pip3 install ... pyspark==${PYSPARK_VERSION} step,
delete the block that computes PYSPARK_DIR and copies to /opt/spark (the RUN
that sets PYSPARK_DIR and cp -r ... /opt/spark and related mkdir/chown/touch),
and drop the SPARK_HOME and PATH environment lines (ENV SPARK_HOME=/opt/spark
and ENV PATH="${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin") so no Spark
artifacts remain in the image; ensure any references to SPARK_HOME, PYSPARK_DIR,
or pyspark installation are removed or gated behind an operator-only build path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 0dbb02d5-f5f2-4ad1-8748-71d648b7aee5
📒 Files selected for processing (1)
examples/openshift/Dockerfile.odh
examples/openshift/Dockerfile.odh
Outdated
| # Build Arguments | ||
| ################################################################################ | ||
| ARG GO_BUILDER_IMAGE=registry.access.redhat.com/ubi9/go-toolset:1.25.7 | ||
| ARG JDK_IMAGE=registry.redhat.io/ubi9/openjdk-21:latest |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd "Dockerfile.odh" --type fRepository: opendatahub-io/spark-operator
Length of output: 106
🏁 Script executed:
cat -n examples/openshift/Dockerfile.odhRepository: opendatahub-io/spark-operator
Length of output: 5872
Pin JDK_IMAGE to an immutable digest instead of :latest.
Severity: Major — CWE-494 (download of code without integrity verification).
Exploit scenario: retagged openjdk-21:latest can silently change the runtime filesystem and toolchain, pulling unreviewed content into builds.
Remediation
-ARG JDK_IMAGE=registry.redhat.io/ubi9/openjdk-21:latest
+ARG JDK_IMAGE=registry.redhat.io/ubi9/openjdk-21@sha256:<approved_digest>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/openshift/Dockerfile.odh` at line 10, The ARG JDK_IMAGE currently
uses a mutable tag ("registry.redhat.io/ubi9/openjdk-21:latest"); replace it
with an immutable digest form (e.g.,
"registry.redhat.io/ubi9/openjdk-21@sha256:<digest>") by resolving the correct
sha256 digest for the desired image version from the registry and updating the
ARG JDK_IMAGE value accordingly so builds always pull a known immutable image.
examples/openshift/Dockerfile.odh
Outdated
| RUN PYSPARK_DIR=$(python3 -c "import pyspark; import os; print(os.path.dirname(pyspark.__file__))") && \ | ||
| cp -r "${PYSPARK_DIR}" /opt/spark && \ | ||
| mkdir -p /opt/spark/work-dir /opt/spark/logs && \ | ||
| touch /opt/spark/RELEASE && \ |
There was a problem hiding this comment.
I guess we can put something like
echo "Spark ${PYSPARK_VERSION} (PySpark)" > /opt/spark/RELEASE
to get the installed Spark version
examples/openshift/Dockerfile.odh
Outdated
|
|
||
| # Install PySpark from Red Hat's internal PyPI index | ||
| RUN pip3 install --no-cache-dir \ | ||
| --index-url ${PYSPARK_INDEX_URL} \ |
There was a problem hiding this comment.
--index-url here shouldn't be needed since the AIPCC base image only uses the Red Hat python registry.
examples/openshift/Dockerfile.odh
Outdated
| RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \ | ||
| dnf install -y tini && \ | ||
| dnf clean all && rm -rf /var/cache/dnf/* && \ | ||
| java -version 2>&1 | grep "21\." && \ |
There was a problem hiding this comment.
Why is java -version being called here?
There was a problem hiding this comment.
It's just a sanity check to check if got java from the jdk, can remove it to reduce the code
examples/openshift/Dockerfile.odh
Outdated
| # Copy operator entrypoint | ||
| COPY --chmod=0755 entrypoint.sh /usr/bin/ | ||
|
|
||
| USER ${SPARK_UID} |
There was a problem hiding this comment.
Just to ensure we're keeping as close to the upstream docker file as possible, we'll want this to be:
USER ${SPARK_UID}:${SPARK_GID}
Take a look at:
https://github.com/kubeflow/spark-operator/blame/master/Dockerfile#L54
examples/openshift/Dockerfile.odh
Outdated
| COPY --from=builder /workspace/bin/spark-operator /usr/bin/spark-operator | ||
|
|
||
| # Copy operator entrypoint | ||
| COPY --chmod=0755 entrypoint.sh /usr/bin/ |
There was a problem hiding this comment.
There are no permissions set on the COPY upstream (https://github.com/kubeflow/spark-operator/blob/master/Dockerfile#L58). Do we need the the 0755?
There was a problem hiding this comment.
Nope it's redundant, entrypoint has this permission, will remove this
examples/openshift/Dockerfile.odh
Outdated
| # Build Arguments | ||
| ################################################################################ | ||
| ARG GO_BUILDER_IMAGE=registry.access.redhat.com/ubi9/go-toolset:1.25.7 | ||
| ARG JDK_IMAGE=registry.redhat.io/ubi9/openjdk-21:latest |
There was a problem hiding this comment.
To stay in tune with what the upstream Spark Image is using, let's use JDK 17 instead of 21 Here's a container example: https://catalog.redhat.com/en/software/containers/ubi9/openjdk-17/61ee7c26ed74b2ffb22b07f6.
There was a problem hiding this comment.
You can see the Spark:4.0.1 image details here: https://hub.docker.com/layers/library/spark/4.0.1/images/sha256-d5079fdf698b7cc6fd17332373fcd623eb6ff166359ebaf4c1bf20e9eaf611b1
examples/openshift/Dockerfile.odh
Outdated
|
|
||
| LABEL name="spark-operator" \ | ||
| summary="Spark Operator for OpenShift" \ | ||
| description="Kubeflow Spark Operator built on AIPCC base with Java 21 and PySpark 4.0.1" \ |
There was a problem hiding this comment.
This would have to change to Java 17 per the comment above
b5e9b69 to
273f1ad
Compare
273f1ad to
d2c42e8
Compare
|
@Vedant-Deshpande : Please add the manual run results to PR. Did you get this qualified against ODH integration? We could have your fork on odh-operator repo to you SHA and trigger the tests. |
- Multi-stage build using AIPCC 3.4 EA-2 base image - Builds Go operator binary from Red Hat registry toolset - Installs Java 17 via dnf from UBI9 repos (no cross-image copy) - Installs PySpark 4.0.1 from Red Hat's internal PyPI registry - Installs tini from EPEL - OpenShift arbitrary UID compatibility - User brings their own Spark Image for driver/executor pods
d2c42e8 to
9609ea7
Compare
- Multi-stage build using AIPCC 3.4 EA-2 base image - Builds Go operator binary from Red Hat registry toolset - Installs Java 17 via dnf from UBI9 repos (no cross-image copy) - Installs PySpark 4.0.1 from Red Hat's internal PyPI registry - Installs tini from EPEL - OpenShift arbitrary UID compatibility - User brings their own Spark Image for driver/executor pods
Purpose of this PR
Currently in RHOAI builds, the base image for the Spark operator is the upstream Spark container. This PR introduces a new multi-stage Containerfile built on a Red Hat AIPCC base image (quay.io/aipcc/base-images/cpu:3.4.0-1773957977) instead, containing the necessary Java libraries (OpenJDK 17), a PySpark 4.0.1 installation (providing spark-submit and all Spark jars), and the compiled operator binary.
Proposed changes:
Change Category
Rationale
The upstream Spark operator image uses docker.io/library/spark:4.0.1 as its base, which is not suitable for RHOAI builds. A Red Hat-based image is needed for supportability, FIPS compliance, and compatibility with the AIPCC ecosystem.
Checklist
Additional Notes
Manually tested by triggering workflow: https://github.com/opendatahub-io/spark-operator/actions/runs/23923611584
Summary by CodeRabbit