Skip to content

build: optimize hermetic prefetch with arch filtering and parallel downloads#3352

Draft
jiridanek wants to merge 4 commits intomainfrom
jd_codeserver_overwrite
Draft

build: optimize hermetic prefetch with arch filtering and parallel downloads#3352
jiridanek wants to merge 4 commits intomainfrom
jd_codeserver_overwrite

Conversation

@jiridanek
Copy link
Copy Markdown
Member

@jiridanek jiridanek commented Apr 11, 2026

Problem

Currently, scripts/lockfile-generators/prefetch-all.sh downloads dependencies for all supported architectures (x86_64, aarch64, ppc64le, s390x) sequentially before the build begins. This results in a massive waste of network bandwidth, disk space, and CI compute time (e.g., downloading ~5.8 GB of wheels and RPMs when the target architecture only needs ~1.4 GB).

Furthermore, cachi2/output/ was hardcoded as a global singleton. When building multiple different images sequentially or in parallel, the prefetch scripts would overwrite or pollute this shared directory, leading to corrupted or incorrect builds (Fixes #3250).

Solution

This PR introduces massive optimizations for the local and CI hermetic prefetch process (Phase 1):

  1. Architecture filtering (--arch flag): Added to prefetch-all.sh to prevent downloading GBs of unused pip wheels and RPMs for other architectures.
  2. Parallel Downloads: Parallelized NPM and Pip downloads using xargs and concurrent.futures.
  3. Namespaced cachi2/output/: The output directory is now dynamically namespaced by the component hash (e.g., cachi2/output/<hash>/deps/). This prevents directory collisions and enables safe parallel local builds. The Makefile has been updated to auto-detect and mount this new directory structure.
  4. Documentation: Updated all related hermetic build documentation to reflect the new architecture.

cc @coderabbitai

Summary by CodeRabbit

  • New Features

    • Added architecture-aware build option for arch-specific dependency selection.
    • Switched to per-component hashed dependency output directories to avoid cross-component conflicts.
  • Performance Improvements

    • Parallelized npm and pip downloads for faster dependency fetching.
  • Documentation

    • Added detailed hermetic build architecture guide and updated build docs to reflect per-component outputs and arch handling.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign atheo89 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from atheo89 and dibryant April 11, 2026 00:18
@github-actions github-actions Bot added the review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel label Apr 11, 2026
@openshift-ci openshift-ci Bot added the size/l label Apr 11, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: a0ac7523-bc10-41cc-bdbd-2fc73b674716

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces the global cachi2/output/ with per-component MD5-hashed output directories (cachi2/output/<hash>/) and wires that through the Makefile, hermetic build docs, prefetch orchestration (prefetch-all.sh), and multiple lockfile generator helpers and downloaders. Exposes CACHI2_OUT_DIR to child scripts, adds --arch filtering, parallelizes pip/npm downloads, and updates scripts to read/write dependency artifacts under the component-specific hashed output path. Several scripts now accept/propagate ARCH and filter artifacts by architecture.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Actionable Issues

  1. Platform md5 tool dependency (portability)
  • Problem: prefetch-all.sh invokes md5sum without fallback; macOS uses md5.
  • Fix: Detect and use available tool or a portable Python fallback: python3 -c "import hashlib,sys;print(hashlib.md5(sys.argv[1].encode()).hexdigest())".
  • Risk: Build failure on macOS (operational).
  1. Incomplete cleanup of temporary hermeto source (resource leak)
  • Problem: hermeto-fetch-rpm.sh uses ${CLEANUP_SOURCE:+...} but CLEANUP_SOURCE is not set in this diff, so temporary filtered source dirs may persist.
  • Fix: Set CLEANUP_SOURCE=1 when creating the temp dir and ensure trap unconditionally removes temp dir.
  • Risk: Disk/resource leak.
  1. Concurrent prefetch race for same component (data corruption)
  • Problem: Multiple concurrent prefetch-all.sh runs for the same component write the same CACHI2_OUT_DIR, risking partial/corrupt artifacts. (CWE-362: Race Condition)
  • Fix: Add a lock around CACHI2_OUT_DIR creation and writing (e.g., flock on a lockfile, or an atomic lock directory with PID+timeout), and fail fast or wait.
  • Risk: Corrupted lockfiles/partial downloads.
  1. Path traversal / unvalidated component dir (path injection)
  • Problem: COMPONENT_DIR is accepted and hashed without validating it’s a sane relative path; malicious or malformed input (e.g., ../../…) may allow unintended filesystem interactions. (CWE-22: Path Traversal)
  • Fix: Validate COMPONENT_DIR is canonical and relative (no .. segments), normalize and reject absolute paths or .. before hashing/using. Consider restricting to allowed chars/patterns.
  1. Loss of aggregated failure metrics in parallel npm downloader (observability)
  • Problem: download-npm.sh switched to parallel xargs -P but removed global counters and final failure summary, hiding systemic failures.
  • Fix: Capture worker exit codes and aggregate counts (success/skip/fail) after join; print totals and non-zero exit if any failures.
  1. JSON read/deserialization change in download-pip-packages.py (robustness)
  • Problem: Replaced json.load(r) with json.loads(r.read().decode()) without explicit error handling; partial reads or encoding issues may produce opaque failures.
  • Fix: Wrap .read() + json.loads() in try/except and surface clear errors; validate response length/HTTP status before parsing.
  1. yq dependency not validated (missing-tool failure mode)
  • Problem: hermeto-fetch-rpm.sh relies on yq when --arch is used but does not check availability, causing unclear failures.
  • Fix: Early command -v yq >/dev/null || { echo "yq required for --arch"; exit 1; }.
  1. Missing verification for required tools used in new parallel/arch flows (robustness)
  • Problem: New flows use xargs, yq, python3, sha256sum/shasum variants, and wget/curl interchangeably without preflight checks.
  • Fix: Add a preflight check that verifies required binaries (and preferred fallbacks) and prints actionable errors.
  1. Use of MD5 for directory names (collision considerations)
  • Problem: MD5 is used to derive a directory name; while collision likelihood is low, MD5 is cryptographically weak.
  • Fix: If security-grade uniqueness matters, prefer SHA256 truncated for filesystem names; otherwise document collision acceptance. (Note: this is operational, not a direct attack vector here.)
  1. Missing aggregated exit handling in parallel pip downloader (consistency)
  • Problem: helpers/download-pip-packages.py parallelizes downloads but must ensure the process exits non-zero when any worker fails; diff claims it sets exit after all tasks, verify behavior.
  • Fix: Ensure worker exceptions are propagated and main sets non-zero exit if any failures; include per-file and summary failure reporting.

CWE references: CWE-22 (Path Traversal), CWE-362 (Race Condition).

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title uses imperative mood ('optimize') and is directly related to the main changes (architecture filtering, parallel downloads, and hermetic prefetch optimization). However, it lacks a ticket/issue reference (RHAIENG-XXXX, NO-JIRA, or ISSUE #NNN).
Description check ✅ Passed PR description clearly outlines the problem (singleton cachi2/output directory causing overwrite/pollution), solution (architecture filtering, parallel downloads, namespaced output), and Phase 1 approach. However, the self-checklist in the description template is not completed (no checkboxes marked), and 'How Has This Been Tested?' and 'Merge criteria' sections are missing.
Linked Issues check ✅ Passed The PR addresses issue #3250 (per-component output directories via MD5 hashing, architecture filtering, parallel downloads) and demonstrates compliance with coding requirements: CACHI2_OUT_DIR environment variable export, --arch flag implementation, dynamic path construction, and concurrent download pipelines via ThreadPoolExecutor and xargs.
Out of Scope Changes check ✅ Passed All code changes align with the stated objectives: architecture filtering (--arch flag propagation), parallel downloads (ThreadPoolExecutor in download-pip-packages.py, xargs in download-npm.sh), CACHI2_OUT_DIR namespacing, and documentation updates. No unrelated refactoring, dependency upgrades, or feature additions detected.
Branch Prefix Policy ✅ Passed PR title 'build: optimize hermetic prefetch with arch filtering and parallel downloads (PR #3352)' correctly omits branch prefix for main branch target, following policy.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added size/l and removed size/l labels Apr 11, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 3.54%. Comparing base (3de4907) to head (4d6496d).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #3352   +/-   ##
=====================================
  Coverage   3.54%   3.54%           
=====================================
  Files         30      30           
  Lines       3359    3359           
  Branches     537     537           
=====================================
  Hits         119     119           
  Misses      3238    3238           
  Partials       2       2           
Flag Coverage Δ
python 3.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3de4907...4d6496d. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/lockfile-generators/helpers/download-pip-packages.py (1)

148-151: ⚠️ Potential issue | 🟠 Major

Reject path separators in remote artifact names (CWE-22).

filename comes from remote index metadata and is concatenated into out_dir / filename without validation. A compromised custom index can return ../../... here and make wget -O overwrite files outside the cache directory during CI or local prefetch.

Remediation
+from pathlib import PurePosixPath
+from urllib.parse import urljoin, urlparse
...
-        download_url, sha, filename = m.group(1), m.group(2), m.group(3).strip()
+        raw_url, sha = m.group(1), m.group(2)
+        download_url = urljoin(page_url, raw_url)
+        filename = PurePosixPath(urlparse(download_url).path).name
+        if not filename:
+            continue
         if sha in wanted_hashes:
             out.append((download_url, filename, sha))
...
-                to_fetch.append((out_dir / filename, expected_hash, url, name, version, filename))
+                target = (out_dir / filename).resolve()
+                if out_dir not in target.parents:
+                    raise ValueError(f"Refusing to write outside {out_dir}: {filename}")
+                to_fetch.append((target, expected_hash, url, name, version, filename))
As per coding guidelines, "Validate file paths (prevent path traversal)".

Also applies to: 232-234

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/lockfile-generators/helpers/download-pip-packages.py` around lines
148 - 151, The code appends a remote-provided filename (variable filename)
directly into out so it will later be joined with out_dir, allowing path
traversal; before using filename (inside the loop where re.finditer yields
download_url, sha, filename and the other similar block at lines 232-234),
validate and sanitize it: reject or canonicalize any path containing path
separators or parent segments (../ or any os.path.sep or os.path.altsep), or
replace it with a safe basename (e.g., use pathlib.Path(filename).name) and
explicitly check for empty or '.'/ '..' names; only append (download_url,
safe_filename, sha) when the sanitized name passes these checks so no "../" can
escape the cache directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/learnings/hermetic-build-architecture_.md`:
- Around line 22-28: Update the document to stop referencing the old singleton
host paths like "cachi2/output/deps/..." and "cachi2/output:/cachi2/output" and
instead show the new hashed mount layout used by the system; for each generator
listed (e.g. create-artifact-lockfile.py, create-requirements-lockfile.sh,
download-npm.sh, hermeto-fetch-rpm.sh, create-go-lockfile.sh) replace examples
and outputs that point to "/cachi2/output/deps/..." with the corresponding
namespaced/hashed mount paths (the hashed-volume/* or namespaced mount pattern
used by the new layout) so readers are directed at the correct host paths and
remove any implication that /cachi2/output/deps/... exists directly under the
shared root; apply the same replacements to the other affected regions mentioned
in the review.
- Around line 89-95: Update the local prefetch example to include the
architecture flag so it doesn't default to amd64; change the command example
that calls scripts/lockfile-generators/prefetch-all.sh to show --arch=$(ARCH)
(or an explicit value like --arch=arm64) and mention that the helper defaults to
amd64 when ARCH is unset, and ensure the Makefile note about auto-injecting
cachi2/output/ remains unchanged; reference the prefetch script by name
(scripts/lockfile-generators/prefetch-all.sh) and the Makefile behavior in the
same paragraph.

In `@scripts/lockfile-generators/download-npm.sh`:
- Around line 258-275: The download_file function currently logs failures but
still exits 0, so xargs treats the run as successful; update download_file (the
function named download_file) to return a non-zero exit code when wget fails
(e.g., after removing the partial file call return 1) so the failure is
propagated, and keep the xargs invocation that calls bash -c 'download_file "$1"
"$2"' _ so xargs will observe the non-zero exit and fail the overall command
when any download fails.

In `@scripts/lockfile-generators/helpers/download-pip-packages.py`:
- Around line 182-209: The current should_keep_for_arch treats any substring
"any" or "noarch" as arch-independent which falsely matches manylinux_* names;
change the logic in should_keep_for_arch to detect "any" and "noarch" as
standalone wheel/platform tags rather than raw substrings (e.g., split the
filename on '-' (and/or '.' segments) and check if 'any' or 'noarch' appears as
a complete tag), keep the existing arch_tags/all_arches checks for explicit arch
matches, and ensure the early-return uses the tag presence (not substring) so
manylinux_* files are not misclassified.

In `@scripts/lockfile-generators/helpers/hermeto-fetch-rpm.sh`:
- Around line 217-228: When ARCH is set but yq is unavailable the script
currently skips filtering silently; modify hermeto-fetch-rpm.sh so that when
ARCH (the --arch flag) is non-empty and command -v yq fails you print a clear
error and exit non-zero instead of continuing. Concretely, locate the block that
checks [[ -n "$ARCH" ]] && command -v yq and change it so the presence of ARCH
is checked first, then if yq is missing emit an error referencing yq and the
requested $ARCH (or $rpm_arch) and exit 1; only proceed to set rpm_arch, copy
PREFETCH_DIR to HERMETO_SOURCE and run yq eval on rpms.lock.yaml when yq is
available.

In `@scripts/lockfile-generators/prefetch-all.sh`:
- Around line 143-147: The script currently hashes the raw CLI value in
COMPONENT_DIR when setting CACHI2_OUT_DIR, causing mismatches for paths like
"foo/" vs "foo"; normalize COMPONENT_DIR first (e.g., strip trailing slashes
and/or resolve to an absolute/real path using realpath or similar) before
computing the MD5 so the hash matches the Makefile behavior; update the code
that exports CACHI2_OUT_DIR to compute the hash from the normalized
COMPONENT_DIR value (refer to the COMPONENT_DIR variable and the CACHI2_OUT_DIR
assignment) and ensure ARCH export remains unchanged.

---

Outside diff comments:
In `@scripts/lockfile-generators/helpers/download-pip-packages.py`:
- Around line 148-151: The code appends a remote-provided filename (variable
filename) directly into out so it will later be joined with out_dir, allowing
path traversal; before using filename (inside the loop where re.finditer yields
download_url, sha, filename and the other similar block at lines 232-234),
validate and sanitize it: reject or canonicalize any path containing path
separators or parent segments (../ or any os.path.sep or os.path.altsep), or
replace it with a safe basename (e.g., use pathlib.Path(filename).name) and
explicitly check for empty or '.'/ '..' names; only append (download_url,
safe_filename, sha) when the sanitized name passes these checks so no "../" can
escape the cache directory.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9d156257-a8e3-4008-9edb-6430b9f9e28b

📥 Commits

Reviewing files that changed from the base of the PR and between 6f93632 and b3dcf4c.

📒 Files selected for processing (10)
  • Makefile
  • docs/hermetic-guide.md
  • docs/learnings/hermetic-build-architecture_.md
  • scripts/lockfile-generators/README.md
  • scripts/lockfile-generators/create-artifact-lockfile.py
  • scripts/lockfile-generators/create-requirements-lockfile.sh
  • scripts/lockfile-generators/download-npm.sh
  • scripts/lockfile-generators/helpers/download-pip-packages.py
  • scripts/lockfile-generators/helpers/hermeto-fetch-rpm.sh
  • scripts/lockfile-generators/prefetch-all.sh

Comment thread docs/learnings/hermetic-build-architecture_.md Outdated
Comment thread docs/learnings/hermetic-build-architecture_.md
Comment thread scripts/lockfile-generators/download-npm.sh
Comment thread scripts/lockfile-generators/helpers/download-pip-packages.py Outdated
Comment thread scripts/lockfile-generators/helpers/hermeto-fetch-rpm.sh Outdated
Comment thread scripts/lockfile-generators/prefetch-all.sh
@jiridanek jiridanek marked this pull request as draft April 11, 2026 00:32
@jiridanek jiridanek marked this pull request as draft April 11, 2026 00:32
@jiridanek jiridanek marked this pull request as draft April 11, 2026 00:32
@ysok
Copy link
Copy Markdown
Contributor

ysok commented Apr 11, 2026

I think some files here are outdated and no longer used in the prefetch-all.sh script, eg. helper/download-pip-packages.py.

For pip wheel, I directly wget the URL specified in requirements.txt instead of using Hermeto like gomod, rpm, or npm. It was a quick solution when I first started working on it, but for a proper solution, I recommend switching to Hermeto and reading the value from the .tekton/*yaml file to determine which architecture it needs to fetch. So ideally, we should replicate the way it was setup and being used in Konflux. Eg. we define arch here in .tekton file.

For RPM, we defined those arches in rpms.in.yaml

Note: Hermeto creates a random and unique temporary directory when used, preventing it from polluting with other directories.

jiridanek and others added 3 commits April 11, 2026 14:34
…wnloads

- Add --arch flag to filter out unnecessary pip wheels and RPMs for other architectures, saving GBs of download per build.
- Parallelize NPM and PIP downloads using xargs and concurrent.futures.
- Namespace cachi2/output/ by component hash to prevent directory collisions and enable parallel local builds (fixes #3250).
- Fix CACHI2_OUT_DIR not used by hermeto-fetch-gomod.sh and download-rpms.sh,
  which would write deps to wrong directory under namespaced layout
- Fix npm download failures silently swallowed: add return 1 on wget failure
  and check xargs exit code
- Fix should_keep_for_arch false positives: parse wheel platform tag instead
  of naive substring matching (e.g. "any" in "manylinux" was always true)
- Fix --arch silently skipped when yq missing in hermeto-fetch-rpm.sh: now
  fails fast with clear error
- Normalize COMPONENT_DIR (strip trailing slash) before hashing to match
  Makefile's patsubst behavior
- Use symlinks instead of cp -r in hermeto-fetch-rpm.sh arch filtering
- Remove dead counter variables in download-npm.sh
- Update docs to reflect hashed cachi2/output/<hash>/ layout
- Fix stale README claim about download-pip-packages.py not being called
- Fix missing newline at end of create-requirements-lockfile.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jiridanek jiridanek force-pushed the jd_codeserver_overwrite branch from b3dcf4c to efd17a4 Compare April 11, 2026 13:01
@openshift-ci openshift-ci Bot added size/xl and removed size/l labels Apr 11, 2026
@jiridanek
Copy link
Copy Markdown
Member Author

@ysok Thanks for the review! A few clarifications:

download-pip-packages.py is actively used — this PR actually wired it into the prefetch chain. The call path is prefetch-all.shcreate-requirements-lockfile.sh --downloaddownload-pip-packages.py. The old inline shell download loop (73 lines parsing pylock.toml URLs directly) was replaced with a call to this script, which reads from requirements.txt instead. That's actually a simplification — it means we could eventually drop pylock.toml generation entirely and just use requirements.txt.

Hermeto for pip (Phase 2) — Agreed that switching pip to Hermeto (like gomod/rpm/npm) would be the proper long-term solution, including reading arch from .tekton/*.yaml. This PR takes the pragmatic incremental approach (parallel wget + --arch filtering) since it works now and cuts download size from ~5.8 GB to ~1.4 GB. Happy to track the Hermeto migration as a follow-up issue.

Arch from .tekton/*.yaml — Good pointer to the platform definitions. For local dev, --arch defaults to the host architecture (uname -m), which is the common case. Reading from Tekton YAML would make sense when we move to Hermeto.

Hermeto temp dirs — Right, Hermeto already namespaces its output. The MD5-hashed cachi2/output/<hash>/ we added here is for the non-Hermeto steps (pip wget, generic artifacts) that don't get that for free. Once everything goes through Hermeto this namespacing could be simplified.

I've also addressed the @coderabbitai review findings in the latest commit (efd17a4) — npm failure propagation, arch filtering false positives, CACHI2_OUT_DIR not being used by hermeto-fetch-gomod.sh/download-rpms.sh, trailing slash normalization, etc.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 11, 2026

@jiridanek Thanks for the thorough write-up. The call chain clarification (prefetch-all.sh → create-requirements-lockfile.sh --download → download-pip-packages.py) is clear, and the incremental approach (parallel wget + --arch filtering) is a reasonable pragmatic step given the 5.8 GB → 1.4 GB reduction.

Regarding the addressed findings in efd17a41e — happy to do a follow-up review pass on that commit. Triggering a re-review now.

On the Hermeto pip migration: since you mentioned tracking it as a follow-up issue, would you like me to open one? I can capture the key details — switching pip to Hermeto (mirroring gomod/rpm/npm), reading arch from .tekton/*.yaml, and the potential to drop pylock.toml generation — with backlinks to this PR.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@openshift-ci openshift-ci Bot added size/xl and removed size/xl labels Apr 11, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
docs/learnings/hermetic-build-architecture_.md (1)

102-105: Consider noting --arch availability for cross-compilation.

The example omits --arch, which defaults to host architecture via uname -m. This is correct for native local builds, but users doing cross-architecture prefetching (e.g., prefetching arm64 wheels on x86_64) won't know the option exists without reading --help.

A brief note or expanded example would help:

Add cross-arch note
 ### Local development
 
 ```bash
+# Native build (defaults to host architecture)
 scripts/lockfile-generators/prefetch-all.sh --component-dir codeserver/ubi9-python-3.12
+
+# Cross-architecture build (e.g., prefetch arm64 deps on x86_64 host)
+scripts/lockfile-generators/prefetch-all.sh --component-dir codeserver/ubi9-python-3.12 --arch arm64
 make codeserver-ubi9-python-3.12
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/learnings/hermetic-build-architecture_.md around lines 102 - 105, Add a
brief note and expanded example next to the prefetch command showing that
scripts/lockfile-generators/prefetch-all.sh defaults to the host architecture
(uname -m) and supports the --arch flag for cross-architecture prefetching;
update the snippet to include a short native example and a cross-arch example
(e.g., --arch arm64) and a one-line explanation so users know to use --arch when
prefetching for a different target architecture.


</details>

</blockquote></details>
<details>
<summary>Makefile (1)</summary><blockquote>

`101-106`: **Hash calculation embeds path directly in Python string literal.**

Line 102 interpolates `$(COMPONENT_DIR_STR)` directly into the Python code string. If a path ever contains a single quote, this breaks. `prefetch-all.sh` line 147 avoids this by passing the path as `sys.argv[1]`.

Current Makefile targets use controlled paths (e.g., `codeserver/ubi9-python-3.12`), so this is low-risk but inconsistent.

<details>
<summary>Safer approach (matches prefetch-all.sh)</summary>

```diff
-$(eval CACHI2_HASH := $(shell python3 -c "import hashlib; print(hashlib.md5('$(COMPONENT_DIR_STR)'.encode()).hexdigest())"))
+$(eval CACHI2_HASH := $(shell python3 -c "import hashlib,sys; print(hashlib.md5(sys.argv[1].encode()).hexdigest())" "$(COMPONENT_DIR_STR)"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 101 - 106, The MD5 call in CACHI2_HASH embeds
$(COMPONENT_DIR_STR) directly into the Python literal which breaks if the path
contains a single quote; change the python invocation in the CACHI2_HASH
definition to accept the path via sys.argv (like prefetch-all.sh) instead of
inlining it. Update the command that defines CACHI2_HASH (the line that
currently uses python3 -c "import hashlib; ... '$(COMPONENT_DIR_STR)'.encode()")
to call python3 -c that imports sys and hashlib and reads sys.argv[1] for
hashing, and pass '$(COMPONENT_DIR_STR)' as the argument (properly quoted) after
the -c so paths with quotes are handled safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/learnings/hermetic-build-architecture_.md`:
- Around line 102-105: Add a brief note and expanded example next to the
prefetch command showing that scripts/lockfile-generators/prefetch-all.sh
defaults to the host architecture (uname -m) and supports the --arch flag for
cross-architecture prefetching; update the snippet to include a short native
example and a cross-arch example (e.g., --arch arm64) and a one-line explanation
so users know to use --arch when prefetching for a different target
architecture.

In `@Makefile`:
- Around line 101-106: The MD5 call in CACHI2_HASH embeds $(COMPONENT_DIR_STR)
directly into the Python literal which breaks if the path contains a single
quote; change the python invocation in the CACHI2_HASH definition to accept the
path via sys.argv (like prefetch-all.sh) instead of inlining it. Update the
command that defines CACHI2_HASH (the line that currently uses python3 -c
"import hashlib; ... '$(COMPONENT_DIR_STR)'.encode()") to call python3 -c that
imports sys and hashlib and reads sys.argv[1] for hashing, and pass
'$(COMPONENT_DIR_STR)' as the argument (properly quoted) after the -c so paths
with quotes are handled safely.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9916894c-3821-4ed3-aa49-836733d60db0

📥 Commits

Reviewing files that changed from the base of the PR and between b3dcf4c and efd17a4.

📒 Files selected for processing (12)
  • Makefile
  • docs/hermetic-guide.md
  • docs/learnings/hermetic-build-architecture_.md
  • scripts/lockfile-generators/README.md
  • scripts/lockfile-generators/create-artifact-lockfile.py
  • scripts/lockfile-generators/create-requirements-lockfile.sh
  • scripts/lockfile-generators/download-npm.sh
  • scripts/lockfile-generators/helpers/download-pip-packages.py
  • scripts/lockfile-generators/helpers/download-rpms.sh
  • scripts/lockfile-generators/helpers/hermeto-fetch-gomod.sh
  • scripts/lockfile-generators/helpers/hermeto-fetch-rpm.sh
  • scripts/lockfile-generators/prefetch-all.sh
🚧 Files skipped from review as they are similar to previous changes (5)
  • scripts/lockfile-generators/create-artifact-lockfile.py
  • scripts/lockfile-generators/create-requirements-lockfile.sh
  • scripts/lockfile-generators/helpers/hermeto-fetch-rpm.sh
  • docs/hermetic-guide.md
  • scripts/lockfile-generators/README.md

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 18, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress needs-rebase review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel size/xl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Local hermetic builds: cachi2/output/ is a global singleton, breaks multi-image workflows

3 participants