build: optimize hermetic prefetch with arch filtering and parallel downloads#3352
build: optimize hermetic prefetch with arch filtering and parallel downloads#3352
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughReplaces the global Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Actionable Issues
CWE references: CWE-22 (Path Traversal), CWE-362 (Race Condition). 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3352 +/- ##
=====================================
Coverage 3.54% 3.54%
=====================================
Files 30 30
Lines 3359 3359
Branches 537 537
=====================================
Hits 119 119
Misses 3238 3238
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/lockfile-generators/helpers/download-pip-packages.py (1)
148-151:⚠️ Potential issue | 🟠 MajorReject path separators in remote artifact names (CWE-22).
filenamecomes from remote index metadata and is concatenated intoout_dir / filenamewithout validation. A compromised custom index can return../../...here and makewget -Ooverwrite files outside the cache directory during CI or local prefetch.As per coding guidelines, "Validate file paths (prevent path traversal)".Remediation
+from pathlib import PurePosixPath +from urllib.parse import urljoin, urlparse ... - download_url, sha, filename = m.group(1), m.group(2), m.group(3).strip() + raw_url, sha = m.group(1), m.group(2) + download_url = urljoin(page_url, raw_url) + filename = PurePosixPath(urlparse(download_url).path).name + if not filename: + continue if sha in wanted_hashes: out.append((download_url, filename, sha)) ... - to_fetch.append((out_dir / filename, expected_hash, url, name, version, filename)) + target = (out_dir / filename).resolve() + if out_dir not in target.parents: + raise ValueError(f"Refusing to write outside {out_dir}: {filename}") + to_fetch.append((target, expected_hash, url, name, version, filename))Also applies to: 232-234
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/lockfile-generators/helpers/download-pip-packages.py` around lines 148 - 151, The code appends a remote-provided filename (variable filename) directly into out so it will later be joined with out_dir, allowing path traversal; before using filename (inside the loop where re.finditer yields download_url, sha, filename and the other similar block at lines 232-234), validate and sanitize it: reject or canonicalize any path containing path separators or parent segments (../ or any os.path.sep or os.path.altsep), or replace it with a safe basename (e.g., use pathlib.Path(filename).name) and explicitly check for empty or '.'/ '..' names; only append (download_url, safe_filename, sha) when the sanitized name passes these checks so no "../" can escape the cache directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/learnings/hermetic-build-architecture_.md`:
- Around line 22-28: Update the document to stop referencing the old singleton
host paths like "cachi2/output/deps/..." and "cachi2/output:/cachi2/output" and
instead show the new hashed mount layout used by the system; for each generator
listed (e.g. create-artifact-lockfile.py, create-requirements-lockfile.sh,
download-npm.sh, hermeto-fetch-rpm.sh, create-go-lockfile.sh) replace examples
and outputs that point to "/cachi2/output/deps/..." with the corresponding
namespaced/hashed mount paths (the hashed-volume/* or namespaced mount pattern
used by the new layout) so readers are directed at the correct host paths and
remove any implication that /cachi2/output/deps/... exists directly under the
shared root; apply the same replacements to the other affected regions mentioned
in the review.
- Around line 89-95: Update the local prefetch example to include the
architecture flag so it doesn't default to amd64; change the command example
that calls scripts/lockfile-generators/prefetch-all.sh to show --arch=$(ARCH)
(or an explicit value like --arch=arm64) and mention that the helper defaults to
amd64 when ARCH is unset, and ensure the Makefile note about auto-injecting
cachi2/output/ remains unchanged; reference the prefetch script by name
(scripts/lockfile-generators/prefetch-all.sh) and the Makefile behavior in the
same paragraph.
In `@scripts/lockfile-generators/download-npm.sh`:
- Around line 258-275: The download_file function currently logs failures but
still exits 0, so xargs treats the run as successful; update download_file (the
function named download_file) to return a non-zero exit code when wget fails
(e.g., after removing the partial file call return 1) so the failure is
propagated, and keep the xargs invocation that calls bash -c 'download_file "$1"
"$2"' _ so xargs will observe the non-zero exit and fail the overall command
when any download fails.
In `@scripts/lockfile-generators/helpers/download-pip-packages.py`:
- Around line 182-209: The current should_keep_for_arch treats any substring
"any" or "noarch" as arch-independent which falsely matches manylinux_* names;
change the logic in should_keep_for_arch to detect "any" and "noarch" as
standalone wheel/platform tags rather than raw substrings (e.g., split the
filename on '-' (and/or '.' segments) and check if 'any' or 'noarch' appears as
a complete tag), keep the existing arch_tags/all_arches checks for explicit arch
matches, and ensure the early-return uses the tag presence (not substring) so
manylinux_* files are not misclassified.
In `@scripts/lockfile-generators/helpers/hermeto-fetch-rpm.sh`:
- Around line 217-228: When ARCH is set but yq is unavailable the script
currently skips filtering silently; modify hermeto-fetch-rpm.sh so that when
ARCH (the --arch flag) is non-empty and command -v yq fails you print a clear
error and exit non-zero instead of continuing. Concretely, locate the block that
checks [[ -n "$ARCH" ]] && command -v yq and change it so the presence of ARCH
is checked first, then if yq is missing emit an error referencing yq and the
requested $ARCH (or $rpm_arch) and exit 1; only proceed to set rpm_arch, copy
PREFETCH_DIR to HERMETO_SOURCE and run yq eval on rpms.lock.yaml when yq is
available.
In `@scripts/lockfile-generators/prefetch-all.sh`:
- Around line 143-147: The script currently hashes the raw CLI value in
COMPONENT_DIR when setting CACHI2_OUT_DIR, causing mismatches for paths like
"foo/" vs "foo"; normalize COMPONENT_DIR first (e.g., strip trailing slashes
and/or resolve to an absolute/real path using realpath or similar) before
computing the MD5 so the hash matches the Makefile behavior; update the code
that exports CACHI2_OUT_DIR to compute the hash from the normalized
COMPONENT_DIR value (refer to the COMPONENT_DIR variable and the CACHI2_OUT_DIR
assignment) and ensure ARCH export remains unchanged.
---
Outside diff comments:
In `@scripts/lockfile-generators/helpers/download-pip-packages.py`:
- Around line 148-151: The code appends a remote-provided filename (variable
filename) directly into out so it will later be joined with out_dir, allowing
path traversal; before using filename (inside the loop where re.finditer yields
download_url, sha, filename and the other similar block at lines 232-234),
validate and sanitize it: reject or canonicalize any path containing path
separators or parent segments (../ or any os.path.sep or os.path.altsep), or
replace it with a safe basename (e.g., use pathlib.Path(filename).name) and
explicitly check for empty or '.'/ '..' names; only append (download_url,
safe_filename, sha) when the sanitized name passes these checks so no "../" can
escape the cache directory.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 9d156257-a8e3-4008-9edb-6430b9f9e28b
📒 Files selected for processing (10)
Makefiledocs/hermetic-guide.mddocs/learnings/hermetic-build-architecture_.mdscripts/lockfile-generators/README.mdscripts/lockfile-generators/create-artifact-lockfile.pyscripts/lockfile-generators/create-requirements-lockfile.shscripts/lockfile-generators/download-npm.shscripts/lockfile-generators/helpers/download-pip-packages.pyscripts/lockfile-generators/helpers/hermeto-fetch-rpm.shscripts/lockfile-generators/prefetch-all.sh
|
I think some files here are outdated and no longer used in the prefetch-all.sh script, eg. helper/download-pip-packages.py. For pip wheel, I directly wget the URL specified in requirements.txt instead of using Hermeto like gomod, rpm, or npm. It was a quick solution when I first started working on it, but for a proper solution, I recommend switching to Hermeto and reading the value from the .tekton/*yaml file to determine which architecture it needs to fetch. So ideally, we should replicate the way it was setup and being used in Konflux. Eg. we define arch here in .tekton file. For RPM, we defined those arches in rpms.in.yaml Note: Hermeto creates a random and unique temporary directory when used, preventing it from polluting with other directories. |
…wnloads - Add --arch flag to filter out unnecessary pip wheels and RPMs for other architectures, saving GBs of download per build. - Parallelize NPM and PIP downloads using xargs and concurrent.futures. - Namespace cachi2/output/ by component hash to prevent directory collisions and enable parallel local builds (fixes #3250).
- Fix CACHI2_OUT_DIR not used by hermeto-fetch-gomod.sh and download-rpms.sh, which would write deps to wrong directory under namespaced layout - Fix npm download failures silently swallowed: add return 1 on wget failure and check xargs exit code - Fix should_keep_for_arch false positives: parse wheel platform tag instead of naive substring matching (e.g. "any" in "manylinux" was always true) - Fix --arch silently skipped when yq missing in hermeto-fetch-rpm.sh: now fails fast with clear error - Normalize COMPONENT_DIR (strip trailing slash) before hashing to match Makefile's patsubst behavior - Use symlinks instead of cp -r in hermeto-fetch-rpm.sh arch filtering - Remove dead counter variables in download-npm.sh - Update docs to reflect hashed cachi2/output/<hash>/ layout - Fix stale README claim about download-pip-packages.py not being called - Fix missing newline at end of create-requirements-lockfile.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b3dcf4c to
efd17a4
Compare
|
@ysok Thanks for the review! A few clarifications:
Hermeto for pip (Phase 2) — Agreed that switching pip to Hermeto (like gomod/rpm/npm) would be the proper long-term solution, including reading arch from Arch from Hermeto temp dirs — Right, Hermeto already namespaces its output. The MD5-hashed I've also addressed the @coderabbitai review findings in the latest commit (efd17a4) — npm failure propagation, arch filtering false positives, |
|
Regarding the addressed findings in On the Hermeto pip migration: since you mentioned tracking it as a follow-up issue, would you like me to open one? I can capture the key details — switching pip to Hermeto (mirroring gomod/rpm/npm), reading arch from ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
docs/learnings/hermetic-build-architecture_.md (1)
102-105: Consider noting--archavailability for cross-compilation.The example omits
--arch, which defaults to host architecture viauname -m. This is correct for native local builds, but users doing cross-architecture prefetching (e.g., prefetching arm64 wheels on x86_64) won't know the option exists without reading--help.A brief note or expanded example would help:
Add cross-arch note
### Local development ```bash +# Native build (defaults to host architecture) scripts/lockfile-generators/prefetch-all.sh --component-dir codeserver/ubi9-python-3.12 + +# Cross-architecture build (e.g., prefetch arm64 deps on x86_64 host) +scripts/lockfile-generators/prefetch-all.sh --component-dir codeserver/ubi9-python-3.12 --arch arm64 make codeserver-ubi9-python-3.12</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/learnings/hermetic-build-architecture_.mdaround lines 102 - 105, Add a
brief note and expanded example next to the prefetch command showing that
scripts/lockfile-generators/prefetch-all.sh defaults to the host architecture
(uname -m) and supports the --arch flag for cross-architecture prefetching;
update the snippet to include a short native example and a cross-arch example
(e.g., --arch arm64) and a one-line explanation so users know to use --arch when
prefetching for a different target architecture.</details> </blockquote></details> <details> <summary>Makefile (1)</summary><blockquote> `101-106`: **Hash calculation embeds path directly in Python string literal.** Line 102 interpolates `$(COMPONENT_DIR_STR)` directly into the Python code string. If a path ever contains a single quote, this breaks. `prefetch-all.sh` line 147 avoids this by passing the path as `sys.argv[1]`. Current Makefile targets use controlled paths (e.g., `codeserver/ubi9-python-3.12`), so this is low-risk but inconsistent. <details> <summary>Safer approach (matches prefetch-all.sh)</summary> ```diff -$(eval CACHI2_HASH := $(shell python3 -c "import hashlib; print(hashlib.md5('$(COMPONENT_DIR_STR)'.encode()).hexdigest())")) +$(eval CACHI2_HASH := $(shell python3 -c "import hashlib,sys; print(hashlib.md5(sys.argv[1].encode()).hexdigest())" "$(COMPONENT_DIR_STR)"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 101 - 106, The MD5 call in CACHI2_HASH embeds $(COMPONENT_DIR_STR) directly into the Python literal which breaks if the path contains a single quote; change the python invocation in the CACHI2_HASH definition to accept the path via sys.argv (like prefetch-all.sh) instead of inlining it. Update the command that defines CACHI2_HASH (the line that currently uses python3 -c "import hashlib; ... '$(COMPONENT_DIR_STR)'.encode()") to call python3 -c that imports sys and hashlib and reads sys.argv[1] for hashing, and pass '$(COMPONENT_DIR_STR)' as the argument (properly quoted) after the -c so paths with quotes are handled safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/learnings/hermetic-build-architecture_.md`:
- Around line 102-105: Add a brief note and expanded example next to the
prefetch command showing that scripts/lockfile-generators/prefetch-all.sh
defaults to the host architecture (uname -m) and supports the --arch flag for
cross-architecture prefetching; update the snippet to include a short native
example and a cross-arch example (e.g., --arch arm64) and a one-line explanation
so users know to use --arch when prefetching for a different target
architecture.
In `@Makefile`:
- Around line 101-106: The MD5 call in CACHI2_HASH embeds $(COMPONENT_DIR_STR)
directly into the Python literal which breaks if the path contains a single
quote; change the python invocation in the CACHI2_HASH definition to accept the
path via sys.argv (like prefetch-all.sh) instead of inlining it. Update the
command that defines CACHI2_HASH (the line that currently uses python3 -c
"import hashlib; ... '$(COMPONENT_DIR_STR)'.encode()") to call python3 -c that
imports sys and hashlib and reads sys.argv[1] for hashing, and pass
'$(COMPONENT_DIR_STR)' as the argument (properly quoted) after the -c so paths
with quotes are handled safely.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 9916894c-3821-4ed3-aa49-836733d60db0
📒 Files selected for processing (12)
Makefiledocs/hermetic-guide.mddocs/learnings/hermetic-build-architecture_.mdscripts/lockfile-generators/README.mdscripts/lockfile-generators/create-artifact-lockfile.pyscripts/lockfile-generators/create-requirements-lockfile.shscripts/lockfile-generators/download-npm.shscripts/lockfile-generators/helpers/download-pip-packages.pyscripts/lockfile-generators/helpers/download-rpms.shscripts/lockfile-generators/helpers/hermeto-fetch-gomod.shscripts/lockfile-generators/helpers/hermeto-fetch-rpm.shscripts/lockfile-generators/prefetch-all.sh
🚧 Files skipped from review as they are similar to previous changes (5)
- scripts/lockfile-generators/create-artifact-lockfile.py
- scripts/lockfile-generators/create-requirements-lockfile.sh
- scripts/lockfile-generators/helpers/hermeto-fetch-rpm.sh
- docs/hermetic-guide.md
- scripts/lockfile-generators/README.md
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Problem
Currently,
scripts/lockfile-generators/prefetch-all.shdownloads dependencies for all supported architectures (x86_64,aarch64,ppc64le,s390x) sequentially before the build begins. This results in a massive waste of network bandwidth, disk space, and CI compute time (e.g., downloading ~5.8 GB of wheels and RPMs when the target architecture only needs ~1.4 GB).Furthermore,
cachi2/output/was hardcoded as a global singleton. When building multiple different images sequentially or in parallel, the prefetch scripts would overwrite or pollute this shared directory, leading to corrupted or incorrect builds (Fixes #3250).Solution
This PR introduces massive optimizations for the local and CI hermetic prefetch process (Phase 1):
--archflag): Added toprefetch-all.shto prevent downloading GBs of unused pip wheels and RPMs for other architectures.xargsandconcurrent.futures.cachi2/output/: The output directory is now dynamically namespaced by the component hash (e.g.,cachi2/output/<hash>/deps/). This prevents directory collisions and enables safe parallel local builds. TheMakefilehas been updated to auto-detect and mount this new directory structure.cc @coderabbitai
Summary by CodeRabbit
New Features
Performance Improvements
Documentation