Skip to content

cargo+pip: Clean up extracted tarballs from output directory#960

Merged
slimreaper35 merged 2 commits into
hermetoproject:mainfrom
slimreaper35:cargo-pip
May 27, 2025
Merged

cargo+pip: Clean up extracted tarballs from output directory#960
slimreaper35 merged 2 commits into
hermetoproject:mainfrom
slimreaper35:cargo-pip

Conversation

@slimreaper35

Copy link
Copy Markdown
Member

Maintainers will complete the following section

  • Commit messages are descriptive enough
  • Code coverage from testing does not decrease and new code is covered
  • Docs updated (if applicable)
  • Docs links in the code are still valid (if docs were updated)

Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:

@a-ovchinnikov a-ovchinnikov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a non-blocking naming nitpick, a non-blocking question and a blocking commits arrangement suggestion.

Comment thread hermeto/core/package_managers/pip.py
Comment thread hermeto/core/package_managers/pip.py Outdated
Comment on lines -2282 to +2295
def _filter_packages_with_rust_code(
packages: list, output_dir: RootedPath, source_dir: RootedPath
) -> list:
def _filter_packages_with_rust_code(packages: list[dict[str, Any]]) -> list[CargoPackageInput]:
packages_containing_rust_code = []
tar_packages = [p for p in packages if tarfile.is_tarfile(p.get("path", ""))]
for p in tar_packages:
fname = p.get("path")
fname = p.get("path", "")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hunk with unrelated changes -> standalone commit please.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, added a tiny commit.

Comment on lines +257 to +262
def remove_extracted(packages: list[CargoPackageInput]) -> None:
"""Remove extracted tarballs in the output directory that contain Rust code."""
for pkg in packages:
# in case the Rust code was in a subdirectory of the package tarball
pip_package_root = pkg.path.parts[0]
shutil.rmtree(pip_deps_dir.join_within_root(pip_package_root), ignore_errors=True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do the job and is also simple when everyone has the immediate context, but presumably might not be the cleanest and scalable future proof solution because with the growing number of things we need to clean, it's harder to track what is cleaned and where, especially if it's buried this deep AND the artifacts are not cleaned at their origin, i.e. _filter_packages_with_rust_code.
I think we should think of a global cleaner passed around in the request object in an attribute which is common for all backends (e.g. named opaque) that would carry enough information (e.g. a callback) for artifacts that could not be short-lived and hence could no have been cleaned at their origin (i.e. with a simple TemporaryFile or TemporaryDirectory context manager) to clean them at the end our execution.

Take this just as an FYI, not that I expect this mechanism to be implemented as part of this PR, but I'd like us to naturally move in that direction and converge towards that as we add more and more logic.

"pip/charset-normalizer-3.2.0.tar.gz": "sha256:3bb3d25a8e6c0aedd251753a79ae98a093c7e7b471faa3aa9a93a81431987ace",
"pip/click-8.1.7.tar.gz": "sha256:ca9853ad459e787e2192211578cc907e7594e294c7ccc834310722b41b9ca6de",
"pip/cryptography-41.0.3.tar.gz": "sha256:6d192741113ef5e30d89dcb5b956ef4e1578f304708701b8b73d38e3e1461f34",
"pip/cryptography-41.0.3/CHANGELOG.rst": "sha256:4601a7b95e2757b672e736720deada09ba2f0a82c76a9832729bf90c54c8090c",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to sit down and figure out if we need ^these JSON files at all, it often just pollutes the diff :( .

Add more specific types annotation. The `source_dir` parameter was
removed from the Rust package detection logic as it was unused.

Signed-off-by: Michal Šoltis <msoltis@redhat.com>
Clean up extracted tarballs (pip packages) containing Rust code after
fetching cargo dependencies to prevent leftover artifacts and increased
size of the output directory.

Improve type annonations for corresponding functions. Simplify path
management by directly using relative Rust package paths instead of
constructing them against output_dir.

Closes hermetoproject#914

Regenerate test SBOM and remove all non-related files from
fetch_deps_sha256sums.json that come from unpacked cryptography package
inside deps/pip directory.

Signed-off-by: Michal Šoltis <msoltis@redhat.com>
@slimreaper35 slimreaper35 added this pull request to the merge queue May 27, 2025
Merged via the queue into hermetoproject:main with commit 8a89584 May 27, 2025
16 checks passed
@slimreaper35 slimreaper35 deleted the cargo-pip branch May 27, 2025 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants