cargo+pip: Clean up extracted tarballs from output directory#960
Conversation
a-ovchinnikov
left a comment
There was a problem hiding this comment.
LGTM with a non-blocking naming nitpick, a non-blocking question and a blocking commits arrangement suggestion.
| def _filter_packages_with_rust_code( | ||
| packages: list, output_dir: RootedPath, source_dir: RootedPath | ||
| ) -> list: | ||
| def _filter_packages_with_rust_code(packages: list[dict[str, Any]]) -> list[CargoPackageInput]: | ||
| packages_containing_rust_code = [] | ||
| tar_packages = [p for p in packages if tarfile.is_tarfile(p.get("path", ""))] | ||
| for p in tar_packages: | ||
| fname = p.get("path") | ||
| fname = p.get("path", "") |
There was a problem hiding this comment.
Hunk with unrelated changes -> standalone commit please.
There was a problem hiding this comment.
Okay, added a tiny commit.
| def remove_extracted(packages: list[CargoPackageInput]) -> None: | ||
| """Remove extracted tarballs in the output directory that contain Rust code.""" | ||
| for pkg in packages: | ||
| # in case the Rust code was in a subdirectory of the package tarball | ||
| pip_package_root = pkg.path.parts[0] | ||
| shutil.rmtree(pip_deps_dir.join_within_root(pip_package_root), ignore_errors=True) |
There was a problem hiding this comment.
This will do the job and is also simple when everyone has the immediate context, but presumably might not be the cleanest and scalable future proof solution because with the growing number of things we need to clean, it's harder to track what is cleaned and where, especially if it's buried this deep AND the artifacts are not cleaned at their origin, i.e. _filter_packages_with_rust_code.
I think we should think of a global cleaner passed around in the request object in an attribute which is common for all backends (e.g. named opaque) that would carry enough information (e.g. a callback) for artifacts that could not be short-lived and hence could no have been cleaned at their origin (i.e. with a simple TemporaryFile or TemporaryDirectory context manager) to clean them at the end our execution.
Take this just as an FYI, not that I expect this mechanism to be implemented as part of this PR, but I'd like us to naturally move in that direction and converge towards that as we add more and more logic.
| "pip/charset-normalizer-3.2.0.tar.gz": "sha256:3bb3d25a8e6c0aedd251753a79ae98a093c7e7b471faa3aa9a93a81431987ace", | ||
| "pip/click-8.1.7.tar.gz": "sha256:ca9853ad459e787e2192211578cc907e7594e294c7ccc834310722b41b9ca6de", | ||
| "pip/cryptography-41.0.3.tar.gz": "sha256:6d192741113ef5e30d89dcb5b956ef4e1578f304708701b8b73d38e3e1461f34", | ||
| "pip/cryptography-41.0.3/CHANGELOG.rst": "sha256:4601a7b95e2757b672e736720deada09ba2f0a82c76a9832729bf90c54c8090c", |
There was a problem hiding this comment.
Yeah, we need to sit down and figure out if we need ^these JSON files at all, it often just pollutes the diff :( .
Add more specific types annotation. The `source_dir` parameter was removed from the Rust package detection logic as it was unused. Signed-off-by: Michal Šoltis <msoltis@redhat.com>
Clean up extracted tarballs (pip packages) containing Rust code after fetching cargo dependencies to prevent leftover artifacts and increased size of the output directory. Improve type annonations for corresponding functions. Simplify path management by directly using relative Rust package paths instead of constructing them against output_dir. Closes hermetoproject#914 Regenerate test SBOM and remove all non-related files from fetch_deps_sha256sums.json that come from unpacked cryptography package inside deps/pip directory. Signed-off-by: Michal Šoltis <msoltis@redhat.com>
Maintainers will complete the following section
Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:
/ok-to-test(as is the standard for Pipelines as Code)