Description
Is your feature request related to a problem? Please describe.
Zip files always retain an index located separately from each entry's possibly-compressed data. This allows performing high-level split/merge operations without de/recompressing file contents. This produces improved performance on benchmarks compared to serially iterating over each entry to extract, or serially iterating over each file to compress.
Describe the solution you'd like
It's possible to extract zip files in parallel (see #72) as well as merge them to create archives in parallel (see discussion in #73).
Describe alternatives you've considered
While parallel zip extraction as in #72 has likely been implemented elsewhere, to my knowledge the parallel split/merge technique in #73 (researched for pex-tool/pex#2175 and prototyped in https://github.com/cosmicexplorer/medusa-zip) has not been discussed or implemented before in other zip tooling (please let me know of any prior art for this!).
Additional context
TODO:
- refactor reader wrappers to use generic type params in refactor readers to use type parameters and not concrete vtables #207 (this gets us
Send
bounds) - parallel/pipelined extraction in parallel/pipelined extraction #208
- bulk copy (no de/recompression) with entry renaming as in consume packed wheel cache in zipapp creation pex-tool/pex#2175
- as in that pex change, bulk copy with renaming enables reconstituting a "parent" zip file from an ordered sequence of "child" zips, which may be used to very quickly reconstruct large zip files from immutable cached components.
- when renaming is not required,
ZipWriter::merge_contents()
already works with a singleio::copy()
call. bulk copy with rename avoids de/recompression of file data, but must edit each renamed local file header and therefore requires O(n)io::copy()
calls.
- parallel split/merge for extremely fast creation as in https://github.com/cosmicexplorer/medusa-zip
- this
zip
crate should probably not get into the weeds of crawling the filesystem, which keepsmedusa-zip
useful as a separate crate, and ensures we don't add too much extraneous code to this one. - however, the process of merging an ordered sequence of "child" zips with
ZipWriter::merge_contents()
can be parallelized, and this is something thezip
crate should be able to do.
- this