Description
Hello! Not sure if this is the correct forum, so please feel free to redirect me if needed.
At present, the WARC 1.1 specification includes an informative annex D that recommends using gzip to compress individual captures if so desired.
Would the IIPC be open to changing this recommendation to zstd? zstd is an open-source, non-patent encumbered algorithm released by Facebook. It is technically superior to gzip along many axes:
- compression ratio - for a given amount of CPU time, zstd willl produce a smaller output than gzip
- compression time - for a given level of compression, zstd will 3-4x faster to compress
- decompression time - regardless of compression level, zstd is ~3x faster to decompress
- CPU cost vs storage size tradeoffs - zstd supports a much wider range of compression speed/ratio choices than zlib, allowing people to tune for CPU cost vs long-term storage cost
It is comparable to gzip along other important axes, namely being open source and having bindings for all major languages.
Ben Wills has done some analysis on the impact of zstd vs gzip for the Common Crawl. You can read his analysis at https://github.com/benwills/proposal-warc-to-zstandard, or some discussion on the Common Crawl mailing list at https://groups.google.com/forum/?hl=en#!topic/common-crawl/bO6B6xQJnEE. For the portion of the Common Crawl that was analyzed, it results in an ~18% decrease in storage size and ~3x throughput for readers.
Additionally, zstd provides the zlibWrapper
, which transparently supports decompressing zlib or zstd streams - this should help make the migration path easier for people who have some collection of archives already stored in zlib format.