Skip to content

How to compress streamed items in batches of (roughly) the same (compressed) size? #204

Open
@KrzysFR

Description

@KrzysFR

How can we batch items according to the resulting compressed (batch) size, and not the number of items, or the input buffer size, especially when the compression ratio can not be predicted, and vary wildly?

Use case

We have a list of small items of various size (a few hundred bytes to a few KB) that we want to compress in batches, by filling as much items as possible per batch, so that the compressed size of each batch would always end up around 30 KB, plus or minus a few KB (there is no hard limit).

Each batch will be stored individually, and must contain whole items: the decompressor will read batches individually, and must end up with an array of complete items. This means we cannot simply compress the items has a giant stream, and chunk the compressed output stream into blocks of exactly 30KB.

We don't need each batch to be exactly the same size. We have a margin of a plus/minus few KB. Also, we don't care about ordering. Each batch can be read or deleted in any order, once produced.

A finally, the items have very different compression ratio, which makes it almost impossible to estimate how much input size would compress down to 30KB. The only real way is to compress them, and see what we get. This means that some batch may require only a few items to output 30KB of compressed text, while others may require a few hundred.

Example

An example of that would be compressing JSON or XML fragments, streamed from a socket, while in the background outputting blocks of ~30 KB or compressed data every know and then, and without having to try compressing 1 fragment, then 2 fragment, then 3 fragments, ... until finally we have enough lines to fill the target compressed batch size.

Each batch produced would then be sent to some worker thread somewhere, decompressed, and passed through a filter that would see complete JSON documents (and not truncated chunks of JSON at the start or end of each batch).

Solution that does not work

We already tried estimating how much input size would compress down to ~30KB (let's say about 100KB of "text"), but in practice this does not work because of too much variation in the data we consume. Compression ratio vary widely, and we sometimes end up with batches from anywhere between 3KB to 90KB, which is way outside the desired result.

Solution that works but is slow

The current algorithm (which is too slow)

  • Start with an empty buffer
  • Read next item from stream
    • Append item to the compression buffer.
    • Compress the buffer, and measure the compressed size:
      • If >= 30KB, complete the block, output it, and empty the buffer
      • If not, continue on with the next item
  • If no more items, but buffer is not empty, compress it and output it.

Ideal solution

What we would prefer:

  • Start a new "batch" compression context
  • Read next item from stream
    • "Append" the item to the compression context, and get current compressed size of the partial block
      • If the partial compression block length >= 30KB, complete the block, get the result compressed text, output it, and "reset" the compression context (to start a new block)
      • If not, continue with the next item
  • If no more items, but a compression block is still pending, complete the block and output it.

Even though the compressed block would have been created by "appending" N fragments to it, we want the compressed result to be identical to compressing the concatenation of the N fragments in a single "zstd_compress" call.

Questions

I think that zstd_compress_continue could be used for this but I'm not sure about the following points:

  • Does zstd_compress_continue output a full block when called? Ie: Would calling zstd_compress_continue on some input data produce the same output has calling it 4 times with a quarter of the same input data?
  • Let's say the current compressed result is at 28 KB (almost there), and I output the final item that would make it reach 30KB. What would happen in this item would be uncompressible and large enough, to overflow the output compression buffer? How would I be able to recover from the error, and output the compressed block as it was BEFORE the call?
  • What would it take to have a compression method that takes in an array of input buffers, and a target destination buffer+size, and have the method return the compressed size, as well as the number of input buffers that were consumed (or fail if even the first buffer would overflow the destination buffer).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions