Description
Lee Howes asked me to post this feedback here.
To provide some context, we use zstd to compress all the trades which occur in the United States daily, which we store into a live custom database for the US SEC and the major financial institutions as per Congressional mandate. This is a lot of data, for example a single channel from OPRA which supplies options and futures might be 1Tb-2Tb per day, and OPRA has forty something channels, and that is just one exchange. We need to provide queries across this live dataset up to a few seconds from just now, but also into the past for several years, and we need to return the first result for any arbitrary query within a millisecond 99.9% of the time, and maintain many sigmas of availability.
I'll firstly say, as I told Lee, that there is literally no other game in town than zstd for this problem. You're unique. This is because we have a power law distribution of sizes of blocks to individually compress, so a tiny proporation are massive (e.g. AMZN), but there are millions which are tiny. We therefore need to train a shared dictionary for those millions of small blocks to get their aggregate size down (as each have a lot of shared structure, so they share dictionaries well), but also fire as many CPU cores as possible to parallelise compression. We also have a time space tradeoff, so for some exchanges it is worth doing a fast compression as a low zstd level to get data off to S3 quickly, and later recompress that data with a higher zstd level to achieve better storage density. zstd gives us all the runtime knobs and twiddles to manually customise our compression per exchange, and that's fabulous for us.
zstd is close to perfect to our needs, but its current main weakness for us is in dictionary training:
-
You must assemble your training set into a single, contiguous, block of memory. This requires us to copy a hundred Gb or so of memory needlessly. If the API instead took a list of pointers each to a training item, or even better again, it took in a sequence of gather buffers (i.e. each training block might itself be distributed across memory), this would be a lot more efficient.
-
There is no way of dictionary training giving progress, which means the routine appears to just hang for an unpredictable time. Internally, the routine loops dictionary generation as it tries to figure out an optimal set. It would be very useful if it could call a callback function per round, ideally with the ability for the callback routine to say whether the training is "good enough" i.e. we might run a timer, and choose whatever is the best after one minute. It would be also useful if the warnings currently printable could instead be given to the callback function, so we can log them into our proper logger.
-
Perhaps this is already supported, but dictionary training ought to be embarrassingly parallelisable. I'd like the ability to control the threads doing the training however, for example in C++ 23 all concurrency ought to act via an Executor. Breaking out the internals into public APIs so we can compose our own training implementations would be great.
Whilst we, in our current use case for zstd, don't need this, in general a scatter-gather based API design is a better call than a single buffer API. This lets one compress, or decompress, frames gathering in the data discontiguously distributed across memory, or scattering it discontiguously across memory. As you may know given you're under the same management structure, Eric Niebler designed Ranges for C++. These naturally act upon discontiguous distributions of data across memory which are composed into as-if apparently contiguous byte sequences, and we on WG21 are going to be extending his work (along with Eric's FB team) into a wider, standardised, framework for C++ whereby everything naturally works in terms of scatter-gather buffer sequences, in order to avoid unnecessary memory copies. If zstd were similarly scatter-gather buffer orientated, including its dictionary training API, that would be superb.
Finally I'd like to thank those supporting this repo with taking the time to engage so fully with the users. On multiple occasions I found treasures of discussion and advice inside the commentary in this issue tracker which has accumulated over the years. So few engineers bother to so fully engage with users to the detriment of us and the ecosystem, but you guys here do so, and it's a real credit to you. So thank you!