[WIP] Minor compression ratio improvements for high compression modes on small data #2781
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR bundles 3 modifications for high compression modes, targeting higher compression ratio on "small" (one-block) data. The benefit may extend beyond the first block, but they are vanishing (and so does the cost).
greedyalgorithms for collection of initial statistics inbtultra. This step is globally positive, though benefits vary widely depending on file. It comes at a speed cost though, of around ~-10%, on one-block data (speed cost rapidly becomes insignificant for large inputs, likesilesia.tar). Therefore, it was selected to not use this new method forbtopt.btultra2, which usesbtultraas an initializer. The impact is small, but constantly positive, and there is no significant cost.greedy -> btultra -> btultra2initialization. The impact is globally positive, but not always. There are wide variations depending on file type and input size. For this reason, the proposed heuristic is to enable the new method for inputs >= 8 KB, as the impact is rather positive for larger blocks, and rather negative for small ones. The speed cost, while not null, is not significant.1. Using
greedyto collect initial statistics forbtultraThis step is globally beneficial, by non-negligible amounts. But there is also a sizable speed cost to it, so the trade-off is debatable. I think I'm rather favorable to it, since speed is less of a concern at
btultralevels, but it also explains why this technique is not extended to the faster variantbtopt.For benchmarking, I'm using multiple block sizes, in order to observe the impact of initialization. This is compared to previous PR #2771 which updated initialization of
btultrawith slightly altered starting value, producing better compression ratios on small data.2 KB blocks, btultra, level 14
devultra_initdevultra_initAt 2K block size, the impact is small but globally positive, saving ~100 KB.
But it comes at a cost, with speed down by -8.2% on average, due to the heavier initialization process using
greedyinstead of "blind" default values. There are very wide variations in this initialization cost depending on exact file. I'm unable to explain some of the most positive changes, such as a speed improvement fornci(+3.3%), but it was tested multiple times to confirm the timing, and all measurements are consistently accurate within ~0.2%, so it's not a fluke.The best impact is achieved on
x-ray(ratio +0.66%), but there is one negative impact, onmr, which is pretty large (ratio -0.59%). I haven't yet been able to find a clear explanation for this case.followup investigation : the large negative impact on
mrseems strongly correlated to the difference in initialization of literals. Essentially with "default" literals initialization "from source", the resulting literals statistics are a lot more flat than when using the outcome ofgreedy(which seems to leave a lot of literals). For some reason, flatter literals statistics seems better formr. Unclear if this is a general conclusion, or really a specificity ofmr.One of the possible explanations could be that
greedyleaves too many literals, thus producing more squeezed statistics favoring more frequent literals, making them cheaper hence preferable. These frequent literals might be over-represented, possible as a consequence ofgreedybeing unable to find 3-bytes matches and thus eliminate corresponding literals. This impact is of course very variable depending on files, and it's also possible that in some other cases, the literals probabilities produced bygreedyare accurate enough, maybe because there aren't many 3-bytes matches worth selecting, and trying to artificially distort these statistics, typically by flattening them, ends up hurting compression ratio (ex:nci).16 KB blocks, btultra, level 14
devultra_initdevultra_initAt 16 KB block size, there are a few improvements : compression benefits increases to 130 KB, now representing a 0.18% ratio improvement, and the speed cost is slightly better, at -6.4%.
The best case remains
x-ray, with a ratio at+0.65%, but there are many changes in the other files.mozilla, which used to be neutral at 2K, is now a big positive contributor at 16K.mr, which was negative at 2K is now positive at 16K. A few files which were slightly positive at 2K become slightly negative at 16K. So it's all over the place. Difficult to draw conclusions from these details, other than it's "globally more positive" at 16K.ncicontinues to defy expectation by running faster despite a heavier initialization process.128 KB blocks, btultra, level 17
devultra_initdevultra_initAt 128 KB block size, the compression benefits increase even more, at > 200 KB, now representing a +0.33% ratio improvement, which is quite sizable for just a difference in statistics initialization.
The majority of the gains come from
mozilla,mrandx-ray.mris the new champion of compression ratio gains, at +0.87%. It's all the more remarkable as it was previously the only file in negative territory at 2K blocks. There are a few files which lose compression ratio (compared to default initialization), though it's never by large amounts. Once again, it's difficult to pinpoint a single clearly reproducible reason.The average speed cost increases to -11.4%, with large differences depending on exact file. Worst speed costs are on
mrandsao, with a drop of almost -25%, whilenciandwebsterend up running faster despite the much heavier initialization.It might be possible to reduce the speed cost by making initialization faster, either by searching less or compressing a portion of the input, but this will likely have an impact on compression ratio, so would have to be carefully monitored.
follow-up investigation : limiting statistics analysis to the first 64 KB of the first block sensibly reduces the speed cost (by more than half). However, it also loses a part of the compression ratio gain in the process (between 1/4th and 1/3rd).
2.
btultra2: modify statistics collectionbtultra2runs the first block withbtultra, then throws away the result, but keeps the statistics, and re-starts compression of first block using these baseline statistics.The second modification changes the statistics transmission logic, rebuilding them from the
seqStore, instead of simply inheriting the ones which are naturally stored into theoptPtras a consequence of runningbtultra.This results in a small, yet consistently positive, compression ratio gain for
btultra2, at essentially no cost.Note that, in this scenario, the first round of
btultrastill uses the default "blind" initialization (same asdev), not thegreedyfirst pass mentioned in earlier part.The benefit though is pretty small, and barely worth mentioning (I'm not detailing benchmarks here), averaging 0.01% ratio improvement. What makes it desirable is that it comes from free, and is a necessary ingredient for stage 3 to work properly.
3.
btultra2: initialize statistics frombtultra, itself initialized bygreedySo that's more or less the intended end game of this exercise. Let's see if, with a first round using
greedy, statistics produced bybtultraget better, resulting in a better final compression ratio forbtultra2.2K blocks, btultra2, level 19
16K blocks, btultra2, level 19
128K blocks, btultra2, level 19
So that's interesting. For some reason, this technique doesn't work well for small blocks (2 KB). It does however tend to improve compression ratio for larger blocks, by a measurable margin, reaching an almost 300 KB gain at 128 KB blocks. That's a 0.44% ratio advantage, which is quite a lot for high compression modes.
For this reason, the logic proposed in this PR has a cutoff point at 8 KB block size : below that threshold, let's start with default statistics, above that, let's initialize with
greedy.Let's be clear though that this is not a universal gain : it depends on the file. In this sample set, the largest gain is achieved by
x-rayat 128 KB, achieving a staggering +3.7% compression ratio gain, which is almost oversized for simple alteration of initial statistics. It's followed bysamba(+0.52%) andmozilla(+0.36%), which are pretty respectable.On the flip side,
dickensloses -0.14% andreymont-0.12%. So this is not universal.Results at 16 KB are much more tamed, with
x-raygains reduced to +0.07%, hence barely relevant, whiledickenslosses increase to -0.25%. The total of 16KB changes still achieve a compression ratio benefit, but by +0.04%, which is limited.The oversized
x-raygains deserve an explanation. Sadly, I'm reduced to guesses right now. It likely receives a correct hint from initial statistics built withgreedy, that make it capable to find a better global path. A classical example would be a path which favor usage of repcodes, over other path consisting of larger yet more costly matches. This shorter global path would be essentially rejected due to wrong cost estimation when using the more standard (dev) initialization. It adds up over the full block.However, for such explanation, I would have expected some good results at 16 KB block size too. Why it needs larger block size to really show up is unclear.
followup investigation : seems
x-rayis "a medical picture of child's hand stored in 12-bit gray scaled image", maybe it needs a certain size for multiple lines to be present, hence for repcode to start making an impact (?). This is highly speculative though, as the image is "large" (8 MB) and compresses poorly, it's difficult to imagine +3.7% compression ratio gains just on repcode targeting previous line.(sidenote : any speed cost due to
greedyinitialization is pretty minimalistic and irrelevant at this compression level).