-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
Running some tests with bitinfo codec, and it seemed like a good time to revisit whether it's better to trim mutual_information by an "arbitrary" threshold or use a free entropy threshold. The former appears to give decent results and better compression but might be losing some real information. I wanted to open the issue before submitting a PR because I assume others have dug more deeply into this.
Here's the code:
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]
encoding = {"air": {"compressor": compressor, "filters": filters}}
ds.to_zarr('codec.zarr', mode="w", encoding=encoding)By default, ds.to_zarr will chunk this dataset into 730x7x27 blocks for compression.
Here are the results:
| Compression | Size |
|---|---|
| None | 17 Mb |
| Zstd | 5.3 Mb |
| Zstd + Bitinfo (default tol w/ factor = 1.1) | 1.2 Mb |
| Zstd + Bitinfo (free entropy tol) | 2.8 Mb |
(An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?)
Metadata
Metadata
Assignees
Labels
No labels