Skip to content

Thoughts about the mutual information threshold #259

@thodson-usgs

Description

@thodson-usgs

Running some tests with bitinfo codec, and it seemed like a good time to revisit whether it's better to trim mutual_information by an "arbitrary" threshold or use a free entropy threshold. The former appears to give decent results and better compression but might be losing some real information. I wanted to open the issue before submitting a PR because I assume others have dug more deeply into this.

Here's the code:

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('codec.zarr', mode="w", encoding=encoding)

By default, ds.to_zarr will chunk this dataset into 730x7x27 blocks for compression.

Here are the results:

Compression Size
None 17 Mb
Zstd 5.3 Mb
Zstd + Bitinfo (default tol w/ factor = 1.1) 1.2 Mb
Zstd + Bitinfo (free entropy tol) 2.8 Mb

(An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions