Skip to content

[edge case] Zstd performs badly on 200-symbol uniform data #3162

Open
@terrelln

Description

@terrelln

Data generated by this script:

import random
rd = random.Random()
rd.seed(0)
HIGH_ENTROPY = bytes(rd.randint(0, 200) for _ in range(10_000_000)) * 10
with open("med.bin", "wb") as f:
  f.write(HIGH_ENTROPY)

gzip -1: 100000000 -> 96526120
zstd -1: 100000000 -> 100002299

If I remove these heusistics:

if (largestTotal <= ((2 * SUSPECT_INCOMPRESSIBLE_SAMPLE_SIZE) >> 7)+4) return 0; /* heuristic : probably not compressible enough */

if (largest <= (srcSize >> 7)+4) return 0; /* heuristic : probably not compressible enough */

We get:

zstd -1: 100000000 -> 96449637

Zstd should do a better job with determining compressibility so we don't lose out on this case.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions