Skip to content

Return the estimated effectiveness of dictionaries in the trainer #1506

Open
@braibant

Description

@braibant

Hi,

I am updating some code using zstd from 1.3.4 to 1.3.8, and encountered one issue in when updating client code which seems related to the changes in dictionary training.

The use case is training a dictionary on some payload type in a server, and send the dictionary to a client, and then use the dictionary to encode messages of this type from server to client. In one degenerate test case, it looks like some data that could be used to "successfully" train a dictionary in 1.3.4 is now yielding an "Error (generic)" in 1.3.8. I suspect that the training data is too small (most messages are a few bytes). I tried looking for guidance on how to train dictionaries or what restrictitions apply, and found an upper bound in #1288 (comment) but not much around lower bounds.

(A reproduction example I am looking at is is training a dictionary in consecutive integers from 0 to 10_000_000 represented as strings, which returns succesfully in 1.3.4, but not succesfully in 1.3.8.)

I am wondering if it could make sense to have some form of dummy dictionary exposed in the library to cater for cases where a dictionary cannot successfully be trained in a programmatic way, which would revert the behavior of e.g. ZSTD_compress_usingCDict to ZSTD_compressCCtx. In the example described above, I would like to gracefully upgrade the server/client communication to not use a dictionary if the server does not manage to build said dictionary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions