Skip to content

SparseHll inserts incorrect value for high number of leading zeros #56

@jonhehir

Description

@jonhehir

This is an edge case that may not be worth fixing.

SparseHll has a small bug that causes the wrong bucket value to be recorded when the number of leading zeros is within 6 of the maximum possible number. For example, with 4096 buckets (indexBitLength = 12), the maximum possible bucket value is 64 - indexBitLength + 1 = 53 (52 leading zeros). Values up to 46 are correctly recorded, but attempting to insert a hash with more 46 or more leading zeros records an incorrect value. This is reflected in the added unit tests in #55.

I have not thoroughly investigated why this is, but I believe the implementation of HyperLogLog++ in SparseHll differs slightly from that laid out in the original paper. (See in particular the encoding and decoding of hashes in the sparse setting, and note that 6 is equal to VALUE_BITS in SparseHll.) I suspect fixing this bug would require making breaking changes to the format of SparseHll. Meanwhile, the probability of this bug occurring is very remote, so its effects are virtually nil. Nonetheless, I figured it would be worthwhile to document this behavior for any future reference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions