Skip to content

dice/jaccard single-pair scorers crash on different-length bloom hex inputs #784

@benzsevern

Description

@benzsevern

Found by the hypothesis property suite (#778) on its first run. Shrunk counterexample:

from goldenmatch.core.scorer import score_field
score_field("0000", "000000", "dice")
# ValueError: operands could not be broadcast together with shapes (2,) (3,)

Root cause: _dice_score_single / _jaccard_score_single in goldenmatch/core/scorer.py decode the two hex strings to byte arrays and call np.bitwise_and(bits_a, bits_b) with no length validation. The matrix variants (_dice_score_matrix / _jaccard_score_matrix) pad to max_len and are unaffected -- so the single-pair and batch paths disagree on the same inputs.

Reachability: bloom encodings are fixed-length per PPRL config, so same-pipeline pairs are safe in practice. But score_field is public API, and cross-config or hand-fed inputs hit an unhandled numpy internals error instead of either a score or a typed rejection.

Suggested fix: zero-pad the shorter array to max_len in the single-pair helpers, mirroring the matrix variants (keeps single-vs-matrix parity on the same inputs). Alternative if padding is semantically wrong for PPRL: raise a typed ValueError("bloom filter length mismatch: ...") -- but then the matrix variants should reject too, not silently pad.

Regression tests already in place: tests/test_property_invariants.py::test_dice_mismatched_length_bug and ::test_jaccard_mismatched_length_bug are @pytest.mark.xfail(strict=True, raises=ValueError) -- they flip to XPASS (suite failure) when this is fixed; remove the markers and keep the assertions.

Related: check the TS twins diceCoefficient / jaccardSimilarity (packages/typescript/goldenmatch/src/core/scorer.ts) for the same gap -- the fast-check suite (#783) tests them on a same-length hex strategy, so their mismatched-length behavior is currently unverified. Whatever semantics the fix picks, both surfaces should match.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions