Skip to content

Divergence between hash3_int and PHP's murmur3a (and other implementations)  #16

Open
@FraserChapman

Description

@FraserChapman

If I test the hash3_int method against php8.1.2 murmur3a it appears to have divergences.

I found this after stumbling across a divergence by chance, and then testing 300000 randomly generated keys in roughly the same format as the problematic key I'd discovered.

e.g.

Given the keys

vdegpxazixnfmxxrshidlvnqyqh20301009315950
xlznhcnqpwpmfpxihtnrgz20290705241970
kuieegiqdkciqyxpuheemjtxcb202709178016583
fdthgabmvundazjxxu202810222044209
eqtlxyiianhpkjupowha200709169794430
tzlvesrdvngxoatqowmvjwdxuey201808121086444
wexigdatedfbjvdrbnwwc202602132914266

hash3_int and murmur3a give different hash values.
All the other tested keys matched exactly.

Running: 300000 iterations
PHP version: 8.1.2-1ubuntu2.20
Seed: 0
Progress: 300000/300000
Divergent hashes found:
Array
(
    [0] => Array
        (
            [key] => vdegpxazixnfmxxrshidlvnqyqh20301009315950
            [hash3Int] => 331996093
            [murmur3a] => 2187344450
        )

    [1] => Array
        (
            [key] => xlznhcnqpwpmfpxihtnrgz20290705241970
            [hash3Int] => 2026460930
            [murmur3a] => 1939335100
        )

    [2] => Array
        (
            [key] => kuieegiqdkciqyxpuheemjtxcb202709178016583
            [hash3Int] => 1380540917
            [murmur3a] => 2945654283
        )

    [3] => Array
        (
            [key] => fdthgabmvundazjxxu202810222044209
            [hash3Int] => 2427300608
            [murmur3a] => 2986937244
        )

    [4] => Array
        (
            [key] => eqtlxyiianhpkjupowha200709169794430
            [hash3Int] => 3932939188
            [murmur3a] => 3428012532
        )

    [5] => Array
        (
            [key] => tzlvesrdvngxoatqowmvjwdxuey201808121086444
            [hash3Int] => 1240384036
            [murmur3a] => 2145074505
        )

    [6] => Array
        (
            [key] => wexigdatedfbjvdrbnwwc202602132914266
            [hash3Int] => 4187875406
            [murmur3a] => 245478873
        )

)

This is using

<?php

function hash3Int(string $key, int $seed = 0): int
{
    $key = array_values(unpack('C*', $key));
    $klen = count($key);
    $h1 = $seed < 0 ? -$seed : $seed;
    $remainder = $i = 0;
    for ($bytes = $klen - ($remainder = $klen & 3); $i < $bytes; ) {
        $k1 = $key[$i]
            | ($key[++$i] << 8)
            | ($key[++$i] << 16)
            | ($key[++$i] << 24);
        ++$i;
        $k1 = (((($k1 & 0xffff) * 0xcc9e2d51) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0xcc9e2d51) & 0xffff) << 16))) & 0xffffffff;
        $k1 = $k1 << 15 | ($k1 >= 0 ? $k1 >> 17 : (($k1 & 0x7fffffff) >> 17) | 0x4000);
        $k1 = (((($k1 & 0xffff) * 0x1b873593) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0x1b873593) & 0xffff) << 16))) & 0xffffffff;
        $h1 ^= $k1;
        $h1 = $h1 << 13 | ($h1 >= 0 ? $h1 >> 19 : (($h1 & 0x7fffffff) >> 19) | 0x1000);
        $h1b = (((($h1 & 0xffff) * 5) + ((((($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000)) * 5) & 0xffff) << 16))) & 0xffffffff;
        $h1 = ((($h1b & 0xffff) + 0x6b64) + ((((($h1b >= 0 ? $h1b >> 16 : (($h1b & 0x7fffffff) >> 16) | 0x8000)) + 0xe654) & 0xffff) << 16));
    }
    $k1 = 0;
    switch ($remainder) {
        case 3:
            $k1 ^= $key[$i + 2] << 16;
        case 2:
            $k1 ^= $key[$i + 1] << 8;
        case 1:
            $k1 ^= $key[$i];
            $k1 = ((($k1 & 0xffff) * 0xcc9e2d51) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0xcc9e2d51) & 0xffff) << 16)) & 0xffffffff;
            $k1 = $k1 << 15 | ($k1 >= 0 ? $k1 >> 17 : (($k1 & 0x7fffffff) >> 17) | 0x4000);
            $k1 = ((($k1 & 0xffff) * 0x1b873593) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0x1b873593) & 0xffff) << 16)) & 0xffffffff;
            $h1 ^= $k1;
    }
    $h1 ^= $klen;
    $h1 ^= ($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000);
    $h1 = ((($h1 & 0xffff) * 0x85ebca6b) + ((((($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000)) * 0x85ebca6b) & 0xffff) << 16)) & 0xffffffff;
    $h1 ^= ($h1 >= 0 ? $h1 >> 13 : (($h1 & 0x7fffffff) >> 13) | 0x40000);
    $h1 = (((($h1 & 0xffff) * 0xc2b2ae35) + ((((($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000)) * 0xc2b2ae35) & 0xffff) << 16))) & 0xffffffff;
    $h1 ^= ($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000);
    return $h1;
}

function murmur3a(string $key, int $seed = 0): int
{
    return (int) base_convert(hash('murmur3a', $key, false, ["seed" => $seed]), 16, 10);
}

function generateRandomKey()
{
    $stringPart = substr(str_shuffle(str_repeat("abcdefghijklmnopqrstuvwxyz", 10)), 0, rand(10, 30));
    $datePart = date("Ymd", timestamp: rand(strtotime("2000-01-01"), strtotime("2030-12-31")));
    $intPart = rand(1000, 9999999);
    return $stringPart . $datePart . $intPart;
}

function runTests($numTests, $seed)
{
    echo "Running: $numTests iterations\n";
    echo "PHP version: " . phpversion() . "\n";
    echo "Seed: $seed\n";

    $divergentHashes = [];

    for ($i = 0; $i < $numTests; $i++) {

        echo "Progress: " . ($i + 1) . "/$numTests\r";

        $key = generateRandomKey();
        $hash3Int = hash3Int($key, $seed);
        $murmur3a = murmur3a($key, $seed);
        if ($hash3Int !== $murmur3a) {
            $divergentHashes[] = [
                'key' => $key,
                'hash3Int' => $hash3Int,
                'murmur3a' => $murmur3a
            ];
        }
    }

    echo "\n";

    return $divergentHashes;
}

$numTests = isset($argv[1]) ? (int) $argv[1] : 1000;
$seed = isset($argv[2]) ? (int) $argv[2] : 0;
$divergentHashes = runTests($numTests, $seed);

if (empty($divergentHashes)) {
    echo "All hashes matched!\n";
} else {
    echo "Divergent hashes found:\n";
    print_r($divergentHashes);
}

I also tested against other versions of murmurhash3 (in various languages; go, JavaScript, Python, etc) and they all do give the same hash as murmur3a - which leads me to suspect there is a subtle bug in the hash3_int implementation within this repo.

This seems to be consistent, if I run the test for a few hundred thousand iterations it finds a small number of divergent keys.

Happy to give more context and information if you need it, and apologies if I am overlooking something here!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions