Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Divergence between hash3_int and PHP's murmur3a (and other implementations) #16

Open
FraserChapman opened this issue Jan 18, 2025 · 2 comments

Comments

@FraserChapman
Copy link

FraserChapman commented Jan 18, 2025

If I test the hash3_int method against php8.1.2 murmur3a it appears to have divergences.

I found this after stumbling across a divergence by chance, and then testing 300000 randomly generated keys in roughly the same format as the problematic key I'd discovered.

e.g.

Given the keys

vdegpxazixnfmxxrshidlvnqyqh20301009315950
xlznhcnqpwpmfpxihtnrgz20290705241970
kuieegiqdkciqyxpuheemjtxcb202709178016583
fdthgabmvundazjxxu202810222044209
eqtlxyiianhpkjupowha200709169794430
tzlvesrdvngxoatqowmvjwdxuey201808121086444
wexigdatedfbjvdrbnwwc202602132914266

hash3_int and murmur3a give different hash values.
All the other tested keys matched exactly.

Running: 300000 iterations
PHP version: 8.1.2-1ubuntu2.20
Seed: 0
Progress: 300000/300000
Divergent hashes found:
Array
(
    [0] => Array
        (
            [key] => vdegpxazixnfmxxrshidlvnqyqh20301009315950
            [hash3Int] => 331996093
            [murmur3a] => 2187344450
        )

    [1] => Array
        (
            [key] => xlznhcnqpwpmfpxihtnrgz20290705241970
            [hash3Int] => 2026460930
            [murmur3a] => 1939335100
        )

    [2] => Array
        (
            [key] => kuieegiqdkciqyxpuheemjtxcb202709178016583
            [hash3Int] => 1380540917
            [murmur3a] => 2945654283
        )

    [3] => Array
        (
            [key] => fdthgabmvundazjxxu202810222044209
            [hash3Int] => 2427300608
            [murmur3a] => 2986937244
        )

    [4] => Array
        (
            [key] => eqtlxyiianhpkjupowha200709169794430
            [hash3Int] => 3932939188
            [murmur3a] => 3428012532
        )

    [5] => Array
        (
            [key] => tzlvesrdvngxoatqowmvjwdxuey201808121086444
            [hash3Int] => 1240384036
            [murmur3a] => 2145074505
        )

    [6] => Array
        (
            [key] => wexigdatedfbjvdrbnwwc202602132914266
            [hash3Int] => 4187875406
            [murmur3a] => 245478873
        )

)

This is using

<?php

function hash3Int(string $key, int $seed = 0): int
{
    $key = array_values(unpack('C*', $key));
    $klen = count($key);
    $h1 = $seed < 0 ? -$seed : $seed;
    $remainder = $i = 0;
    for ($bytes = $klen - ($remainder = $klen & 3); $i < $bytes; ) {
        $k1 = $key[$i]
            | ($key[++$i] << 8)
            | ($key[++$i] << 16)
            | ($key[++$i] << 24);
        ++$i;
        $k1 = (((($k1 & 0xffff) * 0xcc9e2d51) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0xcc9e2d51) & 0xffff) << 16))) & 0xffffffff;
        $k1 = $k1 << 15 | ($k1 >= 0 ? $k1 >> 17 : (($k1 & 0x7fffffff) >> 17) | 0x4000);
        $k1 = (((($k1 & 0xffff) * 0x1b873593) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0x1b873593) & 0xffff) << 16))) & 0xffffffff;
        $h1 ^= $k1;
        $h1 = $h1 << 13 | ($h1 >= 0 ? $h1 >> 19 : (($h1 & 0x7fffffff) >> 19) | 0x1000);
        $h1b = (((($h1 & 0xffff) * 5) + ((((($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000)) * 5) & 0xffff) << 16))) & 0xffffffff;
        $h1 = ((($h1b & 0xffff) + 0x6b64) + ((((($h1b >= 0 ? $h1b >> 16 : (($h1b & 0x7fffffff) >> 16) | 0x8000)) + 0xe654) & 0xffff) << 16));
    }
    $k1 = 0;
    switch ($remainder) {
        case 3:
            $k1 ^= $key[$i + 2] << 16;
        case 2:
            $k1 ^= $key[$i + 1] << 8;
        case 1:
            $k1 ^= $key[$i];
            $k1 = ((($k1 & 0xffff) * 0xcc9e2d51) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0xcc9e2d51) & 0xffff) << 16)) & 0xffffffff;
            $k1 = $k1 << 15 | ($k1 >= 0 ? $k1 >> 17 : (($k1 & 0x7fffffff) >> 17) | 0x4000);
            $k1 = ((($k1 & 0xffff) * 0x1b873593) + ((((($k1 >= 0 ? $k1 >> 16 : (($k1 & 0x7fffffff) >> 16) | 0x8000)) * 0x1b873593) & 0xffff) << 16)) & 0xffffffff;
            $h1 ^= $k1;
    }
    $h1 ^= $klen;
    $h1 ^= ($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000);
    $h1 = ((($h1 & 0xffff) * 0x85ebca6b) + ((((($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000)) * 0x85ebca6b) & 0xffff) << 16)) & 0xffffffff;
    $h1 ^= ($h1 >= 0 ? $h1 >> 13 : (($h1 & 0x7fffffff) >> 13) | 0x40000);
    $h1 = (((($h1 & 0xffff) * 0xc2b2ae35) + ((((($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000)) * 0xc2b2ae35) & 0xffff) << 16))) & 0xffffffff;
    $h1 ^= ($h1 >= 0 ? $h1 >> 16 : (($h1 & 0x7fffffff) >> 16) | 0x8000);
    return $h1;
}

function murmur3a(string $key, int $seed = 0): int
{
    return (int) base_convert(hash('murmur3a', $key, false, ["seed" => $seed]), 16, 10);
}

function generateRandomKey()
{
    $stringPart = substr(str_shuffle(str_repeat("abcdefghijklmnopqrstuvwxyz", 10)), 0, rand(10, 30));
    $datePart = date("Ymd", timestamp: rand(strtotime("2000-01-01"), strtotime("2030-12-31")));
    $intPart = rand(1000, 9999999);
    return $stringPart . $datePart . $intPart;
}

function runTests($numTests, $seed)
{
    echo "Running: $numTests iterations\n";
    echo "PHP version: " . phpversion() . "\n";
    echo "Seed: $seed\n";

    $divergentHashes = [];

    for ($i = 0; $i < $numTests; $i++) {

        echo "Progress: " . ($i + 1) . "/$numTests\r";

        $key = generateRandomKey();
        $hash3Int = hash3Int($key, $seed);
        $murmur3a = murmur3a($key, $seed);
        if ($hash3Int !== $murmur3a) {
            $divergentHashes[] = [
                'key' => $key,
                'hash3Int' => $hash3Int,
                'murmur3a' => $murmur3a
            ];
        }
    }

    echo "\n";

    return $divergentHashes;
}

$numTests = isset($argv[1]) ? (int) $argv[1] : 1000;
$seed = isset($argv[2]) ? (int) $argv[2] : 0;
$divergentHashes = runTests($numTests, $seed);

if (empty($divergentHashes)) {
    echo "All hashes matched!\n";
} else {
    echo "Divergent hashes found:\n";
    print_r($divergentHashes);
}

I also tested against other versions of murmurhash3 (in various languages; go, JavaScript, Python, etc) and they all do give the same hash as murmur3a - which leads me to suspect there is a subtle bug in the hash3_int implementation within this repo.

This seems to be consistent, if I run the test for a few hundred thousand iterations it finds a small number of divergent keys.

Happy to give more context and information if you need it, and apologies if I am overlooking something here!

@fraser-twinkl
Copy link

fraser-twinkl commented Jan 21, 2025

After quite a bit of testing, and looking over previous versions of the hash3_int implementation in this repo, I believe I have a working version.

Fixes

  1. The incorrect handling of unsigned multiplication, especially for large numbers
  2. Inconsistent handling of 32-bit unsigned integers
  3. Bit manipulation for the "mixing" steps to account for PHP's signed integer behaviour
<?php
function murmur3a(string $key, int $seed = 0): int
{
    return (int) base_convert(hash('murmur3a', $key, false, ["seed" => $seed]), 16, 10);
}

function hash3Int_fc($key, $seed = 0)
{
    $key = array_values(unpack('C*', $key));
    $klen = count($key);
    $h1 = $seed < 0 ? -$seed : $seed;

    //  (fix #1) - correctly perform unsigned right shift
    $u32rs = function ($a, $b) {
        if ($b == 0) {
            return $a;
        }

        $shifted = $a >> $b;
        $mask = 0xFFFFFFFF >> $b;

        return $shifted & $mask;
    };

    // (fix #2) - correctly handle unsigned 32-bit multiplication
    $u32m = function ($a, $b) {
        $a &= 0xFFFFFFFF;
        $b &= 0xFFFFFFFF;
        $lo = ($a & 0xFFFF) * ($b & 0xFFFF);
        $mid = ($a >> 16) * ($b & 0xFFFF) + ($a & 0xFFFF) * ($b >> 16);
        $result = $lo + ($mid << 16);

        return $result & 0xFFFFFFFF;
    };

    for ($i = 0, $bytes = $klen - ($remainder = $klen & 3); $i < $bytes; ) {
        $k1 = (($key[$i] & 0xff))
            | (($key[++$i] & 0xff) << 8)
            | (($key[++$i] & 0xff) << 16)
            | (($key[++$i] & 0xff) << 24);
        ++$i;

        $k1 = $u32m($k1, 0xcc9e2d51);
        $k1 = (($k1 << 15) | $u32rs($k1, 17)) & 0xFFFFFFFF;
        $k1 = $u32m($k1, 0x1b873593);

        $h1 ^= $k1;
        $h1 = (($h1 << 13) | $u32rs($h1, 19)) & 0xFFFFFFFF;
        $h1 = ($u32m($h1, 5) + 0xe6546b64) & 0xFFFFFFFF;
    }

    $k1 = 0;
    switch ($remainder) {
        case 3:
            $k1 ^= ($key[$i + 2] & 0xff) << 16;
        case 2:
            $k1 ^= ($key[$i + 1] & 0xff) << 8;
        case 1:
            $k1 ^= $key[$i] & 0xff;
            $k1 = $u32m($k1, 0xcc9e2d51);
            $k1 = (($k1 << 15) | $u32rs($k1, 17)) & 0xFFFFFFFF;
            $k1 = $u32m($k1, 0x1b873593);
            $h1 ^= $k1;
    }

    $h1 ^= $klen;

    // (fix #3) - proper handling of signed/unsigned bit manipulation
    $h1 ^= $u32rs($h1, 16);
    $h1 = $u32m($h1, 0x85ebca6b);
    $h1 ^= $u32rs($h1, 13);
    $h1 = $u32m($h1, 0xc2b2ae35);
    $h1 ^= $u32rs($h1, 16);

    return $h1 & 0xFFFFFFFF;
}

function generateRandomKey()
{
    $stringPart = substr(str_shuffle(str_repeat("abcdefghijklmnopqrstuvwxyz", 10)), 0, rand(10, 30));
    $datePart = date("Ymd", timestamp: rand(strtotime("2000-01-01"), strtotime("2030-12-31")));
    $intPart = rand(1000, 9999999);
    return $stringPart . $datePart . $intPart;
}

function runTests($numTests, $seed)
{
    echo "Running: $numTests iterations\n";
    echo "PHP version: " . phpversion() . "\n";
    echo "Seed: $seed\n";
    $divergentHashes = [];
    for ($i = 0; $i < $numTests; $i++) {
        echo "Progress: " . ($i + 1) . "/$numTests\r";
        $key = generateRandomKey();
        $hash3Int = hash3Int_fc($key, $seed);
        $murmur3a = murmur3a($key, $seed);
        if ($hash3Int !== $murmur3a) {
            $divergentHashes[] = [
                'key' => $key,
                'hash3Int' => $hash3Int,
                'murmur3a' => $murmur3a
            ];
        }
    }

    echo "\n";

    return $divergentHashes;
}

$numTests = isset($argv[1]) ? (int) $argv[1] : 100000;
$seed = isset($argv[2]) ? (int) $argv[2] : 0;
$divergentHashes = runTests($numTests, $seed);

if (empty($divergentHashes)) {
    echo "All hashes matched!\n";
} else {
    echo "Divergent hashes found:\n";
    print_r($divergentHashes);
}

Finds no divergence in one million iterations.

Running: 1000000 iterations
PHP version: 8.1.2-1ubuntu2.20
Seed: 0
Progress: 1000000/10000000
All hashes matched!

@FraserChapman FraserChapman changed the title Divergence between hash3_int and PHP's murmur3a Divergence between hash3_int and PHP's murmur3a (and other implementations) Jan 29, 2025
@fraser-twinkl
Copy link

fraser-twinkl commented Feb 1, 2025

...and finally here is an optimised working implementation that has no divergences and is way faster than the previous fix.

<?php
function murmur3a(string $key, int $seed = 0): int
{
    return (int) base_convert(hash('murmur3a', $key, false, ["seed" => $seed]), 16, 10);
}

function hash3Int_fc_opt($key, $seed = 0)
{
    $key = array_values(unpack('C*', $key));
    $klen = count($key);
    $h1 = $seed < 0 ? -$seed : $seed;
    $bytes = $klen - ($remainder = $klen & 3);

    for ($i = 0; $i < $bytes; ) {
        $k1 = $key[$i] |
            ($key[++$i] << 8) |
            ($key[++$i] << 16) |
            ($key[++$i] << 24);
        ++$i;

        $k1 = (($k1 & 0xFFFF) * 0x2D51 + ((($k1 >> 16) * 0x2D51 + ($k1 & 0xFFFF) * 0xCC9E) << 16)) & 0xFFFFFFFF;
        $k1 = (($k1 << 15) | (($k1 >> 17) & 0x7FFF)) & 0xFFFFFFFF;
        $k1 = (($k1 & 0xFFFF) * 0x3593 + ((($k1 >> 16) * 0x3593 + ($k1 & 0xFFFF) * 0x1B87) << 16)) & 0xFFFFFFFF;

        $h1 ^= $k1;
        $h1 = (($h1 << 13) | (($h1 >> 19) & 0x1FFF)) & 0xFFFFFFFF;
        $h1 = (($h1 & 0xFFFF) * 5 + ((($h1 >> 16) * 5) << 16)) & 0xFFFFFFFF;
        $h1 = ($h1 + 0xe6546b64) & 0xFFFFFFFF;
    }

    if ($remainder) {
        $k1 = 0;
        switch ($remainder) {
            case 3:
                $k1 ^= $key[$i + 2] << 16;
            case 2:
                $k1 ^= $key[$i + 1] << 8;
            case 1:
                $k1 ^= $key[$i];
                $k1 = (($k1 & 0xFFFF) * 0x2D51 + ((($k1 >> 16) * 0x2D51 + ($k1 & 0xFFFF) * 0xCC9E) << 16)) & 0xFFFFFFFF;
                $k1 = (($k1 << 15) | (($k1 >> 17) & 0x7FFF)) & 0xFFFFFFFF;
                $k1 = (($k1 & 0xFFFF) * 0x3593 + ((($k1 >> 16) * 0x3593 + ($k1 & 0xFFFF) * 0x1B87) << 16)) & 0xFFFFFFFF;
                $h1 ^= $k1;
        }
    }

    $h1 ^= $klen;
    $h1 ^= ($h1 >> 16) & 0xFFFF;
    $h1 = (($h1 & 0xFFFF) * 0xCA6B + ((($h1 >> 16) * 0xCA6B + ($h1 & 0xFFFF) * 0x85EB) << 16)) & 0xFFFFFFFF;
    $h1 ^= ($h1 >> 13) & 0x7FFFF;
    $h1 = (($h1 & 0xFFFF) * 0xAE35 + ((($h1 >> 16) * 0xAE35 + ($h1 & 0xFFFF) * 0xC2B2) << 16)) & 0xFFFFFFFF;
    $h1 ^= ($h1 >> 16) & 0xFFFF;

    return $h1;
}

function generateRandomKey()
{
    $stringPart = substr(str_shuffle(str_repeat("abcdefghijklmnopqrstuvwxyz", 10)), 0, rand(10, 30));
    $datePart = date("Ymd", timestamp: rand(strtotime("2000-01-01"), strtotime("2030-12-31")));
    $intPart = rand(1000, 9999999);
    return $stringPart . $datePart . $intPart;
}

function runTests($numTests, $seed)
{
    echo "Running: $numTests iterations\n";
    echo "PHP version: " . phpversion() . "\n";
    echo "Seed: $seed\n";
    $divergentHashes = [];
    for ($i = 0; $i < $numTests; $i++) {
        echo "Progress: " . ($i + 1) . "/$numTests\r";
        $key = generateRandomKey();
        $hash3Int = hash3Int_fc_opt($key, $seed);
        $murmur3a = murmur3a($key, $seed);
        if ($hash3Int !== $murmur3a) {
            $divergentHashes[] = [
                'key' => $key,
                'hash3Int' => $hash3Int,
                'murmur3a' => $murmur3a
            ];
        }
    }

    echo "\n";

    return $divergentHashes;
}

$numTests = isset($argv[1]) ? (int) $argv[1] : 100000;
$seed = isset($argv[2]) ? (int) $argv[2] : 0;
$divergentHashes = runTests($numTests, $seed);

if (empty($divergentHashes)) {
    echo "All hashes matched!\n";
} else {
    echo "Divergent hashes found:\n";
    print_r($divergentHashes);
}

gives

Running: 1000000 iterations
PHP version: 8.1.2-1ubuntu2.20
Seed: 0
Progress: 1000000/1000000
All hashes matched!

Popped this into a PR here - #17 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants