-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Divergence between hash3_int and PHP's murmur3a (and other implementations) #16
Comments
After quite a bit of testing, and looking over previous versions of the Fixes
<?php
function murmur3a(string $key, int $seed = 0): int
{
return (int) base_convert(hash('murmur3a', $key, false, ["seed" => $seed]), 16, 10);
}
function hash3Int_fc($key, $seed = 0)
{
$key = array_values(unpack('C*', $key));
$klen = count($key);
$h1 = $seed < 0 ? -$seed : $seed;
// (fix #1) - correctly perform unsigned right shift
$u32rs = function ($a, $b) {
if ($b == 0) {
return $a;
}
$shifted = $a >> $b;
$mask = 0xFFFFFFFF >> $b;
return $shifted & $mask;
};
// (fix #2) - correctly handle unsigned 32-bit multiplication
$u32m = function ($a, $b) {
$a &= 0xFFFFFFFF;
$b &= 0xFFFFFFFF;
$lo = ($a & 0xFFFF) * ($b & 0xFFFF);
$mid = ($a >> 16) * ($b & 0xFFFF) + ($a & 0xFFFF) * ($b >> 16);
$result = $lo + ($mid << 16);
return $result & 0xFFFFFFFF;
};
for ($i = 0, $bytes = $klen - ($remainder = $klen & 3); $i < $bytes; ) {
$k1 = (($key[$i] & 0xff))
| (($key[++$i] & 0xff) << 8)
| (($key[++$i] & 0xff) << 16)
| (($key[++$i] & 0xff) << 24);
++$i;
$k1 = $u32m($k1, 0xcc9e2d51);
$k1 = (($k1 << 15) | $u32rs($k1, 17)) & 0xFFFFFFFF;
$k1 = $u32m($k1, 0x1b873593);
$h1 ^= $k1;
$h1 = (($h1 << 13) | $u32rs($h1, 19)) & 0xFFFFFFFF;
$h1 = ($u32m($h1, 5) + 0xe6546b64) & 0xFFFFFFFF;
}
$k1 = 0;
switch ($remainder) {
case 3:
$k1 ^= ($key[$i + 2] & 0xff) << 16;
case 2:
$k1 ^= ($key[$i + 1] & 0xff) << 8;
case 1:
$k1 ^= $key[$i] & 0xff;
$k1 = $u32m($k1, 0xcc9e2d51);
$k1 = (($k1 << 15) | $u32rs($k1, 17)) & 0xFFFFFFFF;
$k1 = $u32m($k1, 0x1b873593);
$h1 ^= $k1;
}
$h1 ^= $klen;
// (fix #3) - proper handling of signed/unsigned bit manipulation
$h1 ^= $u32rs($h1, 16);
$h1 = $u32m($h1, 0x85ebca6b);
$h1 ^= $u32rs($h1, 13);
$h1 = $u32m($h1, 0xc2b2ae35);
$h1 ^= $u32rs($h1, 16);
return $h1 & 0xFFFFFFFF;
}
function generateRandomKey()
{
$stringPart = substr(str_shuffle(str_repeat("abcdefghijklmnopqrstuvwxyz", 10)), 0, rand(10, 30));
$datePart = date("Ymd", timestamp: rand(strtotime("2000-01-01"), strtotime("2030-12-31")));
$intPart = rand(1000, 9999999);
return $stringPart . $datePart . $intPart;
}
function runTests($numTests, $seed)
{
echo "Running: $numTests iterations\n";
echo "PHP version: " . phpversion() . "\n";
echo "Seed: $seed\n";
$divergentHashes = [];
for ($i = 0; $i < $numTests; $i++) {
echo "Progress: " . ($i + 1) . "/$numTests\r";
$key = generateRandomKey();
$hash3Int = hash3Int_fc($key, $seed);
$murmur3a = murmur3a($key, $seed);
if ($hash3Int !== $murmur3a) {
$divergentHashes[] = [
'key' => $key,
'hash3Int' => $hash3Int,
'murmur3a' => $murmur3a
];
}
}
echo "\n";
return $divergentHashes;
}
$numTests = isset($argv[1]) ? (int) $argv[1] : 100000;
$seed = isset($argv[2]) ? (int) $argv[2] : 0;
$divergentHashes = runTests($numTests, $seed);
if (empty($divergentHashes)) {
echo "All hashes matched!\n";
} else {
echo "Divergent hashes found:\n";
print_r($divergentHashes);
} Finds no divergence in one million iterations.
|
...and finally here is an optimised working implementation that has no divergences and is way faster than the previous fix. <?php
function murmur3a(string $key, int $seed = 0): int
{
return (int) base_convert(hash('murmur3a', $key, false, ["seed" => $seed]), 16, 10);
}
function hash3Int_fc_opt($key, $seed = 0)
{
$key = array_values(unpack('C*', $key));
$klen = count($key);
$h1 = $seed < 0 ? -$seed : $seed;
$bytes = $klen - ($remainder = $klen & 3);
for ($i = 0; $i < $bytes; ) {
$k1 = $key[$i] |
($key[++$i] << 8) |
($key[++$i] << 16) |
($key[++$i] << 24);
++$i;
$k1 = (($k1 & 0xFFFF) * 0x2D51 + ((($k1 >> 16) * 0x2D51 + ($k1 & 0xFFFF) * 0xCC9E) << 16)) & 0xFFFFFFFF;
$k1 = (($k1 << 15) | (($k1 >> 17) & 0x7FFF)) & 0xFFFFFFFF;
$k1 = (($k1 & 0xFFFF) * 0x3593 + ((($k1 >> 16) * 0x3593 + ($k1 & 0xFFFF) * 0x1B87) << 16)) & 0xFFFFFFFF;
$h1 ^= $k1;
$h1 = (($h1 << 13) | (($h1 >> 19) & 0x1FFF)) & 0xFFFFFFFF;
$h1 = (($h1 & 0xFFFF) * 5 + ((($h1 >> 16) * 5) << 16)) & 0xFFFFFFFF;
$h1 = ($h1 + 0xe6546b64) & 0xFFFFFFFF;
}
if ($remainder) {
$k1 = 0;
switch ($remainder) {
case 3:
$k1 ^= $key[$i + 2] << 16;
case 2:
$k1 ^= $key[$i + 1] << 8;
case 1:
$k1 ^= $key[$i];
$k1 = (($k1 & 0xFFFF) * 0x2D51 + ((($k1 >> 16) * 0x2D51 + ($k1 & 0xFFFF) * 0xCC9E) << 16)) & 0xFFFFFFFF;
$k1 = (($k1 << 15) | (($k1 >> 17) & 0x7FFF)) & 0xFFFFFFFF;
$k1 = (($k1 & 0xFFFF) * 0x3593 + ((($k1 >> 16) * 0x3593 + ($k1 & 0xFFFF) * 0x1B87) << 16)) & 0xFFFFFFFF;
$h1 ^= $k1;
}
}
$h1 ^= $klen;
$h1 ^= ($h1 >> 16) & 0xFFFF;
$h1 = (($h1 & 0xFFFF) * 0xCA6B + ((($h1 >> 16) * 0xCA6B + ($h1 & 0xFFFF) * 0x85EB) << 16)) & 0xFFFFFFFF;
$h1 ^= ($h1 >> 13) & 0x7FFFF;
$h1 = (($h1 & 0xFFFF) * 0xAE35 + ((($h1 >> 16) * 0xAE35 + ($h1 & 0xFFFF) * 0xC2B2) << 16)) & 0xFFFFFFFF;
$h1 ^= ($h1 >> 16) & 0xFFFF;
return $h1;
}
function generateRandomKey()
{
$stringPart = substr(str_shuffle(str_repeat("abcdefghijklmnopqrstuvwxyz", 10)), 0, rand(10, 30));
$datePart = date("Ymd", timestamp: rand(strtotime("2000-01-01"), strtotime("2030-12-31")));
$intPart = rand(1000, 9999999);
return $stringPart . $datePart . $intPart;
}
function runTests($numTests, $seed)
{
echo "Running: $numTests iterations\n";
echo "PHP version: " . phpversion() . "\n";
echo "Seed: $seed\n";
$divergentHashes = [];
for ($i = 0; $i < $numTests; $i++) {
echo "Progress: " . ($i + 1) . "/$numTests\r";
$key = generateRandomKey();
$hash3Int = hash3Int_fc_opt($key, $seed);
$murmur3a = murmur3a($key, $seed);
if ($hash3Int !== $murmur3a) {
$divergentHashes[] = [
'key' => $key,
'hash3Int' => $hash3Int,
'murmur3a' => $murmur3a
];
}
}
echo "\n";
return $divergentHashes;
}
$numTests = isset($argv[1]) ? (int) $argv[1] : 100000;
$seed = isset($argv[2]) ? (int) $argv[2] : 0;
$divergentHashes = runTests($numTests, $seed);
if (empty($divergentHashes)) {
echo "All hashes matched!\n";
} else {
echo "Divergent hashes found:\n";
print_r($divergentHashes);
} gives
Popped this into a PR here - #17 :) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If I test the
hash3_int
method against php8.1.2murmur3a
it appears to have divergences.I found this after stumbling across a divergence by chance, and then testing
300000
randomly generated keys in roughly the same format as the problematic key I'd discovered.e.g.
Given the keys
hash3_int
andmurmur3a
give different hash values.All the other tested keys matched exactly.
This is using
I also tested against other versions of murmurhash3 (in various languages; go, JavaScript, Python, etc) and they all do give the same hash as
murmur3a
- which leads me to suspect there is a subtle bug in thehash3_int
implementation within this repo.This seems to be consistent, if I run the test for a few hundred thousand iterations it finds a small number of divergent keys.
Happy to give more context and information if you need it, and apologies if I am overlooking something here!
The text was updated successfully, but these errors were encountered: