Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Switch to a full bitwidth h2 #513

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

matthieu-m
Copy link

Changes:

  • Use all values of h2, not just 130 of it.
  • Convert SSE2 implementation for benchmarking.
  • Add generic tests to verify that Group functions are correctly implemented.

Motivation:

Using 256 values instead of 130 could theoretically lower the number of false-positive residual matches by close to 50%.

On the other hand, it does make h2 slightly more complicated to compute, and possibly to operate on.

Limitations:

Only the SSE2 is ported at first, not the generic or neon ones, to gauge whether the performance looks worth it.

Design:

The values for EMPTY and DELETED are chosen so as to play well with SSE2, which does not have unsigned vectors. By using the top of the signed range, operations to distinguish between special and non-special are reduced to a single comparison, whereas using the middle of the range would require 2.

On the other hand, the convert_special_to_empty_and_full_to_deleted method is more complicated.

The results, on my machine, are not encouraging, but my machine is noisy so conducting proper benchmarking is tough.

* Changes:

- Use all values of h2, not just 130 of it.
- Convert SSE2 implementation for benchmarking.

* Motivation:

Using 256 values instead of 130 could theoretically lower the number of
false-positive residual matches by close to 50%.

On the other hand, it does make h2 slightly more complicated to compute,
and possibly to operate on.
@matthieu-m matthieu-m force-pushed the performance/full-bitwidth-h2 branch from 61ed56b to 2098bd4 Compare March 24, 2024 16:54
@matthieu-m
Copy link
Author

@Amanieu Would it be possible to benchmark on SSE2?

I'd like to see if it's worth it before trying to make the generic and neon targets pass (especially the generics one, all that bit fiddling is a bit involved).

@Amanieu
Copy link
Member

Amanieu commented Mar 30, 2024

Here are my benchmark results:

 name                         old.txt ns/iter  new.txt ns/iter  diff ns/iter   diff %  speedup 
 clone_from_large             5,064            5,170                     106    2.09%   x 0.98 
 clone_from_small             46               46                          0    0.00%   x 1.00 
 clone_large                  5,141            5,220                      79    1.54%   x 0.98 
 clone_small                  61               62                          1    1.64%   x 0.98 
 grow_insert_ahash_highbits   19,525           20,297                    772    3.95%   x 0.96 
 grow_insert_ahash_random     19,889           20,877                    988    4.97%   x 0.95 
 grow_insert_ahash_serial     19,530           20,794                  1,264    6.47%   x 0.94 
 grow_insert_std_highbits     36,006           36,842                    836    2.32%   x 0.98 
 grow_insert_std_random       35,998           36,841                    843    2.34%   x 0.98 
 grow_insert_std_serial       35,786           36,865                  1,079    3.02%   x 0.97 
 insert                       13,911           8,763                  -5,148  -37.01%   x 1.59 
 insert_ahash_highbits        17,971           18,048                     77    0.43%   x 1.00 
 insert_ahash_random          18,071           17,950                   -121   -0.67%   x 1.01 
 insert_ahash_serial          17,980           17,867                   -113   -0.63%   x 1.01 
 insert_erase_ahash_highbits  18,830           19,945                  1,115    5.92%   x 0.94 
 insert_erase_ahash_random    19,060           19,897                    837    4.39%   x 0.96 
 insert_erase_ahash_serial    18,426           19,221                    795    4.31%   x 0.96 
 insert_erase_std_highbits    32,468           33,627                  1,159    3.57%   x 0.97 
 insert_erase_std_random      33,186           34,270                  1,084    3.27%   x 0.97 
 insert_erase_std_serial      32,487           33,589                  1,102    3.39%   x 0.97 
 insert_std_highbits          22,173           22,509                    336    1.52%   x 0.99 
 insert_std_random            22,249           22,673                    424    1.91%   x 0.98 
 insert_std_serial            22,203           22,440                    237    1.07%   x 0.99 
 insert_unique_unchecked      5,300            5,614                     314    5.92%   x 0.94 
 iter_ahash_highbits          928              927                        -1   -0.11%   x 1.00 
 iter_ahash_random            932              926                        -6   -0.64%   x 1.01 
 iter_ahash_serial            932              916                       -16   -1.72%   x 1.02 
 iter_std_highbits            930              938                         8    0.86%   x 0.99 
 iter_std_random              931              912                       -19   -2.04%   x 1.02 
 iter_std_serial              946              920                       -26   -2.75%   x 1.03 
 lookup_ahash_highbits        3,938            3,989                      51    1.30%   x 0.99 
 lookup_ahash_random          4,039            4,208                     169    4.18%   x 0.96 
 lookup_ahash_serial          3,829            3,929                     100    2.61%   x 0.97 
 lookup_fail_ahash_highbits   3,267            3,389                     122    3.73%   x 0.96 
 lookup_fail_ahash_random     3,316            3,502                     186    5.61%   x 0.95 
 lookup_fail_ahash_serial     3,367            3,332                     -35   -1.04%   x 1.01 
 lookup_fail_std_highbits     9,224            9,561                     337    3.65%   x 0.96 
 lookup_fail_std_random       9,337            9,620                     283    3.03%   x 0.97 
 lookup_fail_std_serial       9,246            9,318                      72    0.78%   x 0.99 
 lookup_std_highbits          9,996            10,278                    282    2.82%   x 0.97 
 lookup_std_random            10,077           10,681                    604    5.99%   x 0.94 
 lookup_std_serial            10,002           10,569                    567    5.67%   x 0.95 
 rehash_in_place              180,307          187,097                 6,790    3.77%   x 0.96 

Overall, this seems like a loss. My guess is that the extra logic needed when handling h2 values is slowing things down.

@matthieu-m
Copy link
Author

One thing I wonder, is how many collisions there are in the first place.

Perfect hashes should be akin to uniformly spread values across the 0-127 range today. Given the general formula of the birthday paradox, we get that the probability of at least one collision across 16 elements is 16 * 15 / (2 * 128) = 0.9375.

Thus even with only 128 values to choose from the chance of two elements having the same residual h2 is pretty low in the first place. Even in a full group -- which won't be the case most of the time -- the probability of at least one collision is only 93.75%... and if there's a single collision, it means 14 elements (out of 16) have no collision.

This means paying for the extra complexity for every element, but rarely ever needing it.

I tried my best to keep the cost low, but computing h2 is slightly harder, and the "magic" remapping on rehash is definitely not optimal. Perhaps someone with better insight could pick better values, and better instructions.

@JustForFun88
Copy link
Contributor

JustForFun88 commented Apr 26, 2024

@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression let top8 = hash >> (MIN_HASH_LEN * 8 - 7) is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.

In addition, I suggest also considering the following possible implementation godbolt:

const EMPTY: u8 = 0b1111_1111;   // 255
const DELETED: u8 = 0b1111_1110; // 254

pub fn h2(hash: u64) -> u8 {
    let bit = hash as u8;
    bit >> ((bit > 253) as u8)
}

@matthieu-m
Copy link
Author

@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression let top8 = hash >> (MIN_HASH_LEN * 8 - 7) is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.

I realize that this implementation is also faulty, I was aiming for the top 8 bits, but it seems we only get the top 7 here.

In addition, I suggest also considering the following possible implementation godbolt:

const EMPTY: u8 = 0b1111_1111;   // 255
const DELETED: u8 = 0b1111_1110; // 254

#[no_mangle]
pub fn h2(hash: u64) -> u8 {
    let bit = hash as u8;
    bit >> ((bit > 253) as u8)
}

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

The trick to shift by 1 to avoid colliding with special values is interesting, and may save cycles compared to an array lookup.

@JustForFun88
Copy link
Contributor

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

// value, some hash functions (such as FxHash) produce a usize result
// instead, which means that the top 32 bits are 0 on 32-bit platforms.
// So we use MIN_HASH_LEN constant to handle this.
let top7 = hash >> (MIN_HASH_LEN * 8 - 7);
(top7 & 0x7f) as u8 // truncation
let top8 = hash >> (MIN_HASH_LEN * 8 - 7);
Copy link
Contributor

@JustForFun88 JustForFun88 Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let top8 = hash >> (MIN_HASH_LEN * 8 - 7);
let top8 = hash >> (MIN_HASH_LEN * 8 - 8);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just:

/// Secondary hash function, saved in the low 8 bits of the control byte.
#[inline]
#[allow(clippy::cast_possible_truncation)]
fn h2(hash: u64) -> u8 {
    // Grab the low 8 bits of the hash. We use a 1 bit shift to the left if
    // the bit is equal to special values
    let bit = hash as u8;
    bit >> ((bit as i8 > (DELETED as i8 - 1)) as u8)
}

@JustForFun88
Copy link
Contributor

JustForFun88 commented Apr 27, 2024

@matthieu-m I carefully looked at the implementation and you are right, it is not possible to use values other than 127_i8 and 126_i8. In addition to your improvements, I also suggest changing the erase function:

const EMPTY: u8 = 0b0111_1111;   // 127
const DELETED: u8 = 0b0111_1110; // 126

impl RawTableInner {
    #[inline]
    unsafe fn erase(&mut self, index: usize) {
        debug_assert!(self.is_bucket_full(index));

        let index_before = index.wrapping_sub(Group::WIDTH) & self.bucket_mask;
        let empty_before = Group::load(self.ctrl(index_before)).match_empty();
        let empty_after = Group::load(self.ctrl(index)).match_empty();

        // Removing if
        let empty_group =
            (empty_before.leading_zeros() + empty_after.trailing_zeros() < Group::WIDTH) as u8;
        let ctrl = DELETED + empty_group;
        self.growth_left += empty_group as usize;

        self.set_ctrl(index, ctrl);
        self.items -= 1;
    }
}

It looks like it will improve removal performance: https://godbolt.org/z/5P9oTYe5z

Comment on lines +153 to +158
let empty = x86::_mm_set1_epi8(EMPTY as i8);
let deleted = x86::_mm_set1_epi8(DELETED as i8);

let is_full = x86::_mm_cmplt_epi8(self.0, deleted);
let is_special = x86::_mm_cmpeq_epi8(is_full, x86::_mm_set1_epi8(0));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let empty = x86::_mm_set1_epi8(EMPTY as i8);
let deleted = x86::_mm_set1_epi8(DELETED as i8);
let is_full = x86::_mm_cmplt_epi8(self.0, deleted);
let is_special = x86::_mm_cmpeq_epi8(is_full, x86::_mm_set1_epi8(0));
// Find all special bytes. A byte is EMPTY or DELETED if it is greater than or equal to DELETED.
let is_special = x86::_mm_cmpgt_epi8(self.0, x86::_mm_set1_epi8(DELETED as i8 - 1));
// Computes the bitwise OR between array of EMPTY bytes (that represents special bytes)
// and array of DELETED bytes. The logic is based on manipulating by the low bit of the byte:
//
// - If the byte was equal to EMPTY (0111_1111), then bitwise OR with DELETED will
// not change its value (0111_1111 | 0111_1110 = 0111_1111);
// - If the byte was FULL (0000_0000), then bitwise OR with DELETED will make it
// DELETED (0000_0000 | 0111_1110 = 0111_1110)

Comment on lines +160 to +161
x86::_mm_and_si128(is_full, deleted),
x86::_mm_and_si128(is_special, empty),
Copy link
Contributor

@JustForFun88 JustForFun88 Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x86::_mm_and_si128(is_full, deleted),
x86::_mm_and_si128(is_special, empty),
// Converting all bytes that represent special bytes (`1111_1111`) to EMPTY `0111_1111`
// 1111_1111 & 0111_1111 = 0111_1111
// 0000_0000 & 0111_1111 = 0000_0000
x86::_mm_and_si128(is_special, x86::_mm_set1_epi8(EMPTY as i8)),
// Array of DELETED bytes
x86::_mm_set1_epi8(DELETED as i8),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reduces the number of function calls from 8 to 6.

@matthieu-m
Copy link
Author

@JustForFun88 You seem to have many (good!) suggestions, and unfortunately I don't have much bandwidth right now.

I can probably get to them eventually (I may have some time in 2-3 weeks), but that's a bit of an eternity momentum-wise.

So, if I may suggest, why don't you checkout the branch, apply your suggested changes, and run the benchmarks? You'll see quickly if it works out or not, and you'll be able to try further ideas as well.

@matthieu-m
Copy link
Author

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

You're correct. h1 is used for indexing via % table length which takes the bottom bits, and Swiss Table (Abseil's implementation) uses the top bits for h2 (the residual).

Now, imagine a table of 256 slots:

  • index = h1 % 256 = hash % 256.
  • h2 = hash % 256.

And therein lies the rub. The goal of h2 is to provide a quick filter across the elements of a group of 16 (contiguous) elements: if you use h2 to decide which group the element goes in, then the h2s of the elements within the group are not uniformly distributed -- at all -- and thus it becomes a very poor filter, and performance will likely suffer.

Thus it's important to try and source h1 and h2 from different, non-overlapping bits, as much as possible. And taking top and bottom is the easiest way to do so. Which of the two takes top and which takes bottom shouldn't matter -- with a 64-bits hash -- so if it's faster for h2 to take the bottom bits, then defining h1 as hash >> 8 would work.

@morrisonlevi
Copy link

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

Yes, because only roughly 48 bits are actually addressable on x86_64 presently. And I'm not sure of any systems which even approach that size... so when the h1 gets truncated to the bucket mask, those upper bits aren't looked at. It's not worth doing any work to shrink down to the lower 57 bits because those will get filtered by the mask anyway.

@bors
Copy link
Contributor

bors commented Sep 17, 2024

☔ The latest upstream changes (presumably #558) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants