-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Switch to a full bitwidth h2 #513
base: master
Are you sure you want to change the base?
WIP Switch to a full bitwidth h2 #513
Conversation
* Changes: - Use all values of h2, not just 130 of it. - Convert SSE2 implementation for benchmarking. * Motivation: Using 256 values instead of 130 could theoretically lower the number of false-positive residual matches by close to 50%. On the other hand, it does make h2 slightly more complicated to compute, and possibly to operate on.
61ed56b
to
2098bd4
Compare
@Amanieu Would it be possible to benchmark on SSE2? I'd like to see if it's worth it before trying to make the generic and neon targets pass (especially the generics one, all that bit fiddling is a bit involved). |
Here are my benchmark results:
Overall, this seems like a loss. My guess is that the extra logic needed when handling h2 values is slowing things down. |
One thing I wonder, is how many collisions there are in the first place. Perfect hashes should be akin to uniformly spread values across the 0-127 range today. Given the general formula of the birthday paradox, we get that the probability of at least one collision across 16 elements is 16 * 15 / (2 * 128) = 0.9375. Thus even with only 128 values to choose from the chance of two elements having the same residual h2 is pretty low in the first place. Even in a full group -- which won't be the case most of the time -- the probability of at least one collision is only 93.75%... and if there's a single collision, it means 14 elements (out of 16) have no collision. This means paying for the extra complexity for every element, but rarely ever needing it. I tried my best to keep the cost low, but computing h2 is slightly harder, and the "magic" remapping on rehash is definitely not optimal. Perhaps someone with better insight could pick better values, and better instructions. |
@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression In addition, I suggest also considering the following possible implementation godbolt: const EMPTY: u8 = 0b1111_1111; // 255
const DELETED: u8 = 0b1111_1110; // 254
pub fn h2(hash: u64) -> u8 {
let bit = hash as u8;
bit >> ((bit > 253) as u8)
} |
I realize that this implementation is also faulty, I was aiming for the top 8 bits, but it seems we only get the top 7 here.
This switches from the top 8 to the bottom 8 bits, and will overlap with The trick to shift by 1 to avoid colliding with special values is interesting, and may save cycles compared to an array lookup. |
To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something. The |
// value, some hash functions (such as FxHash) produce a usize result | ||
// instead, which means that the top 32 bits are 0 on 32-bit platforms. | ||
// So we use MIN_HASH_LEN constant to handle this. | ||
let top7 = hash >> (MIN_HASH_LEN * 8 - 7); | ||
(top7 & 0x7f) as u8 // truncation | ||
let top8 = hash >> (MIN_HASH_LEN * 8 - 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let top8 = hash >> (MIN_HASH_LEN * 8 - 7); | |
let top8 = hash >> (MIN_HASH_LEN * 8 - 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just:
/// Secondary hash function, saved in the low 8 bits of the control byte.
#[inline]
#[allow(clippy::cast_possible_truncation)]
fn h2(hash: u64) -> u8 {
// Grab the low 8 bits of the hash. We use a 1 bit shift to the left if
// the bit is equal to special values
let bit = hash as u8;
bit >> ((bit as i8 > (DELETED as i8 - 1)) as u8)
}
@matthieu-m I carefully looked at the implementation and you are right, it is not possible to use values other than 127_i8 and 126_i8. In addition to your improvements, I also suggest changing the const EMPTY: u8 = 0b0111_1111; // 127
const DELETED: u8 = 0b0111_1110; // 126
impl RawTableInner {
#[inline]
unsafe fn erase(&mut self, index: usize) {
debug_assert!(self.is_bucket_full(index));
let index_before = index.wrapping_sub(Group::WIDTH) & self.bucket_mask;
let empty_before = Group::load(self.ctrl(index_before)).match_empty();
let empty_after = Group::load(self.ctrl(index)).match_empty();
// Removing if
let empty_group =
(empty_before.leading_zeros() + empty_after.trailing_zeros() < Group::WIDTH) as u8;
let ctrl = DELETED + empty_group;
self.growth_left += empty_group as usize;
self.set_ctrl(index, ctrl);
self.items -= 1;
}
} It looks like it will improve removal performance: https://godbolt.org/z/5P9oTYe5z |
let empty = x86::_mm_set1_epi8(EMPTY as i8); | ||
let deleted = x86::_mm_set1_epi8(DELETED as i8); | ||
|
||
let is_full = x86::_mm_cmplt_epi8(self.0, deleted); | ||
let is_special = x86::_mm_cmpeq_epi8(is_full, x86::_mm_set1_epi8(0)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let empty = x86::_mm_set1_epi8(EMPTY as i8); | |
let deleted = x86::_mm_set1_epi8(DELETED as i8); | |
let is_full = x86::_mm_cmplt_epi8(self.0, deleted); | |
let is_special = x86::_mm_cmpeq_epi8(is_full, x86::_mm_set1_epi8(0)); | |
// Find all special bytes. A byte is EMPTY or DELETED if it is greater than or equal to DELETED. | |
let is_special = x86::_mm_cmpgt_epi8(self.0, x86::_mm_set1_epi8(DELETED as i8 - 1)); | |
// Computes the bitwise OR between array of EMPTY bytes (that represents special bytes) | |
// and array of DELETED bytes. The logic is based on manipulating by the low bit of the byte: | |
// | |
// - If the byte was equal to EMPTY (0111_1111), then bitwise OR with DELETED will | |
// not change its value (0111_1111 | 0111_1110 = 0111_1111); | |
// - If the byte was FULL (0000_0000), then bitwise OR with DELETED will make it | |
// DELETED (0000_0000 | 0111_1110 = 0111_1110) |
x86::_mm_and_si128(is_full, deleted), | ||
x86::_mm_and_si128(is_special, empty), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x86::_mm_and_si128(is_full, deleted), | |
x86::_mm_and_si128(is_special, empty), | |
// Converting all bytes that represent special bytes (`1111_1111`) to EMPTY `0111_1111` | |
// 1111_1111 & 0111_1111 = 0111_1111 | |
// 0000_0000 & 0111_1111 = 0000_0000 | |
x86::_mm_and_si128(is_special, x86::_mm_set1_epi8(EMPTY as i8)), | |
// Array of DELETED bytes | |
x86::_mm_set1_epi8(DELETED as i8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduces the number of function calls from 8 to 6.
@JustForFun88 You seem to have many (good!) suggestions, and unfortunately I don't have much bandwidth right now. I can probably get to them eventually (I may have some time in 2-3 weeks), but that's a bit of an eternity momentum-wise. So, if I may suggest, why don't you checkout the branch, apply your suggested changes, and run the benchmarks? You'll see quickly if it works out or not, and you'll be able to try further ideas as well. |
You're correct. Now, imagine a table of 256 slots:
And therein lies the rub. The goal of h2 is to provide a quick filter across the elements of a group of 16 (contiguous) elements: if you use h2 to decide which group the element goes in, then the h2s of the elements within the group are not uniformly distributed -- at all -- and thus it becomes a very poor filter, and performance will likely suffer. Thus it's important to try and source h1 and h2 from different, non-overlapping bits, as much as possible. And taking top and bottom is the easiest way to do so. Which of the two takes top and which takes bottom shouldn't matter -- with a 64-bits hash -- so if it's faster for h2 to take the bottom bits, then defining h1 as hash >> 8 would work. |
Yes, because only roughly 48 bits are actually addressable on x86_64 presently. And I'm not sure of any systems which even approach that size... so when the |
☔ The latest upstream changes (presumably #558) made this pull request unmergeable. Please resolve the merge conflicts. |
Changes:
Group
functions are correctly implemented.Motivation:
Using 256 values instead of 130 could theoretically lower the number of false-positive residual matches by close to 50%.
On the other hand, it does make h2 slightly more complicated to compute, and possibly to operate on.
Limitations:
Only the SSE2 is ported at first, not the generic or neon ones, to gauge whether the performance looks worth it.
Design:
The values for EMPTY and DELETED are chosen so as to play well with SSE2, which does not have unsigned vectors. By using the top of the signed range, operations to distinguish between special and non-special are reduced to a single comparison, whereas using the middle of the range would require 2.
On the other hand, the
convert_special_to_empty_and_full_to_deleted
method is more complicated.The results, on my machine, are not encouraging, but my machine is noisy so conducting proper benchmarking is tough.