-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring proposal: cleaning up the internal APIs #545
Comments
I generally agree with this plan and would encourage you to continue working on this (though I have some thoughts about the specific, below). Most of the code in cc @JustForFun88 who has done a lot of work on documenting the internals of Some general thoughts:
Now onto more specific thoughts:
The actual control array is
This is actually
The primary motivation behind the current design of In hindsight, a better design would have been an iterator type that doesn't actually implement the Finally, we wouldn't actually need separate |
This bit actually confuses me a bit-- aren't we already aligned to a multiple of the group size, so, we don't need to add any extra padding? Or is this specifically for the case when
That's one benefit of being a published crate instead of a standard library API: we can just publish a breaking version and it's totally fine. I was assuming that ultimately, we'd end up with a breaking version anyway once we replaced the APIs, so, I'm down with this. |
We are actually using the "unaligned groups" variant that was briefly covered in the CppCon presentation since that is the variant used by the C++ implementation of SwissTable in Abseil. |
Oh, suddenly the triangular probing makes a lot more sense. Thank you for clarifying that. |
Adding in a few extra notes here re: some other conversations that were had in other places:
|
Poking at this more: one other weird question I have regarding the alignment of the actual control table. If we're using unaligned buckets, then why is the group width factored at all into the alignment of the control data? Presumably, we'd just need to align to the bucket array, right? I'm also just looking at the alignment calculations in the current Layout::from_size_align_unchecked(num_buckets + Group::SIZE, Group::SIZE).extend(Layout::array::<T>(num_buckets)) but I want to make sure that this is actually correct. Note that we'd be kind of abusing the semantics here since, while EDIT: Now realising my mistake is that the buckets are in reverse order. I'll redo that code to something that actually works. |
There's an extra Also in the past we avoided using |
Realising now that basically all of my problems were due to accidentally flipping the two fields, so, now all the alignment makes sense. Will try and take another stab at this tomorrow. Hopefully at least some of the code can be simplified. |
Fixes #21 so crate will work on Rust 1.82. We can't upgrade to `hashbrown 0.15` since it removes the `raw` table support. This is known upstream, and may just be something we have to live with: rust-lang/hashbrown#545 (comment) To align with `hashbrown`, and given this is a breaking change regardless, we remove the `raw` module just like `hashbrown` has. This should hopefully increase the chances that one day we can move to `hashbrown`'s new lower-level interface (`HashTable`) without a breaking change.
Fixes #21 so crate will work on Rust 1.82. We can't upgrade to `hashbrown 0.15` since it removes the `raw` table support. This is known upstream, and may just be something we have to live with: rust-lang/hashbrown#545 (comment) To align with `hashbrown`, and given this is a breaking change regardless, we remove the `raw` module just like `hashbrown` has. This should hopefully increase the chances that one day we can move to `hashbrown`'s new lower-level interface (`HashTable`) without a breaking change. This release also removes the `nightly` feature because it was mostly useless.
Fixes #21 so crate will work on Rust 1.82. We can't upgrade to `hashbrown 0.15` since it removes the `raw` table support. This is known upstream, and may just be something we have to live with: rust-lang/hashbrown#545 (comment) To align with `hashbrown`, and given this is a breaking change regardless, we remove the `raw` module just like `hashbrown` has. This should hopefully increase the chances that one day we can move to `hashbrown`'s new lower-level interface (`HashTable`) without a breaking change. This release also removes the `nightly` feature because it was mostly useless.
Fixes #21 so crate will work on Rust 1.82. We can't upgrade to `hashbrown 0.15` since it removes the `raw` table support. This is known upstream, and may just be something we have to live with: rust-lang/hashbrown#545 (comment) To align with `hashbrown`, and given this is a breaking change regardless, we remove the `raw` module just like `hashbrown` has. This should hopefully increase the chances that one day we can move to `hashbrown`'s new lower-level interface (`HashTable`) without a breaking change. This release also removes the `nightly` feature because it was mostly useless.
Fixes #21 so crate will work on Rust 1.82. We can't upgrade to `hashbrown 0.15` since it removes the `raw` table support. This is known upstream, and may just be something we have to live with: rust-lang/hashbrown#545 (comment) To align with `hashbrown`, and given this is a breaking change regardless, we remove the `raw` module just like `hashbrown` has. This should hopefully increase the chances that one day we can move to `hashbrown`'s new lower-level interface (`HashTable`) without a breaking change. This release also removes the `nightly` feature because it was mostly useless.
Fixes #21 so crate will work on Rust 1.82. We can't upgrade to `hashbrown 0.15` since it removes the `raw` table support. This is known upstream, and may just be something we have to live with: rust-lang/hashbrown#545 (comment) To align with `hashbrown`, and given this is a breaking change regardless, we remove the `raw` module just like `hashbrown` has. This should hopefully increase the chances that one day we can move to `hashbrown`'s new lower-level interface (`HashTable`) without a breaking change. This release also removes the `nightly` feature because it was mostly useless.
This is my writeup re: #543 (comment)
cc #290 as this should also strive to fix that as well. Ideally, by the end of the refactor, we should be able to add the following to the crate root:
Documenting the actual layout
Right now, the documentation for the crate on the actual layout of the hash table is incredibly lacking. It mostly just points to the C++ SwissTable implementation and says "we did that!" while ignoring the fact that, obviously, it is not quite that. There are a lot of subtleties to the Rust code that are completely omitted from the basic description; both the downsides and the upsides are completely undocumented, IMHO, since even if they do exist they're difficult to find in the code.
We should start out by describing the layout from its broadest points first and use this to design the API. I'm also going to be renaming some terms so we can use the more specific names for individual pieces inside the API.
So, here's how I've decided to describe it:
The main benefit of this hash table implementation is that unlike other implementations, instead of a unified bucket array, we have two separate data and control arrays. Having a small, dedicated control array which does not depend on the type inside the table makes it extremely easy to traverse the table using SIMD instructions. It also improves cache locality by keeping the control data as small as possible, so that any operation on the table can happen entirely in cache, even if the data itself is a cache miss.
While the control and data arrays are separate, we can conceptually treat them together as an array of buckets, which forms our hash table.
Control Array
The control array contains a series of tags which indicate the status of any given bucket. To make SIMD operations efficient, we split the tags into groups whose size depends on the specific SIMD implementation. This imposes a restriction on the control array that its size is at least a multiple of the group size, and its alignment is at least a multiple of the group alignment. We additionally require that the number of buckets, and hence the number of tags, be a power of two.
There are two special tags,
EMPTY
andDELETED
, which correspond to buckets which have either always been empty, or who have recently become empty. All other tags are consideredFULL
.Regardless of implementation, the tags are:
EMPTY
:0b1_1111111
DELETED
:0b1_0000000
FULL
:0b0_xxxxxxx
, wherexxxxxxx
are seven bits from the hashWe can use the term
NONFULL
to refer to buckets which are eitherEMPTY
orDELETED
.Data Array
The data array is simply an array of maybe-uninit items, where the exact value of these items depends on what the hash table is being used for. The alignment of these items further imposes alignment constraints on the control array, and similarly, the control array further constrains the alignment of these items. However, no additional padding is placed between the items in the array.
The data array is indexed in the reverse order of the control array, and located before the control array in memory. This lets us store a single pointer to the control array to represent both arrays: if cast to a tag, the pointer can offset
n
to get the control for then
th bucket, and if cast to data, the pointer can offset-n - 1
to get the data for then
th bucket.Padding
Since the alignment of the control and data arrays may differ, the very start of the allocation for the hash table may contain some padding. Specifically, if:
Then the padding will be equal to this discrepancy, to ensure that the control array is aligned.
Redoing the raw table
Conceptually, let's reframe a raw table as being parameterized by
N
, whereN
is a power-of-two number of items in the table. Then, this table (specificallyRawTableInner
) contains:bucket_mask
(N - 1
)ctrl
(pointer to the control array)items
(number of full buckets)growth_left
(N - items
)There have been some comments regarding whether the number of fields can be removed, but I'm going to focus on instead discussing the structures that are useful for operating on the table itself.
Note that these wrapper types don't have to exist for optimizing the exact contents of the raw table. Since the number of fields is small, we can think of these types as little API structures that let us encapsulate the data we need to perform a given operation, that we can then wrap in other types like iterators to do the things we want them to do. Although each individual operation will likely refer to an unsafe method which does not care about the specific types involved or the lifetimes, by having these higher-level wrappers even on the internals, we can reduce the potential to cause undefined behaviour by creating higher-level wrappers on them.
It's also worth noting that undefined data here is specifically reading uninitialized data and breaking lifetime rules. Although changing an item's hash without moving it to a different bucket or leaking an item might cause the table to work incorrectly, it shouldn't trigger UB, especially because
PartialEq
andHash
can't cause UB by themselves.Also, the below types can be made to each have a "non-generic" version, but as a general rule, we should not be interacting between these versions of the code. Specifically, the correct usage would be, for example,
One<T, A>
to interact withTwo<T, A>
, which internally callsOneInner
andTwoInner
methods, rather thanOneInner
directly interacting withTwoInner
. This helps us minimize the number of unsafe methods and hence the opportunities to cause UB, and as long as we add inline tags in the right places, it shouldn't cause any performance issues either.Tag
Right now, we have a
Group
to indicate groups of control, and I propose adding in aTag
as well to act as a newtype overu8
. The goal of this is to help make some internal APIs clearer (is this a byte because it's unknown data, or is it explicitly a control tag?) and also to have a place to put a lot of the associated constants, methods, etc., for groups.If it turns out that this just adds way too much unnecessary unsafe code (there's no safe transmute for transparent newtypes yet), we can omit it. But at least, it's an API nicety that I would like to add.
RawBuckets<T, A>
This is a wrapper over the control pointer and something to indicate its length. Its job is to do all of the pointer arithmetic we care about for the table, namely:
MaybeUninit<T>
inside the dataIt doesn't care about the number of filled buckets, the load of the table, etc., and only cares about performing these operations correctly.
OwnedBuckets<T, A>
This type owns a fixed set of buckets, but cannot be resized. Using an analogy, if
RawTable<T, A>
isVec<T, A>
, then this isBox<[T], A>
.The goal of this wrapper would be to allow deriving operations like
Clone
,Default
, andDrop
on types which own a table, like draining and into-iterators.Right now, these types simply hold onto a raw allocation triple (pointer, layout, allocator) and then free the allocation when they drop. This makes it difficult to clone these types, since the actual structs that "own" the allocation aren't part of this allocation triple. Most likely,
OwnedBuckets<T, A>
would simply wrap aRawBuckets<T, A>
and an allocator, but we could also precompute the layout and allocation start if we want to.Probe<'a, T>
andProbeMut<'a, T>
This implements all the methods we need to actually probe the table for a given hash value. Like the various
find
methods onRawTable
, it will simply fail if the table is too small to insert something.Scan<'a, T>
,ScanMut<'a, T>
These are the raw iterator types that we use for traversing the table in linear order, as we do for iterators.
Drain<'_, T>
,IntoIter<T, A>
These are the extra iterators we need for draining the table. Internally, they share the same code, but the difference is whether they own the table and are responsible for dropping it.
RawTable<T, A>
This is, of course, the actual table, and it also has the ability to verify the table load, rehash the contents, and resize the table.
Specifics of implementation
This is the final part where I attempt to create my own "FAQ" section for questions that nobody asked, in no particular order or usefulness to anyone:
Bucket<T>
which is awkward and represents sometimes a pointer and sometimes an index, we should instead operate on indices for the table and returnMaybeUninit<T>
references whenever data needs to be referenced.InsertSlot
should go. It really doesn't add any safety guarantees over just passing an index, and since methods have the wordinsert
in their names anyway, it's not like people are going to misuse it.*const Group
or a*const Tag
, since a lot of the code is sloppy and uses*const u8
everywhere. This is part of why I prefer having a dedicatedTag
type.--document-private-items
instead.Scan
,ScanMut
,Drain
, andIntoIter
would likely all use the sameRawIterRange
type similar to the existing one, but we would explicitly use the safe wrappers for their implementations onHashMap
andHashSet
andHashTable
instead of the internal one.derive(Clone)
on iterators or not add my ownDrop
implementation and have them work.Nice writeup, but what now?
I'm basically writing up my plans here for a couple reasons:
I will likely be the one to do a lot of this refactoring, unless someone else is substantially more interested. I don't mind sharing the work, just know the reality of open-source work and that generally, people aren't that excited to implement other people's proposals.
The plan is to implement several of these types incrementally rather than doing it all at once. This also allows people to prepare for API breakages if they use the raw API.
The text was updated successfully, but these errors were encountered: