Remove the `raw` feature and make `RawTable` private #546

Amanieu · 2024-08-22T23:05:47Z

This will give more freedom for the internal implementation details of hashbrown to evolve without the need for regular releases with breaking changes.

All existing users of RawTable should migrate to the HashTable API which is entirely safe while providing the same flexibility as RawTable.

This also removes the following features which were only exposed under RawTable:

RawTable::iter_hash
RawIter::reflect_insert and RawIter::reflect_remove
RawTable::clone_from_with_hasher
RawTable::insert_no_grow and RawTable::try_insert_no_grow
RawTable::allocation_info
RawTable::try_with_capacity(_in)
HashMap::raw_table(_mut) and HashSet::raw_table(_mut)

If anyone was previously relying on this functionaly, please raise a comment. It may be possible to re-introduce it as a safe API in HashTable and/or HashMap.

Amanieu · 2024-08-22T23:05:53Z

cc @clarfonthey

This will give more freedom for the internal implementation details of hashbrown to evolve without the need for regular releases with breaking changes. All existing users of `RawTable` should migrate to the `HashTable` API which is entirely safe while providing the same flexibility as `RawTable`. This also removes the following features which were only exposed under `RawTable`: - `RawTable::iter_hash` - `RawIter::reflect_insert` and `RawIter::reflect_remove` - `RawTable::clone_from_with_hasher` - `RawTable::insert_no_grow` and `RawTable::try_insert_no_grow` - `RawTable::allocation_info` - `RawTable::try_with_capacity(_in)` - `HashMap::raw_table(_mut)` and `HashSet::raw_table(_mut)`

Dandandan · 2024-08-23T00:43:11Z

I think https://github.com/Apache/arrow-datafusion

uses:

RawTable::try_insert_no_grow
RawTable::allocation_info

clarfonthey · 2024-08-23T00:57:26Z

Since you're already doing this, would you mind pushing a release after too so rust-lang/rust#128711 can go through? Please and thank you. <3

IMHO the sooner we do this, the better, since it gives people time to migrate away from the API before we start following up on the threat to improve the code. And, since it's a published crate, it's not a big deal, since the old (non-breaking) versions will continue to work correctly.

Amanieu · 2024-08-23T13:09:18Z

I think Apache/arrow-datafusion uses:

Datafusion seem to actually require the full functionality of RawTable since it directly stores bucket indices in a BinaryHeap so that it can delete arbitrary elements later (and rebuilds the heap on rehash to keep the indices valid).

This isn't something that can be supported on HashTable since it relies on low-level implementation details of RawTable such as the fact that buckets are never moved except in a rehash. Perhaps in the future the new raw API that @clarfonthey is planning (see #545) will be publicly exposed, but in the meantime datafusion can just stay on the previous hashbrown version.

Dandandan · 2024-08-23T13:11:57Z

FYI @alamb @avantgardnerio

alamb · 2024-08-23T13:23:02Z

We have also been discussing implementing our own custom hash table in apache/datafusion#7095, so perhaps this would be another potential reason to pursue that idea

I agree that using the low level details of how the hash table in hashbrown is implemented is not ideal (e.g. it constrains how hashbrown can version releases)

@Amanieu I wonder if you would consider a feature flag on hashbrown like hashbrown_unstable (following the model of tokio_unstable). That would let DataFusion keep using hashbrown internals (and it would be on us to deal with any changes during upgrade)

avantgardnerio · 2024-08-23T13:26:15Z

This will require work in DataFusion to fix TopK aggregations, and then after it is fixed it will cause significant (40% IIRC) performance regressions.

Amanieu · 2024-08-23T13:28:57Z

I don't think that's a good idea because every minor release of hashbrown could break users of your crate.

However you can just keep using the 0.14 version of hashbrown which will still have the raw feature. It's only being removed in 0.15.

alamb · 2024-08-23T13:30:35Z

However you can just keep using the 0.14 version of hashbrown which will still have the raw feature. It's only being removed in 0.15.

Another option is for us to to fork the crate which might be reasonable if we can get more performance by doing so

Dandandan · 2024-08-23T17:04:46Z

I think we should create some tickets in DataFusion for moving our usage towards HashTable / own hashtable (e.g. better vectorized) / forked code.

Amanieu · 2024-08-23T21:30:43Z

@bors r+

bors · 2024-08-23T21:30:45Z

📌 Commit 26ef4a1 has been approved by Amanieu

It is now in the queue for this repository.

bors · 2024-08-23T21:31:52Z

⌛ Testing commit 26ef4a1 with merge aa1411b...

bors · 2024-08-23T21:39:59Z

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing aa1411b to master...

the-mikedavis · 2024-08-28T20:54:49Z

I was using RawTable::iter_hash for a multi-map-like type (kind of like HashMap<K, Vec<T>>) and I don't believe there's a way to look up multiple entries with the same key from HashTable<(K, V)>: HashTable::find stops after the first matching element. Would it be possible to introduce something safe in HashTable to take its place? I can share more code if you'd like more context and maybe work on a PR if you think it's a reasonable addition.

Amanieu · 2024-08-28T21:05:40Z

I think iter_hash makes sense to expose on HashTable. Feel free to send a PR, you can do a partial revert of this one to get iter_hash back in RawTable.

Add `HashTable::iter_hash`, `HashTable::iter_hash_mut` This is a follow-up to #546 ([comment](#546 (comment))). `iter_hash` from the old raw API can be useful for reading from a "bag" / "multi map" type which allows duplicate key-value pairs. Exposing it safely in `HashTable` takes a fairly small wrapper around `RawIterHash`. This PR partially reverts #546 to restore `RawTable::iter_hash` and its associated types.

This was previously removed from `RawTable` in rust-lang#546. This is now added as a public API on `HashMap`, `HashSet` and `HashTable`.

xacrimon · 2024-09-14T14:03:35Z

@Amanieu dashmap relies on RawTable functionality. With this merged we're be effectively locked to 0.14 forever.

Amanieu · 2024-09-14T14:18:43Z

What specific functionality do you need that isn't available though HashTable which is the replacement for RawTable?

cuviper · 2024-09-14T14:25:37Z

I took a brief look, and the main roadblock that I see will be iterators and entry structs that currently contain RwLock guards with corresponding RawIter / Bucket / InsertSlot. With HashTable, those second parts all have lifetimes that would need to borrow through the guard, so they become self-referential.

clarfonthey · 2024-09-15T01:56:13Z

Added a comment to #545 mentioning to look into dashmap's internals when I go about ripping out the raw table API. It may be the case that we need to offer some kind of lifetime-erased version of the various types to get it to work, but I'm hoping we can get around that.

Re-introduce a way to get the allocation size of a table This was previously removed from `RawTable` in #546. This is now added as a public API on `HashMap`, `HashSet` and `HashTable`. Fixes #238 Fixes #506

Amanieu · 2024-10-01T08:46:06Z

I took a brief look, and the main roadblock that I see will be iterators and entry structs that currently contain RwLock guards with corresponding RawIter / Bucket / InsertSlot. With HashTable, those second parts all have lifetimes that would need to borrow through the guard, so they become self-referential.

I thought about this a bit and fundamentally it's not a problem with HashTable but rather a limitation of the MutexGuard API. I think you can work around this by using a Mutex<()> and an UnsafeCell<HashTable> where you manage the locking manually.

kyren · 2024-12-08T10:18:12Z

Hi, two projects I'm affiliated with are stuck on hashbrown 0.14 right now due to using the raw API in a way that can't be replicated in 0.15. They both use RawTable for nearly the same purpose, which I will try to explain.

https://github.com/ruffle-rs/ruffle uses the RawTable API to implement AVM2's iteration protocol for object properties. I'm really unfamiliar with this, @moulins and other people on the Ruffle team can probably explain in better detail, but the bytecode assumes that object properties are stored in a flat bucket based hash table, and the bytecode can get entries from an object based on a bucket index and also has an instruction like HasNext(object, index) -> next_index to find the next valid bucket.

https://github.com/kyren/piccolo (which is something I'm much better equipped to explain) uses the RawTable API for something similar. Lua has a function called next that can be used to find the "next" key in a Lua table, which is normally determined by its "raw" order in a hash table, or in hashbrown terms the bucket index (not all versions of PUC-Rio Lua implement tables this way so this could equivalently be some order in a red-black tree). It is used like this:

local my_table = { a = 1, b = 2, c = 3 }
local first_key = next(my_table, nil)
local next_key = next(my_table, first_key)
-- and so on...

The core issue is that both projects implement VMs for languages whose implementation assumes that a hash table has some kind of observable order that is stable ONLY when the hash table is not modified in certain ways. For piccolo (and this is the same in the internals of PUC-Rio Lua), that requirement is actually that the hash table order only be stable under no inserts or removals whatsoever, only mutating values and (transparent) modifications of keys. How this is accomplished is with something the PUC-Rio Lua internals refer to as "dead keys", having a version of a key that compares identically to a normal live key but doesn't own a value for the purposes of garbage collection. When removing an entry in a hash table, you set the key to its "dead key" equivalent and set the value to nil (an entry with a nil value in Lua is indistinguishable from an entry not being present). Ruffle's AVM2 does not use this trick right now and instead relies on the order being guaranteed not to change without a rehash, but AIUI it is possible to efficiently implement tombstone entries in the AVM2 VM in a similar way to piccolo, loosening the requirements.

We (some of the Ruffle devs and I) believe we have a minimal API that satisfies both our use cases and comes with hopefully minimal guarantees.

If it were only piccolo, the required API would be to get the next live entry after some given entry (and some way to get the first entry), in whatever the bucket order is, with no guarantees about that order being the same after any sort of mutation of the HashTable itself at all. AVM2 is slightly different due to needing to get an entry specifically by bucket index (which the AVM2 VM also caches internally in certain cases), but it never needs to get the bucket index by looking up an existing key (which piccolo must have to implement next). Therefore, a union of both requirements could look like this:

Access an entry by bucket index
Ask for the first valid bucket index greater than or equal to a given one (0 for the first valid bucket)
Ask for the bucket index corresponding to an occupied entry / find a bucket index by key and eq predicate
No guarantees whatsoever for iteration order after any kind of "table mutation", both insertion and removal

There is probably a lot of room for equivalent APIs that would work here also. The core issue is that we need to observe the bucket order for a hash table that is not being modified. To put it in maybe a less restrictive way, we really don't even care about the bucket order, what both projects really need is to be able to iterate over a HashTable in such a way that the iteration can be suspended and resumed (for piccolo, resuming from a specific key, for Ruffle, resuming from some other sort of token). It is possible ofc that both projects could create a less efficient custom data structure with auxiliary arrays for an entry order, but in both cases these are pretty much the core data structures for the implemented language, so this is valuable enough that our best alternative other than some kind of API change here is a custom fork of hashbrown (or to stay on 0.14 indefinitely).

clarfonthey · 2024-12-08T21:37:32Z

So, I just want to clarify here that the reason why RawTable is being deprecated is because it's old, not written with modern Rust features (like MaybeUninit) in mind, and is already soft-replaced by the HashTable API. So, all of this functionality you want to use seems extremely reasonable to add onto the HashTable API, since it all feels reasonable to me: the goal isn't to get rid of the lower-level APIs, but rather to offer them in a better place.

I think that the ability to keep track of bucket indices beyond just hashes seems reasonable, although I think that at least linking a few demonstrations of existing code you cannot rewrite under the HashTable API would be useful, so we know specifically what methods you're using and how they could best be replaced. I get the vague idea based upon what you're describing, but code examples would be the most helpful, since that way we can think of a suitable replacement API.

I think it would be most helpful to open a separate issue for these features (or multiple, depending on how different they are) so they can be tracked and followed better. There are a bunch of different features that people want from RawTable that haven't made their way onto HashTable yet, and it's best to separate out the discussions so we make sure nothing gets missed.

kyren · 2024-12-09T07:16:28Z

So, I just want to clarify here that the reason why RawTable is being deprecated is because it's old, not written with modern Rust features (like MaybeUninit) in mind, and is already soft-replaced by the HashTable API. So, all of this functionality you want to use seems extremely reasonable to add onto the HashTable API, since it all feels reasonable to me: the goal isn't to get rid of the lower-level APIs, but rather to offer them in a better place.

Ah I see, I may have gotten the wrong impression from reading the PR comments and from discussion with Ruffle devs, I don't think I correctly understood the primary motivation for getting rid of RawTable (implementation freedom vs just being old and needing to be rewritten). I understand better now, thank you!

I think that the ability to keep track of bucket indices beyond just hashes seems reasonable, although I think that at least linking a few demonstrations of existing code you cannot rewrite under the HashTable API would be useful, so we know specifically what methods you're using and how they could best be replaced. I get the vague idea based upon what you're describing, but code examples would be the most helpful, since that way we can think of a suitable replacement API.

Sure thing, the code I'm the most familiar with is is https://github.com/kyren/piccolo/blob/master/src/table/raw.rs, specifically the RawTable::next method. The way that Lua tables work and Lua table iteration works is a bit strange especially if you're not familiar with Lua, but hopefully it should be possible to follow.

I think it would be most helpful to open a separate issue for these features (or multiple, depending on how different they are) so they can be tracked and followed better. There are a bunch of different features that people want from RawTable that haven't made their way onto HashTable yet, and it's best to separate out the discussions so we make sure nothing gets missed.

I'd be happy to open a new issue for the features that piccolo and Ruffle need, I think they're both close enough that a single issue for the both of them could cover it?

Berrysoft · 2024-12-13T16:23:48Z

https://github.com/Berrysoft/wepoll2 I used the raw feature to make a panic-free HashMap. The insert method of HashMap calls reserve which will panic if allocation fails. I have to reimplement a try_insert method to call try_reserve to remove the panic code path. Do you have some advice on porting to v0.15?

clarfonthey · 2024-12-14T04:56:06Z

I think that falliable allocation APIs are a very reasonable addition to all the types in the crate, so, it's worth filing an issue for.

Amanieu force-pushed the remove-raw branch from 78efa0d to 26ef4a1 Compare August 22, 2024 23:07

bors merged commit aa1411b into rust-lang:master Aug 23, 2024
24 checks passed

the-mikedavis mentioned this pull request Aug 28, 2024

Add HashTable::iter_hash, HashTable::iter_hash_mut #549

Merged

cuviper mentioned this pull request Aug 30, 2024

Use hashbrown::HashTable instead of RawTable indexmap-rs/indexmap#343

Merged

cuviper mentioned this pull request Sep 5, 2024

How to calculate the size of the hashbrown::HashMap at runtime? #506

Closed

Amanieu added a commit to Amanieu/hashbrown that referenced this pull request Sep 12, 2024

Re-introduce a way to get the allocation size of a table

1089b93

This was previously removed from `RawTable` in rust-lang#546. This is now added as a public API on `HashMap`, `HashSet` and `HashTable`.

Amanieu mentioned this pull request Sep 12, 2024

Re-introduce a way to get the allocation size of a table #553

Merged

alamb mentioned this pull request Oct 1, 2024

Update hashbrown requirement from 0.14.2 to 0.15.0 apache/arrow-rs#6486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the `raw` feature and make `RawTable` private #546

Remove the `raw` feature and make `RawTable` private #546

Amanieu commented Aug 22, 2024

Amanieu commented Aug 22, 2024

Dandandan commented Aug 23, 2024 •

edited

Loading

clarfonthey commented Aug 23, 2024 •

edited

Loading

Amanieu commented Aug 23, 2024

Dandandan commented Aug 23, 2024

alamb commented Aug 23, 2024 •

edited

Loading

avantgardnerio commented Aug 23, 2024

Amanieu commented Aug 23, 2024

alamb commented Aug 23, 2024

Dandandan commented Aug 23, 2024

Amanieu commented Aug 23, 2024

bors commented Aug 23, 2024

bors commented Aug 23, 2024

bors commented Aug 23, 2024

the-mikedavis commented Aug 28, 2024

Amanieu commented Aug 28, 2024

xacrimon commented Sep 14, 2024 •

edited

Loading

Amanieu commented Sep 14, 2024

cuviper commented Sep 14, 2024

clarfonthey commented Sep 15, 2024

Amanieu commented Oct 1, 2024

kyren commented Dec 8, 2024 •

edited

Loading

clarfonthey commented Dec 8, 2024

kyren commented Dec 9, 2024 •

edited

Loading

Berrysoft commented Dec 13, 2024

clarfonthey commented Dec 14, 2024

Remove the raw feature and make RawTable private #546

Remove the raw feature and make RawTable private #546

Conversation

Amanieu commented Aug 22, 2024

Amanieu commented Aug 22, 2024

Dandandan commented Aug 23, 2024 • edited Loading

clarfonthey commented Aug 23, 2024 • edited Loading

Amanieu commented Aug 23, 2024

Dandandan commented Aug 23, 2024

alamb commented Aug 23, 2024 • edited Loading

avantgardnerio commented Aug 23, 2024

Amanieu commented Aug 23, 2024

alamb commented Aug 23, 2024

Dandandan commented Aug 23, 2024

Amanieu commented Aug 23, 2024

bors commented Aug 23, 2024

bors commented Aug 23, 2024

bors commented Aug 23, 2024

the-mikedavis commented Aug 28, 2024

Amanieu commented Aug 28, 2024

xacrimon commented Sep 14, 2024 • edited Loading

Amanieu commented Sep 14, 2024

cuviper commented Sep 14, 2024

clarfonthey commented Sep 15, 2024

Amanieu commented Oct 1, 2024

kyren commented Dec 8, 2024 • edited Loading

clarfonthey commented Dec 8, 2024

kyren commented Dec 9, 2024 • edited Loading

Berrysoft commented Dec 13, 2024

clarfonthey commented Dec 14, 2024

Remove the `raw` feature and make `RawTable` private #546

Remove the `raw` feature and make `RawTable` private #546

Dandandan commented Aug 23, 2024 •

edited

Loading

clarfonthey commented Aug 23, 2024 •

edited

Loading

alamb commented Aug 23, 2024 •

edited

Loading

xacrimon commented Sep 14, 2024 •

edited

Loading

kyren commented Dec 8, 2024 •

edited

Loading

kyren commented Dec 9, 2024 •

edited

Loading