-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the raw
feature and make RawTable
private
#546
Conversation
cc @clarfonthey |
This will give more freedom for the internal implementation details of hashbrown to evolve without the need for regular releases with breaking changes. All existing users of `RawTable` should migrate to the `HashTable` API which is entirely safe while providing the same flexibility as `RawTable`. This also removes the following features which were only exposed under `RawTable`: - `RawTable::iter_hash` - `RawIter::reflect_insert` and `RawIter::reflect_remove` - `RawTable::clone_from_with_hasher` - `RawTable::insert_no_grow` and `RawTable::try_insert_no_grow` - `RawTable::allocation_info` - `RawTable::try_with_capacity(_in)` - `HashMap::raw_table(_mut)` and `HashSet::raw_table(_mut)`
I think https://github.com/Apache/arrow-datafusion uses:
|
Since you're already doing this, would you mind pushing a release after too so rust-lang/rust#128711 can go through? Please and thank you. <3 IMHO the sooner we do this, the better, since it gives people time to migrate away from the API before we start following up on the threat to improve the code. And, since it's a published crate, it's not a big deal, since the old (non-breaking) versions will continue to work correctly. |
Datafusion seem to actually require the full functionality of This isn't something that can be supported on |
We have also been discussing implementing our own custom hash table in apache/datafusion#7095, so perhaps this would be another potential reason to pursue that idea I agree that using the low level details of how the hash table in hashbrown is implemented is not ideal (e.g. it constrains how hashbrown can version releases) @Amanieu I wonder if you would consider a feature flag on hashbrown like |
This will require work in DataFusion to fix TopK aggregations, and then after it is fixed it will cause significant (40% IIRC) performance regressions. |
I don't think that's a good idea because every minor release of hashbrown could break users of your crate. However you can just keep using the 0.14 version of hashbrown which will still have the |
Another option is for us to to fork the crate which might be reasonable if we can get more performance by doing so |
I think we should create some tickets in DataFusion for moving our usage towards |
@bors r+ |
☀️ Test successful - checks-actions |
I was using |
I think |
Add `HashTable::iter_hash`, `HashTable::iter_hash_mut` This is a follow-up to #546 ([comment](#546 (comment))). `iter_hash` from the old raw API can be useful for reading from a "bag" / "multi map" type which allows duplicate key-value pairs. Exposing it safely in `HashTable` takes a fairly small wrapper around `RawIterHash`. This PR partially reverts #546 to restore `RawTable::iter_hash` and its associated types.
This was previously removed from `RawTable` in rust-lang#546. This is now added as a public API on `HashMap`, `HashSet` and `HashTable`.
What specific functionality do you need that isn't available though |
I took a brief look, and the main roadblock that I see will be iterators and entry structs that currently contain |
Added a comment to #545 mentioning to look into dashmap's internals when I go about ripping out the raw table API. It may be the case that we need to offer some kind of lifetime-erased version of the various types to get it to work, but I'm hoping we can get around that. |
I thought about this a bit and fundamentally it's not a problem with |
Hi, two projects I'm affiliated with are stuck on hashbrown 0.14 right now due to using the raw API in a way that can't be replicated in 0.15. They both use https://github.com/ruffle-rs/ruffle uses the https://github.com/kyren/piccolo (which is something I'm much better equipped to explain) uses the local my_table = { a = 1, b = 2, c = 3 }
local first_key = next(my_table, nil)
local next_key = next(my_table, first_key)
-- and so on... The core issue is that both projects implement VMs for languages whose implementation assumes that a hash table has some kind of observable order that is stable ONLY when the hash table is not modified in certain ways. For piccolo (and this is the same in the internals of PUC-Rio Lua), that requirement is actually that the hash table order only be stable under no inserts or removals whatsoever, only mutating values and (transparent) modifications of keys. How this is accomplished is with something the PUC-Rio Lua internals refer to as "dead keys", having a version of a key that compares identically to a normal live key but doesn't own a value for the purposes of garbage collection. When removing an entry in a hash table, you set the key to its "dead key" equivalent and set the value to We (some of the Ruffle devs and I) believe we have a minimal API that satisfies both our use cases and comes with hopefully minimal guarantees. If it were only
There is probably a lot of room for equivalent APIs that would work here also. The core issue is that we need to observe the bucket order for a hash table that is not being modified. To put it in maybe a less restrictive way, we really don't even care about the bucket order, what both projects really need is to be able to iterate over a |
So, I just want to clarify here that the reason why I think that the ability to keep track of bucket indices beyond just hashes seems reasonable, although I think that at least linking a few demonstrations of existing code you cannot rewrite under the I think it would be most helpful to open a separate issue for these features (or multiple, depending on how different they are) so they can be tracked and followed better. There are a bunch of different features that people want from |
Ah I see, I may have gotten the wrong impression from reading the PR comments and from discussion with Ruffle devs, I don't think I correctly understood the primary motivation for getting rid of RawTable (implementation freedom vs just being old and needing to be rewritten). I understand better now, thank you!
Sure thing, the code I'm the most familiar with is is https://github.com/kyren/piccolo/blob/master/src/table/raw.rs, specifically the
I'd be happy to open a new issue for the features that piccolo and Ruffle need, I think they're both close enough that a single issue for the both of them could cover it? |
https://github.com/Berrysoft/wepoll2 I used the |
I think that falliable allocation APIs are a very reasonable addition to all the types in the crate, so, it's worth filing an issue for. |
This will give more freedom for the internal implementation details of hashbrown to evolve without the need for regular releases with breaking changes.
All existing users of
RawTable
should migrate to theHashTable
API which is entirely safe while providing the same flexibility asRawTable
.This also removes the following features which were only exposed under
RawTable
:RawTable::iter_hash
RawIter::reflect_insert
andRawIter::reflect_remove
RawTable::clone_from_with_hasher
RawTable::insert_no_grow
andRawTable::try_insert_no_grow
RawTable::allocation_info
RawTable::try_with_capacity(_in)
HashMap::raw_table(_mut)
andHashSet::raw_table(_mut)
If anyone was previously relying on this functionaly, please raise a comment. It may be possible to re-introduce it as a safe API in
HashTable
and/orHashMap
.