[FEA] Look at ways to reduce memory usage in operations that use hash tables #12139

revans2 · 2025-02-14T14:57:11Z

Is your feature request related to a problem? Please describe.
Hash joins and aggregations are some of the most common operations that we do. Joins especially, but there are a lot of aggregations that end up using hash tables.

Currently in CUDF all hash tables are allocated with a maximum occupancy of 50%.

https://github.com/rapidsai/cudf/blob/6c281fded5590e9896a901b5e12ce3a05d510be7/cpp/include/cudf/hashing/detail/helper_functions.cuh#L23

But when talking to the cuco team they mentioned that their hash tables, especially static_multimap has really good performance going up to a 90%

For a key multiplicity of 1 there is effectively no performance drop between 10% and 90% occupancy
For a key multiplicity of 8 there is about a 30% performance drop between 10% and 90% occupancy

In some cases we know a lot about the join cardinality and are already telling CUDF if the build side has distinct keys or not. We could potentially give more hints about the average key multiplicity.

For aggregations we can use previous batches to give us hints about future aggregation batches. We also know for merge aggregations how many upstream tasks there are or how many batches we are combining together to get a worst case key multiplicity.

The idea would be to see if we could reduce the memory usage without impacting the performance of average case queries. If we can reduce the memory pressure, then ideally we can reduce spilling/memory pressure.

So this is not really a performance improvement that we would see everywhere, but it is an experiment to see if we can improve really common operations that use lots of memory.

binmahone · 2025-02-17T00:55:21Z

For a key multiplicity of 1 there is effectively no performance drop between 10% and 90% occupancy
For a key multiplicity of 8 there is about a 30% performance drop between 10% and 90% occupancy

May I ask where the numbers come from? it's a little bit unintuitive to me that high cardinality case (key multiplicity being 1) is more immune to capacity changes than low card cases. I'm curious to learn more about this.

sleeepyjack · 2025-02-20T02:39:08Z

Sorry for the delay @binmahone.
Let me elaborate on this:

The cuCollections hash map implementation used in cudf is based on open addressing (also called closed hashing in the literature counterintuitively), i.e., there is a fixed number of slots in the map. Keys get assigned to an initial slot based on their hash value. If the slot is already occupied (in case of insert) or non-matching (in case of a lookup), there is a so-called probing scheme, i.e., a deterministic sequence of alternative slots to look at for this key.

Now, given this setup, there can be two kinds of collisions:

Hash collision: Meaning that two distinct keys hash to the same slot based on their hash value. The rate of these collisions directly correlate with the map occupancy, i.e., the fuller the map, the more likely we hit a slot that`s already occupied by another key. This is due to the injective nature of the mapping from the hash value domain to the much smaller slot index domain. Although colliding keys might hash to the same initial slots, probing schemes such as double hashing make it very unlikely that they also share the same downstream probing sequence, ultimately resolving the collision after one or more probing steps.
Key collisions: These collisions occur when two or more identical keys are inserted into the map. They share the same hash value and thus follow the same exact probing sequence. For example, if a key occurs already N times in the map, inserting the same key again takes at least N+1 probing steps. Thats the reason why high-multiplicity input data performs worse than unique keys.

Losely speaking: While hash collisions can be resolved quickly by applying an appropriate probing scheme such as double hashing, key collisions cannot benefit from this strategy.

cuCollections partially mitigates the effects of collisions through vectorization: We don't probe one slot at a time during probing, but instead load a bucket of multiple coalesced slots in a vectorized and cooperative (in terms of cooperative groups) manner. That way we can probe through dense areas of the map faster and also mitigate the effect of longer probing chains with high-multiplicity data.

May I ask where the numbers come from?

See this Slack thread (NV internal).

revans2 added ? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue labels Feb 14, 2025

mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Look at ways to reduce memory usage in operations that use hash tables #12139

[FEA] Look at ways to reduce memory usage in operations that use hash tables #12139

revans2 commented Feb 14, 2025

binmahone commented Feb 17, 2025 •

edited

Loading

sleeepyjack commented Feb 20, 2025 •

edited

Loading

[FEA] Look at ways to reduce memory usage in operations that use hash tables #12139

[FEA] Look at ways to reduce memory usage in operations that use hash tables #12139

Comments

revans2 commented Feb 14, 2025

binmahone commented Feb 17, 2025 • edited Loading

sleeepyjack commented Feb 20, 2025 • edited Loading

binmahone commented Feb 17, 2025 •

edited

Loading

sleeepyjack commented Feb 20, 2025 •

edited

Loading