-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-encode dictionaries in selection kernels (take
/ concat_batches
)
#3558
Conversation
arrow-select/src/dictionary.rs
Outdated
/// containing all unique, reference values, along with mappings from the [`DictionaryArray`] | ||
/// keys to the new keys within this values array | ||
pub fn merge_dictionaries<K: ArrowDictionaryKeyType>( | ||
dictionaries: &[(&DictionaryArray<K>, Option<&[u8]>)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to use the masks for the interleave kernel
I think this is now ready for review, benchmarks suggest re-encoding dictionaries is on the order of ~10µs for 1024 rows. The difficulty is the old method is very fast most of the time, on the order of 1µs, and so always re-encoding would result in a significant performance regression in the common case. To avoid this, I've stuck a fairly pessimistic heuristic that will only re-encode if the concatenation of the dictionary values would end up with more elements than the output has keys. This helps to avoid the worst case most of the time, and I think is an acceptable trade-off but welcome other thoughts. Interleave with the same dictionary (which won't merge)
Interleave with different dictionaries where one is sparse (will merge)
The concatenate kernel is typically much faster, and so merging does represent a non-trivial performance regression. (Note this is concatenating 2 arrays of 1024, whereas the interleave benchmarks are interleaving to produce an array of 1024 rows).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code and tests carefully and it looks 👨🍳 👌
I was inspired by @jorgecarleitao 's comment #506 (comment) and think it may be a better idea to expose the "optimizing concat" as its own kernel rather than heuristically trying to choose when to call it within the existing concat kernel.
Making it a separate kernel might better align to the current design principle of "give the user full control"
Perhaps @viirya or @jhorstmann has some thoughts on how to expose this code?
arrow-select/src/concat.rs
Outdated
for ((d, _), mapping) in dictionaries.iter().zip(merged.key_mappings) { | ||
for key in d.keys_iter() { | ||
keys.append_option(key.map(|x| mapping[x])); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to use the new extend
function that got added? Something like (untested)
for ((d, _), mapping) in dictionaries.iter().zip(merged.key_mappings) { | |
for key in d.keys_iter() { | |
keys.append_option(key.map(|x| mapping[x])); | |
} | |
} | |
for ((d, _), mapping) in dictionaries.iter().zip(merged.key_mappings) { | |
keys.extend(d.keys_iter().map(|x| mapping[x])) | |
} |
assert_eq!(actual, expected); | ||
|
||
// Should have merged inputs together | ||
// Not 30 as this is done on a best-effort basis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
arrow-select/src/dictionary.rs
Outdated
} | ||
|
||
/// A weak heuristic of whether to merge dictionary values that aims to only | ||
/// perform the expensive computation when is likely to yield at least |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// perform the expensive computation when is likely to yield at least | |
/// perform the expensive merge computation when is likely to yield at least |
} | ||
|
||
impl<'a, V> Interner<'a, V> { | ||
fn new(capacity: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to document the implications of capacity
size here -- it seems like the idea is the smaller the capacity greater the chance for collisions (and thus duplicates)
arrow-select/src/dictionary.rs
Outdated
let a = | ||
DictionaryArray::<Int32Type>::from_iter(["a", "b", "a", "b", "d", "c", "e"]); | ||
let b = DictionaryArray::<Int32Type>::from_iter(["c", "f", "c", "d", "a", "d"]); | ||
let merged = merge_dictionary_values(&[(&a, None), (&b, None)]).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have at least one test with NULL values in the dictionary keys
arrow-select/src/dictionary.rs
Outdated
let mut values = Vec::with_capacity(dictionaries.len()); | ||
let mut value_slices = Vec::with_capacity(dictionaries.len()); | ||
|
||
for (dictionary, key_mask) in dictionaries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw in should_merge dictionary_values
function that dictionaries with same pointers are considered same ArrayData::ptr_eq(values, first_values);
I don't know how frequent the scenario is. But is it worth to keep track of merged dictionary pointers and skip merging them if they are seen again?
arrow-select/src/dictionary.rs
Outdated
|
||
let values = &data.child_data()[0]; | ||
total_values += values.len(); | ||
single_dictionary &= ArrayData::ptr_eq(values, first_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check only the first value? Are there no possibility of duplicate pointers in indexes other that first index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only care if all the dictionary values are the same, if there are any dictionaries that differ, we will end up concatenating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @alamb's comment on a separate optimized kernel.
/// some return over the naive approach used by MutableArrayData | ||
/// | ||
/// `len` is the total length of the merged output | ||
pub fn should_merge_dictionary_values<K: ArrowDictionaryKeyType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module isn't public
arrow-select/src/dictionary.rs
Outdated
!single_dictionary && is_supported && (overflow || values_exceed_length) | ||
} | ||
|
||
/// Given an array of dictionaries and an optional row mask compute a values array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Given an array of dictionaries and an optional row mask compute a values array | |
/// Given an array of dictionaries and an optional key row mask compute a values array |
Converting to draft whilst I continue to think about how we should be handling this, especially in light of the RunEncodedArray work |
Closing this for now, may re-open in future if I have further ideas in this space |
ea9af09
to
dfe0f7b
Compare
The latest benchmarks are
My takeaway from this is that the performance regression is not significant enough in absolute terms (on the order of microseconds) to warrant not making this the default behaviour. |
With the changes in c2d4009 there is no longer a performance regression, except when re-encoding concat
|
/// Performs a cheap, pointer-based comparison of two byte array | ||
/// | ||
/// See [`ScalarBuffer::ptr_eq`] | ||
fn bytes_ptr_eq<T: ByteArrayType>(a: &dyn Array, b: &dyn Array) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see us eventually promoting this to a first-class API on array
let collected: Vec<_> = vc.into_iter().map(Option::unwrap).collect(); | ||
assert_eq!(&collected, &["c", "c", "c", "a", "c", "b"]); | ||
|
||
// Should recompute dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this one is recomputed but the other case isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the interleave output is smaller
🎉 |
take
/ concat_batches
)
Which issue does this PR close?
Closes #506
Relates to #2832
Rationale for this change
MutableArrayData blindly concatenates the dictionary values arrays together when given multiple different dictionaries. This is straightforward and often fast, but degrades spectacularly if either of the following hold:
This leads to high memory usage, wasted CPU cycles, and potential byte array offset overflow
What changes are included in this PR?
Are there any user-facing changes?