-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Optimize" Dictionary contents in DictionaryArray / concat_batches
#506
Comments
FYI @jhorstmann I don't know if this is similarly interesting to you but we will likely not be able to ensure our dictionaries are always exactly the same |
A nice benefit of this is that a GROUP BY that dictionary column afterwards would be very cheap since it does not need another hashmap and instead could index directly into an array of accumulators with the keys. Not sure if that is the usecase you are after or if this is more of a nice side effect. Ensuring sorted dictionaries is something I'm definitely interested in, |
Great issue description @alamb 🎩 I would do it on a separate kernel, as to not break the principle that concatenating arrays is an
In general, we have a small challenge in how we track dictionary metadata, though: our My feeling is that we should (backward-incompatibly) extend I also though about a more radical approach of removing |
I think "side effect" :)
Yes, we are definitely also interested in ensuring sorted dictionaries (we have an optimized physical representation that requires a sorted dictionary and today it simply resorts the incoming dictionary, unnecessarily) |
@jorgecarleitao it certainly sounds interesting -- as long as there was some reasonable way to interact with the dictionary and its contents this sounds good to me. |
So another option I've been thinking about, is instead of optimizing the RecordBatches after the fact, to optionally generate better RecordBatches in the first place. This is similar to what is proposed in #504 but perhaps more general. The basic idea would be to extend MutableArrayData with a flag to recompute dictionaries on freeze, I already have a fairly good idea of how to implement this. Variants of the concat, take, etc... kernels could then make use of this. An optimize batch function would effectively be just this variant take kernel. Additionally DataFusion operators such as SortPreservingMerge that use MutableArrayData directly could also benefit from this. I'd be happy to implement this if people are happy for me to |
@tustvold This sounds pretty interesting. Would the idea be to add potentially duplicated / unused values in I wonder if there is any way to avoid collecting the duplicated values in the first plane |
The idea I had was be to defer creation of the dictionary ArrayData until freeze, as opposed to in the constructor, but otherwise to keep the extend generation the same. Freeze would then scan through the generated keys in the |
Note that I had very limited knowledge about Arrow and Rust when I initially proposed the MutableArrayData; specially the dictionary part, which was simply written to "work for dictionary arrays". Its primary goal was simply to allow us to easily construct an array out of "slices" of other arrays, which I was observing to be a recursive pattern in filter, concat, sort-merge and joins.
|
I was probably overthinking, but I was imagining that we could reduce peak memory usage by doing something smarter than "concat and then cleanup" However, given concat and then cleanup would be an improvement over master, starting there and then making additional improvements as events / profiling warrants seems like a good idea to me |
@alamb what I'm proposing isn't "concat then cleanup". You build the keys array with the current dictionary encodings as if they had been concatenated together (but don't actually concatenate them). Then compute the new dictionary and re-encode the keys with this new dictionary. Is there some way to do better than this? |
This sounds like a good plan to me |
Another issue with the existing implementation is the DictionaryKeyOverflowError error that is returned in situations where it is reasonably not expected. For example like in this scenario.
This issue makes UInt8 dictionary key unusable in a context where concatenation of batches could take place. |
@lquerel -- I agree that deduplication of dictionary keys is required (or at least doing so when it is needed) |
I plan to address this as part of #1523 |
This is great news. Thanks
…On Tue, Sep 5, 2023 at 8:03 AM Raphael Taylor-Davies < ***@***.***> wrote:
Closed #506 <#506> as completed
via #3558 <#3558>.
—
Reply to this email directly, view it on GitHub
<#506 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFAUSSXOCSD4KGX7MPEWDLXY45KJANCNFSM47OQ3TAA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Our project (IOx) uses dictionary arrays heavily to compress arrays of strings (that are mostly the same value). When
DictionaryArrays
are concatenated together the contents of the dictionaries are also concatenated resulting in duplication and some dictionary entries which are unused.Similarly, when filtering a DictionaryArray some values may be filtered entirely and thus the dictionary may include values that are repeated.
Describe the solution you'd like
optimize_dictionaries
function which would return a potentially new array such that its dictionary had the following properties:a) Every value in the dictionary is unique
b) Every value in the dictionary has at least one use in the array' values
So given an input like
Array: "A, C, A, B, C, C, A"
Dictionary:
{A, B, C, D, E, F, A, B, C}
Values:
{0, 2, 0, 1, 2, 2, 0}
The output of
optimize_dictionaroes
would be (note no duplication in dictionary)Array: "A, C, A, B, C, C, A"
Dictionary:
{A, B, C}
Values:
{0, 2, 0, 1, 2, 2, 0}
concat_and_optimize_batches
that would do the array compaction as part ofconcatenation
(and thus avoid creating an intermediateconcat
result which was then optimized)The solution described in #504 is designed to avoid copying dictionaries when they are identical (aka literally point at the same array)
Describe alternatives you've considered
An alternative might be to extend
concat_batches
directly to always try and compact arrays. However, since this involves a tradeoff of more CPU for potentially less memory usage, I am not sure it is appropriate to always apply.Additional context
@tustvold has implemented a version of this code in IOx: https://github.com/influxdata/influxdb_iox/pull/1830 which could be suitable for inclusion in arrow
See also #504,
The text was updated successfully, but these errors were encountered: