"Optimize" Dictionary contents in DictionaryArray / `concat_batches` #506

alamb · 2021-06-28T18:53:19Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Our project (IOx) uses dictionary arrays heavily to compress arrays of strings (that are mostly the same value). When DictionaryArrays are concatenated together the contents of the dictionaries are also concatenated resulting in duplication and some dictionary entries which are unused.

Similarly, when filtering a DictionaryArray some values may be filtered entirely and thus the dictionary may include values that are repeated.

Describe the solution you'd like

Add an optimize_dictionaries function which would return a potentially new array such that its dictionary had the following properties:
a) Every value in the dictionary is unique
b) Every value in the dictionary has at least one use in the array' values

So given an input like

Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C, D, E, F, A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}

The output of optimize_dictionaroes would be (note no duplication in dictionary)
Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}

Add a function such as concat_and_optimize_batches that would do the array compaction as part of concatenation (and thus avoid creating an intermediate concat result which was then optimized)

The solution described in #504 is designed to avoid copying dictionaries when they are identical (aka literally point at the same array)

Describe alternatives you've considered
An alternative might be to extend concat_batches directly to always try and compact arrays. However, since this involves a tradeoff of more CPU for potentially less memory usage, I am not sure it is appropriate to always apply.

Additional context

@tustvold has implemented a version of this code in IOx: https://github.com/influxdata/influxdb_iox/pull/1830 which could be suitable for inclusion in arrow

See also #504,

The text was updated successfully, but these errors were encountered:

alamb · 2021-06-28T18:55:07Z

FYI @jhorstmann I don't know if this is similarly interesting to you but we will likely not be able to ensure our dictionaries are always exactly the same

jhorstmann · 2021-06-28T20:26:18Z

1. b) Every value in the dictionary has at least one use in the array' values

A nice benefit of this is that a GROUP BY that dictionary column afterwards would be very cheap since it does not need another hashmap and instead could index directly into an array of accumulators with the keys. Not sure if that is the usecase you are after or if this is more of a nice side effect.

Ensuring sorted dictionaries is something I'm definitely interested in, Field already has the dict_is_ordered flag based on which a much faster implementation of sort comparator or comparison kernel could be selected. I was thinking of a different implementation than using a BTreeSet though. I have only a rough sketch, but the idea is to use sort_to_indices on the dictionary values and then somehow build a lookup table as a vector. With the sorted indices it should also be possible to build a lookup table for remapping duplicates.

jorgecarleitao · 2021-06-28T20:39:09Z

Great issue description @alamb 🎩

I would do it on a separate kernel, as to not break the principle that concatenating arrays is an O(N) operation where N is the number of elements in all arrays (this is O(N log N)?)

ensure_sort: bool or something like that would be a nice argument for such a function.

In general, we have a small challenge in how we track dictionary metadata, though: our DataType::Dictionary does not hold dictionary metadata, which means that we must store it somewhere else. Yhis makes it more cumbersome, as the function cannot leverage this information to e.g. avoid re-sorting a sorted dictionary array without that other "dictionary metafata".

My feeling is that we should (backward-incompatibly) extend DataType::Dictionary(keys, values, metadata) where metadata is a struct containing the different dictionary metadata available in Field, but I am not 100% convinced about this.

I also though about a more radical approach of removing DataType::Dictionary, since a Dictionary is not formally a DataType, but an array encoding. With that said, it does have a different physical representation, so in this sense it is convenient to write it as a separate DataType that can be matched. The disadvantage is we can't change an array's encoding without changing the logical type associated with it. This contrasts with parquet, where encodings and logical types are independent of each other.

alamb · 2021-06-28T20:49:33Z

A nice benefit of this is that a GROUP BY that dictionary column afterwards would be very cheap since it does not need another hashmap and instead could index directly into an array of accumulators with the keys. Not sure if that is the usecase you are after or if this is more of a nice side effect.

I think "side effect" :)

Ensuring sorted dictionaries is something I'm definitely interested in, Field already has the dict_is_ordered flag based on which a much faster implementation of sort comparator or comparison kernel could be selected.

Yes, we are definitely also interested in ensuring sorted dictionaries (we have an optimized physical representation that requires a sorted dictionary and today it simply resorts the incoming dictionary, unnecessarily)

alamb · 2021-06-28T20:50:41Z

I also though about a more radical approach of removing DataType::Dictionary, since a Dictionary is not formally a DataType, but an array encoding.

@jorgecarleitao it certainly sounds interesting -- as long as there was some reasonable way to interact with the dictionary and its contents this sounds good to me.

tustvold · 2021-06-29T10:15:39Z

So another option I've been thinking about, is instead of optimizing the RecordBatches after the fact, to optionally generate better RecordBatches in the first place. This is similar to what is proposed in #504 but perhaps more general.

The basic idea would be to extend MutableArrayData with a flag to recompute dictionaries on freeze, I already have a fairly good idea of how to implement this. Variants of the concat, take, etc... kernels could then make use of this. An optimize batch function would effectively be just this variant take kernel.

Additionally DataFusion operators such as SortPreservingMerge that use MutableArrayData directly could also benefit from this.

I'd be happy to implement this if people are happy for me to

alamb · 2021-06-29T10:37:32Z

The basic idea would be to extend MutableArrayData with a flag to recompute dictionaries on freeze

@tustvold This sounds pretty interesting. Would the idea be to add potentially duplicated / unused values in MutableArrayData and then rewrite the data on freeze?

I wonder if there is any way to avoid collecting the duplicated values in the first plane

tustvold · 2021-06-29T11:00:37Z

@tustvold This sounds pretty interesting. Would the idea be to add potentially duplicated / unused values in MutableArrayData and then rewrite the data on freeze?

The idea I had was be to defer creation of the dictionary ArrayData until freeze, as opposed to in the constructor, but otherwise to keep the extend generation the same. Freeze would then scan through the generated keys in the _MutableArrayData to compute the values to include in the final dictionary, build this new dictionary, and then rewrite the keys. I realise as I write this that _MutableArrayData may not have the necessary API to actually mutate data in place, but I don't see a compelling reason that it couldn't?

jorgecarleitao · 2021-06-29T11:13:20Z

Note that I had very limited knowledge about Arrow and Rust when I initially proposed the MutableArrayData; specially the dictionary part, which was simply written to "work for dictionary arrays".

Its primary goal was simply to allow us to easily construct an array out of "slices" of other arrays, which I was observing to be a recursive pattern in filter, concat, sort-merge and joins.

_MutableArrayData is private and does not need to comply with the arrow spec up to "freeze". So, I think that we can use/ experiment with it as we see fit to extract performance :P

alamb · 2021-06-29T13:21:21Z

I was probably overthinking, but I was imagining that we could reduce peak memory usage by doing something smarter than "concat and then cleanup"

However, given concat and then cleanup would be an improvement over master, starting there and then making additional improvements as events / profiling warrants seems like a good idea to me

tustvold · 2021-06-29T13:53:26Z

@alamb what I'm proposing isn't "concat then cleanup". You build the keys array with the current dictionary encodings as if they had been concatenated together (but don't actually concatenate them). Then compute the new dictionary and re-encode the keys with this new dictionary. Is there some way to do better than this?

alamb · 2021-06-29T15:57:22Z

You build the keys array with the current dictionary encodings as if they had been concatenated together (but don't actually concatenate them). Then compute the new dictionary and re-encode the keys with this new dictionary

This sounds like a good plan to me

lquerel · 2021-12-08T21:37:54Z

Another issue with the existing implementation is the DictionaryKeyOverflowError error that is returned in situations where it is reasonably not expected. For example like in this scenario.

Let's imagine a dictionary column type is: DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8))
The dictionary represents an enumeration with 10 distincts values.
As currently the dictionary columns are concatenated without deduplication it becomes very easy to overflow the key type. In my example the concatenation of 26 batches (containing 10 rows, each row containing a different value of the enum) will return a DictionaryKeyOverflowError error.

This issue makes UInt8 dictionary key unusable in a context where concatenation of batches could take place.

alamb · 2021-12-09T14:46:55Z

@lquerel -- I agree that deduplication of dictionary keys is required (or at least doing so when it is needed)

tustvold · 2022-10-06T08:44:14Z

I plan to address this as part of #1523

lquerel · 2023-09-05T15:30:12Z

This is great news. Thanks

…

On Tue, Sep 5, 2023 at 8:03 AM Raphael Taylor-Davies < ***@***.***> wrote: Closed #506 <#506> as completed via #3558 <#3558>. — Reply to this email directly, view it on GitHub <#506 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFAUSSXOCSD4KGX7MPEWDLXY45KJANCNFSM47OQ3TAA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jun 28, 2021

alamb added the arrow Changes to the arrow crate label Jun 28, 2021

tustvold mentioned this issue Jan 21, 2022

Cast Dictionary Options #1218

Closed

tustvold mentioned this issue Oct 6, 2022

Deprecate MutableArrayData #2832

Closed

tustvold mentioned this issue Jan 18, 2023

Re-encode dictionaries in selection kernels (take / concat_batches) #3558

Merged

tustvold mentioned this issue Mar 10, 2023

Concating dictionary array leads to duplicated dict values. #3837

Closed

gsilvestrin mentioned this issue Mar 10, 2023

Merging record_batch results in incorrect Dictionary Data lancedb/lance#674

Closed

This was referenced Jun 14, 2023

Improved Handling of Sparse Dictionaries #4414

Closed

Add DictionaryArray::occupancy #4415

Merged

alamb mentioned this issue Aug 25, 2023

Request for Comment: Native TopK Operator apache/datafusion#7250

Closed

3 tasks

tustvold closed this as completed in #3558 Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Optimize" Dictionary contents in DictionaryArray / `concat_batches` #506

"Optimize" Dictionary contents in DictionaryArray / `concat_batches` #506

alamb commented Jun 28, 2021

alamb commented Jun 28, 2021

jhorstmann commented Jun 28, 2021

jorgecarleitao commented Jun 28, 2021

alamb commented Jun 28, 2021

alamb commented Jun 28, 2021

tustvold commented Jun 29, 2021

alamb commented Jun 29, 2021

tustvold commented Jun 29, 2021

jorgecarleitao commented Jun 29, 2021

alamb commented Jun 29, 2021 •

edited

Loading

tustvold commented Jun 29, 2021 •

edited

Loading

alamb commented Jun 29, 2021

lquerel commented Dec 8, 2021

alamb commented Dec 9, 2021 •

edited

Loading

tustvold commented Oct 6, 2022

lquerel commented Sep 5, 2023 via email

"Optimize" Dictionary contents in DictionaryArray / concat_batches #506

"Optimize" Dictionary contents in DictionaryArray / concat_batches #506

Comments

alamb commented Jun 28, 2021

alamb commented Jun 28, 2021

jhorstmann commented Jun 28, 2021

jorgecarleitao commented Jun 28, 2021

alamb commented Jun 28, 2021

alamb commented Jun 28, 2021

tustvold commented Jun 29, 2021

alamb commented Jun 29, 2021

tustvold commented Jun 29, 2021

jorgecarleitao commented Jun 29, 2021

alamb commented Jun 29, 2021 • edited Loading

tustvold commented Jun 29, 2021 • edited Loading

alamb commented Jun 29, 2021

lquerel commented Dec 8, 2021

alamb commented Dec 9, 2021 • edited Loading

tustvold commented Oct 6, 2022

lquerel commented Sep 5, 2023 via email

"Optimize" Dictionary contents in DictionaryArray / `concat_batches` #506

"Optimize" Dictionary contents in DictionaryArray / `concat_batches` #506

alamb commented Jun 29, 2021 •

edited

Loading

tustvold commented Jun 29, 2021 •

edited

Loading

alamb commented Dec 9, 2021 •

edited

Loading