Stateless Row Conversion #4811

tustvold · 2023-09-13T10:51:26Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently row conversion is stateful, relying on a separate RowConverter to maintain this global state.

This has a number of drawbacks:

Prevents parallelizing row conversion across multiple threads
Complicates APIs which have to manage mutable access to a RowConverter - https://github.com/apache/arrow-datafusion/blob/cf229b82cb53f1d9f430a2fa4ff5191ba52e5cef/datafusion/core/src/physical_plan/sorts/stream.rs#L38
State accumulation can be problematic, necessitating workarounds like Implement CardinalityAwareRowConverter while doing streaming merge datafusion#7401
The dictionary interning logic is very expensive - Make dictionary preservation optional in row encoding #3831

Describe the solution you'd like

I would like to propose removing the dictionary preservation logic, instead always hydrating the dictionaries values when encoding. This in turn would allow simplifying the API to no longer have a notion of a stateful RowConverter.

This may represent a performance regression for dictionaries with small number of values. We should definitely quantify this, but it is my expectation that this will only occur for dictionaries with a very low number of values. It is currently the case that even arrays with low numbers of distinct values may contain non-trivial number of values as a result of the way dictionaries are handled by the various kernels and readers, and so I'm inclined to not weigh this very highly.

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2023-09-13T14:51:14Z

I personally think this proposal is very compelling because using high cardinality dictionaries has caused us significant pain in IOx due to the memory accumulated in the row converters. We have several non trivial PRs to properly account for and try to avoid using dictionary encoding for these cases -- see apache/datafusion#7130 and apache/datafusion#7401 for example

Also, a stateful row converter has caused other operations like apache/datafusion#7379 from @wiedld become significantly more complicatated as it can't be shared between threads

Thus, in my mind the ideal outcome would be that we can remove the stateful row conversion and minimize the performance penalty for relatively low cardinality dictionaries. I am hopeful that @tustvold 's ideas to make string encoding more efficient (e.g. #4812) could be part of this answer

alamb · 2023-09-13T14:54:10Z

The one case I can think of where a stafeful row converter is likely to have massive benefit is with low cardinality dictionaries that have very large individual values (e.g. 2MB strings for each entry). I have no idea how common that type of data is in practice.

#4811) (#4819) * Stateless Row Encoding / Don't Preserve Dictionaries (#4811) * Add low cardinality benchmarks

tustvold · 2023-09-18T14:56:44Z

label_issue.py automatically added labels {'arrow'} from #4819

tustvold · 2023-09-18T14:56:46Z

label_issue.py automatically added labels {'arrow-flight'} from #4819

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Sep 13, 2023

This was referenced Sep 13, 2023

Row Format Adapative Block Size #4812

Closed

Implement CardinalityAwareRowConverter while doing streaming merge apache/datafusion#7401

Merged

alamb mentioned this issue Sep 13, 2023

Preserve RoundTrip types in RowConverter even if preserve_dictionaries=false #4813

Closed

tustvold self-assigned this Sep 13, 2023

tustvold mentioned this issue Sep 14, 2023

Adaptive Row Block Size (#4812) #4818

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 15, 2023

Stateless Row Format (apache#4811)

a848001

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 15, 2023

Stateless Row Encoding / Don't Preserve Dictionaries (apache#4811)

37d5c26

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 15, 2023

Stateless Row Encoding / Don't Preserve Dictionaries (apache#4811)

98dbedb

tustvold mentioned this issue Sep 15, 2023

Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter (#4811) #4819

Merged

tustvold closed this as completed in #4819 Sep 17, 2023

tustvold added a commit that referenced this issue Sep 17, 2023

Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter (

9cb4a75

#4811) (#4819) * Stateless Row Encoding / Don't Preserve Dictionaries (#4811) * Add low cardinality benchmarks

tustvold added the arrow Changes to the arrow crate label Sep 18, 2023

tustvold added the arrow-flight Changes to the arrow-flight crate label Sep 18, 2023

alamb mentioned this issue Sep 22, 2023

Update arrow 47.0.0 in DataFusion apache/datafusion#7587

Merged

brancz mentioned this issue Sep 20, 2024

Panic on aggregations on struct of dictionaries apache/datafusion#12542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateless Row Conversion #4811

Stateless Row Conversion #4811

tustvold commented Sep 13, 2023

alamb commented Sep 13, 2023

alamb commented Sep 13, 2023

tustvold commented Sep 18, 2023

tustvold commented Sep 18, 2023

Stateless Row Conversion #4811

Stateless Row Conversion #4811

Comments

tustvold commented Sep 13, 2023

alamb commented Sep 13, 2023

alamb commented Sep 13, 2023

tustvold commented Sep 18, 2023

tustvold commented Sep 18, 2023