-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve RoundTrip types in RowConverter
even if preserve_dictionaries=false
#4813
Comments
The reason this is currently the case is to avoid taking a dependency on arrow-cast. It isn't a particularly good reason, but it is the reason |
I was thinking in order to make this perform well, we would probably want to avoid using |
So I'm not really sure how to achieve this in a performant manner, taking a step-back though DataFusion uses this for hydrating group keys. Perhaps we could simply alter group by to always materialize dictionaries in the group keys 🤔 ? Certainly in the case of a single column grouping, the dictionary encoding is pure overhead, as each value will only appear once. |
I am not sure about this -- the group keys for each incoming batch are converted to Row format first, to compare them with existing group keys. However, we could potentially add a special case |
I was referring to the reverse transformation, which is what this ticket concerns, i.e. going from unique rows back to arrays |
Following apache/datafusion#8291 the motivating use-case for this functionality I think has been removed, and so I am closing this for now |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We struggle with the memory used by the RowConverter when interning values from
DictionaryArrays
. We are even proposing a special CardinalityAware wrapper on top of the RowConverter in DataFusion (see apache/datafusion#7401)At the moment, round tripping data from Array to Rows and then back to Array works like this:
In DataFusion we must maintain the same input / output types, so in our proposed improvement we needed to add a call to
cast
, which @tustvold notes is likely very expensive: https://github.com/apache/arrow-datafusion/pull/7401/files#r1324281222Describe the solution you'd like
I would like the
RowConverter
to produce the same output type as the input type onSortField
, even if preserve_dictionaries is set to falseThis would avoid a copy of the String data and likely perform much better.
Describe alternatives you've considered
We could potentially simply remove stateful row encoding: #4811
Additional context
The text was updated successfully, but these errors were encountered: