You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aggregations on a struct of dictionaries produces data with a different schema than expected (plain arrays instead of dictionaries).
For example:
thread 'tokio-runtime-worker' panicked at /Users/brancz/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-53.0.0/src/array/struct_array.rs:90:46:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect datatype for StructArray field \"a\", expected Dictionary(Int32, Utf8) got Utf8")
To Reproduce
Have a schema with a struct of dictionaries, and perform an aggregation on it, like count_distinct.
Currently, the aggregation says it will emit data with the dictionaries being dictionaries. Instead, if all it did was declare it would emit plain arrays instead of dictionary-encoded ones, it would not panic.
Have RowConverter emit the same DataType as its input.
I think I'm slightly in favor of 1, because with 2 we'd either have to revert to stateful row converters which were removed intentionally, or we'd have to copy data again on emitting to turn the currently plain arrays into dictionaries again.
The text was updated successfully, but these errors were encountered:
I actually tried to understand what is different about dicts that are not in structs, and it turns out that the row converter also emits plain arrays in those cases, but something turns them back into dictionaries at some point (I'm guessing this has to be in datafusion somewhere).
I think the right fix in that case is 1, adding support for nested schemas.
I agree this sounds like it makes sense. There even seems to be an existing ticket: #7647
Note that there is a PR by @jayzhan211 to rework how grouping is done to avoid the RowConverter in many cases in #12269. I haven't reivewed it thoroughly, but I would suggest that you ensure your fix for this issue is well covered by end to end .slt tests (not just unit tests)
Describe the bug
Aggregations on a struct of dictionaries produces data with a different schema than expected (plain arrays instead of dictionaries).
For example:
To Reproduce
Have a schema with a struct of dictionaries, and perform an aggregation on it, like
count_distinct
.Full code example here:
https://gist.github.com/brancz/fa12a3ae0f5d09620e9c274384ffd506
Expected behavior
No panic.
Additional context
I can see two ways to solve this:
RowConverter
emit the sameDataType
as its input.I think I'm slightly in favor of 1, because with 2 we'd either have to revert to stateful row converters which were removed intentionally, or we'd have to copy data again on emitting to turn the currently plain arrays into dictionaries again.
The text was updated successfully, but these errors were encountered: