Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on aggregations on struct of dictionaries #12542

Closed
brancz opened this issue Sep 20, 2024 · 3 comments · Fixed by #12586
Closed

Panic on aggregations on struct of dictionaries #12542

brancz opened this issue Sep 20, 2024 · 3 comments · Fixed by #12586
Labels
bug Something isn't working

Comments

@brancz
Copy link
Contributor

brancz commented Sep 20, 2024

Describe the bug

Aggregations on a struct of dictionaries produces data with a different schema than expected (plain arrays instead of dictionaries).

For example:

thread 'tokio-runtime-worker' panicked at /Users/brancz/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-53.0.0/src/array/struct_array.rs:90:46:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect datatype for StructArray field \"a\", expected Dictionary(Int32, Utf8) got Utf8")

To Reproduce

Have a schema with a struct of dictionaries, and perform an aggregation on it, like count_distinct.

Full code example here:

https://gist.github.com/brancz/fa12a3ae0f5d09620e9c274384ffd506

Expected behavior

No panic.

Additional context

I can see two ways to solve this:

  1. Currently, the aggregation says it will emit data with the dictionaries being dictionaries. Instead, if all it did was declare it would emit plain arrays instead of dictionary-encoded ones, it would not panic.
  2. Have RowConverter emit the same DataType as its input.

I think I'm slightly in favor of 1, because with 2 we'd either have to revert to stateful row converters which were removed intentionally, or we'd have to copy data again on emitting to turn the currently plain arrays into dictionaries again.

@brancz brancz added the bug Something isn't working label Sep 20, 2024
@brancz
Copy link
Contributor Author

brancz commented Sep 20, 2024

I actually tried to understand what is different about dicts that are not in structs, and it turns out that the row converter also emits plain arrays in those cases, but something turns them back into dictionaries at some point (I'm guessing this has to be in datafusion somewhere).

Example: https://gist.github.com/brancz/9ff04f6263b710ad8215933590026500

@brancz
Copy link
Contributor Author

brancz commented Sep 20, 2024

Ok found it. This is the place where if the data is of an unexpected type, it's casted to the expected type.

I think the right fix in that case is 1, adding support for nested schemas.

@alamb
Copy link
Contributor

alamb commented Sep 20, 2024

I think the right fix in that case is 1, adding support for nested schemas.

I agree this sounds like it makes sense. There even seems to be an existing ticket: #7647

Note that there is a PR by @jayzhan211 to rework how grouping is done to avoid the RowConverter in many cases in #12269. I haven't reivewed it thoroughly, but I would suggest that you ensure your fix for this issue is well covered by end to end .slt tests (not just unit tests)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants