Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve vectorized operations of GroupColumn #13275

Open
Tracked by #13449
Rachelint opened this issue Nov 6, 2024 · 7 comments
Open
Tracked by #13449

Improve vectorized operations of GroupColumn #13275

Rachelint opened this issue Nov 6, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@Rachelint
Copy link
Contributor

Is your feature request related to a problem or challenge?

Some good points about improving vectorized operations of GroupColumn from @Dandandan and @jayzhan211 in #12996 :

#12996 (comment)
#12996 (comment)
#12996 (comment)
#12996 (comment)

This pr is for tracking such points for conintuing improving performace of multi group by.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@Rachelint
Copy link
Contributor Author

I tried the possible optimizations forvectorized_equal_to:
#12996 (comment)
#12996 (comment)

But unfortunately seems no obvious improvement in clickbench, tpch and so on... although some improvement can see in benchmark for the function in #13539

For take + eq, use eq in arrow kernel indeed can get performance improve, but the cost of take seems unacceptable...

For optimizations for special case of input array (like all rows in the array needed, no nulls although it is nullable), it can see some improvement in the benchmark for function, but see no improvement in e2e benchmark(clickbench, tpch). I guess they may be not the hot path...

I plan to not make further attempts recently...

cc @Dandandan @alamb @jayzhan211

@Rachelint
Copy link
Contributor Author

Rachelint commented Nov 28, 2024

I tried the possible optimizations forvectorized_equal_to: #12996 (comment) #12996 (comment)

But unfortunately seems no obvious improvement in clickbench, tpch and so on... although some improvement can see in benchmark for the function in #13539

For take + eq, use eq in arrow kernel indeed can get performance improve, but the cost of take seems unacceptable...

For optimizations for special case of input array (like all rows in the array needed, no nulls although it is nullable), it can see some improvement in the benchmark for function, but see no improvement in e2e benchmark(clickbench, tpch). I guess they may be not the hot path...

I plan to not make further attempts recently...

cc @Dandandan @alamb @jayzhan211

@LeslieKid maybe you can try it if some other possible optimizations not included above.

@Dandandan
Copy link
Contributor

Dandandan commented Nov 29, 2024

Thanks @Rachelint for summarizing, that's interesting. One big difference between take + eq in join versus grouped aggregates seems to be that the ValueBuilders already are type erased, so maybe there is nothing to gain (and using take and eq just creates an extra copy.

It seems for the join code, this could actually be used more (and joining on string columns might be not so fast at the moment).

@Rachelint
Copy link
Contributor Author

Thanks @Rachelint for summarizing, that's interesting. One big difference between take + eq in join versus grouped aggregates seems to be that the ValueBuilders already are type erased, so maybe there is nothing to gain (and using take and eq just creates an extra copy.

It seems for the join code, this could actually be used more (and joining on string columns might be not so fast at the moment).

🤔 I think maybe we can indeed try some solutions without take in join? @LeslieKid seems trying it.

@Dandandan
Copy link
Contributor

Yeah som experiments around that would be nice.

Also see this issue:
#12131

@Dandandan
Copy link
Contributor

Oh I saw @LeslieKid already commented on that issue 👍

@LeslieKid
Copy link
Contributor

🤔 I think maybe we can indeed try some solutions without take in join? @LeslieKid seems trying it.

Yes, I am working on comparing arrays with indices in a for-loop (similar to the implementation of vectorized_equal_to() in multi group-by) instead of using take in hash join. See #13607

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants