-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a Priority Queue based Aggregation with limit
#7192
Create a Priority Queue based Aggregation with limit
#7192
Conversation
I plan to give this a look later today -- thank you @avantgardnerio |
The more I think about this code / approach the more I like it ❤️ -- I spent some time writing up how I think this basic strategy can be applied to all the various TopK type queries at #7198 (comment) I think my writeup assumes a slightly different packaging / deciding how to invoke this operator, but the basic idea I think is the same. Thank you for sparking this @avantgardnerio |
222c458
to
8729398
Compare
Would anyone be able to provide advice on debugging sql logic tests? This error doesn't seem very informative.. I'd expect to see more of a diff than this:?
|
The docs are here: using cargo test --test sqllogictests -- --complete Would likely save you time I believe that diff says a new line was added to the explain plan (which makes sense if you have added a new optimizer pass) |
TLDR: with the naive, unoptimized version in place, it looks to be 2X slower according to a test with realistic data: This is based upon the fact that currently, the normal aggregation is running twice or with the rule enabled 1 of each.
I'm not super worried because:
No matter what, this rule is much more memory efficient. I'll pass the limit down the tree and we'll see if I'm right and we match speed. |
We can see it doing the right thing now:
but very slowly (debug mode is 10x, divide by 10 for release):
Edit: it's almost like there is some high, fixed cost to running this stream 🤔 Welp, at least testing is in place. I'll start tracking down performance issues tomorrow. |
@avantgardnerio seems best to profile it ATM and see where the most time is spent |
f5f70f0
to
dda0c17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current PR is looking good to me, I think in a good shape to be merged and to be continued/extended.
I've one small remaining comment about the rand
dependency.
Let me know if there is anything I can do for this PR -- I think merging the PR and continuing to iterate would be a fine idea, given how long this one has been outstanding and how large it has grown |
Thanks, I was waiting for a non-coralogix ✅ since I introduced a bunch of |
I am backed up on reviews as I was off last week. I will try and find time to review this tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @avantgardnerio -- I didn't review the implementation in detail but I skimmed it and it looked solid to me (and I trust that @Dandandan and @thinkharderdev 's attention is sufficient.
I think this PR is almost ready to merge, the only things I think it needs are:
- An end to end test for actually limiting the values: https://github.com/apache/arrow-datafusion/pull/7192/files#r1301686217
- The follow on work suggested by @ozankabak in https://github.com/apache/arrow-datafusion/pull/7192/files#r1308198186
Also, if someone wanted to change this code in the future, are there benchmarks that would catch any performance regressions?
datafusion/core/src/physical_plan/aggregates/topk/hash_table.rs
Outdated
Show resolved
Hide resolved
|
||
|
||
query TI | ||
select trace_id, MAX(timestamp) from traces group by trace_id order by MAX(timestamp) desc limit 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it is important to have an end to end that that actually limits the number of values coming out - as I mentioned here I think this test only has 4 distinct groups and thus a limit 4
doesn't actually do any limiting.
891189f
to
2552543
Compare
There is a benchmark. I'm not sure... I think the github action fails if that regresses? |
I added some limit 3 tests. |
Which issue does this PR close?
Closes #7191.
Rationale for this change
Described in issue.
What changes are included in this PR?
GroupedTopKAggregateStream
aggregationlimit
property onAggregateExec
SortExec
if applicableAre these changes tested?
AggregateExec
now printslim=X
if there's a limit, and I added some tests to assert thissqllogictest
s to compare to existing functionalityAre there any user-facing changes?
I probably broke other things so this is a draftAll the existing tests now passNotes
Concerns to address:
themost queries will use a single columnOwnedRow
code is not columnar, vectorized, etcuse the existing Acculumators?not required since this is only min/maxfilters are not yet appliedunsupported edge case for nowExec
node, not just a newStream
type?key types other thannow supports String + all primitive keysString
TreeMap
with custom index-based heapOut of scope
OwnedRow