Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Aggregate with Limit #13729

Open
waruto210 opened this issue Dec 11, 2024 · 0 comments
Open

Improve Aggregate with Limit #13729

waruto210 opened this issue Dec 11, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@waruto210
Copy link
Contributor

waruto210 commented Dec 11, 2024

Is your feature request related to a problem or challenge?

Currently datafusion supports topk_aggregation(#7192 ), and I've noticed two potential areas for optimization:

  1. topk_aggregation only supports ordering by aggregations (order by agg), but doesn't support ordering by columns
  2. topk_aggregation cannot be used in cases where there is no order by clause

Describe the solution you'd like

  1. For the first one, I think we can directly use the current priority queue solution

  2. For the second one:

    The following query which without an ORDER BY clause is a non-deterministic query where returning aggregate results for any 10 keys is valid. Therefore, the simplest optimization method is to make the aggregate operation work in an ordered manner.

    SELECT "UserID", MIN("AdvEngineID") FROM hits GROUP BY "UserID" order by MIN("AdvEngineID") LIMIT 10;

    When using ORDER BY, it runs faster

    > SELECT "UserID", MIN("AdvEngineID") FROM hits GROUP BY "UserID" order by MIN("AdvEngineID") LIMIT 10;
    +---------------------+-----------------------+
    | UserID              | min(hits.AdvEngineID) |
    +---------------------+-----------------------+
    | 8265925904823819813 | 0                     |
    | 4187744066815097140 | 0                     |
    | 7970073217952173230 | 0                     |
    | 1427747163043990597 | 0                     |
    | 844431718317972727  | 0                     |
    | 5777095312382298493 | 0                     |
    | 1726996514541263413 | 0                     |
    | 9126602271142633613 | 0                     |
    | 137961262398427389  | 0                     |
    | 1015195936484711824 | 0                     |
    +---------------------+-----------------------+
    10 row(s) fetched.
    Elapsed 0.389 seconds.
    
    > SELECT "UserID", MIN("AdvEngineID") FROM hits GROUP BY "UserID" LIMIT 10;
    +--------------------+-----------------------+
    | UserID             | min(hits.AdvEngineID) |
    +--------------------+-----------------------+
    | 572919489234519776 | 0                     |
    | 573305205053433738 | 0                     |
    | 573316808606264402 | 0                     |
    | 573764012887855352 | 0                     |
    | 573878311420505425 | 0                     |
    | 574390391777319344 | 0                     |
    | 574747808491966822 | 0                     |
    | 574822330152391359 | 0                     |
    | 574882842092706724 | 0                     |
    | 576132580282803970 | 0                     |
    +--------------------+-----------------------+
    10 row(s) fetched.
    Elapsed 0.632 seconds.

Describe alternatives you've considered

No response

Additional context

Just found that the TopKAggregation only support group by one column, I'm confused.

let group_key = aggr.group_expr().expr().iter().exactly_one().ok()?;

@waruto210 waruto210 added the enhancement New feature or request label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant