-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support COUNT()
#11229
feat: support COUNT()
#11229
Conversation
|
||
/// Rewrite `Count()` to `Count(Expr:Literal(1))`. | ||
#[derive(Default)] | ||
pub struct CountEmptyRule {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is heavily inspired/copied from CountWildcardRule
. It should be possible to also combine this with that if that's a better path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As each optimizer pass does have non trivial overhead (like it walks the plan trees) I think it would be better if this was combined into the existing CountWildcardRule
if possible
I can think of two alternative approaches
datafusion/datafusion/sql/src/select.rs Line 84 in 3421b52
I think we could rely on the planner trait we recently added in #11180 , so we can rewrite it with expected args based on the expression we got in sql parser. @tshauck What do you think? |
@jayzhan211 thanks for the feedback. I like the idea because it would seem to avoid having to update the count function to accept zero args. The current approach is nice at least in that it's very clear what's happening (at least to me) and while a bit of code it's mostly a single chunk. Can you say more about new planner trait and how that'd be used? Is the idea I'd implement Also, since this is somewhat inspired from duckdb, they do a |
I think we can have a more general Planner, AggregateFunctionPlanner if let Some(fm) = self.context_provider.get_aggregate_meta(&name) {
let order_by = self.order_by_to_sort_expr(
&order_by,
schema,
planner_context,
true,
None,
)?;
let order_by = (!order_by.is_empty()).then_some(order_by);
let args = self.function_args_to_expr(args, schema, planner_context)?;
let filter: Option<Box<Expr>> = filter
.map(|e| self.sql_expr_to_logical_expr(*e, schema, planner_context))
.transpose()?
.map(Box::new);
let raw_aggregate_function = RawAggregateFunction {
fm,
args,
distinct,
filter,
order_by,
null_treatment,
}
// general planner for rewriting aggregate function
// we convert wildcard and empty args to lit(1) for count in our case
plan_aggregate_function(raw_aggregate_function)
return Ok(Expr::AggregateFunction(expr::AggregateFunction::new_udf(
fm,
args,
distinct,
filter,
order_by,
null_treatment,
)));
} datafusion/datafusion/sql/src/expr/function.rs Lines 337 to 359 in fe66daa
I'm not sure which approach are you mentioning and what do you mean about dataframe expression 😕 Are you suggesting we support count_start() dataframe API? I didn't find count_star() in duckdb, so not sure should we support this we have dataframe API |
Thanks, your original reply was a bit confusing, but I think that makes sense. Adding something new/generic was more than I had originally had in mind, but I'll take a look at adding something like you mention in Here's what I meant w.r.t. duckdb...
You're right on the dataframe you can do count with wildcard, but it doesn't seem as analogous to I'm traveling through the end of the weekend, but'll follow up when I've had a chance to try this out. |
cba60ac
to
07bee49
Compare
@jayzhan211 -- would you mind taking a quick look now to see if this approach is inline w/ what you were thinking? It's super rough, but just hoping to get feedback on the general shape. On this branch I can do:
|
It is what I'm thinking about! |
6782364
to
4550dbd
Compare
return Ok(PlannerResult::Planned(Expr::AggregateFunction( | ||
expr::AggregateFunction::new_udf( | ||
aggregate_function.udf, | ||
vec![lit(COUNT_STAR_EXPANSION).alias("*")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickly tried a few options for aliasing, but nothing quite works right. Need to look a little deeper unless it's obvious to someone else.
I think the idea is to reduce the complexity,
IMO, introducing a new count_start function is unnecessary. We already have the capability to handle count(1) in our AggregateStatistics module. Instead, our focus should be on converting count() and count(*) to count(1).
If you are asking dataframe API specifically, it should be something like pub fn count(expr: Expr) -> Expr {
let expr = if expr == Expr::Wildcard {
expr = lit(1)
} else {
expr
}
Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
count_udaf(),
vec![expr],
false,
None,
None,
None,
))
} |
I think we both agree on the outcome, I'm talking about the implementation and particular how it works for the alias issues I saw. E.g. your example would break a lot of tests though it'd produce the right answer because it changes how the plan is represented. So the additional complexity I'm talking about is what changes need to support that (if any) and the extent to which that makes it a more complex issue than something like count star which may feel verbose, but seems simple. This is also an interesting test, because it goes straight to the logical plan builder... let sql_results = ctx
.sql("select count(*) from t1")
.await?
.select(vec![col("count(*)")])?
.explain(false, false)?
.collect()
.await?; which maybe I missed it, but wouldn't include potentially being planned? |
It seems like a duplicated If the alias issue is what I think of, we can add datafusion/datafusion/expr/src/expr.rs Lines 2103 to 2105 in 16a3148
If only the function displayed name is changed, I think it is acceptable. |
The test is from here: datafusion/datafusion/core/tests/dataframe/mod.rs Lines 208 to 237 in 16a3148
select(col("count(*)")) which again seems to go directly to the logical planner builder and thus isn't an expression to be planned, or is that right?
|
I agree with it. And, I'm not sure why the test looks like this and whether we should consider it valid or not. IMO, I play around it and have a workaround solution for aliasing issue, I guess we can utilize If we need to deal with |
I found the function name issue is a lot complex than I thought. I think we should have a different name for function for displayed and planning. In DuckDB,
|
@jayzhan211 @alamb I've reverted this PR its original form. It adds a logical rule and updates the COUNT() function as a basis for discussion. I also explored a little bit the idea of having a specific |
@tshauck Is it better to merge the logic of count-wild and count empty into single rule, given the similarity of them? |
I will try and find time to review this PR tomorrow |
Thanks! A lot of the discussion is good for context, but not as relevant to the actual changes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tshauck -- I think this looks good to me. I would prefer that we consolidate the optimizer passes, but we could also do it as part of a follow on PR (or never)
|
||
/// Rewrite `Count()` to `Count(Expr:Literal(1))`. | ||
#[derive(Default)] | ||
pub struct CountEmptyRule {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As each optimizer pass does have non trivial overhead (like it walks the plan trees) I think it would be better if this was combined into the existing CountWildcardRule
if possible
@@ -103,10 +103,6 @@ SELECT power(1, 2, 3); | |||
# Wrong window/aggregate function signature | |||
# | |||
|
|||
# AggregateFunction with wrong number of arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to come up with some other query that has an invalid arguments for count(*) but I could not
+1 on this, they are possible to be rewritten in one rule, it is ok to do it in another PR, but I hope we have only on rule at the end. |
Co-authored-by: Andrew Lamb <[email protected]>
Thanks for the feedback @alamb @jayzhan211. I was unaware there was tangible overhead for each rule, so certainly makes sense to me to combine then. Still waiting on windows tests, but otherwise I've made the requested updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you @tshauck
Thanks for your guidance and encouragement @jayzhan211 |
* feat: add count empty rewrite * feat: make count support zero args * docs: add apache license * tests: make count() valid * tests: more tests * refactor: sketch `AggregateFunctionPlanner` * refactor: cleanup `AggregateFunctionPlanner` * feat: add back rule * Revert "feat: add back rule" This reverts commit 2c4fc0a. * Revert "refactor: cleanup `AggregateFunctionPlanner`" This reverts commit 4550dbd. * Revert "refactor: sketch `AggregateFunctionPlanner`" This reverts commit 658671e. * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * refactor: PR feedback * style: fix indent --------- Co-authored-by: Andrew Lamb <[email protected]>
* feat: add count empty rewrite * feat: make count support zero args * docs: add apache license * tests: make count() valid * tests: more tests * refactor: sketch `AggregateFunctionPlanner` * refactor: cleanup `AggregateFunctionPlanner` * feat: add back rule * Revert "feat: add back rule" This reverts commit 2c4fc0a. * Revert "refactor: cleanup `AggregateFunctionPlanner`" This reverts commit 4550dbd. * Revert "refactor: sketch `AggregateFunctionPlanner`" This reverts commit 658671e. * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * refactor: PR feedback * style: fix indent --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #11228
Rationale for this change
Adds an analyzer rule that roughly matches the current
CountWildcardRule
rule. I made them separate, but at first blush they could be combined.I'm not sure how best to handle expressions in rust-land however, because right now the
count
function takes an expression as an input.What changes are included in this PR?
CountEmptyRule
CountWildcardRule
that this is modeled afterAre these changes tested?
Added some slt tests, though need to add more.
Are there any user-facing changes?