-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not push down filter through distinct on #12943
Do not push down filter through distinct on #12943
Conversation
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
Thank you @epsio-banay and @findepi and @eejbyfeldt for your help |
75a06f6
to
d1bcf2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @epsio-banay -- this is looking close
Perhaps @gruuya would have some time to help review this and offer suggestions as well
@@ -628,6 +627,103 @@ fn infer_join_predicates( | |||
.collect::<Result<Vec<_>>>() | |||
} | |||
|
|||
/// Check whether the given expression depends only on the columns in the given schema. | |||
/// Meaning for a certain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence appears to end abruptly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this appears to be a copy of https://github.com/apache/datafusion/blob/b098893a34f83f1a1df290168377d7622938b3f5/datafusion/core/src/datasource/listing/helpers.rs#L55-L54
Perhaps instead of copy/pasting the code, you could move it to a common location (e.g. datafusion-expr)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey,
I agree that this code is a duplicate and it should be a common function, however expr_applicable_for_cols
uses column names as argument, so in case we have two columns with the same name but with different qualifiers, the function will produce incorrect result.
IMO we would need to dig into expr_applicable_for_cols
's usage (there's only one) and understand if it should be fixed. Unfortunately I won't have much time to do it in the coming weeks.
In the mean time I moved this function to be a method of expr
.flat_map(|e| e.column_refs()) | ||
.collect(); | ||
let distinct_on_input_schema = distinct_on.input.schema(); | ||
let distinct_on_qualified_fields: Vec<_> = distinct_on_cols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a lot of copying going on in this code (qualified_field_from_column
etc)
Is there any way to avoid this? I wonder if we could check each reference in the filter expression and see if it had any column that wasn't in the distinct on clause
…e cardinality of the distinct on expressions.
d1bcf2c
to
a2e19ef
Compare
Are we certain that the filter should not be pushed down, and moreover that it can be pushed down in some cases but not in other cases? This is unexpected to me, as my intuition assumes that the postgres@localhost:postgres> explain select distinct on (a) a, b from foo where b = 1 order by a, b desc;
+-----------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------|
| Unique (cost=38.44..38.50 rows=11 width=8) |
| -> Sort (cost=38.44..38.47 rows=11 width=8) |
| Sort Key: a |
| -> Seq Scan on foo (cost=0.00..38.25 rows=11 width=8) |
| Filter: (b = 1) |
+-----------------------------------------------------------------+ Granted with |
I don't think "always pushed down" is the correct description here. My understanding is that it would be planned as a filter before the aggregation and therefore would not needed to be pushed down. In DataFusion this can be seen if we do
To get the filter above the aggregation (in postgres) we can do
e.g not pushed down and if we change the filter to
so this filter gets pushed down. So postgres seems to agree with us that the filter can be pushed down in some case and not in others. |
vec![col("a")], | ||
))))? | ||
.build()?; | ||
// filter is on volatile function, so it should not be pushed down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// filter is on volatile function, so it should not be pushed down | |
// filter is on a non-volatile function, so it should be pushed down |
@@ -1615,6 +1659,91 @@ mod tests { | |||
assert_optimized_plan_eq(plan, expected) | |||
} | |||
|
|||
#[test] | |||
fn distinct_on() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need a "false negative" test here, i.e. a filter on the same subquery as the distinct (so without an alias
) but not on a distinct column.
Something like (not tested):
fn distinct_on_() -> Result<()> {
let table_scan = test_table_scan()?;
let plan = LogicalPlanBuilder::from(table_scan)
.distinct_on(vec![col("a")], vec![col("a"), col("b")], None)?
.filter(col("b").eq(lit(1i64)))?
.build()?;
// filter is on the same subquery as the distinct, so it should be pushed down even though it isn't in the ON clause
let expected = "\
DistinctOn: on_expr=[[test.a]], select_expr=[[a, b]], sort_expr=[[]]\
\n TableScan: test, full_filters=[b = Int64(1)]";
assert_optimized_plan_eq(plan, expected)
}
Ahh good point. I think my misunderstanding here is that I was concentrating on the So yes, I agree we should push down only compatible expressions from outside of the |
Marking as a draft as I think this PR is no longer waiting on feedback (and I am trying to work down the review queue). Please mark it as ready for review when it is ready for another look |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #12942.
Rationale for this change
Written in issue.
What changes are included in this PR?
Fix for PushDownFilter with DistinctOn.
Are these changes tested?
Yes
Are there any user-facing changes?
No