Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not push down filter through distinct on #12943

Closed

Conversation

epsio-banay
Copy link
Contributor

Which issue does this PR close?

Closes #12942.

Rationale for this change

Written in issue.

What changes are included in this PR?

Fix for PushDownFilter with DistinctOn.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the optimizer Optimizer rules label Oct 15, 2024
@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft October 18, 2024 19:08
@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

Thank you @epsio-banay and @findepi and @eejbyfeldt for your help

@epsio-banay epsio-banay force-pushed the fix-distinct-on-push-down-filter branch 4 times, most recently from 75a06f6 to d1bcf2c Compare October 23, 2024 13:25
@epsio-banay epsio-banay marked this pull request as ready for review October 23, 2024 15:23
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @epsio-banay -- this is looking close

Perhaps @gruuya would have some time to help review this and offer suggestions as well

@@ -628,6 +627,103 @@ fn infer_join_predicates(
.collect::<Result<Vec<_>>>()
}

/// Check whether the given expression depends only on the columns in the given schema.
/// Meaning for a certain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence appears to end abruptly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this appears to be a copy of https://github.com/apache/datafusion/blob/b098893a34f83f1a1df290168377d7622938b3f5/datafusion/core/src/datasource/listing/helpers.rs#L55-L54

Perhaps instead of copy/pasting the code, you could move it to a common location (e.g. datafusion-expr)?

Copy link
Contributor Author

@epsio-banay epsio-banay Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,
I agree that this code is a duplicate and it should be a common function, however expr_applicable_for_cols uses column names as argument, so in case we have two columns with the same name but with different qualifiers, the function will produce incorrect result.
IMO we would need to dig into expr_applicable_for_cols's usage (there's only one) and understand if it should be fixed. Unfortunately I won't have much time to do it in the coming weeks.

In the mean time I moved this function to be a method of expr

.flat_map(|e| e.column_refs())
.collect();
let distinct_on_input_schema = distinct_on.input.schema();
let distinct_on_qualified_fields: Vec<_> = distinct_on_cols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lot of copying going on in this code (qualified_field_from_column etc)

Is there any way to avoid this? I wonder if we could check each reference in the filter expression and see if it had any column that wasn't in the distinct on clause

…e cardinality of the distinct on expressions.
@epsio-banay epsio-banay force-pushed the fix-distinct-on-push-down-filter branch from d1bcf2c to a2e19ef Compare October 27, 2024 09:59
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Oct 27, 2024
@gruuya
Copy link
Contributor

gruuya commented Oct 29, 2024

Are we certain that the filter should not be pushed down, and moreover that it can be pushed down in some cases but not in other cases?

This is unexpected to me, as my intuition assumes that the WHERE clause always filters the pre-aggregated data (i.e. it is always pushed down)—double checking on Postgres this seems to be confirmed

postgres@localhost:postgres> explain select distinct on (a) a, b from foo where b = 1 order by a, b desc;
+-----------------------------------------------------------------+
| QUERY PLAN                                                      |
|-----------------------------------------------------------------|
| Unique  (cost=38.44..38.50 rows=11 width=8)                     |
|   ->  Sort  (cost=38.44..38.47 rows=11 width=8)                 |
|         Sort Key: a                                             |
|         ->  Seq Scan on foo  (cost=0.00..38.25 rows=11 width=8) |
|               Filter: (b = 1)                                   |
+-----------------------------------------------------------------+

Granted with DISTINCT ON you can't use HAVING (as there's no explicit GROUP BY), but I guess one should be using a subquery with an additional filter on top of the nested DISTINCT ON in such cases.

@eejbyfeldt
Copy link
Contributor

This is unexpected to me, as my intuition assumes that the WHERE clause always filters the pre-aggregated data (i.e. it is always pushed down)

I don't think "always pushed down" is the correct description here. My understanding is that it would be planned as a filter before the aggregation and therefore would not needed to be pushed down. In DataFusion this can be seen if we do

> EXPLAIN VERBOSE select distinct on (a) a, b from foo where b = 1 order by a, b desc;

+------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | DistinctOn: on_expr=[[foo.a]], select_expr=[[foo.a, foo.b]], sort_expr=[[foo.a ASC NULLS FIRST, foo.b DESC NULLS LAST]]                                                                                                                                                                                                                                                                                                                                                                                                              |
|                                                            |   Filter: foo.b = Int64(1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|                                                            |     TableScan: foo                              
...

To get the filter above the aggregation (in postgres) we can do

> explain select * from (select distinct on (a) a, b from foo) t where b = 1 order by a, b desc;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Subquery Scan on t  (cost=158.51..172.31 rows=1 width=8)
   Filter: (t.b = 1)
   ->  Unique  (cost=158.51..169.81 rows=200 width=8)
         ->  Sort  (cost=158.51..164.16 rows=2260 width=8)
               Sort Key: foo.a
               ->  Seq Scan on foo  (cost=0.00..32.60 rows=2260 width=8)

e.g not pushed down and if we change the filter to a = 1 we get

> explain select * from (select distinct on (a) a, b from foo) t where a = 1 order by a, b desc;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Sort  (cost=38.55..38.58 rows=11 width=8)
   Sort Key: foo.b DESC
   ->  Unique  (cost=0.00..38.25 rows=11 width=8)
         ->  Seq Scan on foo  (cost=0.00..38.25 rows=11 width=8)
               Filter: (a = 1)

so this filter gets pushed down.

So postgres seems to agree with us that the filter can be pushed down in some case and not in others.

vec![col("a")],
))))?
.build()?;
// filter is on volatile function, so it should not be pushed down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// filter is on volatile function, so it should not be pushed down
// filter is on a non-volatile function, so it should be pushed down

@@ -1615,6 +1659,91 @@ mod tests {
assert_optimized_plan_eq(plan, expected)
}

#[test]
fn distinct_on() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a "false negative" test here, i.e. a filter on the same subquery as the distinct (so without an alias) but not on a distinct column.

Something like (not tested):

    fn distinct_on_() -> Result<()> {
        let table_scan = test_table_scan()?;
        let plan = LogicalPlanBuilder::from(table_scan)
            .distinct_on(vec![col("a")], vec![col("a"), col("b")], None)?
            .filter(col("b").eq(lit(1i64)))?
            .build()?;
        // filter is on the same subquery as the distinct, so it should be pushed down even though it isn't in the ON clause
        let expected = "\
        DistinctOn: on_expr=[[test.a]], select_expr=[[a, b]], sort_expr=[[]]\
        \n  TableScan: test, full_filters=[b = Int64(1)]";
        assert_optimized_plan_eq(plan, expected)
    }

@gruuya
Copy link
Contributor

gruuya commented Oct 29, 2024

To get the filter above the aggregation (in postgres) we can do

Ahh good point. I think my misunderstanding here is that I was concentrating on the WHERE clause within the DISTINCT ON (sub)query.

So yes, I agree we should push down only compatible expressions from outside of the DISTINCT ON subquery, but the filter from within that (sub)query should always be pushed down (and I think there's a test missing for that).

@alamb alamb marked this pull request as draft November 3, 2024 11:59
@alamb
Copy link
Contributor

alamb commented Nov 3, 2024

Marking as a draft as I think this PR is no longer waiting on feedback (and I am trying to work down the review queue).

Please mark it as ready for review when it is ready for another look

Copy link

github-actions bot commented Jan 3, 2025

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jan 3, 2025
@github-actions github-actions bot closed this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PushDownFilter optimizer pushes down filters through distinct on
5 participants