Minor: add `with_estimated_selectivity` to Precision #8177

alamb · 2023-11-14T20:35:08Z

Which issue does this PR close?

Part of #8078

Rationale for this change

I am trying to consolidate uses of Precision into methods on the Precision enum so that

how it is used is clearer
Its usage is consistent
So we can more easily change how it is implemented in Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

What changes are included in this PR?

Move the code that applies selectivity from Filter::statistics into ~~Precision::apply_filter~~ Precision:: with_estimated_selectivity

Are these changes tested?

Existing tests

Are there any user-facing changes?

alamb · 2023-11-14T20:37:56Z

datafusion/physical-plan/src/filter.rs

@@ -200,15 +200,10 @@ impl ExecutionPlan for FilterExec {
            // assume filter selects 20% of rows if we cannot do anything smarter
            // tracking issue for making this configurable:
            // https://github.com/apache/arrow-datafusion/issues/8133
-            let selectivity = 0.2_f32;
+            let selectivity = 0.2_f64;


Interestingly this code from @andygrove in #8126 is different to the way the selectivity is implemented below -- it uses f32 and doesn't apply ceil()

This PR makes sure these two paths are consistent which I think is an improvement

alamb · 2023-11-14T20:38:33Z

datafusion/common/src/stats.rs

+    /// Return the estimate of applying a filter of selectivity `selectivity` to
+    /// this Precision. A selectivity of `1.0` means that all rows are selected.
+    /// A selectivity of `0.5` means half the rows are selected.
+    pub fn apply_filter(self, selectivity: f64) -> Self {


This is one of the key APIs I expect to change as part of #8078 (the output will retain information about the min/max).

I am wondering if it might be more suitable to locate this function in physical_plan's? There's already a multiplication function here, and it seems to me that this function could potentially be more relevant with filter context.

This function both multiplies the values and turns the statistics into inexact (which I actually missed my first time through this code).

I think the intent is much clearer and consistent in this PR where the logic is commented in a function (rather than replicated 4 times, inconsistently).

That being said, I agree the notion of 'filter' is probably more specific to the physical plan rather than a "precision"

Perhaps we can come up with a different name for the function ? what about Precision::estimate_filtered or Precision::with_estimated_selectivity?

I think Precision::with_estimated_selectivity sounds more descriptive.

Dandandan · 2023-11-16T09:03:30Z

datafusion/common/src/stats.rs

+    /// rows are selected. A selectivity of `0.5` means half the rows are
+    /// selected. Will always return inexact statistics.
+    pub fn apply_filter(self, selectivity: f64) -> Self {
+        self.map(|v| ((v as f64 * selectivity).ceil()) as usize)


Why use ceil in this case rather than round?

Maybe Inexact(0) triggers some decisions at somewhere?

Yeah, I think there were several cases where DataFusion was (incorrectly) optimizing away scans based on Inexact(0) -- I tried to capture some of what is going on in the #8227 ticket if you are interested

Why use ceil in this case rather than round?

Also, specifically in this PR I used ceil() because that is what the existing code did in the older codepath.

ozankabak

This looks good to me as a step towards simplifying the current mechanism so that we can better figure out/design the next one.

Minor: add apply_filter to Precision

a9ec65a

alamb commented Nov 14, 2023

View reviewed changes

alamb mentioned this pull request Nov 14, 2023

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Open

fix: use inexact

6f04551

alamb requested a review from andygrove November 14, 2023 22:06

alamb marked this pull request as ready for review November 14, 2023 22:07

Dandandan reviewed Nov 16, 2023

View reviewed changes

alamb added 2 commits November 16, 2023 09:11

Rename to with_estimated_selectivity

434dbbf

Merge remote-tracking branch 'apache/main' into alamb/stats_cleanup3

080ae2c

alamb changed the title ~~Minor: add apply_filter to Precision~~ Minor: add with_estimated_selectivity to Precision Nov 16, 2023

ozankabak approved these changes Nov 17, 2023

View reviewed changes

alamb merged commit a2b9ab8 into apache:main Nov 17, 2023
24 checks passed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor: add `with_estimated_selectivity` to Precision #8177

Minor: add `with_estimated_selectivity` to Precision #8177

alamb commented Nov 14, 2023 •

edited

Loading

alamb Nov 14, 2023

alamb Nov 14, 2023

berkaysynnada Nov 15, 2023

alamb Nov 15, 2023

berkaysynnada Nov 16, 2023

alamb Nov 16, 2023

Dandandan Nov 16, 2023

berkaysynnada Nov 16, 2023

alamb Nov 16, 2023

alamb Nov 16, 2023

ozankabak left a comment

Minor: add with_estimated_selectivity to Precision #8177

Minor: add with_estimated_selectivity to Precision #8177

Conversation

alamb commented Nov 14, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak left a comment

Choose a reason for hiding this comment

Minor: add `with_estimated_selectivity` to Precision #8177

Minor: add `with_estimated_selectivity` to Precision #8177

alamb commented Nov 14, 2023 •

edited

Loading