Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: add with_estimated_selectivity to Precision #8177

Merged
merged 4 commits into from
Nov 17, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 14, 2023

Which issue does this PR close?

Part of #8078

Rationale for this change

I am trying to consolidate uses of Precision into methods on the Precision enum so that

  1. how it is used is clearer
  2. Its usage is consistent
  3. So we can more easily change how it is implemented in Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

What changes are included in this PR?

  1. Move the code that applies selectivity from Filter::statistics into Precision::apply_filter Precision:: with_estimated_selectivity

Are these changes tested?

Existing tests

Are there any user-facing changes?

@@ -200,15 +200,10 @@ impl ExecutionPlan for FilterExec {
// assume filter selects 20% of rows if we cannot do anything smarter
// tracking issue for making this configurable:
// https://github.com/apache/arrow-datafusion/issues/8133
let selectivity = 0.2_f32;
let selectivity = 0.2_f64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly this code from @andygrove in #8126 is different to the way the selectivity is implemented below -- it uses f32 and doesn't apply ceil()

This PR makes sure these two paths are consistent which I think is an improvement

/// Return the estimate of applying a filter of selectivity `selectivity` to
/// this Precision. A selectivity of `1.0` means that all rows are selected.
/// A selectivity of `0.5` means half the rows are selected.
pub fn apply_filter(self, selectivity: f64) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the key APIs I expect to change as part of #8078 (the output will retain information about the min/max).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it might be more suitable to locate this function in physical_plan's? There's already a multiplication function here, and it seems to me that this function could potentially be more relevant with filter context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function both multiplies the values and turns the statistics into inexact (which I actually missed my first time through this code).

I think the intent is much clearer and consistent in this PR where the logic is commented in a function (rather than replicated 4 times, inconsistently).

That being said, I agree the notion of 'filter' is probably more specific to the physical plan rather than a "precision"

Perhaps we can come up with a different name for the function ? what about Precision::estimate_filtered or Precision::with_estimated_selectivity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Precision::with_estimated_selectivity sounds more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@alamb alamb requested a review from andygrove November 14, 2023 22:06
@alamb alamb marked this pull request as ready for review November 14, 2023 22:07
/// rows are selected. A selectivity of `0.5` means half the rows are
/// selected. Will always return inexact statistics.
pub fn apply_filter(self, selectivity: f64) -> Self {
self.map(|v| ((v as f64 * selectivity).ceil()) as usize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ceil in this case rather than round?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Inexact(0) triggers some decisions at somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there were several cases where DataFusion was (incorrectly) optimizing away scans based on Inexact(0) -- I tried to capture some of what is going on in the #8227 ticket if you are interested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ceil in this case rather than round?

Also, specifically in this PR I used ceil() because that is what the existing code did in the older codepath.

@alamb alamb changed the title Minor: add apply_filter to Precision Minor: add with_estimated_selectivity to Precision Nov 16, 2023
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as a step towards simplifying the current mechanism so that we can better figure out/design the next one.

@alamb alamb merged commit a2b9ab8 into apache:main Nov 17, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants