Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: add with_estimated_selectivity to Precision #8177

Merged
merged 4 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions datafusion/common/src/stats.rs
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,15 @@ impl Precision<usize> {
(_, _) => Precision::Absent,
}
}

/// Return the estimate of applying a filter with estimated selectivity
/// `selectivity` to this Precision. A selectivity of `1.0` means that all
/// rows are selected. A selectivity of `0.5` means half the rows are
/// selected. Will always return inexact statistics.
pub fn apply_filter(self, selectivity: f64) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the key APIs I expect to change as part of #8078 (the output will retain information about the min/max).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it might be more suitable to locate this function in physical_plan's? There's already a multiplication function here, and it seems to me that this function could potentially be more relevant with filter context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function both multiplies the values and turns the statistics into inexact (which I actually missed my first time through this code).

I think the intent is much clearer and consistent in this PR where the logic is commented in a function (rather than replicated 4 times, inconsistently).

That being said, I agree the notion of 'filter' is probably more specific to the physical plan rather than a "precision"

Perhaps we can come up with a different name for the function ? what about Precision::estimate_filtered or Precision::with_estimated_selectivity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Precision::with_estimated_selectivity sounds more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

self.map(|v| ((v as f64 * selectivity).ceil()) as usize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ceil in this case rather than round?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Inexact(0) triggers some decisions at somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there were several cases where DataFusion was (incorrectly) optimizing away scans based on Inexact(0) -- I tried to capture some of what is going on in the #8227 ticket if you are interested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ceil in this case rather than round?

Also, specifically in this PR I used ceil() because that is what the existing code did in the older codepath.

.to_inexact()
}
}

impl Precision<ScalarValue> {
Expand Down
31 changes: 7 additions & 24 deletions datafusion/physical-plan/src/filter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -200,15 +200,10 @@ impl ExecutionPlan for FilterExec {
// assume filter selects 20% of rows if we cannot do anything smarter
// tracking issue for making this configurable:
// https://github.com/apache/arrow-datafusion/issues/8133
let selectivity = 0.2_f32;
let selectivity = 0.2_f64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly this code from @andygrove in #8126 is different to the way the selectivity is implemented below -- it uses f32 and doesn't apply ceil()

This PR makes sure these two paths are consistent which I think is an improvement

let mut stats = input_stats.clone().into_inexact();
if let Precision::Inexact(n) = stats.num_rows {
stats.num_rows = Precision::Inexact((selectivity * n as f32) as usize);
}
if let Precision::Inexact(n) = stats.total_byte_size {
stats.total_byte_size =
Precision::Inexact((selectivity * n as f32) as usize);
}
stats.num_rows = stats.num_rows.apply_filter(selectivity);
stats.total_byte_size = stats.total_byte_size.apply_filter(selectivity);
return Ok(stats);
}

Expand All @@ -222,14 +217,8 @@ impl ExecutionPlan for FilterExec {

// Estimate (inexact) selectivity of predicate
let selectivity = analysis_ctx.selectivity.unwrap_or(1.0);
let num_rows = match num_rows.get_value() {
Some(nr) => Precision::Inexact((*nr as f64 * selectivity).ceil() as usize),
None => Precision::Absent,
};
let total_byte_size = match total_byte_size.get_value() {
Some(tbs) => Precision::Inexact((*tbs as f64 * selectivity).ceil() as usize),
None => Precision::Absent,
};
let num_rows = num_rows.apply_filter(selectivity);
let total_byte_size = total_byte_size.apply_filter(selectivity);

let column_statistics = collect_new_statistics(
&input_stats.column_statistics,
Expand Down Expand Up @@ -277,16 +266,10 @@ fn collect_new_statistics(
)
};
ColumnStatistics {
null_count: match input_column_stats[idx].null_count.get_value() {
Some(nc) => Precision::Inexact(*nc),
None => Precision::Absent,
},
null_count: input_column_stats[idx].null_count.clone().to_inexact(),
max_value,
min_value,
distinct_count: match distinct_count.get_value() {
Some(dc) => Precision::Inexact(*dc),
None => Precision::Absent,
},
distinct_count: distinct_count.to_inexact(),
}
},
)
Expand Down