Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: add with_estimated_selectivity to Precision #8177

Merged
merged 4 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions datafusion/common/src/stats.rs
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,15 @@ impl Precision<usize> {
(_, _) => Precision::Absent,
}
}

/// Return the estimate of applying a filter with estimated selectivity
/// `selectivity` to this Precision. A selectivity of `1.0` means that all
/// rows are selected. A selectivity of `0.5` means half the rows are
/// selected. Will always return inexact statistics.
pub fn with_estimated_selectivity(self, selectivity: f64) -> Self {
self.map(|v| ((v as f64 * selectivity).ceil()) as usize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ceil in this case rather than round?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Inexact(0) triggers some decisions at somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there were several cases where DataFusion was (incorrectly) optimizing away scans based on Inexact(0) -- I tried to capture some of what is going on in the #8227 ticket if you are interested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ceil in this case rather than round?

Also, specifically in this PR I used ceil() because that is what the existing code did in the older codepath.

.to_inexact()
}
}

impl Precision<ScalarValue> {
Expand Down
25 changes: 8 additions & 17 deletions datafusion/physical-plan/src/filter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -200,15 +200,12 @@ impl ExecutionPlan for FilterExec {
// assume filter selects 20% of rows if we cannot do anything smarter
// tracking issue for making this configurable:
// https://github.com/apache/arrow-datafusion/issues/8133
let selectivity = 0.2_f32;
let mut stats = input_stats.into_inexact();
if let Precision::Inexact(n) = stats.num_rows {
stats.num_rows = Precision::Inexact((selectivity * n as f32) as usize);
}
if let Precision::Inexact(n) = stats.total_byte_size {
stats.total_byte_size =
Precision::Inexact((selectivity * n as f32) as usize);
}
let selectivity = 0.2_f64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly this code from @andygrove in #8126 is different to the way the selectivity is implemented below -- it uses f32 and doesn't apply ceil()

This PR makes sure these two paths are consistent which I think is an improvement

let mut stats = input_stats.clone().into_inexact();
stats.num_rows = stats.num_rows.with_estimated_selectivity(selectivity);
stats.total_byte_size = stats
.total_byte_size
.with_estimated_selectivity(selectivity);
return Ok(stats);
}

Expand All @@ -222,14 +219,8 @@ impl ExecutionPlan for FilterExec {

// Estimate (inexact) selectivity of predicate
let selectivity = analysis_ctx.selectivity.unwrap_or(1.0);
let num_rows = match num_rows.get_value() {
Some(nr) => Precision::Inexact((*nr as f64 * selectivity).ceil() as usize),
None => Precision::Absent,
};
let total_byte_size = match total_byte_size.get_value() {
Some(tbs) => Precision::Inexact((*tbs as f64 * selectivity).ceil() as usize),
None => Precision::Absent,
};
let num_rows = num_rows.with_estimated_selectivity(selectivity);
let total_byte_size = total_byte_size.with_estimated_selectivity(selectivity);

let column_statistics = collect_new_statistics(
&input_stats.column_statistics,
Expand Down