-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add guarantees to simplification #7467
Conversation
45ae442
to
a6b57e3
Compare
pub struct NullableInterval { | ||
/// The interval for the values | ||
pub values: Interval, | ||
/// A boolean interval representing whether the value is certainly valid | ||
/// (not null), certainly null, or has an unknown validity. This takes | ||
/// precedence over the values in the interval: if this field is equal to | ||
/// [Interval::CERTAINLY_FALSE], then the interval is certainly null | ||
/// regardless of what `values` is. | ||
pub is_valid: Interval, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @metesynnada. You're idea of using a pair of intervals has worked quite well. Since it's a new struct, this shouldn't impact the performance of your existing cp_solver
code.
I ended up not moving the Interval
struct into datafusion-common
. First, datafusion-optimizer
already depends on datafusion-physical-epxr
, so I didn't need to move it. Plus it has some dependencies within this crate that make it not easy to move. I think if we wanted to, we might be able to move it to datafusion-expr
, but I'd rather leave that to a different PR.
cc @ozankabak
a9aeecf
to
16d78c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really neat @wjones127 -- thank you 🦾
I went through this code and the tests thoroughly and I think it is really nice. While I have some suggestions on code structure that I think can simplify it substantially I also think this PR could be merged as is. This PR's code is both well commented, and well tested.
Also, I think this could potentially be used to finally unify the pruning predicate code with the other interval analyses as described here: #5535 (basically it would simplify the expression given the statistics for parquet row groups and if the expression evaluated to a constant we could filter out the row group (as well as potentially skip the filter entirely)
@@ -149,6 +153,76 @@ impl<S: SimplifyInfo> ExprSimplifier<S> { | |||
|
|||
expr.rewrite(&mut expr_rewrite) | |||
} | |||
|
|||
/// Input guarantees and simplify the expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
/// .with_schema(schema); | ||
/// let simplifier = ExprSimplifier::new(context); | ||
/// | ||
/// // Expression: (x >= 3) AND (y + 2 < 10) AND (z > 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really cool @wjones127
/// // z > 5. | ||
/// assert_eq!(output, expr_z); | ||
/// ``` | ||
pub fn simplify_with_guarantees<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential API for this might be to store the guarantees on the simplifier -- like
let expr = ExprSimplifier::new(context)
.with_guarantees(guarantees)
.simplify()?
The downside is that the guarantees would have to be owned (aka a Vec
)
So I think this API is fine, I just wanted to mention the possibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside is that the guarantees would have to be owned (aka a Vec)
That doesn't seem to bad, I think. My imagined use case is that we re-use the same simplifier with different guarantees but the same predicate. Something like:
let mut simplifier = ExprSimplifier::new(context);
for row_group in file {
let guarantees = get_guarantees(row_groups.statistics);
simplifier = simplifier.with_guarantees(guarantees);
let group_predicate = simplifier.simplify(predicate);
// Do something with the predicate
}
So my main concern is that it's performant if handled in a loop like that. I think it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to this API.
/// precedence over the values in the interval: if this field is equal to | ||
/// [Interval::CERTAINLY_FALSE], then the interval is certainly null | ||
/// regardless of what `values` is. | ||
pub is_valid: Interval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused about this enumeration for a while as it seems to use a boolean interval to represent one of three states. It took me a while to grok that CERTAINLY_TRUE
mean that the the interval was not null, though the concept is very well documented ❤️
I think this representation also might be more error prone as an invalid state can be encoded (for example, what happens if is_valid
contains ScalarValue::Float32
, or what if the code uses values
when is_valid is CERTAINLY_FALSE
)
Perhaps we can use an enum
and let the rust compiler and type system ensure we always have valid states for NullableInterval
with something like :
pub enum NullableInterval {
/// The value is always null in this interval
Null,
/// The value may or may not be null in this interval. If it is non null its value is within
/// the specified values interval
MaybeNull { values : Interval },
/// The value is definitely not null in this interval and is within values
NotNull { vaules: Interval },
}
I think that might make the intent of some of the code above clearer as well, rather than checking for null ness using tests for CERTAINLY_FALSE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you say it, that enum seems like the obvious right choice. I'll try that out and see how it simplifies things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change, and it's generally clearer. I did make the Null
variant store a datatype, otherwise I found we could get errors in the ConstEvaluator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
match &expr { | ||
Expr::IsNull(inner) => { | ||
if let Some(interval) = self.intervals.get(inner.as_ref()) { | ||
if interval.is_valid == Interval::CERTAINLY_FALSE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylistically I think I would find this easier to read if it was in a method like
if interval.always_null() {
...
} else if interval.alway_not_null() {
..
} else {
...
}
If you use the enum
proposal below for NullableInteval
this would naturally become a match statement like
match self.intervals.get(inner.as_ref()) {
Some(NullableInterval::Null) => Ok(lit(true)),
Some(NullableInterval::NotNull{..}) => Ok(lit(false)),
_ => Ok(expr)
}
Which I think may express the intent of the code more concisely and clearly
BTW I think the CI failure is due to #7523 |
I also filed #7526 for some improvements I thought of while reviewing this PR |
I'll be applying this in Lance first, but I will come back later and integrate this with Parquet scanning. We'll want the Parquet scanning piece in delta-rs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks great.
/// }); | ||
/// | ||
/// ``` | ||
pub fn apply_operator(&self, op: &Operator, rhs: &Self) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
use datafusion_physical_expr::intervals::{Interval, IntervalBound, NullableInterval}; | ||
|
||
/// Rewrite expressions to incorporate guarantees. | ||
pub(crate) struct GuaranteeRewriter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a docstring for GuaranteeRewriter
?
@@ -129,6 +139,7 @@ impl<S: SimplifyInfo> ExprSimplifier<S> { | |||
expr.rewrite(&mut const_evaluator)? | |||
.rewrite(&mut simplifier)? | |||
.rewrite(&mut or_in_list_simplifier)? | |||
.rewrite(&mut guarantee_rewriter)? | |||
// run both passes twice to try an minimize simplifications that we missed | |||
.rewrite(&mut const_evaluator)? | |||
.rewrite(&mut simplifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of such a loop to cover every simplification case and make it easier to accommodate future simplifications, or would it be unnecessary?
loop {
let original_expr = expr.clone();
expr = expr
.rewrite(&mut const_evaluator)?
.rewrite(&mut simplifier)?
.rewrite(&mut or_in_list_simplifier)?
.rewrite(&mut guarantee_rewriter)?;
if expr == original_expr {
break;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a neat idea. I think we should try it in a follow on PR.
ALso, If we did this I would also suggest adding a limit on the number of loops (to avoid a "ping-poing" infinite loop where passes rewrite an expression back and forth)
|
||
Expr::BinaryExpr(BinaryExpr { left, op, right }) => { | ||
// Check if this is a comparison | ||
match op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may prefer this function:
if !op.is_comparison_operator() {
return Ok(expr);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Much better now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
/// assert_eq!(output, expr_z); | ||
/// ``` | ||
pub fn with_guarantees(mut self, guarantees: Vec<(Expr, NullableInterval)>) -> Self { | ||
self.guarantees = guarantees; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
let contains = expr_interval.contains(*interval)?; | ||
|
||
if contains.is_certainly_true() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how easy this is to read now
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
I took the liberty of merging up from main and fixing the clippy and doc errors on this branch |
Thanks @alamb! |
Which issue does this PR close?
Closes #6171.
Rationale for this change
When scanning files we sometimes have statistics about min and max values and null counts. We can translate those into statements about the possible values of each column, and then use those two simplify expressions and predicates.
What changes are included in this PR?
This PR:
NullableInterval
type, which is essentially a pair ofInterval
s, one representing the valid values and a boolean interval representing the validity.ExprSimplifier::simplify_with_guarantees()
, which is similar toExprSimplifier::simplify()
except it allows passing pairs of column expressions andNullableIntervals
to allow for even more simplification. Right now, this handles,IS (NOT) NULL
,BETWEEN
, inequalities, plain column references, andInList
.Are these changes tested?
Most of the new tests reside in
guarantees.rs
.Are there any user-facing changes?
This does not change existing APIs, only adds new ones.