-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support inlist in LiteralGurantee for pruning #8654
Conversation
@@ -334,21 +357,33 @@ impl<'a> ColOpLit<'a> { | |||
binary_expr.op(), | |||
binary_expr.right().as_any(), | |||
); | |||
|
|||
let guarantee = match op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since operator except =
and !=
would be evaluate in predicate_expr, I put the conversion process here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let first_term = &terms[0]; | ||
if terms.iter().all(|term| { | ||
term.col.name() == first_term.col.name() | ||
&& term.op == first_term.op | ||
&& term.guarantee == Guarantee::In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a != foo OR a != bar
would be filtered out here, So we don't need to check new_values.len() == 1
in aggregate_multi_conjunct
. If not, new_values.len() == 1
will confilct with the LiteralGuarantee result from InList
who may have multi literals. And I think refactored code is easier to follow for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also double checked that this will verify that first_term.guarantee
needs to be In
as well ✅
Thank you @my-vegetable-has-exploded -- I plan to review this tomorrow. cc @waynexia and @haohuaijin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @my-vegetable-has-exploded , LGTM!
9baab8d
to
e18ebba
Compare
]); | ||
let sql = | ||
"SELECT * FROM tbl WHERE \"String\" IN ('Hello_Not_Exists', 'Hello_Not_Exists2')"; | ||
let expr = sql_to_physical_plan(sql).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This previous test relied on the optimizer who will convert in_list
expr with no more 3 elements to and
expr. Now we don't need this. logical2physical
would convert logical_expr
to physical_expr
without optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- this change makes sense
BTW I tried to improve these tests to avoid duplication in #8435 (not yet merged). I'll try and consolidate this test too when I get a chance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this PR carefully -- thank you @my-vegetable-has-exploded (and @haohuaijin for the review). This PR looks very nice 👌
I left some small suggestions but nothing I think is required -- we could make them as a follow on PR as well
]); | ||
let sql = | ||
"SELECT * FROM tbl WHERE \"String\" IN ('Hello_Not_Exists', 'Hello_Not_Exists2')"; | ||
let expr = sql_to_physical_plan(sql).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- this change makes sense
BTW I tried to improve these tests to avoid duplication in #8435 (not yet merged). I'll try and consolidate this test too when I get a chance
.as_any() | ||
.downcast_ref::<crate::expressions::InListExpr>() | ||
{ | ||
//Only support single-column inlist currently, multi-column inlist is not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice 👍
let first_term = &terms[0]; | ||
if terms.iter().all(|term| { | ||
term.col.name() == first_term.col.name() | ||
&& term.op == first_term.op | ||
&& term.guarantee == Guarantee::In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also double checked that this will verify that first_term.guarantee
needs to be In
as well ✅
@@ -334,21 +357,33 @@ impl<'a> ColOpLit<'a> { | |||
binary_expr.op(), | |||
binary_expr.right().as_any(), | |||
); | |||
|
|||
let guarantee = match op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
col("b") | ||
.in_list(vec![lit(1), lit(2), lit(3)], false) | ||
.and(col("b").in_list(vec![lit(2), lit(3), lit(4)], false)), | ||
vec![in_guarantee("b", [2, 3])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of simplification might also be valuable to do in the Expr simplifier code too (so that other optimizer passes could see simpler predicates).
However, I am not sure how common such predicates are 🤔
} | ||
|
||
#[test] | ||
fn test_inlist_with_disjunction() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add (negative) tests for the inlist directly used with OR
too ?
For example:
b IN (1, 2, 3) OR b = 2
b IN (1, 2, 3) OR b != 3
I think the code above will work correctly, but it would be nice to verify
col("b") | ||
.in_list(vec![lit(1), lit(2), lit(3)], false) | ||
.and(col("b").eq(lit(4)).or(col("b").eq(lit(5)))), | ||
vec![], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I theory this could be in_guarantee(1,2,3)
right? As it can only be true when b is 1,2,3.
However in this case this expression can never be true so maybe it doesn't really matter 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be invalid since intersection between [1,2,3] and [4,5] is empty.
let intersection = new_values
.into_iter()
.filter(|new_value| existing.literals.contains(*new_value))
.collect::<Vec<_>>();
// for an In guarantee, if the intersection is not empty, we can extend the guarantee
// e.g. `a IN (1,2,3) AND a IN (2,3,4)` is `a IN (2,3)`
// otherwise, we invalidate the guarantee
// e.g. `a IN (1,2,3) AND a IN (4,5,6)` is `a IN ()`, which is invalid
if !intersection.is_empty() {
existing.literals = intersection.into_iter().cloned().collect();
} else {
// at least one was not, so invalidate the guarantee
*entry = None;
}
BTW, I left a comment in #8437. Please cc when you are free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be invalid since intersection between [1,2,3] and [4,5] is empty.
Right -- I wonder if in general for cases where the intersection is empty, can we infer that the expression can not be true ever 🤔 (there is no way to represent this today in the guarantee framework, maybe something like
enum Guarantee {
/// This predicate can not be true
CanNotBeTrue,
/// This predicate can only be true if all the `LiteralGurantee`s are valie
List(Vec<LiteralGurantee>)
}
And then return Gurantee::CanNotBeTrue
if we discover that the expression can not be true 🤔
BTW, I left a comment in #8437. Please cc when you are free.
I am not quite sure what you are referring to. #8437 doesn't seem to have had any recent activity that I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can enhance eliminate_filter
optimizer rule to eliminate the filters(in where clause) that always not be true and replace by a EmptyRelation
.
like we do for where false
in
https://github.com/apache/arrow-datafusion/blob/8284371cb5dbeb5d0b1d50c420affb9be86b1599/datafusion/optimizer/src/eliminate_filter.rs#L55-L62
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed this idea in #8687
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks 👍
@@ -120,7 +114,7 @@ impl LiteralGuarantee { | |||
/// expression is guaranteed to be `null` or `false`. | |||
/// | |||
/// # Notes: | |||
/// 1. `expr` must be a boolean expression. | |||
/// 1. `expr` must be a boolean expression or inlist expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically, the boolean expr can only be in a form that can be transformed into a ColOpLit
Co-authored-by: Ruihang Xia <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Thanks all❤️. |
@@ -780,6 +779,21 @@ mod test { | |||
.and(col("b").eq(lit(3)).or(col("b").eq(lit(4)))), | |||
vec![not_in_guarantee("b", [1, 2, 3]), in_guarantee("b", [3, 4])], | |||
); | |||
// b IN (1, 2, 3) OR b = 2 | |||
// TODO this should be in_guarantee("b", [1, 2, 3]) but currently we don't support to anylize this kind of disjunction. Only `ColOpLit OR ColOpLit` is supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks again @my-vegetable-has-exploded |
if items more than 10, it still cant use bloom filter |
In case anyone is following along, the discussion about this is at #8436 (comment) |
Which issue does this PR close?
Closes #8436
Rationale for this change
What changes are included in this PR?
InList
support in LiteralGurantee coderow_groups.rs
Are these changes tested?
Are there any user-facing changes?