-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine multiple IN
lists in ExprSimplifier
#8949
Conversation
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @jayzhan211 -- I think this one is pretty close. I think the tests need to be moved with the other simplification tests and a few more cases added but all in all 🦾 -- really nice
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! This module implements a rule that simplifies expressions that is guaranteed to be true or false at planning time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! This module implements a rule that simplifies expressions that is guaranteed to be true or false at planning time | |
//! This module implements a rule that simplifies the values for `InList`s |
/// Simplify expressions that is guaranteed to be true or false to a literal boolean expression | ||
/// | ||
/// Rules: | ||
/// If both expressions are positive or negative, then we can apply intersection of both inlist expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the
/// If both expressions are positive or negative, then we can apply intersection of both inlist expressions | |
/// If both expressions are `IN` or `NOT IN`, then we can apply intersection of both lists |
/// | ||
/// Rules: | ||
/// If both expressions are positive or negative, then we can apply intersection of both inlist expressions | ||
/// 1. `a in (1,2,3) AND a in (4,5) -> a in (1,2,3,4,5)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can update this example to show the lists having overlap (like a in (1,2,3) AND a in (2,3,4)
--> a in (1,2,3,4)
/// 1. `a in (1,2,3) AND a in (4,5) -> a in (1,2,3,4,5)` | ||
/// 2. `a not int (1,2,3) AND a not in (4,5,6) -> a not in (1,2,3,4,5,6)` | ||
/// If one of the expressions is negated, then we apply exception on positive expression | ||
/// 1. `a in (1,2,3) AND a not in (1,2,3,4,5) -> a in ()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be rewritten to false
?
/// 1. `a in (1,2,3) AND a not in (1,2,3,4,5) -> a in ()` | |
/// 1. `a in (1,2,3) AND a not in (1,2,3,4,5) -> false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought a in ()
is more understandable in this context
} | ||
} | ||
|
||
fn inlist_intersection(l1: &InList, l2: &InList, negated: bool) -> Result<Expr> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could avoid a clone here potentially because the mutate function gets passed an owed
expr, like:
fn inlist_intersection(l1: InList, l2: InList, negated: bool) -> Result<Expr> {
(we could do this as a follow on PR too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we still need expr
or Binaray Expr
if failed to match if let
statement. I did not find out how to get the cloned value if we consume the value at first hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time fooling with it this afternoon and here is what I came up with #8971
@@ -725,3 +725,57 @@ AggregateExec: mode=SinglePartitioned, gby=[p_partkey@2 as p_partkey], aggr=[SUM | |||
--------CoalesceBatchesExec: target_batch_size=8192 | |||
----------RepartitionExec: partitioning=Hash([ps_partkey@0], 4), input_partitions=1 | |||
------------MemoryExec: partitions=1, partition_sizes=[1] | |||
|
|||
# Inlist simplification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend one or two tests in slt, and then adding the other tests in https://github.com/apache/arrow-datafusion/blob/795c71f3e8249847593a58dafe578ffcbcd3e012/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L1302-L1307 because:
- It would keep the tests closer to the other simplification tests
- It makes it easier to validate what code is causing the rewrite
create table t(x int) as values (1), (2), (3); | ||
|
||
query TT | ||
explain select x from t where x IN (1,2,3) AND x IN (4,5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have some additional suggested tests:
- Negative tests like
x IN (1,2,3) OR x IN (4,5)
(maybe that could actually be combined intoX IN (1,2,3,4,5)
🤔 - Tests with more than 2 clauses:
x IN (1,2,3) AND x IN (3,4) AND x IN (2,3))
- Tests with more than 2 clauses that are not all IN lists but the inlists could be combined
x IN (1,2,3) AND y = 2 AND x IN (2,3))
Maybe also some tests that could in theory be simplified like
`x IN (1,2,3) AND x = 2 AND x IN (2,3))` --> x=2
(we don't have to actually implement that simplification in this PR, but it would be good to document the potential further improvement in tests)
IN
lists IN
lists in ExprSimplifier
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
In the following on PR, I think we can move inlist related simplifier logic to
In this way, when the logic grows more complex, we can easily find out related conversion about specific Expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- thank you @jayzhan211 very nice 👏
Great idea -- filed #8970 to track this |
Which issue does this PR close?
Closes #8687.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?