Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine multiple IN lists in ExprSimplifier #8949

Merged
merged 12 commits into from
Jan 23, 2024

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Closes #8687.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 22, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 marked this pull request as ready for review January 22, 2024 14:47
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @jayzhan211 -- I think this one is pretty close. I think the tests need to be moved with the other simplification tests and a few more cases added but all in all 🦾 -- really nice

// specific language governing permissions and limitations
// under the License.

//! This module implements a rule that simplifies expressions that is guaranteed to be true or false at planning time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This module implements a rule that simplifies expressions that is guaranteed to be true or false at planning time
//! This module implements a rule that simplifies the values for `InList`s

/// Simplify expressions that is guaranteed to be true or false to a literal boolean expression
///
/// Rules:
/// If both expressions are positive or negative, then we can apply intersection of both inlist expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the

Suggested change
/// If both expressions are positive or negative, then we can apply intersection of both inlist expressions
/// If both expressions are `IN` or `NOT IN`, then we can apply intersection of both lists

///
/// Rules:
/// If both expressions are positive or negative, then we can apply intersection of both inlist expressions
/// 1. `a in (1,2,3) AND a in (4,5) -> a in (1,2,3,4,5)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can update this example to show the lists having overlap (like a in (1,2,3) AND a in (2,3,4) --> a in (1,2,3,4)

/// 1. `a in (1,2,3) AND a in (4,5) -> a in (1,2,3,4,5)`
/// 2. `a not int (1,2,3) AND a not in (4,5,6) -> a not in (1,2,3,4,5,6)`
/// If one of the expressions is negated, then we apply exception on positive expression
/// 1. `a in (1,2,3) AND a not in (1,2,3,4,5) -> a in ()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be rewritten to false?

Suggested change
/// 1. `a in (1,2,3) AND a not in (1,2,3,4,5) -> a in ()`
/// 1. `a in (1,2,3) AND a not in (1,2,3,4,5) -> false`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought a in () is more understandable in this context

}
}

fn inlist_intersection(l1: &InList, l2: &InList, negated: bool) -> Result<Expr> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could avoid a clone here potentially because the mutate function gets passed an owed expr, like:

fn inlist_intersection(l1: InList, l2: InList, negated: bool) -> Result<Expr> {

(we could do this as a follow on PR too)

Copy link
Contributor Author

@jayzhan211 jayzhan211 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we still need expr or Binaray Expr if failed to match if let statement. I did not find out how to get the cloned value if we consume the value at first hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time fooling with it this afternoon and here is what I came up with #8971

@@ -725,3 +725,57 @@ AggregateExec: mode=SinglePartitioned, gby=[p_partkey@2 as p_partkey], aggr=[SUM
--------CoalesceBatchesExec: target_batch_size=8192
----------RepartitionExec: partitioning=Hash([ps_partkey@0], 4), input_partitions=1
------------MemoryExec: partitions=1, partition_sizes=[1]

# Inlist simplification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend one or two tests in slt, and then adding the other tests in https://github.com/apache/arrow-datafusion/blob/795c71f3e8249847593a58dafe578ffcbcd3e012/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L1302-L1307 because:

  1. It would keep the tests closer to the other simplification tests
  2. It makes it easier to validate what code is causing the rewrite

create table t(x int) as values (1), (2), (3);

query TT
explain select x from t where x IN (1,2,3) AND x IN (4,5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have some additional suggested tests:

  1. Negative tests like x IN (1,2,3) OR x IN (4,5) (maybe that could actually be combined into X IN (1,2,3,4,5) 🤔
  2. Tests with more than 2 clauses: x IN (1,2,3) AND x IN (3,4) AND x IN (2,3))
  3. Tests with more than 2 clauses that are not all IN lists but the inlists could be combined x IN (1,2,3) AND y = 2 AND x IN (2,3))

Maybe also some tests that could in theory be simplified like

`x IN (1,2,3) AND x = 2 AND x IN (2,3))`   --> x=2

(we don't have to actually implement that simplification in this PR, but it would be good to document the potential further improvement in tests)

@alamb alamb changed the title Inlist simplified Combine multiple IN lists Jan 22, 2024
@alamb alamb changed the title Combine multiple IN lists Combine multiple IN lists in ExprSimplifier Jan 22, 2024
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 marked this pull request as draft January 23, 2024 01:02
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 23, 2024

In the following on PR, I think we can move inlist related simplifier logic to inlist_simplifier.
Steps in inlist_simplifier

  1. early stage simplification like expr IN () --> false, null IN/NOT IN
  2. convert long OR chain or AND chain to InList expressions like what or_in_list_simplifier does
  3. shorten InList expressions like what inlist_simplifier in this PR does
  4. convert final inlist to OR chain or AND chain if len < 3. Move inlist related simplifier out from expr_simplifier

In this way, when the logic grows more complex, we can easily find out related conversion about specific Expr

@jayzhan211 jayzhan211 marked this pull request as ready for review January 23, 2024 14:03
@jayzhan211 jayzhan211 requested a review from alamb January 23, 2024 14:03
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- thank you @jayzhan211 very nice 👏

@alamb
Copy link
Contributor

alamb commented Jan 23, 2024

In the following on PR, I think we can move inlist related simplifier ..

Great idea -- filed #8970 to track this

@alamb alamb merged commit 558b3d6 into apache:main Jan 23, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove / simplify "impossible" predicates
2 participants