add expr::like and expr::notlike to pruning logic #508

nevi-me · 2021-06-05T10:20:08Z

Which issue does this PR close?

Closes #507.

Rationale for this change

Extending pruning to include string columns with LIKE

What changes are included in this PR?

Checks if a LIKE and NOT LIKE condition don't start with %, and converts them into a EQ filter.

Are there any user-facing changes?

No

nevi-me · 2021-06-05T10:23:17Z

@alamb is it enough to add the like and not like where I added them? Not sure of where else I need to change.

@Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).

If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use like and not like that can be pruned.

codecov-commenter · 2021-06-05T10:39:18Z

Codecov Report

Merging #508 (1ee63dd) into master (a9d04ca) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #508      +/-   ##
==========================================
+ Coverage   76.07%   76.14%   +0.06%     
==========================================
  Files         155      155              
  Lines       26544    26626      +82     
==========================================
+ Hits        20194    20274      +80     
- Misses       6350     6352       +2

Impacted Files	Coverage Δ
datafusion/src/physical_optimizer/pruning.rs	`93.05% <100.00%> (+0.78%)`	⬆️
datafusion/src/optimizer/constant_folding.rs	`91.31% <0.00%> (-0.38%)`	⬇️
datafusion-cli/src/lib.rs	`0.00% <0.00%> (ø)`
datafusion-cli/src/main.rs	`0.00% <0.00%> (ø)`
datafusion/src/logical_plan/expr.rs	`84.96% <0.00%> (+0.36%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9d04ca...1ee63dd. Read the comment docs.

Dandandan · 2021-06-05T10:43:46Z

@alamb is it enough to add the like and not like where I added them? Not sure of where else I need to change.

@Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).

If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use like and not like that can be pruned.

I can try!
In my experience / to my knowlegde pruning only matters on sorted / bucketed / colocated data. The data generated by the TPC-H benchmark is very well distributed by default without doing some kind of sorting.

Dandandan · 2021-06-05T10:50:51Z

datafusion/src/physical_optimizer/pruning.rs

+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>


What about patterns like pat1%pat2 ?

We should also consider escaped percent characters in the pattern. Example: LIKE '100\% %'

I think as we can evaluate the like expression anyway, it might be easier to support like / not like to the full extent instead of only "startswith".

I only focused on expressions that don't start with %, under the assumption that they would be a starts_with. I don't think we can support anything other than a starts_with because we translate the queries to min LtEq value && value LtEq max.

Or how would LIKE '100\% %' be evaluated?

you are right, that makes sense 👍 (I think escaping might be an issue in arrow-rs too)

Dandandan · 2021-06-05T14:05:49Z

@alamb is it enough to add the like and not like where I added them? Not sure of where else I need to change.

@Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).

If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use like and not like that can be pruned.

Ah I see, it has some patterns that don't match any value at all... However we don't support query 14/16/20 yet:

#165
#167
#171

nevi-me · 2021-06-05T14:26:18Z

However we don't support query 14/16/20 yet:

Aw :( it would have been great to see what impact the small change has.

Maybe @alamb will see better results in iOX given that their data would likely have patterns that could benefit from this pruning.

There's also parquet column indices that were introduced in 2.5.0. I'd like to work on them, as that's where we'll see bigger read improvements on sorted data.

alamb · 2021-06-05T15:46:16Z

Thanks @nevi-me -- this looks great. I agree iOX may very well benefit from this as regex are common in our query workload. I will try and review this PR carefully tomorrow

Dandandan · 2021-06-05T16:07:30Z

@nevi-me

I did some checking with data from TPC-H:

Before:

CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
select l_orderkey from T where l_comment like '1%';
0 rows in set. Query took 150 milliseconds.
> 
select l_orderkey from T where l_comment like '{%';
0 rows in set. Query took 143 milliseconds.

After:

CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
select l_orderkey from T where l_comment like '1%';
0 rows in set. Query took 148 milliseconds.
> 
select l_orderkey from T where l_comment like '{%';
0 rows in set. Query took 43 milliseconds.

So looks like 👍

alamb

Thank you @nevi-me -- this is awesome.

I think there are a few items that need to be fixed (mentioned in comments), but overall the idea here is 💯 -- thank you so much

Maybe @alamb will see better results in iOX given that their data would likely have patterns that could benefit from this pruning.

I think this is likely in IOx because we have several columns (string) columns that often are very very low cardinality (like 8 distinct values for 100k rows) and matched with regexp style matches

alamb · 2021-06-06T10:31:57Z

datafusion/src/physical_optimizer/pruning.rs

@@ -548,7 +549,7 @@ fn build_predicate_expression(
        // allow partial failure in predicate expression generation
        // this can still produce a useful predicate when multiple conditions are joined using AND
        Err(_) => {
-            return Ok(logical_plan::lit(true));
+            return Ok(unhandled);


👍 thanks I forgot that

alamb · 2021-06-06T10:42:31Z

datafusion/src/physical_optimizer/pruning.rs

+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));


I am not sure if just removing % is correct:

For example in a pattern like foo%bar would be converted to foobar and when compared with a value of fooaaabar would be deemed "out of range" by this logic, even though it matches the original predicate foo%bar.

If instead, for foo%bar we used foo (only use the string up to the first unescaped %) I think then the logic applies.

Thanks, fixed and changed tests

alamb · 2021-06-06T10:44:09Z

datafusion/src/physical_optimizer/pruning.rs

+    #[test]
+    fn row_group_predicate_not_like() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'


Suggested change

// test LIKE operator that can't be converted to a 'starts_with'

// test NOT LIKE operator that can't be converted to a 'starts_with'

alamb · 2021-06-06T10:44:17Z

datafusion/src/physical_optimizer/pruning.rs

+    #[test]
+    fn row_group_predicate_not_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'


Suggested change

// test LIKE operator that can't be converted to a 'starts_with'

// test NOT LIKE operator that can't be converted to a 'starts_with'

alamb · 2021-06-06T10:46:19Z

datafusion/src/physical_optimizer/pruning.rs

+    fn row_group_predicate_not_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").not().like(lit("Banana%"));


I think there is a difference between !(a LIKE b) and a NOT LIKE b -- so to test the NOT LIKE operator above this should be something like

Suggested change

let expr = col("c1").not().like(lit("Banana%"));

let expr = col("c1").not_like(lit("Banana%");

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr.rs#L455-L457

This explains why the filter was negated, thanks!

alamb

Thanks @nevi-me

jorgecarleitao

Ready to go. Thanks a lot, @nevi-me !

jorgecarleitao · 2021-06-08T04:50:34Z

datafusion/src/physical_optimizer/pruning.rs

+                    if !string.starts_with('%') =>
+                {
+                    // Split the string to get the first part before '%'
+                    let split = string.split('%').next().unwrap().to_string();


won't this unwrap panic if the string does not contain any %? (if "like" always requires that, maybe we should throw an error instead?)

Like does not require %

I initially wondered the same thing -- lol! But my conclusion was "no it won't panic"

I made a quick playground that shows this working https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=049f4c1640386ff99c3b5e07085e0889

Yes, won't panic because String::split() will always return at least 1 result, the full string if there's nothing to spilt by

Dandandan · 2021-06-08T05:39:24Z

@nevi-me do you also want to address escaping the percentage character? \%

I know like_utf8 is broken in Arrow but it might be confusing to introduce this error at different parts.

\% should just match the literal % character.
E.g. nevi-\%x% should use nevi-%x as start, not nevi-\ as is the case currently.

nevi-me · 2021-06-08T10:33:31Z

@nevi-me do you also want to address escaping the percentage character? \%

I know like_utf8 is broken in Arrow but it might be confusing to introduce this error at different parts.

\% should just match the literal % character.
E.g. nevi-\%x% should use nevi-%x as start, not nevi-\ as is the case currently.

@alamb @jorgecarleitao please don't merge this yet, so I can address the above.

Avoid accidentally merging, at Nevi's request, until fixed escaping of %,

alamb · 2021-06-27T10:52:44Z

Marking as draft so it is clearer from the list of PRs that there is planned work for this one

alamb · 2021-08-20T19:02:42Z

Closing stale PRs to keep PR review list manageable. Please reopen if that is a mistake

add expr::like and expr::notlike to pruning logic

1062d5c

Dandandan reviewed Jun 5, 2021

View reviewed changes

alamb reviewed Jun 6, 2021

View reviewed changes

address review feedback

1ee63dd

alamb previously approved these changes Jun 7, 2021

View reviewed changes

jorgecarleitao previously approved these changes Jun 8, 2021

View reviewed changes

jorgecarleitao reviewed Jun 8, 2021

View reviewed changes

alamb added the datafusion Changes in the datafusion crate label Jun 10, 2021

alamb marked this pull request as draft June 27, 2021 10:53

alamb added the stale-pr label Jul 13, 2021

alamb closed this Aug 20, 2021

alamb mentioned this pull request Jun 6, 2022

Update sqlparser-rs to 0.18.0 #2705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add expr::like and expr::notlike to pruning logic #508

add expr::like and expr::notlike to pruning logic #508

nevi-me commented Jun 5, 2021

nevi-me commented Jun 5, 2021

codecov-commenter commented Jun 5, 2021 •

edited

Loading

Dandandan commented Jun 5, 2021

Dandandan Jun 5, 2021

andygrove Jun 5, 2021

Dandandan Jun 5, 2021

nevi-me Jun 5, 2021

Dandandan Jun 5, 2021 •

edited

Loading

Dandandan commented Jun 5, 2021

nevi-me commented Jun 5, 2021

alamb commented Jun 5, 2021

Dandandan commented Jun 5, 2021 •

edited

Loading

alamb left a comment

alamb Jun 6, 2021

alamb Jun 6, 2021

nevi-me Jun 7, 2021

alamb Jun 6, 2021

alamb Jun 6, 2021

alamb Jun 6, 2021

nevi-me Jun 7, 2021

alamb left a comment

jorgecarleitao left a comment

jorgecarleitao Jun 8, 2021

Dandandan Jun 8, 2021

alamb Jun 8, 2021

nevi-me Jun 8, 2021

Dandandan commented Jun 8, 2021

nevi-me commented Jun 8, 2021

alamb commented Jun 27, 2021

alamb commented Aug 20, 2021

	// test LIKE operator that can't be converted to a 'starts_with'
	// test NOT LIKE operator that can't be converted to a 'starts_with'

	let expr = col("c1").not().like(lit("Banana%"));
	let expr = col("c1").not_like(lit("Banana%");

add expr::like and expr::notlike to pruning logic #508

add expr::like and expr::notlike to pruning logic #508

Conversation

nevi-me commented Jun 5, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

nevi-me commented Jun 5, 2021

codecov-commenter commented Jun 5, 2021 • edited Loading

Codecov Report

Dandandan commented Jun 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Jun 5, 2021 • edited Loading

Choose a reason for hiding this comment

Dandandan commented Jun 5, 2021

nevi-me commented Jun 5, 2021

alamb commented Jun 5, 2021

Dandandan commented Jun 5, 2021 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Jun 8, 2021

nevi-me commented Jun 8, 2021

alamb commented Jun 27, 2021

alamb commented Aug 20, 2021

codecov-commenter commented Jun 5, 2021 •

edited

Loading

Dandandan Jun 5, 2021 •

edited

Loading

Dandandan commented Jun 5, 2021 •

edited

Loading