-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add expr::like and expr::notlike to pruning logic #508
Conversation
@alamb is it enough to add the @Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet). If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use |
Codecov Report
@@ Coverage Diff @@
## master #508 +/- ##
==========================================
+ Coverage 76.07% 76.14% +0.06%
==========================================
Files 155 155
Lines 26544 26626 +82
==========================================
+ Hits 20194 20274 +80
- Misses 6350 6352 +2
Continue to review full report at Codecov.
|
I can try! |
match &**right { | ||
// If the literal is a 'starts_with' | ||
Expr::Literal(ScalarValue::Utf8(Some(string))) | ||
if !string.starts_with('%') => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about patterns like pat1%pat2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also consider escaped percent characters in the pattern. Example: LIKE '100\% %'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as we can evaluate the like expression anyway, it might be easier to support like / not like to the full extent instead of only "startswith".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only focused on expressions that don't start with %
, under the assumption that they would be a starts_with
. I don't think we can support anything other than a starts_with
because we translate the queries to min LtEq value && value LtEq max
.
Or how would LIKE '100\% %'
be evaluated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, that makes sense 👍 (I think escaping might be an issue in arrow-rs too)
Ah I see, it has some patterns that don't match any value at all... However we don't support query 14/16/20 yet: |
Aw :( it would have been great to see what impact the small change has. Maybe @alamb will see better results in iOX given that their data would likely have patterns that could benefit from this pruning. There's also parquet column indices that were introduced in 2.5.0. I'd like to work on them, as that's where we'll see bigger read improvements on sorted data. |
Thanks @nevi-me -- this looks great. I agree iOX may very well benefit from this as regex are common in our query workload. I will try and review this PR carefully tomorrow |
I did some checking with data from TPC-H: Before:
After:
So looks like 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nevi-me -- this is awesome.
I think there are a few items that need to be fixed (mentioned in comments), but overall the idea here is 💯 -- thank you so much
Maybe @alamb will see better results in iOX given that their data would likely have patterns that could benefit from this pruning.
I think this is likely in IOx because we have several columns (string) columns that often are very very low cardinality (like 8 distinct values for 100k rows) and matched with regexp style matches
@@ -548,7 +549,7 @@ fn build_predicate_expression( | |||
// allow partial failure in predicate expression generation | |||
// this can still produce a useful predicate when multiple conditions are joined using AND | |||
Err(_) => { | |||
return Ok(logical_plan::lit(true)); | |||
return Ok(unhandled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks I forgot that
if !string.starts_with('%') => | ||
{ | ||
let scalar_expr = | ||
Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', "")))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if just removing %
is correct:
For example in a pattern like foo%bar
would be converted to foobar
and when compared with a value of fooaaabar
would be deemed "out of range" by this logic, even though it matches the original predicate foo%bar
.
If instead, for foo%bar
we used foo
(only use the string up to the first unescaped %
) I think then the logic applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed and changed tests
#[test] | ||
fn row_group_predicate_not_like() -> Result<()> { | ||
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]); | ||
// test LIKE operator that can't be converted to a 'starts_with' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// test LIKE operator that can't be converted to a 'starts_with' | |
// test NOT LIKE operator that can't be converted to a 'starts_with' |
#[test] | ||
fn row_group_predicate_not_starts_with() -> Result<()> { | ||
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]); | ||
// test LIKE operator that can't be converted to a 'starts_with' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// test LIKE operator that can't be converted to a 'starts_with' | |
// test NOT LIKE operator that can't be converted to a 'starts_with' |
fn row_group_predicate_not_starts_with() -> Result<()> { | ||
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]); | ||
// test LIKE operator that can't be converted to a 'starts_with' | ||
let expr = col("c1").not().like(lit("Banana%")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a difference between !(a LIKE b)
and a NOT LIKE b
-- so to test the NOT LIKE
operator above this should be something like
let expr = col("c1").not().like(lit("Banana%")); | |
let expr = col("c1").not_like(lit("Banana%"); |
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr.rs#L455-L457
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explains why the filter was negated, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nevi-me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to go. Thanks a lot, @nevi-me !
if !string.starts_with('%') => | ||
{ | ||
// Split the string to get the first part before '%' | ||
let split = string.split('%').next().unwrap().to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this unwrap
panic if the string does not contain any %
? (if "like" always requires that, maybe we should throw an error instead?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like does not require %
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially wondered the same thing -- lol! But my conclusion was "no it won't panic"
I made a quick playground that shows this working https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=049f4c1640386ff99c3b5e07085e0889
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, won't panic because String::split() will always return at least 1 result, the full string if there's nothing to spilt by
@nevi-me do you also want to address escaping the percentage character? I know
|
@alamb @jorgecarleitao please don't merge this yet, so I can address the above. |
Avoid accidentally merging, at Nevi's request, until fixed escaping of %,
Marking as draft so it is clearer from the list of PRs that there is planned work for this one |
Closing stale PRs to keep PR review list manageable. Please reopen if that is a mistake |
Which issue does this PR close?
Closes #507.
Rationale for this change
Extending pruning to include string columns with
LIKE
What changes are included in this PR?
Checks if a
LIKE
andNOT LIKE
condition don't start with%
, and converts them into aEQ
filter.Are there any user-facing changes?
No