Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify EXPR LIKE 'constant' to expr = 'constant' #13061

Closed
wants to merge 5 commits into from

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Oct 22, 2024

@github-actions github-actions bot added the optimizer Optimizer rules label Oct 22, 2024
@@ -41,6 +41,7 @@ async-trait = { workspace = true }
chrono = { workspace = true }
datafusion-common = { workspace = true, default-features = true }
datafusion-expr = { workspace = true }
datafusion-functions = { workspace = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is okay, or if there is a way to avoid it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make some interface for rewriting the function through 🤔

Perhaps adding a method to SimplifyInfo like

/// return a new Expr with a function call for exact string prefix max
fn make_starts_with(base: Expr, prefix: Expr) -> Option<Expr>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure what you mean. Is this a pattern already in use somewhere I can learn from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggest adding a method to https://docs.rs/datafusion/latest/datafusion/logical_expr/simplify/trait.SimplifyInfo.html

However i can't quite figure out how to plumb it in without getting messy

THe trick is that this rewrite is based on the semantcs of starts_with -- so I think logically it would need to live along side the implementation of starts_with because it depends on those semantics.

And similarly, to do the rewrite in the pruning predicate we need something similar (somehow to provide a per function rewrite) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay I'm happy to work on this more but I fear I am not familiar enough with the relationship between all of these things to figure out how to wire everything up correctly. Maybe we pause this work until someone with more familiarity of these bits of code can suggest a way to structure this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try and help push this forward this week. I am sorry for the delay but I have been super busy the last few days

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem 😄

@adriangb
Copy link
Contributor Author

Wow, this is now rewriting col ~ '^foo' -> col like 'foo%' -> starts_with(col, 'foo') 🤯
If we implement stats pushdown for starts_with(col, 'foo') we'll automatically get col ~ '^foo' pushed down 💪🏻

@alamb
Copy link
Contributor

alamb commented Oct 30, 2024

I thought more about this PR the other day -- given the dependecy you have identified, I wonder if we might actually want to do the opposite

Specifically rewrite

starts_with(x, 'foo')

into

x LIKE 'foo%'

With the rationale that like already has quite a bit of special case / optimization.

I'll try and find a ticket to track the idea

@adriangb
Copy link
Contributor Author

Hmm interesting. startswith does seem simpler to me in principle, but I'd understand if it's better to do the other way around.

The dependency is a bit strange to me, I'm not sure why functions would depend on the optimizer, I would think the optimizer could depend on functions.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb what would you think about updating this PR to handle the no wild card case.

The more I think about the prefix case the less I am convinced that translating to some other function would be valuable. Instead I am thinking maybe we can ensure the LIKE evaluation is fast (perhaps it can detect this case and invoke a better kernel) 🤔

@adriangb
Copy link
Contributor Author

To handle ONLY the no wildcard case? Happy to do that if that's all the progress we can make for now.

if index == pattern.len() - 1 && !case_insensitive && escape_char.is_none() {
let prefix = pattern[..index].to_string();
let new_expr = Expr::ScalarFunction(ScalarFunction {
func: datafusion_functions::string::starts_with(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today a downstream project can choose which functions to use, this is configurable in the session, right? The DF repertoire of functions serves as a convenient package, but not mandatory.
Here we're inserting implicit function call to a specific function.
Is this following a pre-existing pattern? if so, i am OK with this.
But maybe it would be better to eg consult session state whether this function was supposed to be used. What do you think?

@@ -1446,18 +1446,54 @@ impl<'a, S: SimplifyInfo> TreeNodeRewriter for Simplifier<'a, S> {

// Rules for Like
Expr::Like(Like {
expr,
pattern,
expr: ref like_expr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep expr as expr
use like_expr when referring to the whole Expr::Like thing

Comment on lines 1458 to 1459
if pattern == "%" && !is_null(like_expr) {
return Ok(Transformed::yes(lit(!negated)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool idea, but is_null(like_expr) checks makes it incorrect.
This function returns "is this always null". Applied to a column reference, it will return false, so this optimization will kick in, but the column values are still nullable.

The "more correct" way would be to

  • optimize WHERE col LIKE '%' to WHERE col IS NOT NULL
  • optimize WHERE col NOT LIKE '%' to WHERE false.

However, this simplification cannot be done here. It can only be done for top-level filter conjuncts, or when expr being simplified is immediately used as a condition (eg IF(expr, ...)). In other cases, this need to be NULL-preserving boolean logic:

trino> SELECT a, a LIKE '%', a NOT LIKE '%'
    -> FROM (VALUES NULL, '', 'x') t(a);
  a   | _col1 | _col2
------+-------+-------
 NULL | NULL  | NULL
      | true  | false
 x    | true  | false
(3 rows)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easiest solution would be to just drop this special case.
I.e. expr LIKE '%' would get simplified to starts_with(expr, '')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #12637 (comment)

Comment on lines 1462 to 1463
let pct_wildcard_index = pattern.find('%');
let underscore_index = pattern.find('_');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honor escape_char


// Handle single % at end case (prefix search)
if let Some(index) = pct_wildcard_index {
if index == pattern.len() - 1 && !case_insensitive && escape_char.is_none() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& underscore_index.is_none()

@@ -3589,6 +3635,18 @@ mod tests {
let expr = not_ilike(col("c1"), "%");
assert_eq!(simplify(expr), lit(false));

let expr = like(col("c1"), "a%");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test cases

  • with % pattern

  • with _ pattern (can be simplified to a length check, but that can be follow up)

  • with %% and %%% patterns (can be simplified just like single %)

  • with __ and ___ patterns (can be simplified to a length check, but that can be follow up)

  • with patterns combining % and _, eg a_b%, a%b_ (these cannot simplified) and _%, %_ (these can be simplified to a length check, but that can be follow up)

@findepi
Copy link
Member

findepi commented Nov 2, 2024

do we have equivalent of #12978 for starts_with?

@adriangb
Copy link
Contributor Author

adriangb commented Nov 2, 2024

do we have equivalent of #12978 for starts_with?

No but I think it will have a lot of code sharing with #12978 and should be relatively easy to implement. After we merge that PR I'd be happy to go back in and implement it, doing whatever refactoring is needed to get good code sharing.

@adriangb adriangb changed the title rewrite prefix and constant cases of like to startswith and equality Simplify EXPR LIKE 'constant' to expr = 'constant' Nov 3, 2024
@adriangb
Copy link
Contributor Author

adriangb commented Nov 3, 2024

@alamb @findepi I redid this PR now to only handle the equality cases and close the issue.

I do still think it'd be nice to figure out how to rewrite to startswith but it seems like that requires some larger scaler refactoring that I don't have the knowledge or time to carry out.

Comment on lines 1477 to 1478
if !is_null(&like_expr.expr) && pattern_str == "%" {
Transformed::yes(lit(!like_expr.negated))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. Did you consider #13061 (comment) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was already here (see diff above). Did I change the logic in any way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Can you please confirm or disconfirm the correctness concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is incorrect but this is the logic that was there before this PR. I'm happy to remove it. I guess no one has complained because the case that breaks is SELECT NULL LIKE '%' which isn't something that happens under normal usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is incorrect but this is the logic that was there before this PR.

thanks for confirming and understood it's pre-existing.

Can it be fixed though?
i personally would prefer that git annotate on the file wouldn't point at me as the author of something that's known to be incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My latest commit removes that special case. I'm not sure how to fix it, your comment suggests it can't be fixed right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess no one has complained because the case that breaks is SELECT NULL LIKE '%' which isn't something that happens under normal usage.

actually the opposite.
this was correct only for this case
and wrong for all other ... LIKE '%' queries.

see #13259

Comment on lines 1479 to 1480
} else if !pattern_str.contains(['%', '_'].as_ref()) {
// If the pattern does not contain any wildcards, we can simplify the like expression to an equality expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TODO to handle escaped search characters in a follow-up.
Escaped wildcards are not uncommon in certain scenarios - eg correct use of JDBC DatabaseMetaData will produce LIKE foo\_bar when inspecting object foo_bar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it not simplify if there is an escape character for now

let expr = like(col("c1"), "a_");
assert_eq!(simplify(expr), col("c1").like(lit("a_")));
let expr = not_like(col("c1"), "a_");
assert_eq!(simplify(expr), col("c1").not_like(lit("a_")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test cases where pattern is constant NULL (it won't be simplified today, that's fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 6c847fa

Transformed::yes(lit(!like_expr.negated))
} else if !pattern_str.contains(['%', '_'].as_ref()) {
// If the pattern does not contain any wildcards, we can simplify the like expression to an equality expression
if like_expr.escape_char.is_some() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this condition be checked on the same line with !pattern_str.contains(['%', '_']
it would reduce code branching.

Expr::Literal(ScalarValue::Utf8(Some(pattern_str))) if pattern_str == "%"
) =>
{
Transformed::yes(lit(!negated))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this is correct, but requires updating some test cases.
i am doing this in #13259

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, I'll wait for your work to be merged then continue here

@adriangb
Copy link
Contributor Author

adriangb commented Nov 5, 2024

Closing in favor of #13260

@adriangb adriangb closed this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify EXPR LIKE 'constant' to expr = 'constant'
3 participants