Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add expr::like and expr::notlike to pruning logic #508

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 94 additions & 2 deletions datafusion/src/physical_optimizer/pruning.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ use crate::{
logical_plan::{Expr, Operator},
optimizer::utils,
physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr},
scalar::ScalarValue,
};

/// Interface to pass statistics information to [`PruningPredicates`]
Expand Down Expand Up @@ -548,7 +549,7 @@ fn build_predicate_expression(
// allow partial failure in predicate expression generation
// this can still produce a useful predicate when multiple conditions are joined using AND
Err(_) => {
return Ok(logical_plan::lit(true));
return Ok(unhandled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks I forgot that

}
};
let corrected_op = expr_builder.correct_operator(op);
Expand Down Expand Up @@ -586,8 +587,45 @@ fn build_predicate_expression(
.min_column_expr()?
.lt_eq(expr_builder.scalar_expr().clone())
}
Operator::Like => {
match &**right {
// If the literal is a 'starts_with'
Expr::Literal(ScalarValue::Utf8(Some(string)))
if !string.starts_with('%') =>
{
let scalar_expr =
Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
// Behaves like Eq
let min_column_expr = expr_builder.min_column_expr()?;
let max_column_expr = expr_builder.max_column_expr()?;
min_column_expr
.lt_eq(scalar_expr.clone())
.and(scalar_expr.lt_eq(max_column_expr))
}
_ => unhandled,
}
}
Operator::NotLike => {
match &**right {
// If the literal is a 'starts_with'
Expr::Literal(ScalarValue::Utf8(Some(string)))
if !string.starts_with('%') =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about patterns like pat1%pat2 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider escaped percent characters in the pattern. Example: LIKE '100\% %'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as we can evaluate the like expression anyway, it might be easier to support like / not like to the full extent instead of only "startswith".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only focused on expressions that don't start with %, under the assumption that they would be a starts_with. I don't think we can support anything other than a starts_with because we translate the queries to min LtEq value && value LtEq max.

Or how would LIKE '100\% %' be evaluated?

Copy link
Contributor

@Dandandan Dandandan Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, that makes sense 👍 (I think escaping might be an issue in arrow-rs too)

{
let scalar_expr =
Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if just removing % is correct:

For example in a pattern like foo%bar would be converted to foobar and when compared with a value of fooaaabar would be deemed "out of range" by this logic, even though it matches the original predicate foo%bar.

If instead, for foo%bar we used foo (only use the string up to the first unescaped %) I think then the logic applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed and changed tests

// Behaves like Eq
let min_column_expr = expr_builder.min_column_expr()?;
let max_column_expr = expr_builder.max_column_expr()?;
// Inverse of Like
min_column_expr
.gt_eq(scalar_expr.clone())
.and(scalar_expr.gt_eq(max_column_expr))
}
_ => unhandled,
}
}
// other expressions are not supported
_ => logical_plan::lit(true),
_ => unhandled,
};
Ok(statistics_expr)
}
Expand Down Expand Up @@ -1095,6 +1133,60 @@ mod tests {
Ok(())
}

#[test]
fn row_group_predicate_starts_with() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
// test LIKE operator that is converted to a 'starts_with'
let expr = col("c1").like(lit("Banana%"));
let expected_expr =
"#c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq #c1_max";
let predicate_expr =
build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
assert_eq!(format!("{:?}", predicate_expr), expected_expr);

Ok(())
}

#[test]
fn row_group_predicate_like() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
// test LIKE operator that can't be converted to a 'starts_with'
let expr = col("c1").like(lit("%Banana%"));
let expected_expr = "Boolean(true)";
let predicate_expr =
build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
assert_eq!(format!("{:?}", predicate_expr), expected_expr);

Ok(())
}

#[test]
fn row_group_predicate_not_starts_with() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
// test LIKE operator that can't be converted to a 'starts_with'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// test LIKE operator that can't be converted to a 'starts_with'
// test NOT LIKE operator that can't be converted to a 'starts_with'

let expr = col("c1").not().like(lit("Banana%"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a difference between !(a LIKE b) and a NOT LIKE b -- so to test the NOT LIKE operator above this should be something like

Suggested change
let expr = col("c1").not().like(lit("Banana%"));
let expr = col("c1").not_like(lit("Banana%");

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr.rs#L455-L457

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explains why the filter was negated, thanks!

let expected_expr =
"NOT #c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq NOT #c1_max";
let predicate_expr =
build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
assert_eq!(format!("{:?}", predicate_expr), expected_expr);

Ok(())
}

#[test]
fn row_group_predicate_not_like() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
// test LIKE operator that can't be converted to a 'starts_with'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// test LIKE operator that can't be converted to a 'starts_with'
// test NOT LIKE operator that can't be converted to a 'starts_with'

let expr = col("c1").not().like(lit("%Banana%"));
let expected_expr = "Boolean(true)";
let predicate_expr =
build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
assert_eq!(format!("{:?}", predicate_expr), expected_expr);

Ok(())
}

#[test]
fn row_group_predicate_required_columns() -> Result<()> {
let schema = Schema::new(vec![
Expand Down