Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider volatile function in simply_expression #13128

Merged
merged 10 commits into from
Nov 1, 2024

Conversation

Lordworms
Copy link
Contributor

Which issue does this PR close?

Closes #13060

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Oct 26, 2024
Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm!

// Eliminate common factors in conjunctions e.g
// (A AND B) OR (A AND C) -> A AND (B OR C)
Expr::BinaryExpr(BinaryExpr {
left,
op: Or,
right,
}) if has_common_conjunction(&left, &right) => {
}) if has_common_conjunction(&left, &right)? => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct. It is ok to just check the presence of common non-volatile expressions in has_common_conjunction() to trigger simplification, but you need to deal with volatile expressions during simplification too and don't extract them as common.

E.g. can you please add a test like column1 = 2 AND random() = 0 OR column1 = 2 AND random() = 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked postgresql and duckdb, they seems to merge the volatile expressions
image
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird and seems incorrect to me. BTW, this is what Spark does:

scala> sql("select * from t where column1 = 2 AND rand() = 0 OR column1 = 2 AND rand() = 0").explain(true);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ((('column1 = 2) AND ('rand() = 0)) OR (('column1 = 2) AND ('rand() = 0)))
   +- 'UnresolvedRelation [t], [], false

== Analyzed Logical Plan ==
column1: int, column2: int
Project [column1#0, column2#1]
+- Filter (((column1#0 = 2) AND (rand(-1409433140814249342) = cast(0 as double))) OR ((column1#0 = 2) AND (rand(2383542950853957003) = cast(0 as double))))
   +- SubqueryAlias spark_catalog.default.t
      +- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]

== Optimized Logical Plan ==
Filter (isnotnull(column1#0) AND ((column1#0 = 2) AND ((rand(-1409433140814249342) = 0.0) OR (rand(2383542950853957003) = 0.0))))
+- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]

== Physical Plan ==
*(1) Filter ((isnotnull(column1#0) AND (column1#0 = 2)) AND ((rand(-1409433140814249342) = 0.0) OR (rand(2383542950853957003) = 0.0)))
+- Scan hive spark_catalog.default.t [column1#0, column2#1], HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]

let lhs: HashSet<&Expr> = iter_conjunction(lhs).collect();
iter_conjunction(rhs).any(|e| lhs.contains(&e))
iter_conjunction(rhs).try_fold(false, |acc, e| {
Copy link
Contributor

@peter-toth peter-toth Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks a bit ugly. I mean, we just look for the first non-volatile expression and don't want to iterate over all expressions.

Actually, I think I made a mistake when I added is_volatile() with return type Result<bool>:

pub fn is_volatile(&self) -> Result<bool> {
self.exists(|expr| Ok(expr.is_volatile_node()))
}

As the closure we provide to exists() can't return any error, it would be safe to unwrap the result and return only bool.

@alamb, do you think we can fix Expr::is_volatile(), it would be an API breaking change though, to make it simpler and so we can keep using any() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored it, Thanks for your review

Copy link
Contributor

@peter-toth peter-toth Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the fix.

Maybe we can consider changing the return type of Expr::is_volatile() in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider changing the return type of Expr::is_volatile() in a follow-up PR.

I think this PR would be much cleaner if we fixed the signature of that first. But I guess the important thing is that we do fix it before it spreads keeps spreading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the signature in #13206 which caused a logical conflict in this PR (failed to compile) and then I updated this PR to use the new signature

As @eejbyfeldt predicated, the PR became much cleaner




query TT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for not adding these as unit tests in the expr_simplifier module instead?

They feel quite output place in file called explain.slt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

let lhs: HashSet<&Expr> = iter_conjunction(lhs).collect();
iter_conjunction(rhs).any(|e| lhs.contains(&e))
iter_conjunction(rhs).try_fold(false, |acc, e| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider changing the return type of Expr::is_volatile() in a follow-up PR.

I think this PR would be much cleaner if we fixed the signature of that first. But I guess the important thing is that we do fix it before it spreads keeps spreading.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Lordworms @findepi and @peter-toth

I agree with @eejbyfeldt in https://github.com/apache/datafusion/pull/13128/files#r1822606751 that we should add some unit tests in datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs in addition to the explain .slt test

I also have a proposal that I think would simplify this PR: #13206

fn has_common_conjunction(lhs: &Expr, rhs: &Expr) -> Result<bool, DataFusionError> {
let lhs_set: HashSet<&Expr> = iter_conjunction(lhs).collect();
for e in iter_conjunction(rhs) {
if lhs_set.contains(&e) && !e.is_volatile()? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked a little more at Expr::is_volatile and I actually think it can't return any errors. If we changed the signature I think this PR would be significantly simpler

@github-actions github-actions bot added the core Core DataFusion crate label Oct 31, 2024
@github-actions github-actions bot removed the core Core DataFusion crate label Oct 31, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Lordworms

.eq(lit(2))
.or(col("column1").eq(lit(2)).and(rand.clone().eq(lit(0))));

assert_eq!(simplify(expr), col("column1").eq(lit(2)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem to me this should still be the same thing -- namely that we shouldn't remove the rand() = 0 condition 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this PR still seems like it is an improvement over what is on main so we could potentially fix this in a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be something wrong here, let me double check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here should remove the rand() = 0 since we must fulfill the column1 = 2 first? according to the description in the issue? Don't know if I got it wrong....
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Happy Halloween!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Happy Halloween!

image

@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor

alamb commented Nov 1, 2024

I merged up from main to resolve a logical conflict with this PR

@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Nov 1, 2024
@alamb alamb merged commit 6b76a35 into apache:main Nov 1, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 1, 2024

Thanks again @Lordworms and @eejbyfeldt and @peter-toth and @jonathanc-n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimplifyExpressions should not consider volatile expressions equal for rewrites
5 participants