-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider volatile function in simply_expression #13128
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lgtm!
// Eliminate common factors in conjunctions e.g | ||
// (A AND B) OR (A AND C) -> A AND (B OR C) | ||
Expr::BinaryExpr(BinaryExpr { | ||
left, | ||
op: Or, | ||
right, | ||
}) if has_common_conjunction(&left, &right) => { | ||
}) if has_common_conjunction(&left, &right)? => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem correct. It is ok to just check the presence of common non-volatile expressions in has_common_conjunction()
to trigger simplification, but you need to deal with volatile expressions during simplification too and don't extract them as common.
E.g. can you please add a test like column1 = 2 AND random() = 0 OR column1 = 2 AND random() = 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird and seems incorrect to me. BTW, this is what Spark does:
scala> sql("select * from t where column1 = 2 AND rand() = 0 OR column1 = 2 AND rand() = 0").explain(true);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ((('column1 = 2) AND ('rand() = 0)) OR (('column1 = 2) AND ('rand() = 0)))
+- 'UnresolvedRelation [t], [], false
== Analyzed Logical Plan ==
column1: int, column2: int
Project [column1#0, column2#1]
+- Filter (((column1#0 = 2) AND (rand(-1409433140814249342) = cast(0 as double))) OR ((column1#0 = 2) AND (rand(2383542950853957003) = cast(0 as double))))
+- SubqueryAlias spark_catalog.default.t
+- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]
== Optimized Logical Plan ==
Filter (isnotnull(column1#0) AND ((column1#0 = 2) AND ((rand(-1409433140814249342) = 0.0) OR (rand(2383542950853957003) = 0.0))))
+- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]
== Physical Plan ==
*(1) Filter ((isnotnull(column1#0) AND (column1#0 = 2)) AND ((rand(-1409433140814249342) = 0.0) OR (rand(2383542950853957003) = 0.0)))
+- Scan hive spark_catalog.default.t [column1#0, column2#1], HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]
let lhs: HashSet<&Expr> = iter_conjunction(lhs).collect(); | ||
iter_conjunction(rhs).any(|e| lhs.contains(&e)) | ||
iter_conjunction(rhs).try_fold(false, |acc, e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still looks a bit ugly. I mean, we just look for the first non-volatile expression and don't want to iterate over all expressions.
Actually, I think I made a mistake when I added is_volatile()
with return type Result<bool>
:
datafusion/datafusion/expr/src/expr.rs
Lines 1581 to 1583 in e22d231
pub fn is_volatile(&self) -> Result<bool> { | |
self.exists(|expr| Ok(expr.is_volatile_node())) | |
} |
As the closure we provide to
exists()
can't return any error, it would be safe to unwrap the result and return only bool
.
@alamb, do you think we can fix Expr::is_volatile()
, it would be an API breaking change though, to make it simpler and so we can keep using any()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored it, Thanks for your review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the fix.
Maybe we can consider changing the return type of Expr::is_volatile()
in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider changing the return type of
Expr::is_volatile()
in a follow-up PR.
I think this PR would be much cleaner if we fixed the signature of that first. But I guess the important thing is that we do fix it before it spreads keeps spreading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the signature in #13206 which caused a logical conflict in this PR (failed to compile) and then I updated this PR to use the new signature
As @eejbyfeldt predicated, the PR became much cleaner
|
||
|
||
|
||
query TT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for not adding these as unit tests in the expr_simplifier
module instead?
They feel quite output place in file called explain.slt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
let lhs: HashSet<&Expr> = iter_conjunction(lhs).collect(); | ||
iter_conjunction(rhs).any(|e| lhs.contains(&e)) | ||
iter_conjunction(rhs).try_fold(false, |acc, e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider changing the return type of
Expr::is_volatile()
in a follow-up PR.
I think this PR would be much cleaner if we fixed the signature of that first. But I guess the important thing is that we do fix it before it spreads keeps spreading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Lordworms @findepi and @peter-toth
I agree with @eejbyfeldt in https://github.com/apache/datafusion/pull/13128/files#r1822606751 that we should add some unit tests in datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs in addition to the explain .slt test
I also have a proposal that I think would simplify this PR: #13206
fn has_common_conjunction(lhs: &Expr, rhs: &Expr) -> Result<bool, DataFusionError> { | ||
let lhs_set: HashSet<&Expr> = iter_conjunction(lhs).collect(); | ||
for e in iter_conjunction(rhs) { | ||
if lhs_set.contains(&e) && !e.is_volatile()? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked a little more at Expr::is_volatile
and I actually think it can't return any errors. If we changed the signature I think this PR would be significantly simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Lordworms
.eq(lit(2)) | ||
.or(col("column1").eq(lit(2)).and(rand.clone().eq(lit(0)))); | ||
|
||
assert_eq!(simplify(expr), col("column1").eq(lit(2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would seem to me this should still be the same thing -- namely that we shouldn't remove the rand() = 0
condition 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this PR still seems like it is an improvement over what is on main so we could potentially fix this in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be something wrong here, let me double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also Happy Halloween!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was marked as outdated.
This comment was marked as outdated.
I merged up from main to resolve a logical conflict with this PR |
Thanks again @Lordworms and @eejbyfeldt and @peter-toth and @jonathanc-n |
Which issue does this PR close?
Closes #13060
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?