-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ANSI support for Add #616
base: main
Are you sure you want to change the base?
Conversation
@dharanad Here is my solution based on your PR. |
This look fine, once this PR is merged. I will extend my solution to this to solve #535 . |
let boolean_array = array | ||
.as_any() | ||
.downcast_ref::<BooleanArray>() | ||
.expect("Expected BooleanArray"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using expect
here this function may panic, can instead return an errors ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved! thanks
} | ||
|
||
fn check_int_overflow(&self, batch: &RecordBatch, result: &ColumnarValue) -> Result<()> { | ||
let check_overflow_expr = Arc::new(BinaryExpr::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since arrow provide overflow checked kernels (apache/arrow-rs#2643) does it makes sense to use those directly rather than re-implementing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to review it. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eejbyfeldt , I've been looking at datafusion and I don't see any options to use those arrow operations from datafusion physical expressions. Do you know if this is implemented yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@planga82 Can we make use of this arithmetic kernel https://docs.rs/arrow/latest/arrow/compute/kernels/numeric/fn.add.html to compute the addition and throw error based on the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something like this ?
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
use arrow::compute::kernels::numeric::add;
let lhs = self.left.evaluate(batch)?;
let rhs = self.left.evaluate(batch)?;
match (self.op, self.eval_mode) {
(Operator::Plus, EvalMode::Ansi) => {
return apply(&lhs, &rhs, add)
},
_ => {
self.inner.evaluate(batch)
}
}
}
But the visibility of apply
fn is more restrcited pub (crate)
. We might need to raise a PR with datafusion to make it public
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that if we directly use the kernels to perform the operations instead of reusing the physical datafusion expression we may lose functionality or have to reimplement it here.
From my point of view in comet we should try to translate from Spark to Datafusuion and add in the form of Wrapper the functionality that may be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well put. I agree to what you are saying. I was thinking to override the implementation but your thoughts makes much more sense, safer & cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @eejbyfeldt that we should just use the existing add_checked
or add_wrapped
kernels in arrow-rs that already provide the functionality that we need (unless we discover any compatibility issue compared to the JVM addExact logic). I will create an example to demonstrate how to use this and will post here later today
Thanks for the contribution @planga82. I am reviewing this today. |
match self.inner.evaluate(batch) { | ||
Ok(result) => { | ||
self.fail_on_overflow(batch, &result)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evaluating the same expression twice is going to be expensive. We should just evaluate once, either checking for overflows, or not, depending on eval mode.
Arc::new(BinaryExpr::new( | ||
Arc::new(BinaryExpr::new( | ||
self.left.clone(), | ||
Operator::BitwiseXor, | ||
self.inner.clone(), | ||
)), | ||
Operator::BitwiseAnd, | ||
Arc::new(BinaryExpr::new( | ||
self.right.clone(), | ||
Operator::BitwiseXor, | ||
self.inner.clone(), | ||
)), | ||
)), | ||
Operator::Lt, | ||
Self::zero_literal(&result.data_type())?, | ||
)); | ||
match check_overflow_expr.evaluate(batch)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very expensive way of implementing this. We don't need to use DataFusion to perform simple math operations when we can just implement this in Rust directly as we process the arrays. I think we can delegate to arrow-rs though and avoid all of this. As stated in another comment, I will provide an example later today.
@planga82 Please see apache/datafusion#11400 that I just created against DataFusion, which possibly gives us most of what we need, although the same principals could be applied directly in Comet. They key change was adding a |
This is amazing, thanks for apache/datafusion#11400 . I have considered this idea, but I wasn't sure how to suggest this change. Correct me if i am wrong, so this change will go in the next datafusion release and we are blocked on supporting ANSI until then ? Or can we plan to update datafusion dep updated once 11400 is merged ? |
Thank you very much @andygrove for the explanation and the PR. In addition to @dharanad question In Spark we have three different behaviors ANSI mode disable --> Same behavior as Datafusion, return overflow value
Should we implement this try_add behavior in Datafusion? |
Which issue does this PR close?
Closes #536 .
Rationale for this change
This PR adds ANSI support for Add operator. This is done by adding a wrapper to BinaryExpr to add the different behavior between Spark and Datafusion.
The wrapper is based on #593 because both PRs solve similar problems.
What changes are included in this PR?
The implementation is based on Java Math.addExact(a, b) because it is the function that uses Spark to solve this problem, but in this case using datafusion operators.
This PR excludes two things that I will do in subsequent PRs to avoid make this PR more complex:
How are these changes tested?
Unit testing