feat: ANSI support for Add #616

planga82 · 2024-07-01T12:26:03Z

Which issue does this PR close?

Closes #536 .

Rationale for this change

This PR adds ANSI support for Add operator. This is done by adding a wrapper to BinaryExpr to add the different behavior between Spark and Datafusion.
The wrapper is based on #593 because both PRs solve similar problems.

What changes are included in this PR?

The implementation is based on Java Math.addExact(a, b) because it is the function that uses Spark to solve this problem, but in this case using datafusion operators.

 public static int addExact(int x, int y) {
        // HD 2-12 Overflow iff both arguments have the opposite sign of the result
         int r = x + y;
         if (((x ^ r) & (y ^ r)) < 0) {
            throw new ArithmeticException("integer overflow");
         }
         return r;
    }

This PR excludes two things that I will do in subsequent PRs to avoid make this PR more complex:

Support for DecimalType overflow check
Spark try_add mode

How are these changes tested?

Unit testing

planga82 · 2024-07-01T12:29:53Z

@dharanad Here is my solution based on your PR.

dharanad · 2024-07-02T06:57:48Z

@dharanad Here is my solution based on your PR.

This look fine, once this PR is merged. I will extend my solution to this to solve #535 .
@andygrove / @viirya Can you please help us with a review

dharanad · 2024-07-02T14:46:02Z

core/src/execution/datafusion/expressions/binary.rs

+                let boolean_array = array
+                    .as_any()
+                    .downcast_ref::<BooleanArray>()
+                    .expect("Expected BooleanArray");


Since we are using expect here this function may panic, can instead return an errors ?

Solved! thanks

eejbyfeldt · 2024-07-03T19:51:05Z

core/src/execution/datafusion/expressions/binary.rs

+    }
+
+    fn check_int_overflow(&self, batch: &RecordBatch, result: &ColumnarValue) -> Result<()> {
+        let check_overflow_expr = Arc::new(BinaryExpr::new(


Since arrow provide overflow checked kernels (apache/arrow-rs#2643) does it makes sense to use those directly rather than re-implementing it?

I'm going to review it. Thank you!

Hi @eejbyfeldt , I've been looking at datafusion and I don't see any options to use those arrow operations from datafusion physical expressions. Do you know if this is implemented yet?

@planga82 Can we make use of this arithmetic kernel https://docs.rs/arrow/latest/arrow/compute/kernels/numeric/fn.add.html to compute the addition and throw error based on the result.

Can we do something like this ?

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { use arrow::compute::kernels::numeric::add; let lhs = self.left.evaluate(batch)?; let rhs = self.left.evaluate(batch)?; match (self.op, self.eval_mode) { (Operator::Plus, EvalMode::Ansi) => { return apply(&lhs, &rhs, add) }, _ => { self.inner.evaluate(batch) } } }

But the visibility of apply fn is more restrcited pub (crate). We might need to raise a PR with datafusion to make it public

My concern is that if we directly use the kernels to perform the operations instead of reusing the physical datafusion expression we may lose functionality or have to reimplement it here.
From my point of view in comet we should try to translate from Spark to Datafusuion and add in the form of Wrapper the functionality that may be missing.

Well put. I agree to what you are saying. I was thinking to override the implementation but your thoughts makes much more sense, safer & cleaner.

I agree with @eejbyfeldt that we should just use the existing add_checked or add_wrapped kernels in arrow-rs that already provide the functionality that we need (unless we discover any compatibility issue compared to the JVM addExact logic). I will create an example to demonstrate how to use this and will post here later today

This reverts commit 1eb985e.

andygrove · 2024-07-09T14:44:52Z

Thanks for the contribution @planga82. I am reviewing this today.

andygrove · 2024-07-10T17:53:50Z

core/src/execution/datafusion/expressions/binary.rs

+        match self.inner.evaluate(batch) {
+            Ok(result) => {
+                self.fail_on_overflow(batch, &result)?;


Evaluating the same expression twice is going to be expensive. We should just evaluate once, either checking for overflows, or not, depending on eval mode.

andygrove · 2024-07-10T17:55:43Z

core/src/execution/datafusion/expressions/binary.rs

+            Arc::new(BinaryExpr::new(
+                Arc::new(BinaryExpr::new(
+                    self.left.clone(),
+                    Operator::BitwiseXor,
+                    self.inner.clone(),
+                )),
+                Operator::BitwiseAnd,
+                Arc::new(BinaryExpr::new(
+                    self.right.clone(),
+                    Operator::BitwiseXor,
+                    self.inner.clone(),
+                )),
+            )),
+            Operator::Lt,
+            Self::zero_literal(&result.data_type())?,
+        ));
+        match check_overflow_expr.evaluate(batch)? {


This is a very expensive way of implementing this. We don't need to use DataFusion to perform simple math operations when we can just implement this in Rust directly as we process the arrays. I think we can delegate to arrow-rs though and avoid all of this. As stated in another comment, I will provide an example later today.

andygrove · 2024-07-10T19:14:11Z

@planga82 Please see apache/datafusion#11400 that I just created against DataFusion, which possibly gives us most of what we need, although the same principals could be applied directly in Comet.

They key change was adding a fail_on_overflow config in BinaryExpr and then choosing to call either add or add_wrapped depending on that value.

dharanad · 2024-07-11T03:26:16Z

@planga82 Please see apache/datafusion# v that I just created against DataFusion, which possibly gives us most of what we need, although the same principals could be applied directly in Comet.

They key change was adding a fail_on_overflow config in BinaryExpr and then choosing to call either add or add_wrapped depending on that value.

This is amazing, thanks for apache/datafusion#11400 . I have considered this idea, but I wasn't sure how to suggest this change.

Correct me if i am wrong, so this change will go in the next datafusion release and we are blocked on supporting ANSI until then ? Or can we plan to update datafusion dep updated once 11400 is merged ?

planga82 · 2024-07-11T12:37:59Z

Thank you very much @andygrove for the explanation and the PR. In addition to @dharanad question

In Spark we have three different behaviors

ANSI mode disable --> Same behavior as Datafusion, return overflow value
ANSI mode enabled

x + y : Fail with overflow message
try_add(x, y): Return null value on overflow

Should we implement this try_add behavior in Datafusion?

planga82 added 6 commits June 29, 2024 16:17

base structure

d186c12

add check overflow

f211b16

Fix error

5079556

Add tests

426b7a3

format

4fb1bce

Fix rust warning

cdd7eba

add test

1eb985e

dharanad reviewed Jul 2, 2024

View reviewed changes

dharanad mentioned this pull request Jul 2, 2024

Add ANSI support for Subtract #535 #593

Closed

eejbyfeldt reviewed Jul 3, 2024

View reviewed changes

jatin510 mentioned this pull request Jul 5, 2024

Add ANSI support for Multiply #534

Open

planga82 added 3 commits July 5, 2024 16:07

merge upstream

ad65bae

avoid panic

4539246

Revert "add test"

1b05af5

This reverts commit 1eb985e.

andygrove reviewed Jul 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ANSI support for Add #616

feat: ANSI support for Add #616

planga82 commented Jul 1, 2024

planga82 commented Jul 1, 2024

dharanad commented Jul 2, 2024

dharanad Jul 2, 2024

planga82 Jul 5, 2024

eejbyfeldt Jul 3, 2024

planga82 Jul 3, 2024

planga82 Jul 5, 2024

dharanad Jul 8, 2024

dharanad Jul 8, 2024 •

edited

Loading

planga82 Jul 8, 2024 •

edited

Loading

dharanad Jul 9, 2024

andygrove Jul 10, 2024

andygrove commented Jul 9, 2024

andygrove Jul 10, 2024

andygrove Jul 10, 2024

andygrove commented Jul 10, 2024

dharanad commented Jul 11, 2024 •

edited

Loading

planga82 commented Jul 11, 2024 •

edited

Loading

feat: ANSI support for Add #616

Are you sure you want to change the base?

feat: ANSI support for Add #616

Conversation

planga82 commented Jul 1, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

planga82 commented Jul 1, 2024

dharanad commented Jul 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dharanad Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

planga82 Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jul 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jul 10, 2024

dharanad commented Jul 11, 2024 • edited Loading

planga82 commented Jul 11, 2024 • edited Loading

dharanad Jul 8, 2024 •

edited

Loading

planga82 Jul 8, 2024 •

edited

Loading

dharanad commented Jul 11, 2024 •

edited

Loading

planga82 commented Jul 11, 2024 •

edited

Loading