Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support binary temporal arithmetic with integers #13741

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

milevin
Copy link

@milevin milevin commented Dec 12, 2024

Which issue does this PR close?

Closes #12342

Rationale for this change

> ./target/debug/datafusion-cli -c "select 4 + to_date('1970-01-03');"

DataFusion CLI v43.0.0
+----------------------------------------+
| Int64(4) + to_date(Utf8("1970-01-03")) |
+----------------------------------------+
| 1970-01-07                             |
+----------------------------------------+
1 row(s) fetched. 
Elapsed 0.018 seconds.

> ./target/debug/datafusion-cli -c "select to_date('1970-01-03') - 2;"
DataFusion CLI v43.0.0
+----------------------------------------+
| to_date(Utf8("1970-01-03")) - Int64(2) |
+----------------------------------------+
| 1970-01-01                             |
+----------------------------------------+
1 row(s) fetched. 
Elapsed 0.006 seconds.

Details

To facilitate this change, I made an update to cast_to function in expr_schema.rs. There are two unpleasant things in here:

  1. My change is not really a cast, it's a rewrite. So, cast_to now sometimes casts and sometimes rewrites.
  2. This only works for literal integers; but fails as before on things like this: select to_date('1970-01-03') - x from (select 1 as x);

The limitations I faced are these:

  1. Arrow can't cast integers to intervals -- this necessitated rewriting instead of casting
  2. I couldn't figure out how to generate a rewritten expression that didn't need the input to be a literal. I tried to play with a multiplication expression, but it choked on multiplying an interval by an integer. That's when I resorted to using literal integer values and failing on everything else.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Dec 12, 2024
datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
Comment on lines +1638 to +1640
let (lhs, rhs) = get_input_types(&$LHS_TYPE, &$OP, &$RHS_TYPE)?;
assert_eq!(lhs, $RESULT_TYPE1);
assert_eq!(rhs, $RESULT_TYPE2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for + operator symmetry is expected, so the assert function could do this for us

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator is symmetric, but how we cast the left and right types is not symmetric. Hence I needed an assymetric test for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have this

let (a, b) = get_input_types(&$LHS_TYPE, &$OP, &$RHS_TYPE)?;
assert_eq!(a, $RESULT_TYPE1);
assert_eq!(b, $RESULT_TYPE2);

// test symmetry
let (b, a) = get_input_types(&$RHS_TYPE, &$OP, &$LHS_TYPE)?;
assert_eq!(a, $RESULT_TYPE1);
assert_eq!(b, $RESULT_TYPE2);

///
/// Applies coercion rules for `$LHS_TYPE $OP $RHS_TYPE` and asserts that the
/// the result types are `$RESULT_TYPE1` and `$RESULT_TYPE2` respectively
macro_rules! test_coercion_assymetric_binary_rule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this being a macro? Could this be ordinary function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know -- I just copied how the rest is done here, via macros too. This macro is a clone of test_coercion_binary_rule macro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assumed so. But also -- macros aren't "better functions", they come with downsides, so if something is expressible as a function, it should.

Comment on lines +458 to +477
} else if matches!(
(&this_type, cast_to_type),
(DataType::Int32 | DataType::Int64, DataType::Interval(_))
) {
// Convert integer (days) to the corresponding DayTime interval
match self {
Expr::Literal(ScalarValue::Int32(Some(days))) => {
Ok(Expr::Literal(ScalarValue::IntervalDayTime(Some(
arrow_buffer::IntervalDayTime::new(days, 0),
))))
}
Expr::Literal(ScalarValue::Int64(Some(days))) => {
Ok(Expr::Literal(ScalarValue::IntervalDayTime(Some(
arrow_buffer::IntervalDayTime::new(days as i32, 0),
))))
}
_ => plan_err!(
"Cannot automatically convert {this_type:?} to {cast_to_type:?}"
),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for literals only and is immediate.
For other types all the function does is create cast expression. Why would we want to special-case here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the alternative? You can't cast an integer to an interval in arrow.

You are right; it only works for literals. Do you have thoughts on how to make it work for non also?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a Cast is supposed to work, it should work regardless how it's weaved into the plan.
Expr.cast_to should not have any additional logic (other than checks) vs creating Expr::Cast directly.

However, we should first decide whether integer should be castable to interval at all.

It's not castable in PostgreSQL

postgres=# SELECT 10::interval;
ERROR:  cannot cast type integer to interval
LINE 1: SELECT 10::interval;
                 ^

so I am inclined it should not be castable in DF either.

Which means, the date + int cannot be converted (desugared) to date + interval using ordinary cast expression (as such shouldn't exist), but it can be converted (desugared) by some other means (eg with a custom conversion function or combination of expressions).

Alternatively, we defer date + int until the execution, eg somewhere in

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {

but i'd prefer to desugar earlier and avoid separate code path in the BinaryExpr physical expr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we require a new scalar function, make_interval, which can convert integers into intervals.
Then, similar to make_array, perform transformations within the SQL planner.
Transform 4 + to_date('1970-01-03') to make_interval(4) + to_date('1970-01-03') inside plan_binary_op

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @jonahgao something like that is the approach! I've pinged the make_interval issue:

#6951

@@ -186,6 +186,12 @@ fn signature(lhs: &DataType, op: &Operator, rhs: &DataType) -> Result<Signature>
} else if let Some(numeric) = mathematics_numerical_coercion(lhs, rhs) {
// Numeric arithmetic, e.g. Int32 + Int32
Ok(Signature::uniform(numeric))
} else if let Some((new_lhs, new_rhs, ret)) = resolve_ints_to_intervals(lhs, rhs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're in Plus | Minus | Multiply | Divide | Modulo => { branch
but the new logic should apply

  • to + symmetrically
  • to date - int, but likely not to int - date (at least per PostgreSQL)
  • not at all to other operators

so what about passing the operator to the new function and doing the work for + and - (so that code already identifies asymmetric treatment)


match (lhs, rhs) {
// Handle integer + temporal types cases
(Int32 | Int64, rhs) if rhs.is_temporal() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be date + int, not any_temporal + int:

postgres=# select now()::date + 1;
  ?column?
------------
 2024-12-13
(1 row)

postgres=# select now()::time + 1;
ERROR:  operator does not exist: time without time zone + integer
LINE 1: select now()::time + 1;
                           ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

postgres=# select now()::timestamp + 1;
ERROR:  operator does not exist: timestamp without time zone + integer
LINE 1: select now()::timestamp + 1;
                                ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

postgres=# select now()::timestamptz + 1;
ERROR:  operator does not exist: timestamp with time zone + integer
LINE 1: select now()::timestamptz + 1;
                                  ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

(i didn't check other is_temporal types: interval, duration)

///
/// Applies coercion rules for `$LHS_TYPE $OP $RHS_TYPE` and asserts that the
/// the result types are `$RESULT_TYPE1` and `$RESULT_TYPE2` respectively
macro_rules! test_coercion_assymetric_binary_rule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assumed so. But also -- macros aren't "better functions", they come with downsides, so if something is expressible as a function, it should.

Comment on lines +1638 to +1640
let (lhs, rhs) = get_input_types(&$LHS_TYPE, &$OP, &$RHS_TYPE)?;
assert_eq!(lhs, $RESULT_TYPE1);
assert_eq!(rhs, $RESULT_TYPE2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have this

let (a, b) = get_input_types(&$LHS_TYPE, &$OP, &$RHS_TYPE)?;
assert_eq!(a, $RESULT_TYPE1);
assert_eq!(b, $RESULT_TYPE2);

// test symmetry
let (b, a) = get_input_types(&$RHS_TYPE, &$OP, &$LHS_TYPE)?;
assert_eq!(a, $RESULT_TYPE1);
assert_eq!(b, $RESULT_TYPE2);

Comment on lines +458 to +477
} else if matches!(
(&this_type, cast_to_type),
(DataType::Int32 | DataType::Int64, DataType::Interval(_))
) {
// Convert integer (days) to the corresponding DayTime interval
match self {
Expr::Literal(ScalarValue::Int32(Some(days))) => {
Ok(Expr::Literal(ScalarValue::IntervalDayTime(Some(
arrow_buffer::IntervalDayTime::new(days, 0),
))))
}
Expr::Literal(ScalarValue::Int64(Some(days))) => {
Ok(Expr::Literal(ScalarValue::IntervalDayTime(Some(
arrow_buffer::IntervalDayTime::new(days as i32, 0),
))))
}
_ => plan_err!(
"Cannot automatically convert {this_type:?} to {cast_to_type:?}"
),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a Cast is supposed to work, it should work regardless how it's weaved into the plan.
Expr.cast_to should not have any additional logic (other than checks) vs creating Expr::Cast directly.

However, we should first decide whether integer should be castable to interval at all.

It's not castable in PostgreSQL

postgres=# SELECT 10::interval;
ERROR:  cannot cast type integer to interval
LINE 1: SELECT 10::interval;
                 ^

so I am inclined it should not be castable in DF either.

Which means, the date + int cannot be converted (desugared) to date + interval using ordinary cast expression (as such shouldn't exist), but it can be converted (desugared) by some other means (eg with a custom conversion function or combination of expressions).

Alternatively, we defer date + int until the execution, eg somewhere in

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {

but i'd prefer to desugar earlier and avoid separate code path in the BinaryExpr physical expr

@alamb
Copy link
Contributor

alamb commented Dec 13, 2024

FYI @jonahgao in case you would like to weigh in here

@alamb
Copy link
Contributor

alamb commented Jan 16, 2025

What is the status of this PR? Are we waiting on upstream changes in arrow-rs?

@alamb alamb marked this pull request as draft January 17, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Error when adding Date32 and Int64
4 participants