Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify type signatures using TypeSignatureClass for mixed type function signature #13372

Merged
merged 23 commits into from
Dec 14, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Nov 12, 2024

Which issue does this PR close?

Part of #13301

Closes #.

Rationale for this change

Some functions require mixed type signature such as (string, integer) or (string, timestamp).

Time related type is added in this PR as the first step.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) common Related to common crate functions labels Nov 12, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 marked this pull request as ready for review November 12, 2024 11:08

#[derive(Debug, Clone, Eq, PartialOrd)]
pub enum TypeSignatureClass {
Timestamp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to treat timestamp and timestamp with zone separately 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is function that treat timestamp differently based on timezone, otherwise it is simpler to treat them equivalently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some functions currently define that they take a UTC timestamp or ANY timestamp via https://docs.rs/datafusion/latest/datafusion/logical_expr/constant.TIMEZONE_WILDCARD.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like we should have another variant for timestamp with time zone. IMO, they are different types.

  • Timestamp(timeunit, None) for timestamp without time zone.
  • Timestamp(timeunit, Some(TIMEZONE_WILDCARD) for timestamp with time zone.

They can't interact with each other before applying some casting or coercion.

For example,

Exact(vec![
DataType::Interval(MonthDayNano),
Timestamp(array_type, None),
Timestamp(Nanosecond, None),
]),
Exact(vec![
DataType::Interval(MonthDayNano),
Timestamp(array_type, Some(TIMEZONE_WILDCARD.into())),
Timestamp(Nanosecond, Some(TIMEZONE_WILDCARD.into())),
]),

The date_bin function accepts two Timestamp arguments. However, if we try to simplify here, we may write something like

  TypeSignature::Coercible(vec![
      TypeSignatureClass::Interval,
      TypeSignatureClass::Timstamp,
      TypeSignatureClass::Timstamp,
  ]),

It means we can accept the SQL like date_bin(INTERVAL 1 HOUR, timestamp_without_timezone_col, timestamp_with_timezone_col) but I guess it's not correct. (no match the original signature)

If we have a class for timestamp with time zone, we can write

TypeSianture::one_of([
  TypeSignature::Coercible(vec![
      TypeSignatureClass::Interval,
      TypeSignatureClass::Timstamp,
      TypeSignatureClass::Timstamp,
  ]),
  TypeSignature::Coercible(vec![
      TypeSignatureClass::Interval,
      TypeSignatureClass::Timstamp_with_time_zone,
      TypeSignatureClass::Timstamp_with_time_zone,
  ]),
])

It's more close to the original signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about managing timestamps within the invoke function? Handling specific cases like Timestamp_with_time_zone or TIMEZONE_WILDCARD adds unnecessary complexity to the function's signature without providing much benefit. Instead, why not define a high-level Timestamp in the signature and handle the finer details elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can refer to the behavior of Postgres or DuckDB. I listed the signature of the timestamp function age from their system catalogs for reference.

Postgres

test=# SELECT proname, proargtypes FROM pg_proc WHERE proname = 'age';
 proname | proargtypes 
---------+-------------
 age     | 28
 age     | 1114
 age     | 1184
 age     | 1114 1114
 age     | 1184 1184
(5 rows)

test=# SELECT oid, typname FROM pg_type WHERE oid IN (28, 1184, 1114);
 oid  |   typname   
------+-------------
   28 | xid
 1114 | timestamp
 1184 | timestamptz
(3 rows)

DuckDB

D SELECT function_name, parameter_types 
  FROM duckdb_functions() 
 WHERE function_name = 'age';
┌───────────────┬──────────────────────────────────────────────────────┐
│ function_name │                   parameter_types                    │
│    varcharvarchar[]                       │
├───────────────┼──────────────────────────────────────────────────────┤
│ age           │ [TIMESTAMP]                                          │
│ age           │ [TIMESTAMP, TIMESTAMP]                               │
│ age           │ [TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH TIME ZONE] │
│ age           │ [TIMESTAMP WITH TIME ZONE]                           │
└───────────────┴──────────────────────────────────────────────────────┘

If you try to input an unsupported type value, you may encounter an error like the following:

D SELECT age('2001-01-01 18:00:00');
Binder Error: Could not choose a best candidate function for the function call "age(STRING_LITERAL)". In order to select one, please add explicit type casts.
    Candidate functions:
    age(TIMESTAMP WITH TIME ZONE) -> INTERVAL
    age(TIMESTAMP) -> INTERVAL

LINE 1: SELECT age('2001-01-01 18:00:00');

Both systems treat TIMESTAMP and TIMESTAMP WITH TIME ZONE as distinct types in the high level.

The advantage of separating these types is that it allows for stricter type-checking when matching a function's signature. This reduces the likelihood of developers failing to correctly handle type checks when implementing timestamp functions.

Honestly, I'm not sure how complex it would become if we separated them 🤔 . If it requires significant effort, I'm fine with keeping the current design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both systems treat TIMESTAMP and TIMESTAMP WITH TIME ZONE as distinct types in the high level.

that's a good point, they are quite different: only the latter denotes point in time. the former denotes "wall date/time" with no zone information, so does not denote any particular point in time.

their similarity is deceptive and source of many bugs

This reduces the likelihood of developers failing to correctly handle type checks when implementing timestamp functions.

the timestamp and timestamp tz values are different arrow types, so the function implementor needs to handle them separately anyway

The point is, some functions will be applicable to one of these types but not the other.
for example, a (hypothetical) to_unix_timestamp(timestamp_tz) -> Int64 function should operate on point in time value, so it should accept the timestamp_tz.
Note that in SQL, timestamp is coercible to timestamp_tz, so such function is still going to be callable with timestamp value, but that's not something function implementor should be concerned about.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My solution so far is that we use TypeSignatureClass::Timestamp if we don't care about it has timezone or not. Fallback to TypeSignatureClass::Native() if we need to tell the difference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My solution so far is that we use TypeSignature::Timestamp if we don't care about it has timezone or not. Fallback to TypeSignature::Native() if we need to tell the difference

Agreed. Handling Timestamps using TypeSignatureClass::Native is a good idea. I still think we should treat them as different types.

@@ -112,8 +114,9 @@ pub enum TypeSignature {
/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![DataType::Int32]` or `vec![DataType::Float32]`
/// since i32 and f32 can be casted to f64
Coercible(Vec<LogicalTypeRef>),
/// Fixed number of arguments of arbitrary types, number should be larger than 0
Coercible(Vec<TypeSignatureClass>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

datafusion/common/src/types/native.rs Show resolved Hide resolved
----
12.12345678

query R
SELECT date_part('millisecond', '2020-09-08T12:00:12.12345678+00:00')
SELECT date_part('millisecond', timestamp '2020-09-08T12:00:12.12345678+00:00')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change existing test queries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the valid type should be timestamp but not string.

This is the result in Postgres

postgres=# SELECT date_part('millisecond', timestamp '2020-09-08T12:00:12.12345678+00:00');
 date_part 
-----------
 12123.457
(1 row)

postgres=# SELECT date_part('millisecond', '2020-09-08T12:00:12.12345678+00:00');
ERROR:  function date_part(unknown, unknown) is not unique
LINE 1: SELECT date_part('millisecond', '2020-09-08T12:00:12.1234567...
               ^
HINT:  Could not choose a best candidate function. You might need to add explicit type casts.


impl Display for TypeSignatureClass {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "TypeSignatureClass::{self:?}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Display looks more verbose than Debug. Typically it's the other way around.

the produced err msg looks a bit longish, but i don't know how to make it more readabile. thoughts?

Internal error: Expect TypeSignatureClass::Native\(LogicalType\(Native\(Int64\), Int64\)\) but received Float64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be great to have a special type signature class display

Maybe something like (any int) or Integer or (any timestamp) for Timestamp 🤔

@alamb
Copy link
Contributor

alamb commented Nov 20, 2024

@jayzhan211 is this one ready for review by a committer ? I am having trouble keeping track of everything that is going on and am trying to prioritize reviews where it would be most helpful.

If you have any suggestions, please let me know

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 21, 2024

@jayzhan211 is this one ready for review by a committer ? I am having trouble keeping track of everything that is going on and am trying to prioritize reviews where it would be most helpful.

If you have any suggestions, please let me know

Yes, this is

trying to prioritize reviews where it would be most helpful.

I think we should. I will ping directly if I think it should be prioritized

@jayzhan211 jayzhan211 marked this pull request as draft November 23, 2024 06:15
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate substrait catalog Related to the catalog crate execution Related to the execution crate proto Related to proto crate labels Dec 8, 2024
- Added logic to prevent unnecessary casting of string types in `native.rs`.
- Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons.
- Updated imports in `functions.rs` and `signature.rs` for better organization.
- Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`.
- Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`.

These changes improve type handling and ensure more accurate function behavior in SQL expressions.
@jayzhan211 jayzhan211 requested a review from goldmedal December 9, 2024 13:35
Exact(vec![
Utf8,
Timestamp(Nanosecond, Some(TIMEZONE_WILDCARD.into())),
TypeSignature::Coercible(vec![
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified signature is the main goal of this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- this is looking quite good. I am worried about the regression of implicit casting from string to timestamps as well as the fact that I think this may not work for timestamps without timezones (seems like there is a gap in testing for datepart -- I will make a PR to add some more tests)


#[derive(Debug, Clone, Eq, PartialOrd)]
pub enum TypeSignatureClass {
Timestamp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some functions currently define that they take a UTC timestamp or ANY timestamp via https://docs.rs/datafusion/latest/datafusion/logical_expr/constant.TIMEZONE_WILDCARD.html

datafusion/expr-common/src/signature.rs Show resolved Hide resolved

impl Display for TypeSignatureClass {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "TypeSignatureClass::{self:?}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be great to have a special type signature class display

Maybe something like (any int) or Integer or (any timestamp) for Timestamp 🤔

.map(|logical_type| match logical_type {
TypeSignatureClass::Native(l) => get_data_types(l.native()),
TypeSignatureClass::Timestamp => {
vec![DataType::Timestamp(TimeUnit::Nanosecond, None)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use WILDCARD to match any timestamp (not just timestamps without timezones)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually used for displaying the signature information for error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to provide both timestamp with time zone and timestamp without time zone here. It's used to provide the possible type for the information_schema routine and parameters table.

vec![DataType::Timestamp(TimeUnit::Nanosecond, None), DataType::Timestamp(TimeUnit::Nanosecond, Some(TIMEZONE_WILDCARD.into()))]

@@ -245,6 +245,8 @@ impl LogicalType for NativeType {
(Self::FixedSizeBinary(size), _) => FixedSizeBinary(*size),
(Self::String, LargeBinary) => LargeUtf8,
(Self::String, BinaryView) => Utf8View,
// We don't cast to another kind of string type if the origin one is already a string type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯


query I
# Second argument should not be string, failed in postgres too.
query error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a regression to me (even if it fails in postgres). I think automatically coercing to a string is important for our users in InfluxDB.

…trings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds.
TypeSignatureClass::Timestamp if logical_type == NativeType::String => {
Ok(DataType::Timestamp(TimeUnit::Nanosecond, None))
}
TypeSignatureClass::Timestamp if logical_type.is_timestamp() => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp is matched whatever the timezone it has

…d maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps.
@jayzhan211 jayzhan211 marked this pull request as draft December 10, 2024 14:33
…ctions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file.
…fication. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations.
@jayzhan211 jayzhan211 marked this pull request as ready for review December 11, 2024 12:05
alamb added a commit to alamb/datafusion that referenced this pull request Dec 11, 2024
alamb added a commit to alamb/datafusion that referenced this pull request Dec 11, 2024
@alamb
Copy link
Contributor

alamb commented Dec 11, 2024

@jayzhan211 jayzhan211 marked this pull request as draft December 12, 2024 13:48
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- I know this has been a long process. Sorry for the slower pace but I think as DataFusion attempts to be more stable, more attention is warranted for potential changes.

Other than the string coercion (I left a question) this PR looks good to me given your comment on #13732 (comment)

❤️ Thanks again

query I
SELECT date_part('second', '2020-09-08T12:00:12.12345678+00:00')
----
12

query I
SELECT date_part('millisecond', '2020-09-08T12:00:12.12345678+00:00')
SELECT date_part('second', timestamp '2020-09-08T12:00:12.12345678+00:00')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these test changes still needed?

From my perspective as long as this PR doesn't change the ability to use a string as an argument to a function that requires a timestamp, it is a good change to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't merge the code from the main

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Dec 12, 2024

Thank you @jayzhan211 -- I know this has been a long process. Sorry for the slower pace but I think as DataFusion attempts to be more stable, more attention is warranted for potential changes.

Other than the string coercion (I left a question) this PR looks good to me given your comment on #13732 (comment)

❤️ Thanks again

I'm also thinking about the comment above about whether to differentiate timestamp timezone or not. Given that we already have NativeType::Timestamp to differentiate with tz or without cases, I still don't think it is a good idea to add another abstraction in TypeSignatureClass. For date_bin that takes 2 args, I think using TypeSignatureClass::Native() might be a better idea. We can review the design again when we deal with date_bin later on

@jayzhan211 jayzhan211 marked this pull request as ready for review December 12, 2024 14:13
alamb added a commit that referenced this pull request Dec 12, 2024
…nes (#13732)

* Add tests for date_part on columns + timestamps with / without timezones

* Add tests from #13372

* remove trailing whitespace
@jayzhan211 jayzhan211 marked this pull request as draft December 13, 2024 10:05
… not_impl_err import for better error handling
@jayzhan211 jayzhan211 marked this pull request as ready for review December 13, 2024 14:15
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- thank you @jayzhan211 for pushing this forward and thank you for putting up with the long review cycle

@@ -560,7 +560,7 @@ select repeat('-1.2', arrow_cast(3, 'Int32'));
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of test changes look great to me ❤️

@alamb alamb changed the title TypeSignatureClass for mixed type function signature Simplify type signatures using TypeSignatureClass for mixed type function signature Dec 13, 2024
@jayzhan211 jayzhan211 merged commit 08d3b65 into apache:main Dec 14, 2024
28 checks passed
@jayzhan211 jayzhan211 deleted the type-class branch December 14, 2024 01:36
@jayzhan211
Copy link
Contributor Author

Thanks @alamb @goldmedal @findepi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate functions logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants