-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix get_type for higher-order array functions #13756
base: main
Are you sure you want to change the base?
Conversation
assert_eq!( | ||
ExprSchemable::get_type(&udf_expr, &schema).unwrap(), | ||
complex_type | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't pass before the change. The assertions above did pass.
The fix is covered by recursive flatten test case in array.slt
1bd311a
to
6d81418
Compare
} | ||
} | ||
|
||
fn recursive_array(array_type: &DataType) -> Option<DataType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend the existing array function for nested array instead of creating another signature for nested array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to do this, please advise!
But this function should go away with #13757.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this function should go away with #13757.
I don't understand -- if the goal is to remove recursive flattening, should we be adding new code to support it 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pre-existing array
signature implied recursively array-infication (replacing FixedLengthList with List, recursively), didn't imply flattening.
the recursive type normalization matters for flatten only, cause it (currently) operates recursively and otherwise would need to gain code to handle FixedLengthList inputs
the recursive array-ification was useless for other array functions and was made non-recursive.
to compensate for this change, new RecursiveArray signature was added for flatten case.
use std::collections::HashMap; | ||
|
||
#[test] | ||
fn test_array_element_return_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add tests in slt file that cover the array signature test cases, so we can avoid creating rust test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rust test allows explicitly exercising various ways of getting expression type.
Before i wrote it, I wasn't even sure whether it's a bug or a feature.
I can add slt test, how would it look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try to write some slt regression tests, but i couldn't expose the bug. Yet, the unit tests proves the bug exists.
I trust you have a better intuition how signature related bug can be exposed in SLT. Please advise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks @findepi and @jayzhan211
From what I can see the point of this PR is to make array_element_udf
have different type resolution rules (non recursive), which seems reasonable
However, as you both mention I don't seem to be able to trigger the problem from SQL (element access seems to work correctly): (e.g. the [[20]]
isn't flattened in on main:
> create table t as values ([[[10]], [[20]]]);
0 row(s) fetched.
Elapsed 0.007 seconds.
> explain select column1[2] from t;
+---------------+---------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------+
| logical_plan | Projection: array_element(t.column1, Int64(2)) |
| | TableScan: t projection=[column1] |
| physical_plan | ProjectionExec: expr=[array_element(column1@0, 2) as t.column1[Int64(2)]] |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+---------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.001 seconds.
> select column1[2] from t;
+---------------------+
| t.column1[Int64(2)] |
+---------------------+
| [[20]] |
+---------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
And the type seems good too list(list(int))
> select arrow_typeof(column1[2]) from t;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(t.column1[Int64(2)]) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
So this problem is quite strange. How is it working today (without this change) 🤔
} | ||
} | ||
|
||
fn recursive_array(array_type: &DataType) -> Option<DataType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this function should go away with #13757.
I don't understand -- if the goal is to remove recursive flattening, should we be adding new code to support it 🤔
i believe the bug -- if we agree this is a bug -- is compensated by other factors. I found this bug in case where array_element was inserted into the plan as a result of ScalarUDFImpl::simplify. At this stage it's "loose typing" is no longer OK. @alamb @jayzhan211 can you please review the attached unit test? |
I am checking this out in more detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still digging. This is so weird
I messed with the test and it seems like the failure only happens when the complex type is a FixedSizeList
for some reason..
fn array(array_type: &DataType) -> Option<DataType> { | ||
match array_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this says that if the type is a list, keep the type, but if the type is large list / fixed size list then take the field type?
Why doesn't it also take the field type for List
🤔 ? (Aka it doesn't make sense to me that List is treated differently than LargeList and FixedSizeList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for backwards compat i should keep LargeList so it stays LargeList, will push shortly
Aka it doesn't make sense to me that List is treated differently than LargeList and FixedSizeList
not my invention, it was like this before.
i think the intention is "converge List, LL and FSL into one type... or maybe two types... to keep UDF impl simpler".
i am not attached to this approach, but i think code may be reliant on that
|
||
#[test] | ||
fn test_array_element_return_type() { | ||
let complex_type = DataType::FixedSizeList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I change this complex type to DataType::List
the test passes 🤔
let complex_type = DataType::List(
Field::new("some_arbitrary_test_field", DataType::Int32, false).into(),
);
It also passes when complex_type is a Struct
let complex_type = DataType::Struct(Fields::from(vec![
Arc::new(Field::new("some_arbitrary_test_field", DataType::Int32, false)),
]));
It seems like there is something about FixedSizeList that is causing issues to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, when I remove this line in expr schema the test passes (with FixedSizedList):
diff --git a/datafusion/expr/src/expr_schema.rs b/datafusion/expr/src/expr_schema.rs
index 3317deafb..50aeb222f 100644
--- a/datafusion/expr/src/expr_schema.rs
+++ b/datafusion/expr/src/expr_schema.rs
@@ -152,6 +152,7 @@ impl ExprSchemable for Expr {
.map(|e| e.get_type(schema))
.collect::<Result<Vec<_>>>()?;
+
// Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature`
let new_data_types = data_types_with_scalar_udf(&arg_data_types, func)
.map_err(|err| {
@@ -168,7 +169,7 @@ impl ExprSchemable for Expr {
// Perform additional function arguments validation (due to limited
// expressiveness of `TypeSignature`), then infer return type
- Ok(func.return_type_from_exprs(args, schema, &new_data_types)?)
+ Ok(func.return_type_from_exprs(args, schema, &arg_data_types)?)
}
Expr::WindowFunction(window_function) => self
.data_type_and_nullable_with_window_function(schema, window_function)
Which basically says pass the input data types directly to the function call rather than calling data_types_with_scalar_udf
first (which claims to type coercion)
datafusion/datafusion/expr/src/expr_schema.rs
Line 171 in 68ead28
Ok(func.return_type_from_exprs(args, schema, &new_data_types)?) |
🤔 this looks like it was added in Sep via 1b3608d (before that the input types were passed directly) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem right to me that ExprSchema
is coercing the arguments (implicitly) to me 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there is something about FixedSizeList that is causing issues to me
correct, #13756 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, when I remove this line in expr schema the test passes (with FixedSizedList):
i did the same, basically removing this block
datafusion/datafusion/expr/src/expr_schema.rs
Lines 155 to 167 in b30c200
// Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature` | |
let new_data_types = data_types_with_scalar_udf(&arg_data_types, func) | |
.map_err(|err| { | |
plan_datafusion_err!( | |
"{} {}", | |
err, | |
utils::generate_signature_error_msg( | |
func.name(), | |
func.signature().clone(), | |
&arg_data_types, | |
) | |
) | |
})?; |
it's enough to fix the unit test in this PR
but other things start to fail
It doesn't seem right to me that
ExprSchema
is coercing the arguments (implicitly) to me 🤔
agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem right to me that ExprSchema is coercing the arguments (implicitly) to me 🤔
We need to get the return_type
of the function here and the arguments of return_type
is the "coerced data type" therefore I think new_data_types
is the right choice to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a logical plan, the function arguments should already be of the right coerced type, we should just use them.
they may be not required type on the early plan building phase (when plan is still syntactical, not semantical), which unfortuantely uses the same LogicalPlan
and Expr
types. #12604 would address that.
because
|
|
in the unit test, we ask for |
Which issue does this PR close?
Fixes #13755
Rationale for this change
Fix a bug, see issue #13755
TL;DR: fix incorrect result of
ExprSchemable::get_type
for an array function invoked on array of listWhat changes are included in this PR?
Just the fix
Are these changes tested?
unit test
Are there any user-facing changes?
yes