-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make scalar and array handling for array_has consistent #13683
base: main
Are you sure you want to change the base?
Conversation
@@ -215,7 +215,11 @@ fn array_has_dispatch_for_array<O: OffsetSizeTrait>( | |||
let needle_row = Scalar::new(needle.slice(i, 1)); | |||
let eq_array = compare_with_eq(&arr, &needle_row, is_nested)?; | |||
let is_contained = eq_array.true_count() > 0; | |||
boolean_builder.append_value(is_contained) | |||
if is_contained || eq_array.null_count() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised this isn't
if is_contained || eq_array.null_count() == 0 { | |
if is_contained && eq_array.null_count() == 0 { |
I thought if the eq_array is null that means there was at least one comparison that is not known 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics are basically:
- if the element is contained, return true
- otherwise if there is a null value in the array, return null
- otherwise return false
So nulls don't matter if there's match, it only matters if there's no match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @Kimahriman
I have a question
This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:
I think that is the right set of target systems.
@@ -5260,6 +5270,13 @@ select array_has([], null), | |||
---- | |||
NULL NULL NULL | |||
|
|||
# If lhs is has any Nulls, we return Null instead of false | |||
query BB | |||
select array_has([1, null, 2], 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a correct behavior.
Checking DuckDB
D select array_has([1, null, 2], 3);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(1, NULL, 2), 3) │
│ boolean │
├───────────────────────────────────────────┤
│ false │
└───────────────────────────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm DuckDB is different then Spark and Postgres then
spark-sql (default)> SELECT array_contains(array(1, NULL, 3), 4);
NULL
postgres=# SELECT 4 =ANY('{1, NULL, 3}');
?column?
----------
(1 row)
Was hoping to use this to implement ArrayContains
for Comet, and this is the one behavior difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the existing "correct" behavior for scalars doesn't match DuckDB either, which never returns null unless the whole array is null
DuckDB
D select array_has([null, null], 4);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(NULL, NULL), 4) │
│ boolean │
├───────────────────────────────────────────┤
│ false │
└───────────────────────────────────────────┘
DF
DataFusion CLI v43.0.0
> select array_has([null, null], 4);
+-------------------------------------------+
| array_has(make_array(NULL,NULL),Int64(4)) |
+-------------------------------------------+
| |
+-------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the test gives a correct answer
We follow DuckDB for array function mostly, the best I can think of is implementing spark function in https://github.com/datafusion-contrib/datafusion-functions-extra |
I agree with @jayzhan211 that we should follow the DuckDB pattern for the array family function, as we consider the DuckDB results. |
Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065 But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52 So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it |
Interesting, in this case I think changing the result is not a bad idea. |
just checked the Trino, it returns null in such case. DuckDB returns false. |
I recommend following @jayzhan211 and @Weijun-H 's suggestion that we follow DuckDB semantics for array/list functions I think it is past time to document what we think we are doing with respect to dialects, and I will attempt to do so shortly |
Can we add config setting in Something like If |
That sounds reasonable but those discrepancies tends to grow so maintain those params could be unbearable. My feeling we should follow some standard whatever it is, but one and the idea of extensibility is awesome so users can implements their own behavior |
Yeah that does seem like a messy slippery slope. And I feel like part of the critique in #13525 was that too many things required the |
Which issue does this PR close?
Closes #13682
Rationale for this change
Makes null handling for
array_has
consistent across scalars and arrays, and makes the behavior similar to other systems such as Postgres and Spark.What changes are included in this PR?
Updates both scalar and array handling for
array_has
to return null if any element on the left hand side is null and the right side is not found. This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:https://www.postgresql.org/docs/current/functions-comparisons.html#FUNCTIONS-COMPARISONS-ANY-SOME
Are these changes tested?
Tests are added and updated to accommodate the changes.
Are there any user-facing changes?
Slight behavior change for both scalar and array handling of
array_has
, which was already inconsistent.