Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make scalar and array handling for array_has consistent #13683

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Kimahriman
Copy link

Which issue does this PR close?

Closes #13682

Rationale for this change

Makes null handling for array_has consistent across scalars and arrays, and makes the behavior similar to other systems such as Postgres and Spark.

What changes are included in this PR?

Updates both scalar and array handling for array_has to return null if any element on the left hand side is null and the right side is not found. This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:

If the array expression yields a null array, the result of ANY will be null. If the left-hand expression yields null, the result of ANY is ordinarily null (though a non-strict comparison operator could possibly yield a different result). Also, if the right-hand array contains any null elements and no true comparison result is obtained, the result of ANY will be null, not false (again, assuming a strict comparison operator). This is in accordance with SQL's normal rules for Boolean combinations of null values.

https://www.postgresql.org/docs/current/functions-comparisons.html#FUNCTIONS-COMPARISONS-ANY-SOME

Are these changes tested?

Tests are added and updated to accommodate the changes.

Are there any user-facing changes?

Slight behavior change for both scalar and array handling of array_has, which was already inconsistent.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 7, 2024
@@ -215,7 +215,11 @@ fn array_has_dispatch_for_array<O: OffsetSizeTrait>(
let needle_row = Scalar::new(needle.slice(i, 1));
let eq_array = compare_with_eq(&arr, &needle_row, is_nested)?;
let is_contained = eq_array.true_count() > 0;
boolean_builder.append_value(is_contained)
if is_contained || eq_array.null_count() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised this isn't

Suggested change
if is_contained || eq_array.null_count() == 0 {
if is_contained && eq_array.null_count() == 0 {

I thought if the eq_array is null that means there was at least one comparison that is not known 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics are basically:

  • if the element is contained, return true
  • otherwise if there is a null value in the array, return null
  • otherwise return false

So nulls don't matter if there's match, it only matters if there's no match

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @Kimahriman

I have a question

This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:

I think that is the right set of target systems.

@@ -5260,6 +5270,13 @@ select array_has([], null),
----
NULL NULL NULL

# If lhs is has any Nulls, we return Null instead of false
query BB
select array_has([1, null, 2], 3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a correct behavior.

Checking DuckDB

D select array_has([1, null, 2], 3);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(1, NULL, 2), 3) │
│                  boolean                  │
├───────────────────────────────────────────┤
│ false                                     │
└───────────────────────────────────────────┘

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm DuckDB is different then Spark and Postgres then

spark-sql (default)> SELECT array_contains(array(1, NULL, 3), 4);
NULL
postgres=# SELECT 4 =ANY('{1, NULL, 3}');
 ?column? 
----------
 
(1 row)

Was hoping to use this to implement ArrayContains for Comet, and this is the one behavior difference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the existing "correct" behavior for scalars doesn't match DuckDB either, which never returns null unless the whole array is null

DuckDB

D select array_has([null, null], 4);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(NULL, NULL), 4) │
│                  boolean                  │
├───────────────────────────────────────────┤
│ false                                     │
└───────────────────────────────────────────┘

DF

DataFusion CLI v43.0.0
> select array_has([null, null], 4);
+-------------------------------------------+
| array_has(make_array(NULL,NULL),Int64(4)) |
+-------------------------------------------+
|                                           |
+-------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the test gives a correct answer

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 8, 2024

We follow DuckDB for array function mostly, the best I can think of is implementing spark function in https://github.com/datafusion-contrib/datafusion-functions-extra

@Weijun-H
Copy link
Member

Weijun-H commented Dec 8, 2024

We follow DuckDB for array function mostly, the best I can think of is implementing spark function in datafusion-contrib/datafusion-functions-extra

I agree with @jayzhan211 that we should follow the DuckDB pattern for the array family function, as we consider the DuckDB results.

@Kimahriman
Copy link
Author

Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065

But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52

So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it

@jayzhan211
Copy link
Contributor

Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065

But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52

So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it

Interesting, in this case I think changing the result is not a bad idea.

@comphead
Copy link
Contributor

comphead commented Dec 8, 2024

just checked the Trino, it returns null in such case. DuckDB returns false.
Not sure what should we do, @alamb WDYT to break the tie?

@alamb
Copy link
Contributor

alamb commented Dec 9, 2024

just checked the Trino, it returns null in such case. DuckDB returns false. Not sure what should we do, @alamb WDYT to break the tie?

I recommend following @jayzhan211 and @Weijun-H 's suggestion that we follow DuckDB semantics for array/list functions

I think it is past time to document what we think we are doing with respect to dialects, and I will attempt to do so shortly

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 14, 2024

Can we add config setting in SessionConfig to tune the result of function after #13519 is done?

Something like returns_null_if_contains_null, ignore_nulls, convert_null_to_value and so on. (These should be dialect-agnostic)

If returns_null_if_contains_null is true, we can have the Spark result of array_has, otherwise we have Postgres/DuckDB result.

@comphead
Copy link
Contributor

Can we add config setting in SessionConfig to tune the result of function after #13519 is done?

Something like returns_null_if_contains_null, ignore_nulls, convert_null_to_value and so on. (These should be dialect-agnostic)

If returns_null_if_contains_null is true, we can have the Spark result of array_has, otherwise we have Postgres/DuckDB result.

That sounds reasonable but those discrepancies tends to grow so maintain those params could be unbearable. My feeling we should follow some standard whatever it is, but one and the idea of extensibility is awesome so users can implements their own behavior

@Kimahriman
Copy link
Author

Yeah that does seem like a messy slippery slope. And I feel like part of the critique in #13525 was that too many things required the SessionContext to customize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_has has inconsistent null handling for scalars and arrays
5 participants