Make scalar and array handling for array_has consistent #13683

Kimahriman · 2024-12-07T14:01:08Z

Which issue does this PR close?

Rationale for this change

Makes null handling for array_has consistent across scalars and arrays, and makes the behavior similar to other systems such as Postgres and Spark.

What changes are included in this PR?

Updates both scalar and array handling for array_has to return null if any element on the left hand side is null and the right side is not found. This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:

If the array expression yields a null array, the result of ANY will be null. If the left-hand expression yields null, the result of ANY is ordinarily null (though a non-strict comparison operator could possibly yield a different result). Also, if the right-hand array contains any null elements and no true comparison result is obtained, the result of ANY will be null, not false (again, assuming a strict comparison operator). This is in accordance with SQL's normal rules for Boolean combinations of null values.

https://www.postgresql.org/docs/current/functions-comparisons.html#FUNCTIONS-COMPARISONS-ANY-SOME

Are these changes tested?

Tests are added and updated to accommodate the changes.

Are there any user-facing changes?

Slight behavior change for both scalar and array handling of array_has, which was already inconsistent.

alamb · 2024-12-07T20:19:41Z

datafusion/functions-nested/src/array_has.rs

@@ -215,7 +215,11 @@ fn array_has_dispatch_for_array<O: OffsetSizeTrait>(
        let needle_row = Scalar::new(needle.slice(i, 1));
        let eq_array = compare_with_eq(&arr, &needle_row, is_nested)?;
        let is_contained = eq_array.true_count() > 0;
-        boolean_builder.append_value(is_contained)
+        if is_contained || eq_array.null_count() == 0 {


I am surprised this isn't

Suggested change

if is_contained || eq_array.null_count() == 0 {

if is_contained && eq_array.null_count() == 0 {

I thought if the eq_array is null that means there was at least one comparison that is not known 🤔

The semantics are basically:

if the element is contained, return true

otherwise if there is a null value in the array, return null

otherwise return false

So nulls don't matter if there's match, it only matters if there's no match

alamb

Thank you for this PR @Kimahriman

I have a question

This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:

I think that is the right set of target systems.

comphead · 2024-12-07T22:26:07Z

datafusion/sqllogictest/test_files/array.slt

@@ -5260,6 +5270,13 @@ select array_has([], null),
 ----
 NULL NULL NULL

+# If lhs is has any Nulls, we return Null instead of false
+query BB
+select array_has([1, null, 2], 3),


I'm not sure this is a correct behavior.

Checking DuckDB

D select array_has([1, null, 2], 3); ┌───────────────────────────────────────────┐ │ array_has(main.list_value(1, NULL, 2), 3) │ │ boolean │ ├───────────────────────────────────────────┤ │ false │ └───────────────────────────────────────────┘

Hmm DuckDB is different then Spark and Postgres then

spark-sql (default)> SELECT array_contains(array(1, NULL, 3), 4); NULL

postgres=# SELECT 4 =ANY('{1, NULL, 3}'); ?column? ---------- (1 row)

Was hoping to use this to implement ArrayContains for Comet, and this is the one behavior difference.

Turns out the existing "correct" behavior for scalars doesn't match DuckDB either, which never returns null unless the whole array is null

DuckDB

D select array_has([null, null], 4); ┌───────────────────────────────────────────┐ │ array_has(main.list_value(NULL, NULL), 4) │ │ boolean │ ├───────────────────────────────────────────┤ │ false │ └───────────────────────────────────────────┘

DF

DataFusion CLI v43.0.0 > select array_has([null, null], 4); +-------------------------------------------+ | array_has(make_array(NULL,NULL),Int64(4)) | +-------------------------------------------+ | | +-------------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds.

comphead

I'm not sure the test gives a correct answer

jayzhan211 · 2024-12-08T01:30:54Z

We follow DuckDB for array function mostly, the best I can think of is implementing spark function in https://github.com/datafusion-contrib/datafusion-functions-extra

Weijun-H · 2024-12-08T03:28:52Z

We follow DuckDB for array function mostly, the best I can think of is implementing spark function in datafusion-contrib/datafusion-functions-extra

I agree with @jayzhan211 that we should follow the DuckDB pattern for the array family function, as we consider the DuckDB results.

Kimahriman · 2024-12-08T03:33:00Z

Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065

But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52

So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it

jayzhan211 · 2024-12-08T03:41:46Z

Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065

But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52

So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it

Interesting, in this case I think changing the result is not a bad idea.

comphead · 2024-12-08T19:31:02Z

just checked the Trino, it returns null in such case. DuckDB returns false.
Not sure what should we do, @alamb WDYT to break the tie?

alamb · 2024-12-09T14:44:56Z

just checked the Trino, it returns null in such case. DuckDB returns false. Not sure what should we do, @alamb WDYT to break the tie?

I recommend following @jayzhan211 and @Weijun-H 's suggestion that we follow DuckDB semantics for array/list functions

I think it is past time to document what we think we are doing with respect to dialects, and I will attempt to do so shortly

Document the SQL dialect DataFusion attempts to follow #13704

jayzhan211 · 2024-12-14T02:06:18Z

Can we add config setting in SessionConfig to tune the result of function after #13519 is done?

Something like returns_null_if_contains_null, ignore_nulls, convert_null_to_value and so on. (These should be dialect-agnostic)

If returns_null_if_contains_null is true, we can have the Spark result of array_has, otherwise we have Postgres/DuckDB result.

comphead · 2024-12-14T05:37:15Z

Can we add config setting in SessionConfig to tune the result of function after #13519 is done?

Something like returns_null_if_contains_null, ignore_nulls, convert_null_to_value and so on. (These should be dialect-agnostic)

If returns_null_if_contains_null is true, we can have the Spark result of array_has, otherwise we have Postgres/DuckDB result.

That sounds reasonable but those discrepancies tends to grow so maintain those params could be unbearable. My feeling we should follow some standard whatever it is, but one and the idea of extensibility is awesome so users can implements their own behavior

Kimahriman · 2024-12-15T17:03:27Z

Yeah that does seem like a messy slippery slope. And I feel like part of the critique in #13525 was that too many things required the SessionContext to customize.

Make scalar and array handling for array_has consistent

828ccff

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 7, 2024

alamb reviewed Dec 7, 2024

View reviewed changes

alamb approved these changes Dec 7, 2024

View reviewed changes

comphead reviewed Dec 7, 2024

View reviewed changes

comphead requested changes Dec 7, 2024

View reviewed changes

alamb mentioned this pull request Dec 9, 2024

Document the SQL dialect DataFusion attempts to follow #13704

Open

alamb mentioned this pull request Dec 9, 2024

Document SQL dialect guidance #13706

Open

Kimahriman mentioned this pull request Dec 12, 2024

feat: add support for array_contains expression apache/datafusion-comet#1163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make scalar and array handling for array_has consistent #13683

Make scalar and array handling for array_has consistent #13683

Kimahriman commented Dec 7, 2024

alamb Dec 7, 2024

Kimahriman Dec 7, 2024

alamb left a comment

comphead Dec 7, 2024

Kimahriman Dec 7, 2024

Kimahriman Dec 9, 2024

comphead left a comment

jayzhan211 commented Dec 8, 2024 •

edited

Loading

Weijun-H commented Dec 8, 2024

Kimahriman commented Dec 8, 2024

jayzhan211 commented Dec 8, 2024

comphead commented Dec 8, 2024

alamb commented Dec 9, 2024

jayzhan211 commented Dec 14, 2024 •

edited

Loading

comphead commented Dec 14, 2024

Kimahriman commented Dec 15, 2024

	if is_contained \|\| eq_array.null_count() == 0 {
	if is_contained && eq_array.null_count() == 0 {

Make scalar and array handling for array_has consistent #13683

Are you sure you want to change the base?

Make scalar and array handling for array_has consistent #13683

Conversation

Kimahriman commented Dec 7, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb Dec 7, 2024

Choose a reason for hiding this comment

Kimahriman Dec 7, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

comphead Dec 7, 2024

Choose a reason for hiding this comment

Kimahriman Dec 7, 2024

Choose a reason for hiding this comment

Kimahriman Dec 9, 2024

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

jayzhan211 commented Dec 8, 2024 • edited Loading

Weijun-H commented Dec 8, 2024

Kimahriman commented Dec 8, 2024

jayzhan211 commented Dec 8, 2024

comphead commented Dec 8, 2024

alamb commented Dec 9, 2024

jayzhan211 commented Dec 14, 2024 • edited Loading

comphead commented Dec 14, 2024

Kimahriman commented Dec 15, 2024

jayzhan211 commented Dec 8, 2024 •

edited

Loading

jayzhan211 commented Dec 14, 2024 •

edited

Loading