Fix get_type for higher-order array functions #13756

findepi · 2024-12-13T13:32:23Z

Which issue does this PR close?

Rationale for this change

Fix a bug, see issue #13755
TL;DR: fix incorrect result of ExprSchemable::get_type for an array function invoked on array of list

What changes are included in this PR?

Just the fix

Are these changes tested?

unit test

Are there any user-facing changes?

yes

findepi · 2024-12-13T13:33:26Z

datafusion/functions-nested/src/extract.rs

+        assert_eq!(
+            ExprSchemable::get_type(&udf_expr, &schema).unwrap(),
+            complex_type
+        );


This didn't pass before the change. The assertions above did pass.

datafusion/functions-nested/src/flatten.rs

The fix is covered by recursive flatten test case in array.slt

jayzhan211 · 2024-12-13T14:11:52Z

datafusion/expr/src/type_coercion/functions.rs

+        }
+    }
+
+    fn recursive_array(array_type: &DataType) -> Option<DataType> {


Can we extend the existing array function for nested array instead of creating another signature for nested array

I don't know how to do this, please advise!
But this function should go away with #13757.

But this function should go away with #13757.

I don't understand -- if the goal is to remove recursive flattening, should we be adding new code to support it 🤔

the pre-existing array signature implied recursively array-infication (replacing FixedLengthList with List, recursively), didn't imply flattening.

the recursive type normalization matters for flatten only, cause it (currently) operates recursively and otherwise would need to gain code to handle FixedLengthList inputs

the recursive array-ification was useless for other array functions and was made non-recursive.
to compensate for this change, new RecursiveArray signature was added for flatten case.

jayzhan211 · 2024-12-13T14:12:48Z

datafusion/functions-nested/src/extract.rs

+    use std::collections::HashMap;
+
+    #[test]
+    fn test_array_element_return_type() {


I think we can add tests in slt file that cover the array signature test cases, so we can avoid creating rust test here.

The rust test allows explicitly exercising various ways of getting expression type.
Before i wrote it, I wasn't even sure whether it's a bug or a feature.

I can add slt test, how would it look like?

I did try to write some slt regression tests, but i couldn't expose the bug. Yet, the unit tests proves the bug exists.
I trust you have a better intuition how signature related bug can be exposed in SLT. Please advise.

alamb

THanks @findepi and @jayzhan211

From what I can see the point of this PR is to make array_element_udf have different type resolution rules (non recursive), which seems reasonable

However, as you both mention I don't seem to be able to trigger the problem from SQL (element access seems to work correctly): (e.g. the [[20]] isn't flattened in on main:

> create table t as values ([[[10]], [[20]]]);
0 row(s) fetched.
Elapsed 0.007 seconds.

> explain select column1[2] from t;
+---------------+---------------------------------------------------------------------------+
| plan_type     | plan                                                                      |
+---------------+---------------------------------------------------------------------------+
| logical_plan  | Projection: array_element(t.column1, Int64(2))                            |
|               |   TableScan: t projection=[column1]                                       |
| physical_plan | ProjectionExec: expr=[array_element(column1@0, 2) as t.column1[Int64(2)]] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                           |
|               |                                                                           |
+---------------+---------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.001 seconds.

> select column1[2] from t;
+---------------------+
| t.column1[Int64(2)] |
+---------------------+
| [[20]]              |
+---------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

And the type seems good too list(list(int))

> select arrow_typeof(column1[2]) from t;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(t.column1[Int64(2)])                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So this problem is quite strange. How is it working today (without this change) 🤔

alamb · 2024-12-13T20:02:36Z

datafusion/expr/src/type_coercion/functions.rs

+        }
+    }
+
+    fn recursive_array(array_type: &DataType) -> Option<DataType> {


But this function should go away with #13757.

I don't understand -- if the goal is to remove recursive flattening, should we be adding new code to support it 🤔

findepi · 2024-12-13T20:40:49Z

So this problem is quite strange. How is it working today (without this change) 🤔

i believe the bug -- if we agree this is a bug -- is compensated by other factors.
For example, at early planning stage it's totally OK to change expression types.
Later, such a change triggers schema change assertion.

I found this bug in case where array_element was inserted into the plan as a result of ScalarUDFImpl::simplify. At this stage it's "loose typing" is no longer OK.

@alamb @jayzhan211 can you please review the attached unit test?
Does it look sound, ie should it pass?
Does it pass for you without other changes from the PR?

alamb · 2024-12-13T21:19:23Z

I am checking this out in more detail

alamb

I am still digging. This is so weird

I messed with the test and it seems like the failure only happens when the complex type is a FixedSizeList for some reason..

alamb · 2024-12-13T21:16:00Z

datafusion/expr/src/type_coercion/functions.rs

    fn array(array_type: &DataType) -> Option<DataType> {
+        match array_type {


so this says that if the type is a list, keep the type, but if the type is large list / fixed size list then take the field type?

Why doesn't it also take the field type for List 🤔 ? (Aka it doesn't make sense to me that List is treated differently than LargeList and FixedSizeList

for backwards compat i should keep LargeList so it stays LargeList, will push shortly

Aka it doesn't make sense to me that List is treated differently than LargeList and FixedSizeList

not my invention, it was like this before.
i think the intention is "converge List, LL and FSL into one type... or maybe two types... to keep UDF impl simpler".

i am not attached to this approach, but i think code may be reliant on that

alamb · 2024-12-13T21:24:24Z

datafusion/functions-nested/src/extract.rs

+
+    #[test]
+    fn test_array_element_return_type() {
+        let complex_type = DataType::FixedSizeList(


When I change this complex type to DataType::List the test passes 🤔

let complex_type = DataType::List( Field::new("some_arbitrary_test_field", DataType::Int32, false).into(), );

It also passes when complex_type is a Struct

let complex_type = DataType::Struct(Fields::from(vec![ Arc::new(Field::new("some_arbitrary_test_field", DataType::Int32, false)), ]));

It seems like there is something about FixedSizeList that is causing issues to me

Weird, when I remove this line in expr schema the test passes (with FixedSizedList):

diff --git a/datafusion/expr/src/expr_schema.rs b/datafusion/expr/src/expr_schema.rs index 3317deafb..50aeb222f 100644 --- a/datafusion/expr/src/expr_schema.rs +++ b/datafusion/expr/src/expr_schema.rs @@ -152,6 +152,7 @@ impl ExprSchemable for Expr { .map(|e| e.get_type(schema)) .collect::<Result<Vec<_>>>()?; + // Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature` let new_data_types = data_types_with_scalar_udf(&arg_data_types, func) .map_err(|err| { @@ -168,7 +169,7 @@ impl ExprSchemable for Expr { // Perform additional function arguments validation (due to limited // expressiveness of `TypeSignature`), then infer return type - Ok(func.return_type_from_exprs(args, schema, &new_data_types)?) + Ok(func.return_type_from_exprs(args, schema, &arg_data_types)?) } Expr::WindowFunction(window_function) => self .data_type_and_nullable_with_window_function(schema, window_function)

Which basically says pass the input data types directly to the function call rather than calling data_types_with_scalar_udf first (which claims to type coercion)

datafusion/datafusion/expr/src/expr_schema.rs

Line 171 in 68ead28

Ok(func.return_type_from_exprs(args, schema, &new_data_types)?)

🤔 this looks like it was added in Sep via 1b3608d (before that the input types were passed directly) 🤔

It doesn't seem right to me that ExprSchema is coercing the arguments (implicitly) to me 🤔

It seems like there is something about FixedSizeList that is causing issues to me

correct, #13756 (comment)

Weird, when I remove this line in expr schema the test passes (with FixedSizedList):

i did the same, basically removing this block

datafusion/datafusion/expr/src/expr_schema.rs

Lines 155 to 167 in b30c200

// Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature`

let new_data_types = data_types_with_scalar_udf(&arg_data_types, func)

.map_err(|err| {

plan_datafusion_err!(

"{} {}",

err,

utils::generate_signature_error_msg(

func.name(),

func.signature().clone(),

&arg_data_types,

)

)

})?;

it's enough to fix the unit test in this PR
but other things start to fail

It doesn't seem right to me that ExprSchema is coercing the arguments (implicitly) to me 🤔

agreed

the function arguments should already be of the right coerced type

I don't know the context of why we needed to apply coercion rules in the first place

The reason is because we can't guarantee the input is already coerced.

To determine the return type of a function for a given set of inputs, we follow these steps:

Input Validation: Check if the number of inputs is correct and whether their types match the expected types.

Type Coercion: If the input types don't match exactly, attempt to coerce them into compatible types.

Return Type Decision: Once coercion is complete (if applicable), decide the return type based on the resulting input types.

That is why we have coercion in get_type for return_type. We can move out the coercion in get_type to ScalarFunction::new_udf

How about we compute the return_type when the function is created, and get_type read the value.

I like the idea in principle.

It should be combined with a new ScalarUDFImpl sub-trait that doesn't have return type-related methods at all, since they are not to be used once the plan is constructed.

The reason is because we can't guarantee the input is already coerced.

in a logical plan we can.

My understanding is that coercing analyzer also calls the get_type functions.
It can be solved by changing how the coercing analyzer tracks its internal state.

But the real problem is that same types, the LogicalPlan & Expr, have two meanings: syntactic and semantic. So in the code we go back and forth about what should and what cannot be guaranteed for an Expr or LogicalPlan instance.

the LogicalPlan & Expr, have two meanings: syntactic and semantic.

Is there example about the difference of this two, especially for function. For Expr::ScalarFunction, it has no difference in LogicalPlan, we don't do anything special, but I think this is what you don't expect. What should we have in LogicalPlan, Expr::ScalarFunction but with coerced input?

since they are not to be used once the plan is constructed.

Why get_type is not supposed to be available after plan is constructed from Expr.

Is there example about the difference of this two, especially for function.

the difference is more apparent for duplicate syntax (such is IS NULL vs IS UNKNOWN), syntax sugar (order by 1, order by all, select *)
for function call the difference is about function being resolved (typed and inputs coerced) or not.

since they are not to be used once the plan is constructed.

Why get_type is not supposed to be available after plan is constructed from Expr.

for a fully resolved logical plan it's fair question to ask what is the type of an expression (and this may or may not be O(1) available answer)

however, there is no point to ask a UDF what is its type, since we already asked it

think of this as engine and UDF being implemented by independent parties, with UDF being a contract layer.
you go over a contract layer when you have to (analysis time), but going over contract layer multiple times with the same question should be avoided.

findepi · 2024-12-13T21:37:02Z

I messed with the test and it seems like the failure only happens when the complex type is a FixedSizeList for some reason..

because coerced_fixed_size_list_to_list called here is recursive

datafusion/datafusion/expr/src/type_coercion/functions.rs

Line 422 in 55e56c4

let array_type = coerced_fixed_size_list_to_list(array_type);

jayzhan211 · 2024-12-14T01:33:15Z

ExprSchemable::get_type for ScalarFunction is basically asking the return_type for the function. Given that we coerce fixed size list to list, the return type of array_element(fixed size list) makes sense to be list. Therefore, I think the unit test is expected to fail since it is coerced to List

findepi · 2024-12-14T14:39:26Z

Given that we coerce fixed size list to list, the return type of array_element(fixed size list) makes sense to be list.

in the unit test, we ask for array_element(list(fixed size list)) and we expect the return type to be fixed size list.
in the fix, we make so that array_element(list(T)) always returns T.

jayzhan211 · 2024-12-16T02:43:42Z

Given that we coerce fixed size list to list, the return type of array_element(fixed size list) makes sense to be list.

in the unit test, we ask for array_element(list(fixed size list)) and we expect the return type to be fixed size list. in the fix, we make so that array_element(list(T)) always returns T.

The idea to coerce fixed size list to list is to simplify the logic to handle both kinds of list. Unless this leads to issue otherwise I think we should keep this aggressive coercion.

findepi · 2024-12-17T12:24:55Z

The idea to coerce fixed size list to list is to simplify the logic to handle both kinds of list.

100% agreed

Unless this leads to issue otherwise I think we should keep this aggressive coercion.

it does, because the logic was too eager (recursive where only single-step is needed).
As proven by the unit test attached to the issue.

I am naturally biased towards merging this PR, as it solves a real-life problem I encountered and had to workaround.
@alamb @jayzhan211 what problem are we solving by not merging it?

jayzhan211 · 2024-12-17T13:44:20Z

it solves a real-life problem

I hope we can have an end2end test in slt if this is a real issue. I can help to find such test when I have time.

(recursive where only single-step is needed).

I expect such test shows this mentioned issue.

Can you explain more on the reason this eager coercion is an issue? The given unit test I don't think is correct, because the return type List is what I expect not FixedSizeList.

I think an example that coerce inner fixed size list to list result incorrect result in a valid sql query (from Postgres, DuckDB) would helps a lot.

alamb

I am naturally biased towards merging this PR, as it solves a real-life problem I encountered and had to workaround.
@alamb @jayzhan211 what problem are we solving by not merging it?

I had two concerns with this PR:

It introduces a new API that initially I thought was going to be removed again, which sounded confusing
It may introduce errors / other bugs or potentially mask additional problems

After more time to think about it, however, I am convinced that this PR is a step forwards.

The split between Array and RecursiveArray I think makes more sense as they are doing two fundamentally different things (aka flatten flattens some arbitrary number of levels)
While this may mask other bugs, all the existing tests pass and thus this PR seems to be a step forward. If we have a gap in test coverage we should fix that

In terms of @jayzhan211 's concerns:

Can you explain more on the reason this eager coercion is an issue? The given unit test I don't think is correct, because the return type List is what I expect not FixedSizeList.

In my mind, selecting an element of a list would return the same type as the element. For example, an element of List(FixedSizeList) is FixedSizeList which is what this PR does.

I tried quite hard to construct a List(FixedSizeList) via SQL and could not. This suggests to me we have some sort of gap / over eager conversion to List

> create table t as values (arrow_cast([1,2,3], 'FixedSizeList(3, Int64)'), arrow_cast([3,4,5], 'FixedSizeList(3, Int64)') );
0 row(s) fetched.
Elapsed 0.004 seconds.

> select * from t;
+-----------+-----------+
| column1   | column2   |
+-----------+-----------+
| [1, 2, 3] | [3, 4, 5] |
+-----------+-----------+
1 row(s) fetched.
Elapsed 0.001 seconds.

-- The elements are FixedSizedList
> select arrow_typeof(column1), arrow_typeof(column2) from t;
+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(t.column1)                                                                                                      | arrow_typeof(t.column2)                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3) | FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3) |
+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> select [column1, column2] from t;
+---------------------------------+
| make_array(t.column1,t.column2) |
+---------------------------------+
| [[1, 2, 3], [3, 4, 5]]          |
+---------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

-- Note making a list of the two fixed sized lists converts them into lists
> select arrow_typeof([column1, column2]) from t;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(t.column1,t.column2))                                                                                                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

I will file a ticket about over eager coercion to list

alamb · 2024-12-17T17:15:27Z

datafusion/expr/src/type_coercion/functions.rs

    fn array(array_type: &DataType) -> Option<DataType> {
+        match array_type {
+            DataType::List(_) | DataType::LargeList(_) => Some(array_type.clone()),
+            DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),


Suggested change

DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),

// Note array functions can often change the number of elements

// so convert from FixedSize --> variable

DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),

alamb · 2024-12-17T17:33:14Z

Filed Cannot create a List of FixedSizedList in SQL #13819 to track inability to create List(FixedSizeList)

findepi · 2024-12-17T20:57:15Z

Filed Can not create a List of FixedSizedList in SQL #13819 to track inability to create List(FixedSizeList)

Looks related indeed
thank you @alamb

findepi · 2024-12-17T21:00:53Z

since the bug turned out to be specific for list of fixed size list, i updated the test naming (and var naming inside the test)

* Minor: Use `div_ceil` * Fix hash join with sort push down (#13560) * fix: join with sort push down * chore: insert some value * apply suggestion * recover handle_costom_pushdown change * apply suggestion * add more test * add partition * Improve substr() performance by avoiding using owned string (#13688) Co-authored-by: zhangli20 <[email protected]> * reinstate down_cast_any_ref (#13705) * Optimize performance of `character_length` function (#13696) * Optimize performance of function Signed-off-by: Tai Le Manh <[email protected]> * Add pre-check array is null * Fix clippy warnings --------- Signed-off-by: Tai Le Manh <[email protected]> * Update prost-build requirement from =0.13.3 to =0.13.4 (#13698) Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version. - [Release notes](https://github.com/tokio-rs/prost/releases) - [Changelog](https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md) - [Commits](https://github.com/tokio-rs/prost/compare/v0.13.3...v0.13.4) --- updated-dependencies: - dependency-name: prost-build dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Minor: Output elapsed time for sql logic test (#13718) * Minor: Output elapsed time for sql logic test * refactor: simplify the `make_udf_function` macro (#13712) * refactor: replace `Vec` with `IndexMap` for expression mappings in `ProjectionMapping` and `EquivalenceGroup` (#13675) * refactor: replace Vec with IndexMap for expression mappings in ProjectionMapping and EquivalenceGroup * chore * chore: Fix CI * chore: comment * chore: simplify * Handle alias when parsing sql(parse_sql_expr) (#12939) * fix: Fix parse_sql_expr not handling alias * cargo fmt * fix parse_sql_expr example(remove alias) * add testing * add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function * revert change on example `parse_sql_expr` * Improve documentation for TableProvider (#13724) * Reveal implementing type and return type in simple UDF implementations (#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait. * minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731) * Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests * Update to apache-avro 0.17, fix compatibility changes schema handling (#13727) * Update apache-avro requirement from 0.16 to 0.17 --- updated-dependencies: - dependency-name: apache-avro dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compatibility changes schema handling apache-avro 0.17 - Handle ArraySchema struct - Handle MapSchema struct - Map BigDecimal => LargeBinary - Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None) - Map LocalTimestampNanos => todo!() - Add Default to FixedSchema test * Update Cargo.lock file for apache-avro 0.17 --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc Droogh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Minor: Add doc example to RecordBatchStreamAdapter (#13725) * Minor: Add doc example to RecordBatchStreamAdapter * Update datafusion/physical-plan/src/stream.rs Co-authored-by: Berkay Şahin <[email protected]> --------- Co-authored-by: Berkay Şahin <[email protected]> * Implement GroupsAccumulator for corr(x,y) aggregate function (#13581) * Implement GroupsAccumulator for corr(x,y) * feedbacks * fix CI MSRV * review * avoid collect in accumulation * add back cast * fix union serialisation order in proto (#13709) * fix union serialisation order in proto * clippy * address comments * Minor: make unsupported `nanosecond` part a real (not internal) error (#13733) * Minor: make unsupported `nanosecond` part a real (not internal) error * fmt * Improve wording to refer to date part * Add tests for date_part on columns + timestamps with / without timezones (#13732) * Add tests for date_part on columns + timestamps with / without timezones * Add tests from https://github.com/apache/datafusion/pull/13372 * remove trailing whitespace * Optimize performance of `initcap` function (~2x faster) (#13691) * Optimize performance of initcap (~2x faster) Signed-off-by: Tai Le Manh <[email protected]> * format --------- Signed-off-by: Tai Le Manh <[email protected]> * Minor: Add documentation explaining that initcap oly works for ASCII (#13749) * Support sqllogictest --complete with postgres (#13746) Before the change, the request to use PostgreSQL was simply ignored when `--complete` flag was present. * doc-gen: migrate window functions documentation to attribute based (#13739) * doc-gen: migrate window functions documentation Signed-off-by: zjregee <[email protected]> * fix: update Cargo.lock --------- Signed-off-by: zjregee <[email protected]> * Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751) * Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation * Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation. * Update to bigdecimal 0.4.7 (#13747) * Add big decimal formatting test cases with potential trailing zeros * Rename and simplify decimal rendering functions - add `decimal` to function name - drop `precision` parameter as it is not supposed to affect the result * Update to bigdecimal 0.4.7 Utilize new `to_plain_string` function * chore: clean up dependencies (#13728) * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Clean up dependencies * CI: Clean up dependencies * fix: Implicitly plan `UNNEST` as lateral (#13695) * plan implicit lateral if table factor is UNNEST * check for outer references in `create_relation_subquery` * add sqllogictest * fix lateral constant test to not expect a subquery node * replace sqllogictest in favor of logical plan test * update lateral join sqllogictests * add sqllogictests * fix logical plan test * Minor: improve the Deprecation / API health guidelines (#13701) * Minor: improve the Deprecation / API health policy * prettier * Update docs/source/library-user-guide/api-health.md Co-authored-by: Jonah Gao <[email protected]> * Add version guidance and make more copy/paste friendly * prettier * better * rename to guidelines --------- Co-authored-by: Jonah Gao <[email protected]> * fix: specify roottype in substrait fieldreference (#13647) * fix: specify roottype in fieldreference Signed-off-by: MBWhite <[email protected]> * Fix formatting Signed-off-by: MBWhite <[email protected]> * review suggestion Signed-off-by: MBWhite <[email protected]> --------- Signed-off-by: MBWhite <[email protected]> * Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372) * add type sig class Signed-off-by: jayzhan211 <[email protected]> * timestamp Signed-off-by: jayzhan211 <[email protected]> * date part Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * taplo format Signed-off-by: jayzhan211 <[email protected]> * tpch test Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * explicit hash Signed-off-by: jayzhan211 <[email protected]> * Enhance type coercion and function signatures - Added logic to prevent unnecessary casting of string types in `native.rs`. - Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons. - Updated imports in `functions.rs` and `signature.rs` for better organization. - Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`. - Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`. These changes improve type handling and ensure more accurate function behavior in SQL expressions. * fix comment Signed-off-by: Jay Zhan <[email protected]> * fix signature Signed-off-by: Jay Zhan <[email protected]> * fix test Signed-off-by: Jay Zhan <[email protected]> * Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds. * Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps. * Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file. * Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations. * Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling --------- Signed-off-by: jayzhan211 <[email protected]> Signed-off-by: Jay Zhan <[email protected]> * Minor: Add some more blog posts to the readings page (#13761) * Minor: Add some more blog posts to the readings page * prettier * prettier * Update docs/source/user-guide/concepts-readings-events.md --------- Co-authored-by: Oleks V <[email protected]> * docs: update GroupsAccumulator instead of GroupAccumulator (#13787) Fixing `GroupsAccumulator` trait name in its docs * Improve Deprecation Guidelines more (#13776) * Improve deprecation guidelines more * prettier * fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758) * fix: add `null_buffer` check for `LargeStringArray` Add a safety check to ensure that the alignment of buffers cannot be overflowed. This introduces a panic if they are not aligned through a runtime assertion. * fix: remove value_buffer assertion These buffers can be misaligned and it is not problematic, it is the `null_buffer` which we care about being of the same length. * feat: add `null_buffer` check to `StringArray` This is in a similar vein to `LargeStringArray`, as the code is the same, except for `i32`'s instead of `i64`. * feat: use `row_count` var to avoid drift * Revert the removal of reservation in HashJoin (#13792) * fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption. * fmt Signed-off-by: Jay Zhan <[email protected]> --------- Signed-off-by: Jay Zhan <[email protected]> * added count aggregate slt (#13790) * Update documentation guidelines for contribution content (#13703) * Update documentation guidelines for contribution content * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * clarify discussions and remove requirements note * prettier * Update docs/source/contributor-guide/index.md Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * Add Round trip tests for Array <--> ScalarValue (#13777) * Add Round trip tests for Array <--> ScalarValue * String dictionary test * remove unecessary value * Improve comments * fix: Limit together with pushdown_filters (#13788) * fix: Limit together with pushdown_filters * Fix format * Address new comments * Fix testing case to hit the problem * Minor: improve Analyzer docs (#13798) * Minor: cargo update in datafusion-cli (#13801) * Update datafusion-cli toml to pin home=0.5.9 * update Cargo.lock * Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797) * fix: enable pruning by bloom filters for dictionary columns (#13768) * Handle empty rows for `array_distinct` (#13810) * handle empty array distinct * ignore * fix --------- Co-authored-by: Cyprien Huet <[email protected]> * Fix get_type for higher-order array functions (#13756) * Fix get_type for higher-order array functions * Fix recursive flatten The fix is covered by recursive flatten test case in array.slt * Restore "keep LargeList" in Array signature * clarify naming in the test * Chore: Do not return empty record batches from streams (#13794) * do not emit empty record batches in plans * change function signatures to Option<RecordBatch> if empty batches are possible * format code * shorten code * change list_unnest_at_level for returning Option value * add documentation take concat_batches into compute_aggregates function again * create unit test for row_hash.rs * add test for unnest * add test for unnest * add test for partial sort * add test for bounded window agg * add test for window agg * apply simplifications and fix typo * apply simplifications and fix typo * Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802) * test(13796): reproducer of overflow on capacity * fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer * refactor: use simple solution and provide panic * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750) * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema * clippy * fix csv and json tests * add testing for parquet * cleanup * fix parquet tests * document describe_partition, add back repartition options to one of the csv empty files tests * Support Null regex override in csv parser options. (#13228) Co-authored-by: Andrew Lamb <[email protected]> * Minor: Extend ScalarValue::new_zero() (#13828) * Update mod.rs * Update mod.rs * Update mod.rs * Update mod.rs * chore: temporarily disable windows flow (#13833) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 * Fix test * Add test * Add test * Refine negative scales * Update comment * Refine bigint_to_i256 * UT for bigint_to_i256 * Add ut for parse_decimal * Replace `BooleanArray::extend` with `append_n` (#13832) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> * improve docs --------- Co-authored-by: Piotr Findeisen <[email protected]> * [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830) * [test] coalesce round trip schema mismatch * [proto] added the nullable flag in PhysicalScalarUdfNode * [bugfix] propagate the nullable flag for serialized scalar UDFS * Add example of interacting with a remote catalog (#13722) * Add example of interacting with a remote catalog * Update datafusion/core/src/execution/session_state.rs Co-authored-by: Berkay Şahin <[email protected]> * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Use HashMap to hold tables --------- Co-authored-by: Berkay Şahin <[email protected]> Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Update substrait requirement from 0.49 to 0.50 (#13808) * Update substrait requirement from 0.49 to 0.50 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compilation * Add expr test --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * typo: remove extraneous "`" in doc comment, fix header (#13848) * typo: extraneous "`" in doc comment * Update datafusion/execution/src/runtime_env.rs * Update datafusion/execution/src/runtime_env.rs --------- Co-authored-by: Oleks V <[email protected]> * typo: remove extra "`" interfering with doc formatting (#13847) * Support n-ary monotonic functions in ordering equivalence (#13841) * Support n-ary monotonic functions in `discover_new_orderings` * Add tests for n-ary monotonic functions in `discover_new_orderings` * Fix tests * Fix non-monotonic test case * Fix unintended simplification * Minor comment changes * Fix tests * Add `preserves_lex_ordering` field * Use `preserves_lex_ordering` on `discover_new_orderings()` * Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc` * Update tests * Move logic to UDF * Cargo fmt * Refactor * Cargo fmt * Simply use false value on default implementation * Remove unnecessary import * Clippy fix * Update Cargo.lock * Move dep to dev-dependencies * Rename output_preserves_lex_ordering to preserves_lex_ordering * minor --------- Co-authored-by: berkaysynnada <[email protected]> * Replace `execution_mode` with `emission_type` and `boundedness` (#13823) * feat: update execution modes and add bitflags dependency - Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan. - Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities. - Added `bitflags` dependency to `Cargo.toml` for better management of execution modes. - Adjusted execution mode handling in multiple files to ensure compatibility with the new structure. * add exec API Signed-off-by: Jay Zhan <[email protected]> * replace done but has stackoverflow Signed-off-by: Jay Zhan <[email protected]> * exec API done Signed-off-by: Jay Zhan <[email protected]> * Refactor execution plan properties to remove execution mode - Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations. - Updated related functions to utilize the new structure, ensuring compatibility with the changes. - Adjusted comments and cleaned up imports to reflect the removal of execution mode handling. This refactor simplifies the execution plan properties and enhances maintainability. * Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType` - Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files. - Introduced `EmissionType` to better represent the output characteristics of execution plans. - Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability. - Cleaned up imports and adjusted comments accordingly. This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans. * fix test Signed-off-by: Jay Zhan <[email protected]> * Refactor join handling and emission type logic - Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage. - Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking. - Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior. - Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission. These changes improve the accuracy of tests and the clarity of execution plan properties. * Implement emission type for execution plans - Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans. - This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior. These updates contribute to a more robust execution plan framework within the DataFusion project. * Enhance join type documentation and refine emission type logic - Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results. - Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable. These changes improve the documentation and functionality of join operations within the DataFusion project. * Refactor emission type logic in join and sort execution plans - Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics. - Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type. - These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types. * Refactor emission type handling in execution plans - Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed. - Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability. These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework. * Enhance execution plan properties with boundedness and emission type - Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics. - Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior. - Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor execution plans to enhance boundedness and emission type handling - Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans. - Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior. - Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor memory handling in execution plans - Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management. - This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints. * Refactor boundedness checks in execution plans - Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management. - Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project. - These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies. * Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation - Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity. - This change contributes to a cleaner and more maintainable codebase within the DataFusion project. * Refactor execution plan boundedness and emission type handling - Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior. - Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams. - Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic. - Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics. These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types. * Refactor emission type and boundedness handling in execution plans - Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase. - Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines. - Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness. * Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling - Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows. - Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations. * Review Part1 * Update sanity_checker.rs * addressing reviews * Review Part 1 * Update datafusion/physical-plan/src/execution_plan.rs * Update datafusion/physical-plan/src/execution_plan.rs * Shorten imports * Enhance documentation for JoinType and Boundedness enums - Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples. - Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max. --------- Signed-off-by: Jay Zhan <[email protected]> Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Preserve ordering equivalencies on `with_reorder` (#13770) * Preserve ordering equivalencies on `with_reorder` * Add assertions * Return early if filtered_exprs is empty * Add clarify comment * Refactor * Add comprehensive test case * Add comment for exprs_equal * Cargo fmt * Clippy fix * Update properties.rs * Update exprs_equal and add tests * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * replace CASE expressions in predicate pruning with boolean algebra (#13795) * replace CASE expressions in predicate pruning with boolean algebra * fix merge * update tests * add some more tests * add some more tests * remove duplicate test case * Update datafusion/physical-optimizer/src/pruning.rs * swap NOT for != * replace comments, update docstrings * fix example * update tests * update tests * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Chunchun Ye <[email protected]> * enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857) fixes #13854 Co-authored-by: Arttu Voutilainen <[email protected]> * Add configurable normalization for configuration options and preserve case for S3 paths (#13576) * Do not normalize values * Fix tests & update docs * Prettier * Lowercase config params * Unify transform and parse * Fix tests * Rename `default_transform` and relax boundaries * Make `compression` case-insensitive * Comment to new line * Deprecate and ignore `enable_options_value_normalization` * Update datafusion/common/src/config.rs * fix typo --------- Co-authored-by: Oleks V <[email protected]> * Improve`Signature` and `comparison_coercion` documentation (#13840) * Improve Signature documentation more * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> * feat: support normalized expr in CSE (#13315) * feat: support normalized expr in CSE * feat: support normalize_eq in cse optimization * feat: support cumulative binary expr result in normalize_eq --------- Co-authored-by: Andrew Lamb <[email protected]> * Upgrade to sqlparser `0.53.0` (#13767) * chore: Udpate to sqlparser 0.53.0 * Update for new sqlparser API * more api updates * Avoid serializing query to SQL string unless it is necessary * Box wildcard options * chore: update datafusion-cli Cargo.lock * Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861) Thanks @Dandandan * feat(function): add `least` function (#13786) * start adding least fn * feat(function): add least function * update function name * fix scalar smaller function * add tests * run Clippy and Fmt * Generated docs using `./dev/update_function_docs.sh` * add comment why `descending: false` * update comment * Update least.rs Co-authored-by: Bruce Ritchie <[email protected]> * Update scalar_functions.md * run ./dev/update_function_docs.sh to update docs * merge greatest and least implementation to one * add header --------- Co-authored-by: Bruce Ritchie <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Improve SortPreservingMerge::enable_round_robin_repartition docs (#13826) * Clarify SortPreservingMerge::enable_round_robin_repartition docs * tweaks * Improve comments more * clippy * fix doc link * Minor: Unify `downcast_arg` method (#13865) * Implement `SHOW FUNCTIONS` (#13799) * introduce rid for different signature * implement show functions syntax * add syntax example * avoid duplicate join * fix clippy * show function_type instead of routine_type * add some doc and comments * Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740) * Update bzip2 requirement from 0.4.3 to 0.5.0 Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version. - [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases) - [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0) --- updated-dependencies: - dependency-name: bzip2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix test * Fix CLI cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Fix build (#13869) * feat(substrait): modular substrait consumer (#13803) * feat(substrait): modular substrait consumer * feat(substrait): include Extension Rel handlers in default consumer Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer * refactor(substrait) _selection -> _field_reference * refactor(substrait): remove SubstraitPlannerState usage from consumer * refactor: get_state() -> get_function_registry() * docs: elide imports from example * test: simplify test * refactor: remove Arc from DefaultSubstraitConsumer * doc: add ticket for API improvements * doc: link DefaultSubstraitConsumer to from_subtrait_plan * refactor: remove redundant Extensions parsing * Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862) * include FetchRel when producing LogicalPlan from Sort * add suggested test * address review feedback * Minor: improve error message when ARRAY literals can not be planned (#13859) * Minor: improve error message when ARRAY literals can not be planned * fmt * Update datafusion/sql/src/expr/value.rs Co-authored-by: Oleks V <[email protected]> --------- Co-authored-by: Oleks V <[email protected]> * Add documentation for `SHOW FUNCTIONS` (#13868) * Support unicode character for `initcap` function (#13752) * Support unicode character for 'initcap' function Signed-off-by: Tai Le Manh <[email protected]> * Update unit tests * Fix clippy warning * Update sqllogictests - initcap * Update scalar_functions.md docs * Add suggestions change Signed-off-by: Tai Le Manh <[email protected]> --------- Signed-off-by: Tai Le Manh <[email protected]> * [minor] make recursive package dependency optional (#13778) * make recursive optional * add to default for common package * cargo update * added to readme * make test conditional * reviews * cargo update --------- Co-authored-by: Andrew Lamb <[email protected]> * Minor: remove unused async-compression `futures-io` feature (#13875) * Minor: remove unused async-compression feature * Fix cli cargo lock * Consolidate Example: dataframe_output.rs into dataframe.rs (#13877) * Restore `DocBuilder::new()` to avoid breaking API change (#13870) * Fix build * Restore DocBuilder::new(), deprecate * cmt * clippy * Improve error messages for incorrect zero argument signatures (#13881) * Improve error messages for incorrect zero argument signatures * fix errors * fix fmt * Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883) * minor: fix typos in comments / structure names (#13879) * minor: fix typo error in datafusion * fix: fix rebase error * fix: format HashJoinExec doc * doc: recover thiserror/preemptively * fix: other typo error fixed * fix: directories to dir_entries in catalog example * Support 1 or 3 arg in generate_series() UDTF (#13856) * Support 1 or 3 args in generate_series() UDTF * address comment * Support (order by / sort) for DataFrameWriteOptions (#13874) * Support (order by / sort) for DataFrameWriteOptions * Fix fmt * Fix import * Add insert into example * Update sort_merge_join.rs (#13894) * Update join_selection.rs (#13893) * Fix `recursive-protection` feature flag (#13887) * Fix recursive-protection feature flag * rename feature flag to be consistent * Make default * taplo format * Fix visibility of swap_hash_join (#13899) * Minor: Avoid emitting empty batches in partial sort (#13895) * Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * Prepare for 44.0.0 release: version and changelog (#13882) * Prepare for 44.0.0 release: version and changelog * update changelog * update configs * update before release * Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824) * support unparsing the implicit lateral unnest plan * cargo clippy and fmt * refactor for `check_unnest_placeholder_with_outer_ref` * add const for the prefix string of unnest and outer refernece column * fix case_column_or_null with nullable when conditions (#13886) * fix case_column_or_null with nullable when conditions * improve sqllogictests for case_column_or_null --------- Co-authored-by: zhangli20 <[email protected]> * Fixed Issue #13896 (#13903) The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used. * Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880) * make ast builder public * introduce udlp unparser * add documents * add examples * add negative tests and fmt * fix the doc * rename udlp to extension * apply the first unparsing result only * improve the doc * seperate the enum for the unparsing result * fix the doc --------- Co-authored-by: Andrew Lamb <[email protected]> * Preserve constant values across union operations (#13805) * Add value tracking to ConstExpr for improved union optimization * Update PartialEq impl * Minor change * Add docstring for ConstExpr value * Improve constant propagation across union partitions * Add assertion for across_partitions * fix fmt * Update properties.rs * Remove redundant constant removal loop * Remove unnecessary mut * Set across_partitions=true when both sides are constant * Extract and use constant values in filter expressions * Add initial SLT for constant value tracking across UNION ALL * Assign values to ConstExpr where possible * Revert "Set across_partitions=true when both sides are constant" This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449. * Temporarily take value from literal * Lint fixes * Cargo fmt * Add get_expr_constant_value * Make `with_value()` accept optional value * Add todo * Move test to union.slt * Fix changed slt after merge * Simplify constexpr * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902) * fix RecordBatch size in topK (#13906) * ci improvements, update protoc (#13876) * Fix md5 return_type to only return Utf8 as per current code impl. * ci improvements * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Revert nextest change until action is approved. * Exclude requires workspace * Fixing minor typo to verify ci caching of builds is working as expected. * Updates from PR review. * Adding issue link for disabling intel mac build * improve performance of running examples * remove cargo check * Introduce LogicalPlan invariants, begin automatically checking them (#13651) * minor(13525): perform LP validation before and after each possible mutation * minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass * minor(13525): validate union after each optimizer passes * refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass * chore: add link to invariant docs * fix: add new invariants module * refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants * test: update test for slight error message change * fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause * refactor: move collect_subquery_cols() to common utils crate * refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass. * refactor: based upon performance tests, run the maximum number of checks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug * chore: update error naming and terminology used in code comments * refactor: use proper error methods * chore: more cleanup of error messages * chore: handle option trailer to error message * test: update sqllogictests tests to not use multiline * Correct return type for initcap scalar function with utf8view (#13909) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view * Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905) * Implement maintains_input_order for AggregateExec (#13897) * Implement maintains_input_order for AggregateExec * Update mod.rs * Improve comments --------- Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: mertak-synnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Move join type input swapping to pub methods on Joins (#13910) * doc-gen: migrate scalar functions (string) documentation 3/4 (#13926) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917) * Update sqllogictest requirement from 0.24.0 to 0.25.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Remove labels --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913) * doc-gen: migrate scalar functions (crypto) documentation (#13918) * doc-gen: migrate scalar functions (crypto) documentation * doc-gen: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920) * doc-gen: migrate scalar functions (datetime) documentation 1/2 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * fix RecordBatch size in hash join (#13916) * doc-gen: migrate scalar functions (array) documentation 1/3 (#13928) * doc-gen: migrate scalar functions (array) documentation 1/3 * fix: remove unsed import, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 1/2 (#13922) * doc-gen: migrate scalar functions (math) documentation 1/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 2/2 (#13923) * doc-gen: migrate scalar functions (math) documentation 2/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 3/3 (#13930) * doc-gen: migrate scalar functions (array) documentation 3/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 2/3 (#13929) * doc-gen: migrate scalar functions (array) documentation 2/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (string) documentation 4/4 (#13927) * doc-gen: migrate scalar functions (string) documentation 4/4 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Support explain query when running dfbench with clickbench (#13942) * Support explain query when running dfbench * Address comments * Consolidate example to_date.rs into dateframe.rs (#13939) * Consolidate example to_date.rs into dateframe.rs * Assert results using assert_batches_eq * clippy * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945) * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf. * add comment * Implement predicate pruning for `like` expressions (prefix matching) (#12978) * Implement predicate pruning for like expressions * add function docstring * re-order bounds calculations * fmt * add fuzz tests * fix clippy * Update datafusion/core/tests/fuzz_cases/pruning.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * doc-gen: migrate scalar functions (string) documentation 1/4 (#13924) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * consolidate dataframe_subquery.rs into dataframe.rs (#13950) * migrate btrim to user_doc macro (#13952) * doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921) * doc-gen: migrate scalar functions (datetime) documentation 2/2 * fix: fix typo and update function docs * doc: update function docs * doc-gen: remove slash --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936) * Fix md5 return_type to only return Utf8 as per current code impl. * Add support for sqlite test files to sqllogictest * Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed. * Removed workaround for bug that was fixed. * Git submodule update ... err update, link to sqlite tests. * Git submodule update * Readd submodule --------- Co-authored-by: Andrew Lamb <[email protected]> * Supporting writing schema metadata when writing Parquet in parallel (#13866) * refactor: make ParquetSink tests a bit more readable * chore(11770): add new ParquetOptions.skip_arrow_metadata * test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement * refactor(11770): replace with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not * fix(11770): fix parallel ParquetSink to encode arrow schema into the file metadata, based on the ParquetOptions * refactor(11770): provide deprecation warning for TryFrom * test(11770): update tests with new default to include arrow schema * refactor: including partitioning of arrow schema inserted into kv_metdata * test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata * chore: avoid cloning in tests, and update code docs * refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions * refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration * chore: update configs.md * test: update tests to handle the (default) required arrow schema in the kv_metadata * chore: add reference to arrow-rs upstream PR * chore: Create devcontainer.json (#13520) * Create devcontainer.json * update devcontainer * remove useless features * Minor: consolidate ConfigExtension example into API docs (#13954) * Update examples README.md * Minor: consolidate ConfigExtension example into API docs * more docs * Remove update * clippy * Fix issue with ExtensionsOptions docs * Parallelize pruning utf8 fuzz test (#13947) * Add swap_inputs to SMJ (#13984) * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966) * added failing test * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows * Update datafusion/functions-nested/src/set_ops.rs Co-authored-by: Andrew Lamb <[email protected]> * Update set_ops.rs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update release instructions for 44.0.0 (#13959) * Update release instructions for 44.0.0 * update macros and order * add functions-table * Add datafusion python 43.1.0 blog post to doc. (#13974) * Include license and notice files in more crates (#13985) * Extract postgres container from sqllogictest, update datafusion-testing pin (#13971) * Add support for sqlite test files to sqllogictest * Removed workaround for bug that was fixed. * Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock. * Add missing license header. * Update rstest requirement from 0.23.0 to 0.24.0 (#13977) Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0) --- updated-dependencies: - dependency-name: rstest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Move hash collision test to run only when merging to main. (#13973) * Update itertools requirement from 0.13 to 0.14 (#13965) * Update itertools requirement from 0.13 to 0.14 Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix build * Simplify * Update CLI lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988) * Rename hash_collision.yml to extended.yml and add comments * Adjust schedule, add comments * Update job, rerun * doc-gen: migrate scalar functions (string) documentation 2/4 (#13925) * doc-gen: migrate scalar functions (string) documentation 2/4 * doc-gen: update function docs * doc: fix related udf order for upper function in documentation * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * doc-gen: update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Oleks V <[email protected]> * Update substrait requirement from 0.50 to 0.51 (#13978) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update release README for datafusion-cli publishing (#13982) * Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980) - Updated LastValueAccumulator to include requirement satisfaction check before updating the last value. - Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios. * Improve deserialize_to_struct example (#13958) * Cleanup deserialize_to_struct example * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> --------- Co-authored-by: Jonah Gao <[email protected]> * Update docs (#14002) * Optimize CASE expression for "expr or expr" usage. (#13953) * Apply optimization for ExprOrExpr. * Implement optimization similar to existing code. * Add sqllogictest. * feat(substrait): introduce consume_rel and consume_expression (#13963) * feat(substrait): introduce consume_rel and consume_expression Route calls to from_substrait_rel and from_substrait_rex through the SubstraitConsumer in order to allow users to provide their own behaviour * feat(substrait): consume nulls of user-defined types * docs(substrait): consume_rel and consume_expression docstrings * Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981) * Consolidate csv_opener.rs and json_opener.rs into a single example (#13955) * Update datafusion-examples/examples/csv_json_opener.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion-examples/README.md Co-authored-by: Andrew Lamb <[email protected]> * Apply code formatting with cargo fmt --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * FIX : Incorrect NULL handling in BETWEEN expression (#14007) * submodule update * FIX : Incorrect NULL handling in BETWEEN expression * Revert "submodule update" This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3. * fix incorrect unit test * move sqllogictest to expr * feat(substrait): modular substrait producer (#13931) * feat(substrait): modular substrait producer * refactor(substrait): simplify col_ref_offset handling in producer * refactor(substrait): remove column offset tracking from producer * docs(substrait): document SubstraitProducer * refactor: minor cleanup * feature: remove unused SubstraitPlanningState BREAKING CHANGE: SubstraitPlanningState is no longer available * refactor: cargo fmt * refactor(substrait): consume_ -> handle_ * refactor(substrait): expand match blocks * refactor: DefaultSubstraitProducer only needs serializer_registry * refactor: remove unnecessary warning suppression * fix(substrait): route expr conversion through handle_expr * cargo fmt * fix: Avoid re-wrapping planning errors Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000) * fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err * test: add tests for error formatting during planning * feat: support `RightAnti` for `SortMergeJoin` (#13680) * feat: support `RightAnti` for `SortMergeJoin` * feat: preserve session id when using cxt.enable_url_table() (#14004) * Return error message during planning when inserting into a MemTable with zero partitions. (#14011) * Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012) * Refactor max_rows for join plan, made it easier to understand * Simplified max_rows for Union * Chore: update wasm-supported crates, add tests (#14005) * Chore: update wasm-supported crates * format * Use workspace rust-version for all workspace crates (#14009) * [Minor] refactor: make ArraySort public for broader access (#14006) * refactor: make ArraySort public for broader access Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the struct, enabling its use in other modules and promoting better code reuse. * clippy and docs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017) * Update sqllogictest requirement from =0.24.0 to =0.26.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * remove version pin and note --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Eduard Karacharov <[email protected]> * `url` dependancy update (#14019) * `url` dependancy update * `url` version update for datafusion-cli * Minor: Improve zero partition check when inserting into `MemTable` (#14024) * Improve zero partition check when inserting into `MemTable` * update err msg * refactor: make structs public and implement Default trait (#14030) * Minor: Remove redundant implementation of `StringArrayType` (#14023) * Minor: Remove redundant implementation of StringArrayType Signed-off-by: Tai Le Manh <[email protected]> * Deprecate rather than remove StringArrayType --------- Signed-off-by: Tai Le Manh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014) * Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995) * Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema * Implement aggregate functions with spill handling in tests * Add tests for aggregate functions with and without spill handling * Move test related imports into mod test * Rename spill pool test functions for clarity and consistency * Refactor aggregate function imports to use fully qualified paths * Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream * Update aggregate test to use AVG instead of MAX * assert spill count * Refactor partial aggregate schema creation to use create_schema function * Refactor partial aggregation schema creation and remove redundant function * Remove unused import of Schema from arrow::datatypes in row_hash.rs * move spill pool testing for aggregate functions to physical-plan/src/aggregates * Use Arc::clone for schema references in aggregate functions * Encapsulate fields of `EquivalenceProperties` (#14040) * Encapsulate fields of `EquivalenceGroup` (#14039) * Fix error on `array_distinct` when input is empty #13810 (#14034) * fix * add test * oops --------- Co-authored-by: Cyprien Huet <[email protected]> * Update petgraph requirement from 0.6.2 to 0.7.1 (#14045) * Update petgraph requirement from 0.6.2 to 0.7.1 Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version. - [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst) - [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.7.1) --- updated-dependencies: - dependency-name: petgraph dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update datafusion-cli/Cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]> * Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037) * Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) * fix doc * Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021) * chore: add test to verify that schema is inferred as expected * chore: add comment to method as suggested * chore: restructure to avoid need to clone * chore: fix flaw in rewrite * feat(optimizer): Enable filter pushdown on window functions (#14026) * feat(optimizer): Enable filter pushdown on window functions Ensures selections can be pushed past window functions similarly to what is already done with aggregations, when possible. * fix: Add missing dependency * minor(optimizer): Use 'datafusion-functions-window' as a dev dependency * docs(optimizer): Add example to filter pushdown on LogicalPlan::Window * Unparsing optimized (> 2 inputs) unions (#14031) * tests and optimizer in testing queries * unparse optimized unions * format Cargo.toml * format Cargo.toml * revert test * rewrite test to avoid cyclic dep * remove old test * cleanup * comments and error handling * handle union with lt 2 inputs * Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047) * Simplify error handling in case.rs (#13990) (#14033) * Simplify error handling in case.rs (#13990) * Fix issues causing GitHub checks to fail * Update datafusion/physical-expr/src/expressions/case.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800) * Add asynchronous catalog traits to help users that have asynchronous catalogs * Apply clippy suggestions * Address PR reviews * Remove allow_unused exceptions * Update remote catalog example to demonstrate new helper structs * Move schema_name / catalog_name parameters into resolve f…

* Handle alias when parsing sql(parse_sql_expr) (#12939) * fix: Fix parse_sql_expr not handling alias * cargo fmt * fix parse_sql_expr example(remove alias) * add testing * add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function * revert change on example `parse_sql_expr` * Improve documentation for TableProvider (#13724) * Reveal implementing type and return type in simple UDF implementations (#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait. * minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731) * Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests * Update to apache-avro 0.17, fix compatibility changes schema handling (#13727) * Update apache-avro requirement from 0.16 to 0.17 --- updated-dependencies: - dependency-name: apache-avro dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compatibility changes schema handling apache-avro 0.17 - Handle ArraySchema struct - Handle MapSchema struct - Map BigDecimal => LargeBinary - Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None) - Map LocalTimestampNanos => todo!() - Add Default to FixedSchema test * Update Cargo.lock file for apache-avro 0.17 --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc Droogh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Minor: Add doc example to RecordBatchStreamAdapter (#13725) * Minor: Add doc example to RecordBatchStreamAdapter * Update datafusion/physical-plan/src/stream.rs Co-authored-by: Berkay Şahin <[email protected]> --------- Co-authored-by: Berkay Şahin <[email protected]> * Implement GroupsAccumulator for corr(x,y) aggregate function (#13581) * Implement GroupsAccumulator for corr(x,y) * feedbacks * fix CI MSRV * review * avoid collect in accumulation * add back cast * fix union serialisation order in proto (#13709) * fix union serialisation order in proto * clippy * address comments * Minor: make unsupported `nanosecond` part a real (not internal) error (#13733) * Minor: make unsupported `nanosecond` part a real (not internal) error * fmt * Improve wording to refer to date part * Add tests for date_part on columns + timestamps with / without timezones (#13732) * Add tests for date_part on columns + timestamps with / without timezones * Add tests from https://github.com/apache/datafusion/pull/13372 * remove trailing whitespace * Optimize performance of `initcap` function (~2x faster) (#13691) * Optimize performance of initcap (~2x faster) Signed-off-by: Tai Le Manh <[email protected]> * format --------- Signed-off-by: Tai Le Manh <[email protected]> * Minor: Add documentation explaining that initcap oly works for ASCII (#13749) * Support sqllogictest --complete with postgres (#13746) Before the change, the request to use PostgreSQL was simply ignored when `--complete` flag was present. * doc-gen: migrate window functions documentation to attribute based (#13739) * doc-gen: migrate window functions documentation Signed-off-by: zjregee <[email protected]> * fix: update Cargo.lock --------- Signed-off-by: zjregee <[email protected]> * Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751) * Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation * Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation. * Update to bigdecimal 0.4.7 (#13747) * Add big decimal formatting test cases with potential trailing zeros * Rename and simplify decimal rendering functions - add `decimal` to function name - drop `precision` parameter as it is not supposed to affect the result * Update to bigdecimal 0.4.7 Utilize new `to_plain_string` function * chore: clean up dependencies (#13728) * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Clean up dependencies * CI: Clean up dependencies * fix: Implicitly plan `UNNEST` as lateral (#13695) * plan implicit lateral if table factor is UNNEST * check for outer references in `create_relation_subquery` * add sqllogictest * fix lateral constant test to not expect a subquery node * replace sqllogictest in favor of logical plan test * update lateral join sqllogictests * add sqllogictests * fix logical plan test * Minor: improve the Deprecation / API health guidelines (#13701) * Minor: improve the Deprecation / API health policy * prettier * Update docs/source/library-user-guide/api-health.md Co-authored-by: Jonah Gao <[email protected]> * Add version guidance and make more copy/paste friendly * prettier * better * rename to guidelines --------- Co-authored-by: Jonah Gao <[email protected]> * fix: specify roottype in substrait fieldreference (#13647) * fix: specify roottype in fieldreference Signed-off-by: MBWhite <[email protected]> * Fix formatting Signed-off-by: MBWhite <[email protected]> * review suggestion Signed-off-by: MBWhite <[email protected]> --------- Signed-off-by: MBWhite <[email protected]> * Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372) * add type sig class Signed-off-by: jayzhan211 <[email protected]> * timestamp Signed-off-by: jayzhan211 <[email protected]> * date part Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * taplo format Signed-off-by: jayzhan211 <[email protected]> * tpch test Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * msrc issue Signed-off-by: jayzhan211 <[email protected]> * explicit hash Signed-off-by: jayzhan211 <[email protected]> * Enhance type coercion and function signatures - Added logic to prevent unnecessary casting of string types in `native.rs`. - Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons. - Updated imports in `functions.rs` and `signature.rs` for better organization. - Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`. - Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`. These changes improve type handling and ensure more accurate function behavior in SQL expressions. * fix comment Signed-off-by: Jay Zhan <[email protected]> * fix signature Signed-off-by: Jay Zhan <[email protected]> * fix test Signed-off-by: Jay Zhan <[email protected]> * Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds. * Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps. * Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file. * Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations. * Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling --------- Signed-off-by: jayzhan211 <[email protected]> Signed-off-by: Jay Zhan <[email protected]> * Minor: Add some more blog posts to the readings page (#13761) * Minor: Add some more blog posts to the readings page * prettier * prettier * Update docs/source/user-guide/concepts-readings-events.md --------- Co-authored-by: Oleks V <[email protected]> * docs: update GroupsAccumulator instead of GroupAccumulator (#13787) Fixing `GroupsAccumulator` trait name in its docs * Improve Deprecation Guidelines more (#13776) * Improve deprecation guidelines more * prettier * fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758) * fix: add `null_buffer` check for `LargeStringArray` Add a safety check to ensure that the alignment of buffers cannot be overflowed. This introduces a panic if they are not aligned through a runtime assertion. * fix: remove value_buffer assertion These buffers can be misaligned and it is not problematic, it is the `null_buffer` which we care about being of the same length. * feat: add `null_buffer` check to `StringArray` This is in a similar vein to `LargeStringArray`, as the code is the same, except for `i32`'s instead of `i64`. * feat: use `row_count` var to avoid drift * Revert the removal of reservation in HashJoin (#13792) * fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption. * fmt Signed-off-by: Jay Zhan <[email protected]> --------- Signed-off-by: Jay Zhan <[email protected]> * added count aggregate slt (#13790) * Update documentation guidelines for contribution content (#13703) * Update documentation guidelines for contribution content * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * clarify discussions and remove requirements note * prettier * Update docs/source/contributor-guide/index.md Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Oleks V <[email protected]> * Add Round trip tests for Array <--> ScalarValue (#13777) * Add Round trip tests for Array <--> ScalarValue * String dictionary test * remove unecessary value * Improve comments * fix: Limit together with pushdown_filters (#13788) * fix: Limit together with pushdown_filters * Fix format * Address new comments * Fix testing case to hit the problem * Minor: improve Analyzer docs (#13798) * Minor: cargo update in datafusion-cli (#13801) * Update datafusion-cli toml to pin home=0.5.9 * update Cargo.lock * Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797) * fix: enable pruning by bloom filters for dictionary columns (#13768) * Handle empty rows for `array_distinct` (#13810) * handle empty array distinct * ignore * fix --------- Co-authored-by: Cyprien Huet <[email protected]> * Fix get_type for higher-order array functions (#13756) * Fix get_type for higher-order array functions * Fix recursive flatten The fix is covered by recursive flatten test case in array.slt * Restore "keep LargeList" in Array signature * clarify naming in the test * Chore: Do not return empty record batches from streams (#13794) * do not emit empty record batches in plans * change function signatures to Option<RecordBatch> if empty batches are possible * format code * shorten code * change list_unnest_at_level for returning Option value * add documentation take concat_batches into compute_aggregates function again * create unit test for row_hash.rs * add test for unnest * add test for unnest * add test for partial sort * add test for bounded window agg * add test for window agg * apply simplifications and fix typo * apply simplifications and fix typo * Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802) * test(13796): reproducer of overflow on capacity * fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer * refactor: use simple solution and provide panic * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750) * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema * clippy * fix csv and json tests * add testing for parquet * cleanup * fix parquet tests * document describe_partition, add back repartition options to one of the csv empty files tests * Support Null regex override in csv parser options. (#13228) Co-authored-by: Andrew Lamb <[email protected]> * Minor: Extend ScalarValue::new_zero() (#13828) * Update mod.rs * Update mod.rs * Update mod.rs * Update mod.rs * chore: temporarily disable windows flow (#13833) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 * Fix test * Add test * Add test * Refine negative scales * Update comment * Refine bigint_to_i256 * UT for bigint_to_i256 * Add ut for parse_decimal * Replace `BooleanArray::extend` with `append_n` (#13832) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> * improve docs --------- Co-authored-by: Piotr Findeisen <[email protected]> * [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830) * [test] coalesce round trip schema mismatch * [proto] added the nullable flag in PhysicalScalarUdfNode * [bugfix] propagate the nullable flag for serialized scalar UDFS * Add example of interacting with a remote catalog (#13722) * Add example of interacting with a remote catalog * Update datafusion/core/src/execution/session_state.rs Co-authored-by: Berkay Şahin <[email protected]> * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Use HashMap to hold tables --------- Co-authored-by: Berkay Şahin <[email protected]> Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Weston Pace <[email protected]> * Update substrait requirement from 0.49 to 0.50 (#13808) * Update substrait requirement from 0.49 to 0.50 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix compilation * Add expr test --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * typo: remove extraneous "`" in doc comment, fix header (#13848) * typo: extraneous "`" in doc comment * Update datafusion/execution/src/runtime_env.rs * Update datafusion/execution/src/runtime_env.rs --------- Co-authored-by: Oleks V <[email protected]> * typo: remove extra "`" interfering with doc formatting (#13847) * Support n-ary monotonic functions in ordering equivalence (#13841) * Support n-ary monotonic functions in `discover_new_orderings` * Add tests for n-ary monotonic functions in `discover_new_orderings` * Fix tests * Fix non-monotonic test case * Fix unintended simplification * Minor comment changes * Fix tests * Add `preserves_lex_ordering` field * Use `preserves_lex_ordering` on `discover_new_orderings()` * Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc` * Update tests * Move logic to UDF * Cargo fmt * Refactor * Cargo fmt * Simply use false value on default implementation * Remove unnecessary import * Clippy fix * Update Cargo.lock * Move dep to dev-dependencies * Rename output_preserves_lex_ordering to preserves_lex_ordering * minor --------- Co-authored-by: berkaysynnada <[email protected]> * Replace `execution_mode` with `emission_type` and `boundedness` (#13823) * feat: update execution modes and add bitflags dependency - Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan. - Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities. - Added `bitflags` dependency to `Cargo.toml` for better management of execution modes. - Adjusted execution mode handling in multiple files to ensure compatibility with the new structure. * add exec API Signed-off-by: Jay Zhan <[email protected]> * replace done but has stackoverflow Signed-off-by: Jay Zhan <[email protected]> * exec API done Signed-off-by: Jay Zhan <[email protected]> * Refactor execution plan properties to remove execution mode - Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations. - Updated related functions to utilize the new structure, ensuring compatibility with the changes. - Adjusted comments and cleaned up imports to reflect the removal of execution mode handling. This refactor simplifies the execution plan properties and enhances maintainability. * Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType` - Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files. - Introduced `EmissionType` to better represent the output characteristics of execution plans. - Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability. - Cleaned up imports and adjusted comments accordingly. This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans. * fix test Signed-off-by: Jay Zhan <[email protected]> * Refactor join handling and emission type logic - Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage. - Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking. - Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior. - Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission. These changes improve the accuracy of tests and the clarity of execution plan properties. * Implement emission type for execution plans - Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans. - This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior. These updates contribute to a more robust execution plan framework within the DataFusion project. * Enhance join type documentation and refine emission type logic - Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results. - Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable. These changes improve the documentation and functionality of join operations within the DataFusion project. * Refactor emission type logic in join and sort execution plans - Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics. - Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type. - These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types. * Refactor emission type handling in execution plans - Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed. - Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability. These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework. * Enhance execution plan properties with boundedness and emission type - Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics. - Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior. - Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor execution plans to enhance boundedness and emission type handling - Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans. - Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior. - Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor memory handling in execution plans - Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management. - This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints. * Refactor boundedness checks in execution plans - Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management. - Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project. - These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies. * Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation - Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity. - This change contributes to a cleaner and more maintainable codebase within the DataFusion project. * Refactor execution plan boundedness and emission type handling - Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior. - Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams. - Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic. - Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics. These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types. * Refactor emission type and boundedness handling in execution plans - Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase. - Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines. - Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness. * Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling - Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows. - Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations. * Review Part1 * Update sanity_checker.rs * addressing reviews * Review Part 1 * Update datafusion/physical-plan/src/execution_plan.rs * Update datafusion/physical-plan/src/execution_plan.rs * Shorten imports * Enhance documentation for JoinType and Boundedness enums - Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples. - Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max. --------- Signed-off-by: Jay Zhan <[email protected]> Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Preserve ordering equivalencies on `with_reorder` (#13770) * Preserve ordering equivalencies on `with_reorder` * Add assertions * Return early if filtered_exprs is empty * Add clarify comment * Refactor * Add comprehensive test case * Add comment for exprs_equal * Cargo fmt * Clippy fix * Update properties.rs * Update exprs_equal and add tests * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * replace CASE expressions in predicate pruning with boolean algebra (#13795) * replace CASE expressions in predicate pruning with boolean algebra * fix merge * update tests * add some more tests * add some more tests * remove duplicate test case * Update datafusion/physical-optimizer/src/pruning.rs * swap NOT for != * replace comments, update docstrings * fix example * update tests * update tests * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> * Update pruning.rs Co-authored-by: Chunchun Ye <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Chunchun Ye <[email protected]> * enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857) fixes #13854 Co-authored-by: Arttu Voutilainen <[email protected]> * Add configurable normalization for configuration options and preserve case for S3 paths (#13576) * Do not normalize values * Fix tests & update docs * Prettier * Lowercase config params * Unify transform and parse * Fix tests * Rename `default_transform` and relax boundaries * Make `compression` case-insensitive * Comment to new line * Deprecate and ignore `enable_options_value_normalization` * Update datafusion/common/src/config.rs * fix typo --------- Co-authored-by: Oleks V <[email protected]> * Improve`Signature` and `comparison_coercion` documentation (#13840) * Improve Signature documentation more * Apply suggestions from code review Co-authored-by: Piotr Findeisen <[email protected]> --------- Co-authored-by: Piotr Findeisen <[email protected]> * feat: support normalized expr in CSE (#13315) * feat: support normalized expr in CSE * feat: support normalize_eq in cse optimization * feat: support cumulative binary expr result in normalize_eq --------- Co-authored-by: Andrew Lamb <[email protected]> * Upgrade to sqlparser `0.53.0` (#13767) * chore: Udpate to sqlparser 0.53.0 * Update for new sqlparser API * more api updates * Avoid serializing query to SQL string unless it is necessary * Box wildcard options * chore: update datafusion-cli Cargo.lock * Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861) Thanks @Dandandan * feat(function): add `least` function (#13786) * start adding least fn * feat(function): add least function * update function name * fix scalar smaller function * add tests * run Clippy and Fmt * Generated docs using `./dev/update_function_docs.sh` * add comment why `descending: false` * update comment * Update least.rs Co-authored-by: Bruce Ritchie <[email protected]> * Update scalar_functions.md * run ./dev/update_function_docs.sh to update docs * merge greatest and least implementation to one * add header --------- Co-authored-by: Bruce Ritchie <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Improve SortPreservingMerge::enable_round_robin_repartition docs (#13826) * Clarify SortPreservingMerge::enable_round_robin_repartition docs * tweaks * Improve comments more * clippy * fix doc link * Minor: Unify `downcast_arg` method (#13865) * Implement `SHOW FUNCTIONS` (#13799) * introduce rid for different signature * implement show functions syntax * add syntax example * avoid duplicate join * fix clippy * show function_type instead of routine_type * add some doc and comments * Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740) * Update bzip2 requirement from 0.4.3 to 0.5.0 Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version. - [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases) - [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0) --- updated-dependencies: - dependency-name: bzip2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix test * Fix CLI cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Fix build (#13869) * feat(substrait): modular substrait consumer (#13803) * feat(substrait): modular substrait consumer * feat(substrait): include Extension Rel handlers in default consumer Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer * refactor(substrait) _selection -> _field_reference * refactor(substrait): remove SubstraitPlannerState usage from consumer * refactor: get_state() -> get_function_registry() * docs: elide imports from example * test: simplify test * refactor: remove Arc from DefaultSubstraitConsumer * doc: add ticket for API improvements * doc: link DefaultSubstraitConsumer to from_subtrait_plan * refactor: remove redundant Extensions parsing * Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862) * include FetchRel when producing LogicalPlan from Sort * add suggested test * address review feedback * Minor: improve error message when ARRAY literals can not be planned (#13859) * Minor: improve error message when ARRAY literals can not be planned * fmt * Update datafusion/sql/src/expr/value.rs Co-authored-by: Oleks V <[email protected]> --------- Co-authored-by: Oleks V <[email protected]> * Add documentation for `SHOW FUNCTIONS` (#13868) * Support unicode character for `initcap` function (#13752) * Support unicode character for 'initcap' function Signed-off-by: Tai Le Manh <[email protected]> * Update unit tests * Fix clippy warning * Update sqllogictests - initcap * Update scalar_functions.md docs * Add suggestions change Signed-off-by: Tai Le Manh <[email protected]> --------- Signed-off-by: Tai Le Manh <[email protected]> * [minor] make recursive package dependency optional (#13778) * make recursive optional * add to default for common package * cargo update * added to readme * make test conditional * reviews * cargo update --------- Co-authored-by: Andrew Lamb <[email protected]> * Minor: remove unused async-compression `futures-io` feature (#13875) * Minor: remove unused async-compression feature * Fix cli cargo lock * Consolidate Example: dataframe_output.rs into dataframe.rs (#13877) * Restore `DocBuilder::new()` to avoid breaking API change (#13870) * Fix build * Restore DocBuilder::new(), deprecate * cmt * clippy * Improve error messages for incorrect zero argument signatures (#13881) * Improve error messages for incorrect zero argument signatures * fix errors * fix fmt * Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883) * minor: fix typos in comments / structure names (#13879) * minor: fix typo error in datafusion * fix: fix rebase error * fix: format HashJoinExec doc * doc: recover thiserror/preemptively * fix: other typo error fixed * fix: directories to dir_entries in catalog example * Support 1 or 3 arg in generate_series() UDTF (#13856) * Support 1 or 3 args in generate_series() UDTF * address comment * Support (order by / sort) for DataFrameWriteOptions (#13874) * Support (order by / sort) for DataFrameWriteOptions * Fix fmt * Fix import * Add insert into example * Update sort_merge_join.rs (#13894) * Update join_selection.rs (#13893) * Fix `recursive-protection` feature flag (#13887) * Fix recursive-protection feature flag * rename feature flag to be consistent * Make default * taplo format * Fix visibility of swap_hash_join (#13899) * Minor: Avoid emitting empty batches in partial sort (#13895) * Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * Prepare for 44.0.0 release: version and changelog (#13882) * Prepare for 44.0.0 release: version and changelog * update changelog * update configs * update before release * Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824) * support unparsing the implicit lateral unnest plan * cargo clippy and fmt * refactor for `check_unnest_placeholder_with_outer_ref` * add const for the prefix string of unnest and outer refernece column * fix case_column_or_null with nullable when conditions (#13886) * fix case_column_or_null with nullable when conditions * improve sqllogictests for case_column_or_null --------- Co-authored-by: zhangli20 <[email protected]> * Fixed Issue #13896 (#13903) The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used. * Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880) * make ast builder public * introduce udlp unparser * add documents * add examples * add negative tests and fmt * fix the doc * rename udlp to extension * apply the first unparsing result only * improve the doc * seperate the enum for the unparsing result * fix the doc --------- Co-authored-by: Andrew Lamb <[email protected]> * Preserve constant values across union operations (#13805) * Add value tracking to ConstExpr for improved union optimization * Update PartialEq impl * Minor change * Add docstring for ConstExpr value * Improve constant propagation across union partitions * Add assertion for across_partitions * fix fmt * Update properties.rs * Remove redundant constant removal loop * Remove unnecessary mut * Set across_partitions=true when both sides are constant * Extract and use constant values in filter expressions * Add initial SLT for constant value tracking across UNION ALL * Assign values to ConstExpr where possible * Revert "Set across_partitions=true when both sides are constant" This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449. * Temporarily take value from literal * Lint fixes * Cargo fmt * Add get_expr_constant_value * Make `with_value()` accept optional value * Add todo * Move test to union.slt * Fix changed slt after merge * Simplify constexpr * Update properties.rs --------- Co-authored-by: berkaysynnada <[email protected]> * chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902) * fix RecordBatch size in topK (#13906) * ci improvements, update protoc (#13876) * Fix md5 return_type to only return Utf8 as per current code impl. * ci improvements * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Revert nextest change until action is approved. * Exclude requires workspace * Fixing minor typo to verify ci caching of builds is working as expected. * Updates from PR review. * Adding issue link for disabling intel mac build * improve performance of running examples * remove cargo check * Introduce LogicalPlan invariants, begin automatically checking them (#13651) * minor(13525): perform LP validation before and after each possible mutation * minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass * minor(13525): validate union after each optimizer passes * refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass * chore: add link to invariant docs * fix: add new invariants module * refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants * test: update test for slight error message change * fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause * refactor: move collect_subquery_cols() to common utils crate * refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass. * refactor: based upon performance tests, run the maximum number of checks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug * chore: update error naming and terminology used in code comments * refactor: use proper error methods * chore: more cleanup of error messages * chore: handle option trailer to error message * test: update sqllogictests tests to not use multiline * Correct return type for initcap scalar function with utf8view (#13909) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view * Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905) * Implement maintains_input_order for AggregateExec (#13897) * Implement maintains_input_order for AggregateExec * Update mod.rs * Improve comments --------- Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: mertak-synnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Move join type input swapping to pub methods on Joins (#13910) * doc-gen: migrate scalar functions (string) documentation 3/4 (#13926) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917) * Update sqllogictest requirement from 0.24.0 to 0.25.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Remove labels --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913) * doc-gen: migrate scalar functions (crypto) documentation (#13918) * doc-gen: migrate scalar functions (crypto) documentation * doc-gen: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920) * doc-gen: migrate scalar functions (datetime) documentation 1/2 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * fix RecordBatch size in hash join (#13916) * doc-gen: migrate scalar functions (array) documentation 1/3 (#13928) * doc-gen: migrate scalar functions (array) documentation 1/3 * fix: remove unsed import, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 1/2 (#13922) * doc-gen: migrate scalar functions (math) documentation 1/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 2/2 (#13923) * doc-gen: migrate scalar functions (math) documentation 2/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 3/3 (#13930) * doc-gen: migrate scalar functions (array) documentation 3/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 2/3 (#13929) * doc-gen: migrate scalar functions (array) documentation 2/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (string) documentation 4/4 (#13927) * doc-gen: migrate scalar functions (string) documentation 4/4 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Support explain query when running dfbench with clickbench (#13942) * Support explain query when running dfbench * Address comments * Consolidate example to_date.rs into dateframe.rs (#13939) * Consolidate example to_date.rs into dateframe.rs * Assert results using assert_batches_eq * clippy * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945) * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf. * add comment * Implement predicate pruning for `like` expressions (prefix matching) (#12978) * Implement predicate pruning for like expressions * add function docstring * re-order bounds calculations * fmt * add fuzz tests * fix clippy * Update datafusion/core/tests/fuzz_cases/pruning.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * doc-gen: migrate scalar functions (string) documentation 1/4 (#13924) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * consolidate dataframe_subquery.rs into dataframe.rs (#13950) * migrate btrim to user_doc macro (#13952) * doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921) * doc-gen: migrate scalar functions (datetime) documentation 2/2 * fix: fix typo and update function docs * doc: update function docs * doc-gen: remove slash --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936) * Fix md5 return_type to only return Utf8 as per current code impl. * Add support for sqlite test files to sqllogictest * Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed. * Removed workaround for bug that was fixed. * Git submodule update ... err update, link to sqlite tests. * Git submodule update * Readd submodule --------- Co-authored-by: Andrew Lamb <[email protected]> * Supporting writing schema metadata when writing Parquet in parallel (#13866) * refactor: make ParquetSink tests a bit more readable * chore(11770): add new ParquetOptions.skip_arrow_metadata * test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement * refactor(11770): replace with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not * fix(11770): fix parallel ParquetSink to encode arrow schema into the file metadata, based on the ParquetOptions * refactor(11770): provide deprecation warning for TryFrom * test(11770): update tests with new default to include arrow schema * refactor: including partitioning of arrow schema inserted into kv_metdata * test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata * chore: avoid cloning in tests, and update code docs * refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions * refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration * chore: update configs.md * test: update tests to handle the (default) required arrow schema in the kv_metadata * chore: add reference to arrow-rs upstream PR * chore: Create devcontainer.json (#13520) * Create devcontainer.json * update devcontainer * remove useless features * Minor: consolidate ConfigExtension example into API docs (#13954) * Update examples README.md * Minor: consolidate ConfigExtension example into API docs * more docs * Remove update * clippy * Fix issue with ExtensionsOptions docs * Parallelize pruning utf8 fuzz test (#13947) * Add swap_inputs to SMJ (#13984) * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966) * added failing test * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows * Update datafusion/functions-nested/src/set_ops.rs Co-authored-by: Andrew Lamb <[email protected]> * Update set_ops.rs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update release instructions for 44.0.0 (#13959) * Update release instructions for 44.0.0 * update macros and order * add functions-table * Add datafusion python 43.1.0 blog post to doc. (#13974) * Include license and notice files in more crates (#13985) * Extract postgres container from sqllogictest, update datafusion-testing pin (#13971) * Add support for sqlite test files to sqllogictest * Removed workaround for bug that was fixed. * Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock. * Add missing license header. * Update rstest requirement from 0.23.0 to 0.24.0 (#13977) Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0) --- updated-dependencies: - dependency-name: rstest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Move hash collision test to run only when merging to main. (#13973) * Update itertools requirement from 0.13 to 0.14 (#13965) * Update itertools requirement from 0.13 to 0.14 Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Fix build * Simplify * Update CLI lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> * Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988) * Rename hash_collision.yml to extended.yml and add comments * Adjust schedule, add comments * Update job, rerun * doc-gen: migrate scalar functions (string) documentation 2/4 (#13925) * doc-gen: migrate scalar functions (string) documentation 2/4 * doc-gen: update function docs * doc: fix related udf order for upper function in documentation * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * doc-gen: update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Oleks V <[email protected]> * Update substrait requirement from 0.50 to 0.51 (#13978) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update release README for datafusion-cli publishing (#13982) * Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980) - Updated LastValueAccumulator to include requirement satisfaction check before updating the last value. - Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios. * Improve deserialize_to_struct example (#13958) * Cleanup deserialize_to_struct example * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <[email protected]> --------- Co-authored-by: Jonah Gao <[email protected]> * Update docs (#14002) * Optimize CASE expression for "expr or expr" usage. (#13953) * Apply optimization for ExprOrExpr. * Implement optimization similar to existing code. * Add sqllogictest. * feat(substrait): introduce consume_rel and consume_expression (#13963) * feat(substrait): introduce consume_rel and consume_expression Route calls to from_substrait_rel and from_substrait_rex through the SubstraitConsumer in order to allow users to provide their own behaviour * feat(substrait): consume nulls of user-defined types * docs(substrait): consume_rel and consume_expression docstrings * Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981) * Consolidate csv_opener.rs and json_opener.rs into a single example (#13955) * Update datafusion-examples/examples/csv_json_opener.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion-examples/README.md Co-authored-by: Andrew Lamb <[email protected]> * Apply code formatting with cargo fmt --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * FIX : Incorrect NULL handling in BETWEEN expression (#14007) * submodule update * FIX : Incorrect NULL handling in BETWEEN expression * Revert "submodule update" This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3. * fix incorrect unit test * move sqllogictest to expr * feat(substrait): modular substrait producer (#13931) * feat(substrait): modular substrait producer * refactor(substrait): simplify col_ref_offset handling in producer * refactor(substrait): remove column offset tracking from producer * docs(substrait): document SubstraitProducer * refactor: minor cleanup * feature: remove unused SubstraitPlanningState BREAKING CHANGE: SubstraitPlanningState is no longer available * refactor: cargo fmt * refactor(substrait): consume_ -> handle_ * refactor(substrait): expand match blocks * refactor: DefaultSubstraitProducer only needs serializer_registry * refactor: remove unnecessary warning suppression * fix(substrait): route expr conversion through handle_expr * cargo fmt * fix: Avoid re-wrapping planning errors Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000) * fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err * test: add tests for error formatting during planning * feat: support `RightAnti` for `SortMergeJoin` (#13680) * feat: support `RightAnti` for `SortMergeJoin` * feat: preserve session id when using cxt.enable_url_table() (#14004) * Return error message during planning when inserting into a MemTable with zero partitions. (#14011) * Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012) * Refactor max_rows for join plan, made it easier to understand * Simplified max_rows for Union * Chore: update wasm-supported crates, add tests (#14005) * Chore: update wasm-supported crates * format * Use workspace rust-version for all workspace crates (#14009) * [Minor] refactor: make ArraySort public for broader access (#14006) * refactor: make ArraySort public for broader access Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the struct, enabling its use in other modules and promoting better code reuse. * clippy and docs --------- Co-authored-by: Andrew Lamb <[email protected]> * Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017) * Update sqllogictest requirement from =0.24.0 to =0.26.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * remove version pin and note --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Eduard Karacharov <[email protected]> * `url` dependancy update (#14019) * `url` dependancy update * `url` version update for datafusion-cli * Minor: Improve zero partition check when inserting into `MemTable` (#14024) * Improve zero partition check when inserting into `MemTable` * update err msg * refactor: make structs public and implement Default trait (#14030) * Minor: Remove redundant implementation of `StringArrayType` (#14023) * Minor: Remove redundant implementation of StringArrayType Signed-off-by: Tai Le Manh <[email protected]> * Deprecate rather than remove StringArrayType --------- Signed-off-by: Tai Le Manh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014) * Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995) * Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema * Implement aggregate functions with spill handling in tests * Add tests for aggregate functions with and without spill handling * Move test related imports into mod test * Rename spill pool test functions for clarity and consistency * Refactor aggregate function imports to use fully qualified paths * Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream * Update aggregate test to use AVG instead of MAX * assert spill count * Refactor partial aggregate schema creation to use create_schema function * Refactor partial aggregation schema creation and remove redundant function * Remove unused import of Schema from arrow::datatypes in row_hash.rs * move spill pool testing for aggregate functions to physical-plan/src/aggregates * Use Arc::clone for schema references in aggregate functions * Encapsulate fields of `EquivalenceProperties` (#14040) * Encapsulate fields of `EquivalenceGroup` (#14039) * Fix error on `array_distinct` when input is empty #13810 (#14034) * fix * add test * oops --------- Co-authored-by: Cyprien Huet <[email protected]> * Update petgraph requirement from 0.6.2 to 0.7.1 (#14045) * Update petgraph requirement from 0.6.2 to 0.7.1 Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version. - [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst) - [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.7.1) --- updated-dependencies: - dependency-name: petgraph dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update datafusion-cli/Cargo.lock --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]> * Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037) * Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) * fix doc * Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021) * chore: add test to verify that schema is inferred as expected * chore: add comment to method as suggested * chore: restructure to avoid need to clone * chore: fix flaw in rewrite * feat(optimizer): Enable filter pushdown on window functions (#14026) * feat(optimizer): Enable filter pushdown on window functions Ensures selections can be pushed past window functions similarly to what is already done with aggregations, when possible. * fix: Add missing dependency * minor(optimizer): Use 'datafusion-functions-window' as a dev dependency * docs(optimizer): Add example to filter pushdown on LogicalPlan::Window * Unparsing optimized (> 2 inputs) unions (#14031) * tests and optimizer in testing queries * unparse optimized unions * format Cargo.toml * format Cargo.toml * revert test * rewrite test to avoid cyclic dep * remove old test * cleanup * comments and error handling * handle union with lt 2 inputs * Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047) * Simplify error handling in case.rs (#13990) (#14033) * Simplify error handling in case.rs (#13990) * Fix issues causing GitHub checks to fail * Update datafusion/physical-expr/src/expressions/case.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Sergey Zhukov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800) * Add asynchronous catalog traits to help users that have asynchronous catalogs * Apply clippy suggestions * Address PR reviews * Remove allow_unused exceptions * Update remote catalog example to demonstrate new helper structs * Move schema_name / catalog_name parameters into resolve function and out of trait * Custom scalar to sql overrides support for DuckDB Unparser dialect (#13915) * Allow adding custom scalar to sql overrides for DuckDB (#68) * Add unit test: custom_scalar_overrides_duckdb * Move `with_custom_scalar_overrides` definition on `Dialect` trait level * Improve perfomance of `reverse` function (#14025) * Improve perfomance of 'reverse' function Signed-off-by: Tai Le Manh <[email protected]> * Apply sugestion change * Fix typo --------- Signed-off-by: Tai Le Manh <[email protected]> * docs(ci): use up-to-date protoc with docs.rs (#14048) * fix (#14042) Co-authored-by: Cyprien Huet <[email protected]> * Re-export TypeSignatureClass from the datafusion-expr package (#14051) * Fix clippy for Rust 1.84 (#14065) * fix: incorrect error message of function_length_check (#14056) * minor fix * add ut * remove check for 0 arg * test: Add plan execution during tests for bounded source (#14013) * Bump `ctor` to `0.2.9` (#14069) * Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone (#14038) * Move collapse_lex_ordering to Lexordering::collapse * reduce diff * avoid clone, cleanup * Introduce LexRequirement::collapse * Improve performance of collapse, from @akurmustafa https://github.com/alamb/datafusion/pull/26 fix formatting * Revert "Improve performance of collapse, from @akurmustafa" This reverts commit a44acfdb3af5bf0082c277de6ee7e09e92251a49. * remove incorrect comment --------- Co-authored-by: Mustafa Akur <[email protected]> * Bump `wasm-bindgen` and `wasm-bindgen-futures` (#14068) * update (#14070) * fix: make get_valid_types handle TypeSignature::Numeric correctly (#14060) * fix get_valid_types with TypeSignature::Numeric * f…

Fix get_type for higher-order array functions

6903259

github-actions bot added the logical-expr Logical plan and expressions label Dec 13, 2024

findepi commented Dec 13, 2024

View reviewed changes

datafusion/functions-nested/src/flatten.rs Show resolved Hide resolved

Fix recursive flatten

6d81418

The fix is covered by recursive flatten test case in array.slt

findepi force-pushed the findepi/array-get-type branch from 1bd311a to 6d81418 Compare December 13, 2024 13:55

jayzhan211 reviewed Dec 13, 2024

View reviewed changes

findepi requested review from alamb, jayzhan211 and jonahgao December 13, 2024 15:39

alamb reviewed Dec 13, 2024

View reviewed changes

Restore "keep LargeList" in Array signature

038a015

buraksenn mentioned this pull request Dec 15, 2024

flatten should be single-step, not recursive #13757

Open

findepi requested a review from alamb December 17, 2024 12:45

alamb approved these changes Dec 17, 2024

View reviewed changes

alamb mentioned this pull request Dec 17, 2024

Cannot create a List of FixedSizedList in SQL #13819

Closed

clarify naming in the test

69fcf24

findepi merged commit 7e0fc14 into apache:main Dec 18, 2024
25 checks passed

findepi deleted the findepi/array-get-type branch December 18, 2024 07:15

jayzhan211 mentioned this pull request Dec 18, 2024

Compute ScalarFunction properties including return_type and nullable on creation #13825

Open

alamb mentioned this pull request Jan 1, 2025

Jan 1, 2025: This week(s) in DataFusion #13970

Closed

11 tasks

		fn array(array_type: &DataType) -> Option<DataType> {
		match array_type {

	// Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature`
	let new_data_types = data_types_with_scalar_udf(&arg_data_types, func)
	.map_err(\|err\| {
	plan_datafusion_err!(
	"{} {}",
	err,
	utils::generate_signature_error_msg(
	func.name(),
	func.signature().clone(),
	&arg_data_types,
	)
	)
	})?;

Fix get_type for higher-order array functions #13756

Fix get_type for higher-order array functions #13756

Conversation

findepi commented Dec 13, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Dec 13, 2024

alamb commented Dec 13, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Dec 13, 2024

jayzhan211 commented Dec 14, 2024

findepi commented Dec 14, 2024

jayzhan211 commented Dec 16, 2024

findepi commented Dec 17, 2024

jayzhan211 commented Dec 17, 2024 • edited Loading

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 17, 2024

findepi commented Dec 17, 2024

findepi commented Dec 17, 2024

jayzhan211 Dec 17, 2024 •

edited

Loading

jayzhan211 commented Dec 17, 2024 •

edited

Loading

alamb left a comment •

edited

Loading