feat: Add GroupColumn `Decimal128Array` #13564

jonathanc-n · 2024-11-26T03:43:25Z

Which issue does this PR close?

Closes #13505.

Rationale for this change

What changes are included in this PR?

Added group column for Decimal128Array

Are these changes tested?

Yes, slt tests.

Are there any user-facing changes?

jayzhan211 · 2024-11-26T05:18:57Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/primitive.rs

        // Set timezone information for timestamp
-        Arc::new(arr.with_data_type(data_type))
+        adjust_output_array(&data_type, array_ref)


If we need adjust_output_array then I think we don't need with_data_type to set the timezone information since it is handled by adjust_output_array as well

The precision and scale aren't kept in the generic when constructing the buffer, so i think i need t keep with_data_type

It is kept in data_type. When you run adjust_output_array, the precision of Decimal should be set with with_precision_and_scale.

pub fn adjust_output_array(data_type: &DataType, array: ArrayRef) -> Result<ArrayRef> { let array = match data_type { DataType::Decimal128(p, s) => Arc::new( array .as_primitive::<Decimal128Type>() .clone() .with_precision_and_scale(*p, *s)?, ) as ArrayRef,

Yeah that is what I thought as well but in ea6f77a it fails due to precision/scale errors due to the original array not having .with_data_type on it

The error you mentioned can be fixed by adding support for Decimal256

I tried backing out the use of adjust_data_type and I didn't see any failures locally 🤔

alamb

Thank you for working on this @jonathanc-n -- your code seems good but something is going wonky with the data types:

the fuzz test is failing intermittently
The extra adjustment of output type seems unecessary 🤔

alamb · 2024-11-27T19:22:27Z

datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs

@@ -87,7 +87,12 @@ impl DatasetGeneratorConfig {
            .iter()
            .filter_map(|d| {
                if d.column_type.is_numeric()
-                    && !matches!(d.column_type, DataType::Float32 | DataType::Float64)
+                    && !matches!(


Something is wrong here. This change effectively turns off fuzz testing for sum with decimal:

When I reverted this change the fuzz tests fail occasionally like this:

test fuzz_cases::aggregate_fuzz::test_sum ... FAILED ... Arrow error: Invalid argument error: column types must match schema types, expected Decimal128(21, -112) but found Decimal128(38, 10) at column index 1

From my perspective other than this change, this PR is ready to go.

Thank you @jonathanc-n

@alamb thanks for the tip, I reverted that change! For the test_sum fuzz test, I removed it the same way I did for the Float types due to casting to a DateType. This was the error I got after executing with a backtrace (many more of this same error): ERROR: Cast error: Failed to convert 39087111289254881.41 to datetime for Timestamp(Millisecond, None)

I'm getting this, should I still remove it?

The real issue is that we have decimal(38, 10) which is the fixed precision for sum and it mismatches with the fuzz test which has random precision

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { match &arg_types[0] { DataType::Int64 => Ok(DataType::Int64), DataType::UInt64 => Ok(DataType::UInt64), DataType::Float64 => Ok(DataType::Float64), DataType::Decimal128(precision, scale) => { // in the spark, the result type is DECIMAL(min(38,precision+10), s) // ref: https://github.com/apache/spark/blob/fcf636d9eb8d645c24be3db2d599aba2d7e2955a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala#L66 let new_precision = DECIMAL128_MAX_PRECISION.min(*precision + 10); Ok(DataType::Decimal128(new_precision, *scale)) } DataType::Decimal256(precision, scale) => { // in the spark, the result type is DECIMAL(min(38,precision+10), s) // ref: https://github.com/apache/spark/blob/fcf636d9eb8d645c24be3db2d599aba2d7e2955a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala#L66 let new_precision = DECIMAL256_MAX_PRECISION.min(*precision + 10); Ok(DataType::Decimal256(new_precision, *scale)) } other => { exec_err!("[return_type] SUM not supported for {}", other) } } }

Decimal128(38, 3) for normal precision, Decimal128(30, 3) for grouping. Not sure why there is mismatch in fuzz test. We should either align the precision for both cases or fix the fuzz schema check if they are not necessary to have the same precision like slt

query TT select arrow_typeof(sum(column1)), arrow_typeof(sum(distinct column1)) from t group by column2; ---- Decimal128(38, 3) Decimal128(30, 3) Decimal128(38, 3) Decimal128(30, 3) query TT explain select sum(column1), sum(distinct column1) from t group by column2; ---- logical_plan 01)Projection: sum(alias2) AS sum(t.column1), sum(alias1) AS sum(DISTINCT t.column1) 02)--Aggregate: groupBy=[[t.column2]], aggr=[[sum(alias2), sum(alias1)]] 03)----Aggregate: groupBy=[[t.column2, t.column1 AS alias1]], aggr=[[sum(t.column1) AS alias2]] 04)------TableScan: t projection=[column1, column2] physical_plan 01)ProjectionExec: expr=[sum(alias2)@1 as sum(t.column1), sum(alias1)@2 as sum(DISTINCT t.column1)] 02)--AggregateExec: mode=FinalPartitioned, gby=[column2@0 as column2], aggr=[sum(alias2), sum(alias1)] 03)----CoalesceBatchesExec: target_batch_size=8192 04)------RepartitionExec: partitioning=Hash([column2@0], 4), input_partitions=4 05)--------AggregateExec: mode=Partial, gby=[column2@0 as column2], aggr=[sum(alias2), sum(alias1)] 06)----------AggregateExec: mode=FinalPartitioned, gby=[column2@0 as column2, alias1@1 as alias1], aggr=[alias2] 07)------------CoalesceBatchesExec: target_batch_size=8192 08)--------------RepartitionExec: partitioning=Hash([column2@0, alias1@1], 4), input_partitions=4 09)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 10)------------------AggregateExec: mode=Partial, gby=[column2@1 as column2, column1@0 as alias1], aggr=[alias2] 11)--------------------MemoryExec: partitions=1, partition_sizes=[1]

alamb · 2024-11-27T21:42:45Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/primitive.rs

        // Set timezone information for timestamp
-        Arc::new(arr.with_data_type(data_type))
+        adjust_output_array(&data_type, array_ref)


I tried backing out the use of adjust_data_type and I didn't see any failures locally 🤔

alamb · 2024-11-27T21:54:58Z

Here is a PR to deprecate adjust_output_array: #13585

jonathanc-n · 2024-11-27T22:19:01Z

@alamb thanks for the tip, I reverted that change! For the test_sum fuzz test, I removed it the same way I did for the Float types due to casting to a DateType. This was the error I got after executing with a backtrace (many more of this same error):
ERROR: Cast error: Failed to convert 39087111289254881.41 to datetime for Timestamp(Millisecond, None)

jayzhan211 · 2024-11-28T00:23:39Z

For the test_sum fuzz test, I removed it the same way I did for the Float types due to casting to a DateType

I didn't see the removal you mentioned, but the change looks better to me now

Decimal128 support for DatasetGeneratorConfig seems incorrect

alamb · 2024-12-02T19:07:58Z

@alamb thanks for the tip, I reverted that change! For the test_sum fuzz test, I removed it the same way I did for the Float types due to casting to a DateType. This was the error I got after executing with a backtrace (many more of this same error): ERROR: Cast error: Failed to convert 39087111289254881.41 to datetime for Timestamp(Millisecond, None)

I think I found the issue (thanks @jayzhan211 for the sluthing):

Add missing data type jonathanc-n/datafusion#1

(also shout out to @Rachelint for writing the first version of these fuzz testers. We just avoided a potential bug with it!)

Add missing data type

jonathanc-n · 2024-12-03T04:30:35Z

@alamb @jayzhan211 Thanks for the reviews and changes!

jayzhan211

👍🏻

alamb

Thanks @jonathanc-n

feat: Add GroupColumn `Decimal128Array

8da0bc3

github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Nov 26, 2024

jonathanc-n added 2 commits November 25, 2024 23:32

fix clippy

dcc488d

fix errors

a707d12

github-actions bot added the core Core DataFusion crate label Nov 26, 2024

jayzhan211 reviewed Nov 26, 2024

View reviewed changes

jonathanc-n added 3 commits November 26, 2024 02:01

remove .with_data_type

ea6f77a

fix

2101112

fmt

403cfa7

alamb reviewed Nov 27, 2024

View reviewed changes

alamb mentioned this pull request Nov 27, 2024

Deprecate adjust_output_array in favor of PrimitiveArray::with_data_type #13585

Merged

jonathanc-n added 2 commits November 27, 2024 17:19

remove adjust_output_array

afd6e2c

fix

079942c

jayzhan211 previously approved these changes Nov 28, 2024

View reviewed changes

Add missing data type

76ec4f9

alamb mentioned this pull request Dec 2, 2024

Add missing data type jonathanc-n/datafusion#1

Merged

Merge pull request #1 from alamb/alamb/missing_decimal

8e7159e

Add missing data type

github-actions bot removed the core Core DataFusion crate label Dec 3, 2024

jayzhan211 approved these changes Dec 3, 2024

View reviewed changes

alamb approved these changes Dec 3, 2024

View reviewed changes

jayzhan211 merged commit 143ef97 into apache:main Dec 4, 2024
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add GroupColumn `Decimal128Array` #13564

feat: Add GroupColumn `Decimal128Array` #13564

jonathanc-n commented Nov 26, 2024

jayzhan211 Nov 26, 2024

jonathanc-n Nov 26, 2024

jayzhan211 Nov 27, 2024

jonathanc-n Nov 27, 2024

jayzhan211 Nov 27, 2024

alamb Nov 27, 2024

alamb left a comment

alamb Nov 27, 2024

alamb Nov 30, 2024

jonathanc-n Nov 30, 2024

jayzhan211 Dec 2, 2024 •

edited

Loading

jayzhan211 Dec 2, 2024

alamb Nov 27, 2024

alamb commented Nov 27, 2024

jonathanc-n commented Nov 27, 2024

jayzhan211 commented Nov 28, 2024

alamb commented Dec 2, 2024

jonathanc-n commented Dec 3, 2024

jayzhan211 left a comment

alamb left a comment

feat: Add GroupColumn Decimal128Array #13564

feat: Add GroupColumn Decimal128Array #13564

Conversation

jonathanc-n commented Nov 26, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 27, 2024

jonathanc-n commented Nov 27, 2024

jayzhan211 commented Nov 28, 2024

alamb commented Dec 2, 2024

jonathanc-n commented Dec 3, 2024

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

feat: Add GroupColumn `Decimal128Array` #13564

feat: Add GroupColumn `Decimal128Array` #13564

jayzhan211 Dec 2, 2024 •

edited

Loading