Feature Add Monotonic Definition #59

mertak-synnada · 2025-01-16T12:26:44Z

Which issue does this PR close?

Closes #.

Rationale for this change

This PR creates a definition of set-monotonicity for Aggregate expressions. Some aggregation functions create ordered results by definition (such as count, min, max). With this PR, we're adding this information to the output ordering and be able to remove some SortExecs while optimizing

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mertak-synnada · 2025-01-16T12:30:42Z

datafusion/physical-expr/src/aggregate.rs

+        let idx = schema.index_of(self.name()).unwrap_or(0);
+        let expr = Arc::new(Column::new(self.name(), idx));
+
+        let options = SortOptions::new(!is_ascending, false);


We assumed Nulls Last by definition, I'm not fully sure about this will be as is in the future as well

mertak-synnada · 2025-01-16T12:31:53Z

datafusion/expr/src/udaf.rs

+    /// Returns None if the function is not monotonic.
+    /// If the function is monotonically decreasing returns Some(false) e.g. Min
+    /// If the function is monotonically increasing returns Some(true) e.g. Max
+    fn is_monotonic(&self) -> Option<bool> {


I think we will need to accept a data_type argument here since Sums monotonicity depends on it.

we can add it first so we don't have to break API again in the future

mertak-synnada · 2025-01-16T12:32:38Z

datafusion/physical-expr/src/aggregate.rs

+    pub fn natural_sort_expr(&self, schema: &SchemaRef) -> Option<PhysicalSortExpr> {
+        // If the aggregate expressions are monotonic, the output data is naturally ordered with it.
+        let is_ascending = self.is_monotonic()?;
+        let idx = schema.index_of(self.name()).unwrap_or(0);


I don't like the way we get the expr here but couldn't find a better way

return Result for this function and not return incorrect 0 index seems better

I tried to implement so but got too many errors on tests, the expression name does not always match the actual column name. But there should not be any need for us to search with name, I'll try to get the actual index to here and insert

mertak-synnada · 2025-01-16T12:35:59Z

datafusion/sqllogictest/test_files/aggregate.slt

-11)--------------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
-12)----------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c3], has_header=true
+03)----AggregateExec: mode=FinalPartitioned, gby=[c3@0 as c3, min(aggregate_test_100.c1)@1 as min(aggregate_test_100.c1)], aggr=[], ordering_mode=PartiallySorted([1])
+04)------SortExec: expr=[min(aggregate_test_100.c1)@1 ASC NULLS LAST], preserve_partitioning=[true]


This SortExec is being added, because there's another logic on AggregateExec's partition_by values. The parent AggregateExec is adding the min column as an ordering requirement, and since prefer_existing_sort is false by default RepartitionExec does not maintain input order.

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/test_utils.rs

alihan-synnada · 2025-01-22T14:03:53Z

Small nit: is_monotonic returning Option<bool> is a little confusing. The documentation is clear about what the returned value means so it isn't a huge deal. I don't want to waste too much time bikeshedding but here are a few alternatives:

is_monotonically_increasing() sounds more descriptive but it's rather long
monotonicity() could return an enum with 3 variants Monotonicity(NonDecreasing, NonIncreasing, NonMonotonic)

LGTM otherwise 👍

jayzhan-synnada · 2025-01-23T03:32:05Z

datafusion/physical-optimizer/src/test_utils.rs

+        ];
+    Arc::new(
+        AggregateExec::try_new(
+            AggregateMode::Final,


Suggested change

AggregateMode::Final,

AggregateMode::Single,

Final is used in 2-step aggregation, should be Single here

jayzhan-synnada · 2025-01-23T03:34:14Z

datafusion/core/src/physical_optimizer/enforce_sorting.rs

+        let expected_input = [
+            "SortExec: expr=[avg@0 ASC], preserve_partitioning=[false]",
+            "  AggregateExec: mode=Final, gby=[], aggr=[avg]",
+            "    SortExec: expr=[a@0 ASC], preserve_partitioning=[false]",


I guess avg is monotonic for integer but not for float?

Nope, it's not monotonic for unsigned integers, for example:

1st data 100 => Avg 100
2nd data 2 => (102/2) Avg 51 (decreased)
3rd data 3 => (105/3) Avg 35 (decreased)
4th data 1003 => (1108/4) Avg 277 (increased)

To be monotonic, it needs to be always increasing/decreasing

jayzhan-synnada · 2025-01-23T10:41:16Z

datafusion/physical-expr/src/aggregate.rs

+    pub fn natural_sort_expr(&self, schema: &SchemaRef) -> Option<PhysicalSortExpr> {
+        // If the aggregate expressions are monotonic, the output data is naturally ordered with it.
+        let is_ascending = self.is_monotonic()?;
+        let idx = schema.index_of(self.name()).unwrap_or(0);


return Result for this function and not return incorrect 0 index seems better

jayzhan-synnada · 2025-01-23T11:00:31Z

datafusion/physical-optimizer/src/test_utils.rs

+        ];
+    Arc::new(
+        AggregateExec::try_new(
+            AggregateMode::Final,


Suggested change

AggregateMode::Final,

AggregateMode::Single,

jayzhan-synnada · 2025-01-23T11:02:50Z

datafusion/physical-expr/src/window/aggregate.rs

+    pub fn add_equal_orderings(&self, eq_properties: &mut EquivalenceProperties) {
+        let Some(expr) = self
+            .get_aggregate_expr()
+            .natural_sort_expr(eq_properties.schema())


Does anyone know why schema is inside EquivalenceProperties but not PlanProperties 🤔 or even not storing them inside both of them but pass as a parameter instead.

🤷‍♂️ Have no idea, but I'll remove the schema need anyways

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/physical-optimizer/src/test_utils.rs

separate stubs and count_udafs

change monotonicity to return an Enum rather than Option<bool> fix indices re-add monotonicity tests

…gle_group_by (apache#14360) * refactor: switch BooleanBufferBuilder to NullBufferBuilder in single_group_by * Update datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Andrew Lamb <[email protected]>

…pache#14265) * chore(deps): bump rustyline from 14.0.0 to 15.0.0 in /datafusion-cli Bumps [rustyline](https://github.com/kkawakam/rustyline) from 14.0.0 to 15.0.0. - [Release notes](https://github.com/kkawakam/rustyline/releases) - [Changelog](https://github.com/kkawakam/rustyline/blob/master/History.md) - [Commits](kkawakam/rustyline@v14.0.0...v15.0.0) --- updated-dependencies: - dependency-name: rustyline dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> * fix build --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

apache#14263) Bumps [aws-sdk-sts](https://github.com/awslabs/aws-sdk-rust) from 1.51.0 to 1.57.0. - [Release notes](https://github.com/awslabs/aws-sdk-rust/releases) - [Commits](https://github.com/awslabs/aws-sdk-rust/commits) --- updated-dependencies: - dependency-name: aws-sdk-sts dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]>

apache#14259) Bumps [aws-sdk-sso](https://github.com/awslabs/aws-sdk-rust) from 1.50.0 to 1.56.0. - [Release notes](https://github.com/awslabs/aws-sdk-rust/releases) - [Commits](https://github.com/awslabs/aws-sdk-rust/commits) --- updated-dependencies: - dependency-name: aws-sdk-sso dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]>

Bumps [korandoru/hawkeye](https://github.com/korandoru/hawkeye) from 5 to 6. - [Release notes](https://github.com/korandoru/hawkeye/releases) - [Changelog](https://github.com/korandoru/hawkeye/blob/main/CHANGELOG.md) - [Commits](korandoru/hawkeye@v5...v6) --- updated-dependencies: - dependency-name: korandoru/hawkeye dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [aws-sdk-ssooidc](https://github.com/awslabs/aws-sdk-rust) from 1.57.1 to 1.58.0. - [Release notes](https://github.com/awslabs/aws-sdk-rust/releases) - [Commits](https://github.com/awslabs/aws-sdk-rust/commits) --- updated-dependencies: - dependency-name: aws-sdk-ssooidc dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…e#14359) * Minor: include the number of files run in sqllogictest display * fmt * Update datafusion/sqllogictest/bin/sqllogictests.rs

* move information_schema to datafusion-catalog * fix: formatting * fix: doctests import * Remove unecessary datafuson-catalog dependency * remove some more unecessary dependencies * Update datafusion-cli/Carglo.ock * fix: doctest dependency --------- Co-authored-by: Andrew Lamb <[email protected]>

* Fix incorrect searched CASE optimization There is an optimization for searched CASE where values are of boolean type. It was converting the expression like CASE WHEN X THEN A WHEN Y THEN B .. [ ELSE D ] END into (X AND A) OR (Y AND NOT X AND B) [ OR (NOT (X OR Y) AND D) ] This had the following problems - does not work for nullable conditions. If X is nullable, we cannot use NOT (X) to compliment it. We need to use `X IS DISTINCT FROM true` - it does not work correctly when some conditions are nullable and other values are false. E.g. X=NULL, A=true, Y=NULL, B=true, D=false, the CASE should return false, but the boolean expression will simplify to `(NULL AND ..) OR (NULL AND ..) OR (false)` which is NULL, not false - thus we use `X` for truthness check of `X`, we need to test `X IS NOT DISTINCT FROM true` - it did not work correctly when default D is missing, but conditions do not evaluate to NULL. CASE's result should be NULL but was false. This commit fixes that optimization. * Fix complexity comment

…apache#13681) * draft of `MedianGroupAccumulator`. * impl `state`. * impl rest methods of `MedianGroupsAccumulator`. * improve comments. * use `MedianGroupsAccumulator`. * remove unused import. * add `group_median_table` to test group median. * complete group median test cases in aggregate slt. * fix type of state. * Clippy * Fmt * add fuzzy tests for median. * fix decimal. * fix clippy. * improve comments. * add median cases with nulls. * Update datafusion/functions-aggregate/src/median.rs Co-authored-by: Andrew Lamb <[email protected]> * use `OffsetBuffer::new_unchecked` in `convert_to_state`. * add todo. * remove assert and switch to i32 try from. * return error when try from failed. --------- Co-authored-by: Daniël Heres <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

berkaysynnada · 2025-01-31T13:00:12Z

apache#14271 merged, closing this one.

add monotonic function definitions for aggregate expressions

a2919b6

github-actions bot added core physical-expr sqllogictest logical-expr functions labels Jan 16, 2025

mertak-synnada commented Jan 16, 2025

View reviewed changes

mertak-synnada added 6 commits January 16, 2025 15:42

fix benchmark results

14109e6

set prefer_existing_sort to true in sqllogictests

b3d75ba

set prefer_existing_sort to true in sqllogictests

549502e

fix typo

623e0c5

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

6a9d24e

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/test_utils.rs

re-add test_utils.rs changes to the new file

53ee3de

github-actions bot added the optimizer label Jan 20, 2025

mertak-synnada requested review from berkaysynnada, alihan-synnada and jayzhan-synnada January 20, 2025 14:36

mertak-synnada added 2 commits January 20, 2025 17:40

clone input with Arc

97d8951

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

cc33031

jayzhan-synnada reviewed Jan 23, 2025

View reviewed changes

mertak-synnada added 3 commits January 23, 2025 14:13

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

41d9430

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/physical-optimizer/src/test_utils.rs

inject aggr expr indices

e988dcf

separate stubs and count_udafs

remove redundant file

906245e

github-actions bot removed the core label Jan 23, 2025

add Sum monotonicity

475fe2d

change monotonicity to return an Enum rather than Option<bool> fix indices re-add monotonicity tests

github-actions bot added the core label Jan 24, 2025

mertak-synnada added 2 commits January 24, 2025 12:19

fix sql logic tests

57e000e

fix sql logic tests

ca57f46

mertak-synnada requested a review from jayzhan-synnada January 24, 2025 12:53

Chen-Yuan-Lai and others added 10 commits January 29, 2025 15:48

Issue-14216 - Support arrays_overlap function (apache#14217)

707f673

remove aggregate changes, tests already give expected results

ba7b94f

fix clippy

2152b7f

remove one row sorts

7822613

Improve comments

5e9b2db

ozankabak force-pushed the feature/monotonic-sets branch from d7e3135 to 5e9b2db Compare January 30, 2025 12:54

dependabot bot and others added 9 commits January 30, 2025 09:14

Minor: include the number of files run in sqllogictest display (apach…

0edc3d9

…e#14359) * Minor: include the number of files run in sqllogictest display * fmt * Update datafusion/sqllogictest/bin/sqllogictests.rs

Use a short name for set monotonicity

54d62d6

removed (apache#14370)

29e9a1c

Unpin aws sdk dependencies (apache#14361)

07ee09a

Merge branch 'main' into feature/monotonic-sets

1146811

github-actions bot added documentation Improvements or additions to documentation sql optimizer substrait development-process common proto catalog labels Jan 31, 2025

berkaysynnada closed this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Add Monotonic Definition #59

Feature Add Monotonic Definition #59

mertak-synnada commented Jan 16, 2025

mertak-synnada Jan 16, 2025

mertak-synnada Jan 16, 2025

jayzhan-synnada Jan 23, 2025 •

edited

Loading

mertak-synnada Jan 16, 2025

jayzhan-synnada Jan 23, 2025

mertak-synnada Jan 23, 2025

mertak-synnada Jan 16, 2025

alihan-synnada commented Jan 22, 2025

jayzhan-synnada Jan 23, 2025

jayzhan-synnada Jan 23, 2025

mertak-synnada Jan 23, 2025

jayzhan-synnada Jan 23, 2025

jayzhan-synnada Jan 23, 2025

jayzhan-synnada Jan 23, 2025

mertak-synnada Jan 23, 2025

berkaysynnada commented Jan 31, 2025

Feature Add Monotonic Definition #59

Feature Add Monotonic Definition #59

Conversation

mertak-synnada commented Jan 16, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan-synnada Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alihan-synnada commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada commented Jan 31, 2025

jayzhan-synnada Jan 23, 2025 •

edited

Loading