fix: Distribution error failing in the SanityCheck, for a specific influxql plan. #58

wiedld · 2025-02-10T23:35:28Z

Temporary fix for https://github.com/influxdata/influxdb_iox/issues/13310.

This PR branches off our current patched DF branch. It adds a few commits to handle the above issue.

Confirmed that it does fix this bug reproducer in iox.

Changes made

Commit 1 = recreate the insertion of the coalesce in the EnforceDistribution optimization pass.
- Demonstrates that we get a coalesce (not a repartition) inserted into our plan.
- In order to show this, I had to update the test suite to handle more "real world" parameters; which enables distribution decisions based on statistics.
Commit 2 = reproducer of SanityCheck failure after EnforceSorting removes the added coalesce
- This adds a test case with the same SanityCheck failure as seen in our issue.
Commit 3 = a special cased fix that uses the checking of the AggregateExec. (a more general fix will occur upstream)
- This fixes the test case seen in commit 2.
Commit 4 = unrelated. It's fixing a wasm build CI error by cherry-picking over the upstream fix.

* demonstrate the insertion of coalesce after the use of column estimates, and the removal of the test scenario's forcing of rr repartitioning

…the coalesce added in the EnforceDistribution

* fix * update cli

wiedld · 2025-02-12T16:00:42Z

datafusion/core/src/physical_optimizer/enforce_sorting.rs

@@ -516,7 +516,7 @@ fn remove_bottleneck_in_subplan(
 ) -> Result<PlanWithCorrespondingCoalescePartitions> {
    let plan = &requirements.plan;
    let children = &mut requirements.children;
-    if is_coalesce_partitions(&children[0].plan) {
+    if is_coalesce_partitions(&children[0].plan) && !is_aggregation(plan) {


This is the fix, for now.

A proper fix (if we decide to fix at this conditional) should be comparing the partitioning needs of the parent (of coalesce) vs children of coalesce. My initial attempt to do so caused other DF tests to fail.

I didn't proceed further since I'm unsure of the correct solution at a higher level.

Should this be checking that it is an aggregation and that that the aggregation mode is SinglePartitioned 🤔

wiedld · 2025-02-12T16:31:34Z

We have a bug where the SanityCheck fails due to mismatched distribution needs between a union and an aggregate. After chasing it down (and recreating in this PR), it appears that a change done in an earlier EnforceDistribution optimizer run -- is undone in the EnforceSorting optimizer run.

Here is the plan before the EnforceDistribution:

OutputRequirementExec
  SortExec: expr=[time@1 ASC NULLS LAST], preserve_partitioning=[false]
    ProjectionExec: expr=[t as iox::measurement, time@0 as time, sum(Value)@1 as Value]
      GapFillExec: group_expr=[time@0], aggr_expr=[sum(Value)@1], stride=IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time_range=Included("0")..Included("59999999999")
        AggregateExec: mode=FinalPartitioned, gby=[time@0 as time], aggr=[sum(Value)]
          AggregateExec: mode=Partial, gby=[date_bin_wallclock(IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time@0, 0) as time], aggr=[sum(Value)]
            SortExec: expr=[time@0 ASC NULLS LAST], preserve_partitioning=[false]
              ProjectionExec: expr=[time@1 as time, f@0 as Value]
                UnionExec
                  ParquetExec: 
                    file_groups={1 group: [[1/1/b862a7e9b329ee6a418cde191198eaeb1512753f19b87a81def2ae6c3d0ed237/b5e32e0e-ffa8-4831-aef2-4e16e32a1264.parquet]]}, projection=[f, time], predicate=time@2 >= 0 AND time@2 <= 59999999999 AND f@1 IS NOT NULL, pruning_predicate=time_null_count@1 != time_row_count@2 AND time_max@0 >= 0 AND time_null_count@1 != time_row_count@2 AND time_min@3 <= 59999999999 AND f_null_count@5 != f_row_count@4, required_guarantees=[]
                  FilterExec: f@0 IS NOT NULL
                    ProjectionExec: expr=[f@0 as f, time@1 as time]
                      DeduplicateExec: [x@2 ASC,time@1 ASC]
                        FilterExec: time@1 >= 0 AND time@1 <= 59999999999
                          RecordBatchesExec: chunks=1, projection=[f, time, x, __chunk_order]

Here is the plan after the EnforceDistribution. Note the insertion of the coalesce between the union and aggregate.

OutputRequirementExec
  SortExec: expr=[time@1 ASC NULLS LAST], preserve_partitioning=[false]
    ProjectionExec: expr=[t as iox::measurement, time@0 as time, sum(Value)@1 as Value]
      GapFillExec: group_expr=[time@0], aggr_expr=[sum(Value)@1], stride=IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time_range=Included("0")..Included("59999999999")
        AggregateExec: mode=FinalPartitioned, gby=[time@0 as time], aggr=[sum(Value)]
          AggregateExec: mode=Partial, gby=[date_bin_wallclock(IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time@0, 0) as time], aggr=[sum(Value)]
            SortExec: expr=[time@0 ASC NULLS LAST], preserve_partitioning=[false]
              CoalescePartitionsExec
                ProjectionExec: expr=[time@1 as time, f@0 as Value]
                  UnionExec
                    ParquetExec: 
                        file_groups={1 group: [[1/1/b862a7e9b329ee6a418cde191198eaeb1512753f19b87a81def2ae6c3d0ed237/b5e32e0e-ffa8-4831-aef2-4e16e32a1264.parquet]]}, projection=[f, time], predicate=time@2 >= 0 AND time@2 <= 59999999999 AND f@1 IS NOT NULL, pruning_predicate=time_null_count@1 != time_row_count@2 AND time_max@0 >= 0 AND time_null_count@1 != time_row_count@2 AND time_min@3 <= 59999999999 AND f_null_count@5 != f_row_count@4, required_guarantees=[]
                    FilterExec: f@0 IS NOT NULL
                      ProjectionExec: expr=[f@0 as f, time@1 as time]
                        DeduplicateExec: [x@2 ASC,time@1 ASC]
                          FilterExec: time@1 >= 0 AND time@1 <= 59999999999
                            RecordBatchesExec: chunks=1, projection=[f, time, x, __chunk_order]

The partial and final aggregates are later combined in the CombinePartialFinalAggregate. The plan then passed to EnforceSorting is:

OutputRequirementExec
  SortExec: expr=[time@1 ASC NULLS LAST], preserve_partitioning=[false]
    ProjectionExec: expr=[t as iox::measurement, time@0 as time, sum(Value)@1 as Value]
      GapFillExec: group_expr=[time@0], aggr_expr=[sum(Value)@1], stride=IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time_range=Included("0")..Included("59999999999")
        AggregateExec: mode=SinglePartitioned, gby=[date_bin_wallclock(IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time@0, 0) as time], aggr=[sum(Value)]
          SortExec: expr=[time@0 ASC NULLS LAST], preserve_partitioning=[false]
            CoalescePartitionsExec
              ProjectionExec: expr=[time@1 as time, f@0 as Value]
                UnionExec
                  ParquetExec: file_groups={1 group: [[1/1/b862a7e9b329ee6a418cde191198eaeb1512753f19b87a81def2ae6c3d0ed237/0d80ea75-da4d-4d01-ba58-5169be3df839.parquet]]}, projection=[f, time], predicate=time@2 >= 0 AND time@2 <= 59999999999 AND f@1 IS NOT NULL, pruning_predicate=time_null_count@1 != time_row_count@2 AND time_max@0 >= 0 AND time_null_count@1 != time_row_count@2 AND time_min@3 <= 59999999999 AND f_null_count@5 != f_row_count@4, required_guarantees=[]
                  FilterExec: f@0 IS NOT NULL
                    ProjectionExec: expr=[f@0 as f, time@1 as time]
                      DeduplicateExec: [x@2 ASC,time@1 ASC]
                        FilterExec: time@1 >= 0 AND time@1 <= 59999999999
                          RecordBatchesExec: chunks=1, projection=[f, time, x, __chunk_order]

EnforceSorting removes the needed coalesce, and replaces with an SPM. But the SPM is inserted further up the plan (above the aggregate node):

OutputRequirementExec
  SortExec: expr=[time@1 ASC NULLS LAST], preserve_partitioning=[false]
    ProjectionExec: expr=[t as iox::measurement, time@0 as time, sum(Value)@1 as Value]
      GapFillExec: group_expr=[time@0], aggr_expr=[sum(Value)@1], stride=IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time_range=Included("0")..Included("59999999999")
        SortPreservingMergeExec: [time@0 ASC]
          SortExec: expr=[time@0 ASC], preserve_partitioning=[true]
            AggregateExec: mode=SinglePartitioned, gby=[date_bin_wallclock(IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 10000000000 }, time@0, 0) as time], aggr=[sum(Value)]
              ProjectionExec: expr=[time@1 as time, f@0 as Value]
                UnionExec
                  ParquetExec: file_groups={1 group: [[1/1/b862a7e9b329ee6a418cde191198eaeb1512753f19b87a81def2ae6c3d0ed237/0d80ea75-da4d-4d01-ba58-5169be3df839.parquet]]}, projection=[f, time], predicate=time@2 >= 0 AND time@2 <= 59999999999 AND f@1 IS NOT NULL, pruning_predicate=time_null_count@1 != time_row_count@2 AND time_max@0 >= 0 AND time_null_count@1 != time_row_count@2 AND time_min@3 <= 59999999999 AND f_null_count@5 != f_row_count@4, required_guarantees=[]
                  FilterExec: f@0 IS NOT NULL
                    ProjectionExec: expr=[f@0 as f, time@1 as time]
                      DeduplicateExec: [x@2 ASC,time@1 ASC]
                        SortExec: expr=[x@2 ASC, time@1 ASC, __chunk_order@3 ASC], preserve_partitioning=[false]
                          FilterExec: time@1 >= 0 AND time@1 <= 59999999999
                            RecordBatchesExec: chunks=1, projection=[f, time, x, __chunk_order]

This then fails the SanityCheck plan.

Possible solutions

This PR is about demonstrating how the error occurred, (based upon changes is the 2 optimizer runs), and adds a temporary fix to unblock iox. As for the proper fix, I'm unclear on what to do. Ideas include, but are not limited to:

have the EnforceSorting selectively not remove the coalesce
let the EnforceSorting remove it, but update the code to add the SPM in the correct place
- right now the decision to remove, vs to add the SPM, are in different transformations of the plan. We could either change the rules of the SPM insertion, or alternatively, combine the SPM insertion with the coalesce removal.

There could also be other solutions. I'll make an upstream PR and ping for advice.

wiedld · 2025-02-12T17:49:07Z

datafusion/core/src/physical_optimizer/enforce_distribution.rs

-            // Use a small batch size, to trigger RoundRobin in tests
-            config.execution.batch_size = 1;


Note the removal of the forced insertion of round robin repartitioning. This was preventing the coalesce insertion (as seen in our real world reproducer).

wiedld · 2025-02-12T17:50:10Z

datafusion/core/src/physical_optimizer/enforce_distribution.rs

+    pub(crate) fn parquet_exec_with_stats() -> Arc<ParquetExec> {
+        let mut statistics = Statistics::new_unknown(&schema());
+        statistics.num_rows = Precision::Inexact(10);
+        statistics.column_statistics = column_stats();
+
+        let config =
+            FileScanConfig::new(ObjectStoreUrl::parse("test:///").unwrap(), schema())
+                .with_file(PartitionedFile::new("x".to_string(), 10000))
+                .with_statistics(statistics);
+        assert_eq!(config.statistics.num_rows, Precision::Inexact(10));
+
+        ParquetExec::builder(config).build_arc()
+    }


Note that we are now providing statistics in the parquet exec.

This was preventing the coalesce insertion (as seen in our real world reproducer). In the absence of parquet stats, we always has a rr repartition because the roundrobin_beneficial_stats was always evaluating to true.

alamb · 2025-02-12T18:12:07Z

datafusion/core/src/physical_optimizer/enforce_sorting.rs

-            .optimize(optimized, &Default::default())
-            .unwrap_err();
-        assert!(err.message().contains(" does not satisfy distribution requirements: HashPartitioned[[a@0]]). Child-0 output partitioning: UnknownPartitioning(2)"));
+        let checker = checker.optimize(optimized, &Default::default());


I really like the idea of running the SanityChecker as part of the tests

alamb

I have one suggestion about making the check even more specific, but otherwise I think this is great

Thanks @wiedld

wiedld · 2025-02-28T21:48:15Z

datafusion/core/src/physical_optimizer/enforce_distribution.rs

+    /// Same as [`repartitions_for_aggregate_after_sorted_union`], but adds a projection
+    /// as well between the union and aggregate. This change the outcome:
+    ///
+    /// * we no longer get repartitioning, and instead get coalescing.
+    #[test]
+    fn coalesces_for_aggregate_after_sorted_union_projection() -> Result<()> {


This is how we are getting coalesce inserted in the EnforceDistribution run.

github-actions bot added the core label Feb 10, 2025

wiedld force-pushed the dlw/debug-distribution-error branch from af2856a to 27f2a3a Compare February 12, 2025 03:01

wiedld added 3 commits February 12, 2025 07:49

test: recreating the iox plan:

90280ea

* demonstrate the insertion of coalesce after the use of column estimates, and the removal of the test scenario's forcing of rr repartitioning

test: reproducer of SanityCheck failure after EnforceSorting removes …

f792cfa

…the coalesce added in the EnforceDistribution

fix: special case to not remove the needed coalesce

202860b

wiedld force-pushed the dlw/debug-distribution-error branch from 27f2a3a to 202860b Compare February 12, 2025 15:51

wiedld changed the title ~~test: debugging distribution enforcement bug.~~ fix: Distribution error failing in the SanityCheck, for a specific influxql plan. Feb 12, 2025

fix(ci): build error with wasm (apache#14494)

1c10b8b

* fix * update cli

wiedld commented Feb 12, 2025

View reviewed changes

alamb reviewed Feb 12, 2025

View reviewed changes

alamb approved these changes Feb 12, 2025

View reviewed changes

wiedld mentioned this pull request Feb 12, 2025

test: Demonstrate EnforceDistribution and EnforceSorting behavior on the latest DF branch. #60

Closed

wiedld mentioned this pull request Feb 28, 2025

Add tests for Demonstrate EnforceSorting can remove a needed coalesce apache/datafusion#14919

Merged

wiedld commented Feb 28, 2025

View reviewed changes

This was referenced Feb 28, 2025

test: enforce distribution is no longer inserting the coalesce apache/datafusion#14949

Closed

ParallelizeSorts, a subrule of EnforceSorting optimizer, should not remove necessary coalesce. apache/datafusion#14691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Distribution error failing in the SanityCheck, for a specific influxql plan. #58

fix: Distribution error failing in the SanityCheck, for a specific influxql plan. #58

wiedld commented Feb 10, 2025 •

edited

Loading

wiedld Feb 12, 2025 •

edited

Loading

alamb Feb 12, 2025

wiedld commented Feb 12, 2025 •

edited

Loading

wiedld Feb 12, 2025 •

edited

Loading

wiedld Feb 12, 2025 •

edited

Loading

alamb Feb 12, 2025

alamb left a comment

wiedld Feb 28, 2025 •

edited

Loading

		// Use a small batch size, to trigger RoundRobin in tests
		config.execution.batch_size = 1;

fix: Distribution error failing in the SanityCheck, for a specific influxql plan. #58

Are you sure you want to change the base?

fix: Distribution error failing in the SanityCheck, for a specific influxql plan. #58

Conversation

wiedld commented Feb 10, 2025 • edited Loading

Changes made

wiedld Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

alamb Feb 12, 2025

Choose a reason for hiding this comment

wiedld commented Feb 12, 2025 • edited Loading

Possible solutions

wiedld Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

wiedld Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

alamb Feb 12, 2025

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

wiedld Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

wiedld commented Feb 10, 2025 •

edited

Loading

wiedld Feb 12, 2025 •

edited

Loading

wiedld commented Feb 12, 2025 •

edited

Loading

wiedld Feb 12, 2025 •

edited

Loading

wiedld Feb 12, 2025 •

edited

Loading

wiedld Feb 28, 2025 •

edited

Loading