feat: Optimize `SortPreservingMergeExec` to avoid merging non-overlapping partitions #13296

suremarc · 2024-11-07T17:37:13Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

This PR uses the existing MinMaxStatistics to reorder the input streams into chains of non-overlapping streams based on the statistics knowledge of its input. This requires some changes:

A tentative new API, ExecutionPlan::statistics_by_partition
- Left basically unimplemented for now, pending on feedback
Move the MinMaxStatistics type to datafusion-physical-plan
Introduce a way to represent constrained statistics / bounds on values in Statistics #8078
- Currently the MinMaxStatistics code assumes that the statistics give precise bounds and not (potentially overzealous) estimates.

Are these changes tested?

Yes, there is a new sqllogictest, optimize_sort_preserving_merge.slt

Are there any user-facing changes?

The MinMaxStatistics API is made public, as otherwise we can't use it in the core crate where it previously was being used.

There is a new ExecutionPlan::statistics_by_partition method with a default implementation, but it is not breaking.

There are also the new Statistics::merge and ColumnStatistics::merge functions.

suremarc · 2024-11-07T17:37:40Z

datafusion/core/src/datasource/physical_plan/file_scan_config.rs

-        // First Fit:
-        // * Choose the first file group that a file can be placed into.
-        // * If it fits into no existing file groups, create a new one.
-        //
-        // By sorting files by min values and then applying first-fit bin packing,
-        // we can produce the smallest number of file groups such that
-        // files within a group are in order and non-overlapping.
-        //
-        // Source: Applied Combinatorics (Keller and Trotter), Chapter 6.8
-        // https://www.appliedcombinatorics.org/book/s_posets_dilworth-intord.html


I moved this and the relevant code into a new method, MinMaxStatistics::first_fit

suremarc · 2024-11-07T17:38:30Z

datafusion/physical-plan/src/execution_plan.rs

+    fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {
+        Ok(vec![
+            self.statistics()?;
+            self.properties().partitioning.partition_count()
+        ])
+    }
+


As stated in the PR description, this is what a proposed API would look like for statistics by partition, though it is certainly not final.

suremarc · 2024-11-07T18:19:14Z

datafusion/physical-plan/src/statistics.rs

+        // Helper function to get min/max statistics
+        let get_min_max = |i: usize| -> Result<(Vec<ScalarValue>, Vec<ScalarValue>)> {
+            Ok(projected_statistics
+                .iter()


I moved this code later so it uses the projected statistics, i.e. it only relies on stats for sorting columns. I was seeing the code error because some columns had unknown statistics. Hopefully this will reduce such cases.

alamb

Thank you @suremarc -- this looks like a great start to me

alamb · 2024-11-08T20:12:07Z

datafusion/physical-plan/src/statistics.rs

+    /// into chains, such that elements in a chain are non-overlapping and ordered
+    /// amongst one another.
+    /// This bin-packing is optimal in the sense that it has the fewest number of chains.
+    pub fn first_fit(&self) -> Vec<Vec<usize>> {


How do we know there are no overlapping ranges here? It seems like we would also have to check if the ranges overlapped and if any did we can't do this packing

This may be checked elsewhere but I didn't see it in a cursory glance

If no ranges overlapped, they could all be ordered into a single chain. If some ranges do overlap, any ranges that overlap get placed into separate chains. The check for non-overlapping-ness happens in this logic:

datafusion/datafusion/physical-plan/src/statistics.rs

Lines 286 to 294 in 31d3716

let chain_to_insert = chains.iter_mut().find(|chain| {

// If our element is non-overlapping and comes _after_ the last element of the chain,

// it can be added to this chain.

min > self.max(

*chain

.last()

.expect("groups should be nonempty at construction"),

)

});

Ah, I see 👍

suremarc · 2024-11-11T22:15:13Z

datafusion/sqllogictest/test_files/optimize_sort_preserving_merge.slt

+query TT
+EXPLAIN 
+select a from t WHERE partition = 1
+UNION all
+select a from t WHERE partition = 2
+ORDER BY a;
+----
+logical_plan
+01)Sort: t.a ASC NULLS LAST
+02)--Union
+03)----TableScan: t projection=[a], full_filters=[t.partition = Int32(1)]
+04)----TableScan: t projection=[a], full_filters=[t.partition = Int32(2)]
+physical_plan
+01)SortPreservingMergeExec: [a@0 ASC NULLS LAST], partition_groups=[[2,0],[1]]
+02)--UnionExec
+03)----ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/optimize_sort_preserving_merge/parquet_table/partition=1/1_1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/optimize_sort_preserving_merge/parquet_table/partition=1/1_2.parquet]]}, projection=[a], output_ordering=[a@0 ASC NULLS LAST]
+04)----ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/optimize_sort_preserving_merge/parquet_table/partition=2/2_2.parquet]]}, projection=[a], output_ordering=[a@0 ASC NULLS LAST]


Hey @alamb I was able to implement the statistics_by_partition for ParquetExec and UnionExec and I wrote a little test. It seems to work 🎉

In this case, files 0 (partition=1/1_1.parquet) and 2 (partition=2/2_2.parquet) are non-overlapping, but file 1 (partition=1/1_2.parquet) overlaps file 0, so it gets placed into another chain (group).

I noticed int32 columns didn't seem to have working parquet statistics, so I used a string column. Seems like we will need to plug a lot of holes to make this feature complete.

…ut partitions

2010YOUY01 · 2024-11-22T13:04:01Z

The implementation is really nice.
I'm wondering is it convenient to move the stream concat logic into StreamingMergeBuilder, like

let result = StreamingMergeBuilder::new()
    .with_streams(inputs)
    .with_statistics_by_stream(stats)
    .build(); // Concat non-overlapping input streams here

Now SortExec is implemented as 1. Sort several small runs 2. Create a internal SortPreservingMergeStream to merge all small runs. This way sort query can also benefit from this work with some additional effort

alamb · 2024-11-23T12:21:34Z

FYI an update here is that I don't think I am going to be able to work on Statistics for the next month or two. Though I think @mhilton from InfluxData was thinking of potentially helping (🎣 ).

suremarc · 2024-11-25T21:26:57Z

FYI an update here is that I don't think I am going to be able to work on Statistics for the next month or two. Though I think @mhilton from InfluxData was thinking of potentially helping (🎣 ).

Ok. My team is pretty eager to get this optimization in before February-ish, so I think we may be able to spare a helper or two for the statistics-related changes. But obviously that would require someone available to review, also I think we would need a resolution in #13293 before proceeding.

suremarc · 2024-11-25T21:28:15Z

The implementation is really nice. I'm wondering is it convenient to move the stream concat logic into StreamingMergeBuilder, like
let result = StreamingMergeBuilder::new()
    .with_streams(inputs)
    .with_statistics_by_stream(stats)
    .build(); // Concat non-overlapping input streams here
Now SortExec is implemented as 1. Sort several small runs 2. Create a internal SortPreservingMergeStream to merge all small runs. This way sort query can also benefit from this work with some additional effort

This didn't occur to me but I think it would be a great change. On the other hand I'm considering if it would make sense in a follow-on PR. But in any case there's a lot of statistics-related work that will need to be done before this PR is mergeable, unfortunately.

alamb · 2024-11-25T21:50:13Z

FYI an update here is that I don't think I am going to be able to work on Statistics for the next month or two. Though I think @mhilton from InfluxData was thinking of potentially helping (🎣 ).

Ok. My team is pretty eager to get this optimization in before February-ish, so I think we may be able to spare a helper or two for the statistics-related changes. But obviously that would require someone available to review, also I think we would need a resolution in #13293 before proceeding.

I will find time to review / help it along. Also, given you are willing to help I will help drive a resolution on #13293

berkaysynnada · 2024-11-26T08:17:22Z

I will find time to review / help it along. Also, given you are willing to help I will help drive a resolution on #13293

We are also willing to contribute the final design of #13293, and I can help the review of this.

initial attempt at implementation

c138b24

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Nov 7, 2024

suremarc added 2 commits November 7, 2024 17:58

fall back to full merge on errors

b469b71

make MinMaxStatistics only care about sorting column statistics

31d3716

suremarc commented Nov 7, 2024

View reviewed changes

suremarc mentioned this pull request Nov 7, 2024

Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

Open

alamb changed the title ~~initial attempt at implementation~~ initial attempt at non-overlapping range implementation Nov 8, 2024

suremarc changed the title ~~initial attempt at non-overlapping range implementation~~ feat: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions Nov 8, 2024

alamb mentioned this pull request Nov 8, 2024

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Open

alamb reviewed Nov 8, 2024

View reviewed changes

fixes + statistics merging

2f0e74f

github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Nov 11, 2024

suremarc added 3 commits November 11, 2024 21:59

change language

2678db3

todo comment

6102047

fix display impl

0662eff

suremarc commented Nov 11, 2024

View reviewed changes

suremarc requested a review from alamb November 11, 2024 22:15

rename output_partitions to streams_to_merge because they're not outp…

1c93a0d

…ut partitions

alamb mentioned this pull request Nov 13, 2024

Review Backlog and Plan - Andrew Lamb - Nov 2024 #13386

Closed

suremarc mentioned this pull request Dec 11, 2024

RFC: Add Precision:AtLeast and Precision::AtMost for more Statistics… precision #13293

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Optimize `SortPreservingMergeExec` to avoid merging non-overlapping partitions #13296

feat: Optimize `SortPreservingMergeExec` to avoid merging non-overlapping partitions #13296

suremarc commented Nov 7, 2024 •

edited

Loading

suremarc Nov 7, 2024

suremarc Nov 7, 2024

suremarc Nov 7, 2024 •

edited

Loading

alamb left a comment

alamb Nov 8, 2024

suremarc Nov 8, 2024 •

edited

Loading

alamb Nov 8, 2024

suremarc Nov 11, 2024

2010YOUY01 commented Nov 22, 2024

alamb commented Nov 23, 2024

suremarc commented Nov 25, 2024

suremarc commented Nov 25, 2024

alamb commented Nov 25, 2024

berkaysynnada commented Nov 26, 2024

	let chain_to_insert = chains.iter_mut().find(\|chain\| {
	// If our element is non-overlapping and comes _after_ the last element of the chain,
	// it can be added to this chain.
	min > self.max(
	*chain
	.last()
	.expect("groups should be nonempty at construction"),
	)
	});

feat: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #13296

Are you sure you want to change the base?

feat: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #13296

Conversation

suremarc commented Nov 7, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

suremarc Nov 7, 2024

Choose a reason for hiding this comment

suremarc Nov 7, 2024

Choose a reason for hiding this comment

suremarc Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 8, 2024

Choose a reason for hiding this comment

suremarc Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Nov 8, 2024

Choose a reason for hiding this comment

suremarc Nov 11, 2024

Choose a reason for hiding this comment

2010YOUY01 commented Nov 22, 2024

alamb commented Nov 23, 2024

suremarc commented Nov 25, 2024

suremarc commented Nov 25, 2024

alamb commented Nov 25, 2024

berkaysynnada commented Nov 26, 2024

feat: Optimize `SortPreservingMergeExec` to avoid merging non-overlapping partitions #13296

feat: Optimize `SortPreservingMergeExec` to avoid merging non-overlapping partitions #13296

suremarc commented Nov 7, 2024 •

edited

Loading

suremarc Nov 7, 2024 •

edited

Loading

suremarc Nov 8, 2024 •

edited

Loading