Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets #13707

Dandandan · 2024-12-09T15:25:35Z

Which issue does this PR close?

Closes #.

Rationale for this change

We have some logic to not introduce RoundRobin repartitioning whenever number of child partitions is already greater than input partitions.
This is however not optimal when input data is not perfectly balanced, when the number of batches in some partitions is lower than in other partitions. Adding an extra round robin RepartitionExec helps to get more performance out of these cases at a negligible cost.

The performance improvement for some cases can be quite high:

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ repartition_poc_2 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 125.52ms │          105.82ms │ +1.19x faster │
│ QQuery 2     │  57.14ms │           37.76ms │ +1.51x faster │
│ QQuery 3     │  56.00ms │           54.94ms │     no change │
│ QQuery 4     │  41.75ms │           37.57ms │ +1.11x faster │
│ QQuery 5     │  78.27ms │           75.59ms │     no change │
│ QQuery 6     │  21.36ms │           21.04ms │     no change │
│ QQuery 7     │  93.01ms │           91.13ms │     no change │
│ QQuery 8     │  79.40ms │           74.41ms │ +1.07x faster │
│ QQuery 9     │ 115.29ms │          103.16ms │ +1.12x faster │
│ QQuery 10    │ 101.89ms │          102.25ms │     no change │
│ QQuery 11    │  40.69ms │           23.73ms │ +1.71x faster │
│ QQuery 12    │  54.58ms │           48.67ms │ +1.12x faster │
│ QQuery 13    │ 122.04ms │           82.29ms │ +1.48x faster │
│ QQuery 14    │  39.81ms │           39.65ms │     no change │
│ QQuery 15    │  49.66ms │           50.71ms │     no change │
│ QQuery 16    │  36.18ms │           24.67ms │ +1.47x faster │
│ QQuery 17    │ 105.28ms │           97.24ms │ +1.08x faster │
│ QQuery 18    │ 164.32ms │          153.53ms │ +1.07x faster │
│ QQuery 19    │  65.62ms │           58.05ms │ +1.13x faster │
│ QQuery 20    │  60.52ms │           66.36ms │  1.10x slower │
│ QQuery 21    │ 130.67ms │          119.25ms │ +1.10x faster │
│ QQuery 22    │  35.66ms │           28.04ms │ +1.27x faster │
└──────────────┴──────────┴───────────────────┴───────────────┘

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion/core/src/physical_optimizer/enforce_distribution.rs

Co-authored-by: Berkay Şahin <[email protected]>

…sion into repartition_poc_2

findepi · 2024-12-10T10:59:13Z

Would this also result in parallel execution of plans based on VALUES?

Dandandan · 2024-12-10T11:05:00Z

Would this also result in parallel execution of plans based on VALUES?

WDYM, when VALUES would be very large? I think in most cases VALUES would be not very large (not even one batch), so not sure if it makes much difference, currently round robin works per batch which is 8192 by default.

Maybe it would help for e.g. parallelizing joins or similar queries, but it would be nice to show this with some benchmarks :).

findepi · 2024-12-10T11:37:40Z

@Dandandan that was a question, not a suggestion.
(for context, i recently exploited lack of parallelism of VALUES to test some UDF behavior around around arrays vs scalars.)

VALUES can turn into larger data sets -- for example with sequence + unnest -- but i don't see a need to optimize around this at this point. Just being curious, whether this round robin above leaves gets also inserted above values.

Dandandan · 2024-12-10T12:47:26Z

VALUES can turn into larger data sets -- for example with sequence + unnest -- but i don't see a need to optimize around this at this point. Just being curious, whether this round robin above leaves gets also inserted.

Clear! I think it might be the case already, as ValuesExec has only a single partition (though not 100% sure).
It might not add it in some cases though e.g. if statistics are available / heuristic

jayzhan211 · 2024-12-10T12:51:29Z

ValuesExec is single partition currently, but I would like to extend it to multi partitiones #12905

2010YOUY01

The performance improvement is impressive 🚀

I feel this change benefits some workload, but might harm locality in others:
for example input is almost sorted with few inversions, if it's simply split evenly, locality can be exploited (when doing aggregation each partition can have lower cardinality), but round-robin shuffling would do the opposite

However it's fine to proceed since we can add a configuration for it in the future

2010YOUY01 · 2024-12-11T11:02:06Z

datafusion/core/src/physical_optimizer/enforce_distribution.rs

@@ -836,7 +836,10 @@ fn add_roundrobin_on_top(
    n_target: usize,
 ) -> Result<DistributionContext> {
    // Adding repartition is helpful:


Suggested change

// Adding repartition is helpful:

// Always perform repartitioning on data sources, as they may be imbalanced or consist of too many small partitions.

A 'why' comment here can help understanding

Dandandan · 2024-12-11T11:58:53Z

The performance improvement is impressive 🚀

I feel this change benefits some workload, but might harm locality in others: for example input is almost sorted with few inversions, if it's simply split evenly, locality can be exploited (when doing aggregation each partition can have lower cardinality), but round-robin shuffling would do the opposite

However it's fine to proceed since we can add a configuration for it in the future

Yes I agree it might hurt performance in those cases (lower cardinality / data correlation). Even without it, we might hurt performance a bit by repartitioning (memory / cpu locality), but couldn't find yet a benchmark showing this. Round robin can be disabled (e.g. before executing a query) already if someone would like to tweak it, and we can avoid it if sorting/partitioning is known.

I think it might be interesting as well to look in improving round-robin repartitionexec in the future: if all partitions still are yielding batches, we can consider keeping them to the current partition, changing to round-robin (or similar) only when one of the partitions is exhausted, so we can benefit from both better execution as benefitting from lower cardinality in case of correlated data.

github-actions bot added the core Core DataFusion crate label Dec 9, 2024

Dandandan force-pushed the repartition_poc_2 branch from 1b8e770 to 5fa27b0 Compare December 9, 2024 15:28

github-actions bot added the physical-expr Physical Expressions label Dec 9, 2024

Always add round robin repartition for child sources

8ca56a2

Dandandan force-pushed the repartition_poc_2 branch from 5fa27b0 to 8ca56a2 Compare December 9, 2024 15:31

github-actions bot removed the physical-expr Physical Expressions label Dec 9, 2024

Dandandan changed the title ~~Always add round robin repartitioning to leaves~~ Always add round robin repartitioning to leaves (data sources) Dec 9, 2024

berkaysynnada reviewed Dec 10, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/enforce_distribution.rs Outdated Show resolved Hide resolved

Update datafusion/core/src/physical_optimizer/enforce_distribution.rs

a10d8ab

Co-authored-by: Berkay Şahin <[email protected]>

Dandandan changed the title ~~Always add round robin repartitioning to leaves (data sources)~~ Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets Dec 10, 2024

Dandandan added 3 commits December 10, 2024 10:14

Fix test

a8c8ae3

Merge branch 'repartition_poc_2' of github.com:Dandandan/arrow-datafu…

f4f5902

…sion into repartition_poc_2

Fmt

20049b6

Dandandan added the performance Make DataFusion faster label Dec 10, 2024

Fix

2259cae

tests

b9bfc9a

github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Dec 10, 2024

2010YOUY01 reviewed Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets #13707

Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets #13707

Dandandan commented Dec 9, 2024 •

edited

Loading

findepi commented Dec 10, 2024

Dandandan commented Dec 10, 2024

findepi commented Dec 10, 2024

Dandandan commented Dec 10, 2024 •

edited

Loading

jayzhan211 commented Dec 10, 2024

2010YOUY01 left a comment

2010YOUY01 Dec 11, 2024

Dandandan commented Dec 11, 2024

	// Adding repartition is helpful:
	// Always perform repartitioning on data sources, as they may be imbalanced or consist of too many small partitions.

Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets #13707

Are you sure you want to change the base?

Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets #13707

Conversation

Dandandan commented Dec 9, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

findepi commented Dec 10, 2024

Dandandan commented Dec 10, 2024

findepi commented Dec 10, 2024

Dandandan commented Dec 10, 2024 • edited Loading

jayzhan211 commented Dec 10, 2024

2010YOUY01 left a comment

Choose a reason for hiding this comment

2010YOUY01 Dec 11, 2024

Choose a reason for hiding this comment

Dandandan commented Dec 11, 2024

Dandandan commented Dec 9, 2024 •

edited

Loading

Dandandan commented Dec 10, 2024 •

edited

Loading