Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets #13707

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 9, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

We have some logic to not introduce RoundRobin repartitioning whenever number of child partitions is already greater than input partitions.
This is however not optimal when input data is not perfectly balanced, when the number of batches in some partitions is lower than in other partitions. Adding an extra round robin RepartitionExec helps to get more performance out of these cases at a negligible cost.

The performance improvement for some cases can be quite high:

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ repartition_poc_2 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 125.52ms │          105.82ms │ +1.19x faster │
│ QQuery 2     │  57.14ms │           37.76ms │ +1.51x faster │
│ QQuery 3     │  56.00ms │           54.94ms │     no change │
│ QQuery 4     │  41.75ms │           37.57ms │ +1.11x faster │
│ QQuery 5     │  78.27ms │           75.59ms │     no change │
│ QQuery 6     │  21.36ms │           21.04ms │     no change │
│ QQuery 7     │  93.01ms │           91.13ms │     no change │
│ QQuery 8     │  79.40ms │           74.41ms │ +1.07x faster │
│ QQuery 9     │ 115.29ms │          103.16ms │ +1.12x faster │
│ QQuery 10    │ 101.89ms │          102.25ms │     no change │
│ QQuery 11    │  40.69ms │           23.73ms │ +1.71x faster │
│ QQuery 12    │  54.58ms │           48.67ms │ +1.12x faster │
│ QQuery 13    │ 122.04ms │           82.29ms │ +1.48x faster │
│ QQuery 14    │  39.81ms │           39.65ms │     no change │
│ QQuery 15    │  49.66ms │           50.71ms │     no change │
│ QQuery 16    │  36.18ms │           24.67ms │ +1.47x faster │
│ QQuery 17    │ 105.28ms │           97.24ms │ +1.08x faster │
│ QQuery 18    │ 164.32ms │          153.53ms │ +1.07x faster │
│ QQuery 19    │  65.62ms │           58.05ms │ +1.13x faster │
│ QQuery 20    │  60.52ms │           66.36ms │  1.10x slower │
│ QQuery 21    │ 130.67ms │          119.25ms │ +1.10x faster │
│ QQuery 22    │  35.66ms │           28.04ms │ +1.27x faster │
└──────────────┴──────────┴───────────────────┴───────────────┘

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Dec 9, 2024
@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 9, 2024
@github-actions github-actions bot removed the physical-expr Physical Expressions label Dec 9, 2024
@Dandandan Dandandan changed the title Always add round robin repartitioning to leaves Always add round robin repartitioning to leaves (data sources) Dec 9, 2024
@Dandandan Dandandan changed the title Always add round robin repartitioning to leaves (data sources) Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets Dec 10, 2024
@Dandandan Dandandan added the performance Make DataFusion faster label Dec 10, 2024
@findepi
Copy link
Member

findepi commented Dec 10, 2024

Would this also result in parallel execution of plans based on VALUES?

@Dandandan
Copy link
Contributor Author

Would this also result in parallel execution of plans based on VALUES?

WDYM, when VALUES would be very large? I think in most cases VALUES would be not very large (not even one batch), so not sure if it makes much difference, currently round robin works per batch which is 8192 by default.

Maybe it would help for e.g. parallelizing joins or similar queries, but it would be nice to show this with some benchmarks :).

@findepi
Copy link
Member

findepi commented Dec 10, 2024

@Dandandan that was a question, not a suggestion.
(for context, i recently exploited lack of parallelism of VALUES to test some UDF behavior around around arrays vs scalars.)

VALUES can turn into larger data sets -- for example with sequence + unnest -- but i don't see a need to optimize around this at this point. Just being curious, whether this round robin above leaves gets also inserted above values.

@Dandandan
Copy link
Contributor Author

Dandandan commented Dec 10, 2024

VALUES can turn into larger data sets -- for example with sequence + unnest -- but i don't see a need to optimize around this at this point. Just being curious, whether this round robin above leaves gets also inserted.

Clear! I think it might be the case already, as ValuesExec has only a single partition (though not 100% sure).
It might not add it in some cases though e.g. if statistics are available / heuristic

@jayzhan211
Copy link
Contributor

ValuesExec is single partition currently, but I would like to extend it to multi partitiones #12905

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Dec 10, 2024
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance improvement is impressive 🚀

I feel this change benefits some workload, but might harm locality in others:
for example input is almost sorted with few inversions, if it's simply split evenly, locality can be exploited (when doing aggregation each partition can have lower cardinality), but round-robin shuffling would do the opposite

However it's fine to proceed since we can add a configuration for it in the future

@@ -836,7 +836,10 @@ fn add_roundrobin_on_top(
n_target: usize,
) -> Result<DistributionContext> {
// Adding repartition is helpful:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Adding repartition is helpful:
// Always perform repartitioning on data sources, as they may be imbalanced or consist of too many small partitions.

A 'why' comment here can help understanding

@Dandandan
Copy link
Contributor Author

The performance improvement is impressive 🚀

I feel this change benefits some workload, but might harm locality in others: for example input is almost sorted with few inversions, if it's simply split evenly, locality can be exploited (when doing aggregation each partition can have lower cardinality), but round-robin shuffling would do the opposite

However it's fine to proceed since we can add a configuration for it in the future

Yes I agree it might hurt performance in those cases (lower cardinality / data correlation). Even without it, we might hurt performance a bit by repartitioning (memory / cpu locality), but couldn't find yet a benchmark showing this. Round robin can be disabled (e.g. before executing a query) already if someone would like to tweak it, and we can avoid it if sorting/partitioning is known.

I think it might be interesting as well to look in improving round-robin repartitionexec in the future: if all partitions still are yielding batches, we can consider keeping them to the current partition, changing to round-robin (or similar) only when one of the partitions is exhausted, so we can benefit from both better execution as benefitting from lower cardinality in case of correlated data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants