Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Improved aggregate function performance (faster H20.ai benchmarks) #13548

Open
1 of 3 tasks
alamb opened this issue Nov 24, 2024 · 18 comments
Open
1 of 3 tasks

[EPIC] Improved aggregate function performance (faster H20.ai benchmarks) #13548

alamb opened this issue Nov 24, 2024 · 18 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 24, 2024

Is your feature request related to a problem or challenge?

The basic aggregate functions like COUNT and SUM in DataFusion are very fast (see Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)

However, many of the other aggregate functions are not particularly fast, and this shows up specifically on some of the H20 benchmarks

We saw this in the results in the 2024 DataFusion SIGMOD paper
Screenshot 2024-11-24 at 8 34 35 AM

(BTW we have made median faster)

@MrPowers has also observed similar results on discord (link):

DataFusion was added to the h2o benchmarks (which are now maintained by duckdb) and DataFusion performs quite well for most of the "basic" groupby queries. It performs poorly for some of the advanced questions on the 50GB dataset. Here are the results:
https://duckdblabs.github.io/db-benchmark/

See his version of the benchmarks here
https://github.com/MrPowers/mrpowers-benchmarks

Testing

Functions

Describe the solution you'd like

DataFusion has two APIs ways to implement Aggregate functions like SUM and COUNT

  • Easy (but slow) way: Accumulator (api docs)
  • Fast (but complicated way): GroupsAccumulator (api docs)

The basic aggregates are implemented using GroupsAccumulator and are part of DataFusions performance

This ticket tracks the effort to improve the performance of these for these "more advanced" aggregate functions, likely by implementing GroupsAccumulator

Describe alternatives you've considered

For each function listed above, ideally we would:

  1. Add a new benchmark. Either add a specific one for H20 benchmarks or add a query to the ClickBench extended benchmark Documentation Here in one PR
  2. Implement GroupsAccumulator for the relevant aggregate function in a second PR (along with tests for correctness). We would use the benchmark to verify the performance

Here is a pretty good example of how @eejbyfeldt did this for STDDEV:

Additional context

No response

@alamb alamb added the enhancement New feature or request label Nov 24, 2024
@alamb
Copy link
Contributor Author

alamb commented Nov 24, 2024

For posterity, here is a link to the discord chat: https://discord.com/channels/885562378132000778/1309883046886903870/1309887744595595324

@MrPowers
Copy link

Would like to note that the DataFusion performance really starts to lag when the dataset size grows.

Take a look at this query: select id2, id4, power(corr(v1, v2), 2) as r2 from x group by id2, id4.

When the dataset is 10 million rows, then Polars takes 3 seconds and DataFusion takes 3.6 seconds, so pretty similar.

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

@alamb
Copy link
Contributor Author

alamb commented Nov 25, 2024

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?

@Rachelint has some ideas of how to improve this:

@Dandandan
Copy link
Contributor

Hm this seems something quadratic in nature?

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?

@Rachelint has some ideas of how to improve this:

Does it fully explain the dramatic difference? @MrPowers how do you generate the 10M vs 100M rows?

@alamb
Copy link
Contributor Author

alamb commented Nov 25, 2024

I would also expect this to help (but it was merged and depends on when it is merged)

@MrPowers
Copy link

@Dandandan - thanks to the great work by @SemyonSinchenko, it's easy to generate these datasets with falsa.

Here's the command to generate the 10 million row dataset: falsa groupby --path-prefix=~/data --size SMALL --data-format PARQUET. Just use MEDIUM to generate the 100 million row dataset.

@Rachelint
Copy link
Contributor

@Dandandan - thanks to the great work by @SemyonSinchenko, it's easy to generate these datasets with falsa.

Here's the command to generate the 10 million row dataset: falsa groupby --path-prefix=~/data --size SMALL --data-format PARQUET. Just use MEDIUM to generate the 100 million row dataset.

Thanks, I will profile and see what happen about the so long time cost in datafusion.

@Rachelint
Copy link
Contributor

Rachelint commented Nov 26, 2024

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?

@Rachelint has some ideas of how to improve this:

* [Sketch for aggregation intermediate results blocked management #11943](https://github.com/apache/datafusion/pull/11943)

* [Manage group values and states by blocks in aggregation #11931](https://github.com/apache/datafusion/issues/11931)

🤔 I guess it may be caused by the similar reason of what we encountered during benchmarking in #11827

@alamb
Copy link
Contributor Author

alamb commented Nov 26, 2024

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?
@Rachelint has some ideas of how to improve this:

* [Sketch for aggregation intermediate results blocked management #11943](https://github.com/apache/datafusion/pull/11943)

* [Manage group values and states by blocks in aggregation #11931](https://github.com/apache/datafusion/issues/11931)

🤔 I guess it may be caused by the similar reason of what we encountered during benchmarking in #11827

Specifically that power and corr need to support convert_to_state?

@Rachelint
Copy link
Contributor

Rachelint commented Nov 27, 2024

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?
@Rachelint has some ideas of how to improve this:

* [Sketch for aggregation intermediate results blocked management #11943](https://github.com/apache/datafusion/pull/11943)

* [Manage group values and states by blocks in aggregation #11931](https://github.com/apache/datafusion/issues/11931)

🤔 I guess it may be caused by the similar reason of what we encountered during benchmarking in #11827

Specifically that power and corr need to support convert_to_state?

I am not sure, but I think it maybe really related to GroupAccumulatorAdapter as #11827?
I am running and profiling it to find the answer.

@2010YOUY01
Copy link
Contributor

I rerun the H2O Q9 with GroupsAccumulator for corr(). See #13581
h2o dataset is in parquet format

Result
----
main, h2o_10m: 0.8s
main, h2o_100m: 12s
pr, h2o_10m: 0.2s
pr, h2o_100m: 4s

I didn't reproduce the drastic slowdown in main branch🤔

When the dataset is 10 million rows, then Polars takes 3 seconds and DataFusion takes 3.6 seconds, so pretty similar.

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

@MrPowers
Copy link

The h2o benchmarks are run on a Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz machine with 128 cores and 250 GB of RAM.

DataFusion groupby queries perform well on the 100 million row dataset (~5GB of data in a CSV file):

Screenshot 2024-11-27 at 12 10 41 PM

Some don't run with the 1 billion row dataset (~50GB of data in an uncompressed CSV file):

Screenshot 2024-11-27 at 12 12 20 PM

I am using a M3 Macbook with 16 GB of RAM. How much RAM does your machine have? Perhaps DataFusion only struggles with query 9 when the machine doesn't have lots of extra RAM.

@Dandandan
Copy link
Contributor

GroupsAccumulator for median/corr reduces memory usage (should be by quite a bit).

@Dandandan
Copy link
Contributor

Looking at the benchmark results, I think query 8 is worth analyzing / optimizing as well:
#13548

@2010YOUY01
Copy link
Contributor

I am using a M3 Macbook with 16 GB of RAM. How much RAM does your machine have? Perhaps DataFusion only struggles with query 9 when the machine doesn't have lots of extra RAM.

This explains 👍🏼 I ran the benchmark on a macbook with 48G of ram.
It is likely Q9 requires > 16G RAM, and OS memory swapping caused the performance regression.

We should also take a look at how much memory does DataFusion consume for those queries, comparing to other systems. Thanks for the report.

@Rachelint
Copy link
Contributor

I am using a M3 Macbook with 16 GB of RAM. How much RAM does your machine have? Perhaps DataFusion only struggles with query 9 when the machine doesn't have lots of extra RAM.

This explains 👍🏼 I ran the benchmark on a macbook with 48G of ram. It is likely Q9 requires > 16G RAM, and OS memory swapping caused the performance regression.

We should also take a look at how much memory does DataFusion consume for those queries, comparing to other systems. Thanks for the report.

Yes, I run it today, and my machine has only 16GB memory too... and I found the query very very slow due to swapping, too...

@alamb
Copy link
Contributor Author

alamb commented Dec 2, 2024

I think making DataFusion work better in lower memory situations would certainly be nice

@alamb alamb changed the title [EPIC] Improved aggregate function performance [EPIC] Improved aggregate function performance (faster H20.ai benchmarks) Dec 13, 2024
@alamb
Copy link
Contributor Author

alamb commented Dec 13, 2024

Update here is that @2010YOUY01 has made corr much better:

I also dug up and connected the task to add the H20.ai queries to bench.sh. Check out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants