Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClickBench benchmarks with DataFusion 44.0.0 #13983

Closed
alamb opened this issue Jan 2, 2025 · 24 comments
Closed

Update ClickBench benchmarks with DataFusion 44.0.0 #13983

alamb opened this issue Jan 2, 2025 · 24 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 2, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Now that DataFusion 44.0.0 is released, It would be great to update ClickBench https://benchmark.clickhouse.com/ with the newest version

ClickBench is a benchmark heavy on filtering and aggregation that we have used as an optimization target for the last several releases.

Describe alternatives you've considered

Additional context

I am especially interested to see the improvements after the vectorized comparison from @Rachelint @jayzhan211 @Dandandan and others in DataFusion aggregate code

@alamb alamb added the enhancement New feature or request label Jan 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

Someone pointed out to me the other day that DataFusion 43 is no longer on top of the ClickBench Parquet Leaderboard

Image

(actually it was one of the people who has spent substntial time optimizing Hyper...)

Thus I think it is that much more valuable to get some DataFusion 44 numbers on the board

@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

I also filed a ticket to track running clickbench on DataFusion 45 once that is released in a few weeks

@Rachelint
Copy link
Contributor

Rachelint commented Jan 24, 2025

I think Q8, Q16~18, Q35 can be closer to hyper in 44.0, they are improved in #12996
And Q35 can be even much faster when #13617 is merged (unfortunately, it can just be released in 46.0 for my long delay recently...)

But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

I think Q8, Q16~18, Q35 can be closer to hyper in 44.0, they are improved in #12996 And Q35 can be even much faster when #13617 is merged (unfortunately, it can just be released in 46.0 for my long delay recently...)

But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it.

I agree -- in case anyone else wants to see hyper reported 5x faster than DataFusion and 6x faster than DuckDB

Image

I think this is Q23

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

Profiling it like this:

$ datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), MIN(\"Title\"), COUNT(*) AS c, COUNT(DISTINCT \"UserID\") FROM hits_partitioned WHERE \"Title\" LIKE '%Google%' AND \"URL\" NOT LIKE '%.google.%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"

26% of the time goes to snappy decompression and 40% of the time to utf8 validation:

Image

Here is the full flamegraph.svg

So by my calculations the snappy decompression time alone in DataFusion (0.26 * 10.28s = 2.6s) takes longer than the hyper reported time of 1.8s 😕

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

If we wanted to juice our numbers we could turn off ut8 validation too but I feel like that would be cheating (as most/many systems would never run without validtion on)

@Dandandan
Copy link
Contributor

Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that.

Running without filter pushdown (the default)

set datafusion.execution.parquet.pushdown_filters = false;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

I get:

Elapsed 2.232 seconds.
Elapsed 2.252 seconds.
Elapsed 2.236 seconds.

When I enabled filter pushdown it goes 15% faster.

set datafusion.execution.parquet.pushdown_filters = true;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

I get:
Elapsed 1.981 seconds.
Elapsed 1.953 seconds.
Elapsed 1.966 seconds.

Still not 5x though 🤔

Though it gives me new motivation tohelp @XiangpengHao get the pushdown improvements over the line in

apache/arrow-rs#6921

@Rachelint
Copy link
Contributor

Rachelint commented Jan 24, 2025

@alamb 🤔 Q23 seems to be SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10 ?

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

@alamb 🤔 Q23 seems to be SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10 ?

🤔 you are right indeed 🤦 -- sorry about that (I went the wrong direction)

SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

I will profile that and report back

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

And in this case enabling predicate pushdown results in a 2x speedup

set datafusion.execution.parquet.pushdown_filters = false;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Elapsed 4.108 seconds.
Elapsed 5.430 seconds.
Elapsed 4.659 seconds.

set datafusion.execution.parquet.pushdown_filters = true;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Elapsed 2.415 seconds.
Elapsed 2.070 seconds.
Elapsed 2.279 seconds.

Here is the flamegraph for no pushdown:

Image

It would be cool to test with @XiangpengHao 's change to the parquet decoder here:

@Dandandan
Copy link
Contributor

It seems it could also benefit from some further utf8 validation speed up, filed it here apache/arrow-rs#7014

@Rachelint
Copy link
Contributor

Rachelint commented Jan 24, 2025

@alamb Excited to see further optmization about late materialization, it is really an important feature as I think !
I tried to use it in HoraeDB last year, and found the same problem mentioned in #6921 and it is frustrated...

I will profile again with setting datafusion.execution.parquet.pushdown_filters = true;, and see what optimizations we can do in datafusion.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

@alamb Excited to see further optmization about late materialization, it is really an important feature as I think ! I tried to use it in HoraeDB last year, and found the same problem mentioned in #6921 and it is frustrated...

I will profile again with setting datafusion.execution.parquet.pushdown_filters = true;, and see what optimizations we can do in datafusion.

Thanks @Rachelint

For this case I believe the core change needs to happen in the Parquet reader. The background as I understand it is described here

@XiangpengHao has a prototype in the following PR

A good next step would be to measure how much faster DataFusion is with that PR -- the previous measurements we had a few other optimizations mixed in.

@Rachelint
Copy link
Contributor

Here is the flamegraph for pushdown, 36+% time costs in decompression.

profile

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2025

BTW @pmcgleenon -- is there any chance you might have time to run the ClickBench benchmarks for DataFusion 44? If not no worries, we'll find someone else

@pmcgleenon
Copy link
Contributor

hi @alamb I could pick this up towards the end of the week / early next week, does that work?

by the way last time I checked it (a few weeks back) I noticed that this change in ClickBench has introduced a regression. One of the Clickbench queries fails when using read -r query instead of read query in the run.sh script

the clickbench run.sh script reports the result [null, null, null] for that specific query with both single and partitioned

Do you think this a Datafusion bug?

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2025

hi @alamb I could pick this up towards the end of the week / early next week, does that work?

Yes of course! Thank you so much!

by the way last time I checked it (a few weeks back) I noticed that this change in ClickBench has introduced a regression. One of the Clickbench queries fails when using read -r query instead of read query in the run.sh script

Thanks for checking -- I was not aware of that change

the clickbench run.sh script reports the result [null, null, null] for that specific query with both single and partitioned

Do you think this a Datafusion bug?

I am not sure to be honest -- I think we would need to run the query and see what the results are now 🤔

I will put this on my list of things to look at

@Dandandan Dandandan self-assigned this Jan 30, 2025
@pmcgleenon
Copy link
Contributor

hi @Dandandan @alamb here are the results I've got. Looks like 44.0.0 has registered improvements compared with 43.0.0

Image

Attached is the updated latest Clickbench web page if you want to compare datafusion with the other DBs

clickbench-latest.html.zip

If this looks ok, I can create a PR on clickbench to update the results

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

If this looks ok, I can create a PR on clickbench to update the results

First of all, thank you so much @pmcgleenon -- this is super helpful as always 🙏

I looked at the results and they are inline with my expectations. Specifically most of the major improvements mirror the improvement made by @jayzhan211 #12996 (they are similar improvements as reported in #12996 (comment))

There are a few queries that appear to have gotten slower (q36 and q37)

SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10;
SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10;

I will file a separate ticket for looking into them.

But from my perspective these results are ready to go 🚀

(BTW I expect DataFusion 45.0.0 to be available in the next few days -- see #14008 -- and it has yet more improvements!)

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

I will file a separate ticket for looking into them.

@pmcgleenon
Copy link
Contributor

PR raised to update Clickbench results for datafusion 44.0.0 ClickHouse/ClickBench#301

@pmcgleenon
Copy link
Contributor

pmcgleenon commented Feb 5, 2025

I've re-run this again following the discussion on the ClickBench PR

Q28 needed an update, since the regex was not working correctly

Query 28: SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

It needs to include an additional escape character \ for :www\\. and \\1:

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k
HAVING COUNT(*) > 100000
ORDER BY l DESC LIMIT 25;

After that change, the Q28 performance for datafusion 44.0.0 has regressed compared to 43.0.0. Based on the discussion over on ClickBench I think is expected since the Q28 results for all previous datafusion releases are invalid.

Image

@Dandandan @alamb I've attached the full clickbench here if you want to take a look
clickbench-latest 2.html.zip

@pmcgleenon
Copy link
Contributor

hi folks, the Clickbench PR to update the results for 44.0.0 has been merged. Thanks @alamb for helping figure out the Q28 problems with the regex expression!

Looking forward to see further improved results with the 45.0.0 release!! 🚀

@alamb
Copy link
Contributor Author

alamb commented Feb 7, 2025

Thanks @pmcgleenon ! I am closing this one down.

It turns out we have just released datafusion-45 this morning so we are ready for the run whenever you are !

@alamb alamb closed this as completed Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants