Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

alamb · 2023-08-05T17:01:57Z

Is your feature request related to a problem or challenge?

Follow on to #7052
There is an interesting database benchark called "H20.ai database like benchmark" that DuckDB seems to have revived (perhaps because the original went dormant with very old with very old/ slow duckdb results). More background here: https://duckdb.org/2023/04/14/h2oai.html#results

@Dandandan added a new solution for datafusion here: duckdblabs/db-benchmark#18

However, there is no easy way to run the h2o benchmark within the datafusion repo. There is an old version of some of these benchmarks in the code: https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/h2o.rs

Describe the solution you'd like

I would like someone to make it easy to run the h20.ai benchmark in the datafusion repo.

Ideally this would look like

# generate data
./benchmarks/bench.sh data h20.ai
# run 
./benchmarks/bench.sh run h20.ai

I would expect to be able to run the individual queries like this

cargo run  --bin dfbench -- h2o.ai --query=3

Some steps might be

port the existing benchmark script to dfbench following the model in Add parquet-filter and sort benchmarks to dfbench #7120
update bench.sh, following the model of existing benchmarks
Update the documentation

Describe alternatives you've considered

We could also simply remove the h20.ai benchmark script as it is not clear how important it will be long term

Additional context

I think this is a good first issue as the task is clear, and there are existing patterns in bench.sh, dfbench and in

The text was updated successfully, but these errors were encountered:

palash25 · 2023-08-07T08:45:22Z

I would like to work on this

alamb · 2023-08-07T14:46:20Z

Thank you @palash25

palash25 · 2023-09-09T18:30:16Z

sorry for the inactivity on this. my RSI came back so i was taking a break from typing, i will try to submit the PR in a day or two.

alamb · 2023-09-10T10:26:42Z

No problem -- I hope you feel better soon

drewhayward · 2024-07-29T18:49:27Z

Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.

alamb · 2024-07-31T14:06:33Z

Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.

I think it would be useful. Thank you

I think figuring out how to generate the data locally would be super valuable -- perhaps we can use a docker like approach as we do for tpch:

datafusion/benchmarks/bench.sh

Lines 286 to 292 in 89677ae

    
           FILE="${TPCH_DIR}/supplier.tbl" 
        
           if test -f "${FILE}"; then 
        
               echo " tbl files exist ($FILE exists)." 
        
           else 
        
               echo " creating tbl files with tpch_dbgen..." 
        
               docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s ${SCALE_FACTOR} 
        
           fi

So it would run like

./bench.sh data h2o

Which would leave data in datafusion/benchmarks/data/h2o

🤔

alamb · 2024-12-13T16:54:54Z

Now that @Rachelint and @2010YOUY01 and others have started working on

[EPIC] Improved aggregate function performance (faster H20.ai benchmarks) #13548

I think this issue is more important than ever

I think the hardest part of this task is actually generating the benchmark data
Thankfully @MrPowers has created falsa to generate the dataset (so we don't need R installed):

Here are the instructions for generating data: https://github.com/MrPowers/mrpowers-benchmarks?tab=readme-ov-file#running-the-benchmarks-on-your-machine

Here are the queries:

alamb added enhancement New feature or request good first issue Good for newcomers labels Aug 5, 2023

alamb mentioned this issue Aug 14, 2023

Consolidate datafusion benchmarks into dfbench #7052

Closed

8 tasks

alamb mentioned this issue Dec 13, 2024

[EPIC] Improved aggregate function performance (faster H20.ai benchmarks) #13548

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

alamb commented Aug 5, 2023

palash25 commented Aug 7, 2023 •

edited

Loading

alamb commented Aug 7, 2023

palash25 commented Sep 9, 2023

alamb commented Sep 10, 2023

drewhayward commented Jul 29, 2024

alamb commented Jul 31, 2024

alamb commented Dec 13, 2024 •

edited

Loading

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Comments

alamb commented Aug 5, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

palash25 commented Aug 7, 2023 • edited Loading

alamb commented Aug 7, 2023

palash25 commented Sep 9, 2023

alamb commented Sep 10, 2023

drewhayward commented Jul 29, 2024

alamb commented Jul 31, 2024

alamb commented Dec 13, 2024 • edited Loading

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

Add H2O.ai Database-like Ops benchmark to `dfbench` #7209

palash25 commented Aug 7, 2023 •

edited

Loading

alamb commented Dec 13, 2024 •

edited

Loading