Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Open
Tracked by #13548
alamb opened this issue Aug 5, 2023 · 7 comments
Open
Tracked by #13548

Add H2O.ai Database-like Ops benchmark to dfbench #7209

alamb opened this issue Aug 5, 2023 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Aug 5, 2023

Is your feature request related to a problem or challenge?

Follow on to #7052
There is an interesting database benchark called "H20.ai database like benchmark" that DuckDB seems to have revived (perhaps because the original went dormant with very old with very old/ slow duckdb results). More background here: https://duckdb.org/2023/04/14/h2oai.html#results

@Dandandan added a new solution for datafusion here: duckdblabs/db-benchmark#18

However, there is no easy way to run the h2o benchmark within the datafusion repo. There is an old version of some of these benchmarks in the code: https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/h2o.rs

Describe the solution you'd like

I would like someone to make it easy to run the h20.ai benchmark in the datafusion repo.

Ideally this would look like

# generate data
./benchmarks/bench.sh data h20.ai
# run 
./benchmarks/bench.sh run h20.ai

I would expect to be able to run the individual queries like this

cargo run  --bin dfbench -- h2o.ai --query=3

Some steps might be

  1. port the existing benchmark script to dfbench following the model in Add parquet-filter and sort benchmarks to dfbench #7120
  2. update bench.sh, following the model of existing benchmarks
  3. Update the documentation

Describe alternatives you've considered

We could also simply remove the h20.ai benchmark script as it is not clear how important it will be long term

Additional context

I think this is a good first issue as the task is clear, and there are existing patterns in bench.sh, dfbench and in

@alamb alamb added enhancement New feature or request good first issue Good for newcomers labels Aug 5, 2023
@palash25
Copy link
Contributor

palash25 commented Aug 7, 2023

I would like to work on this

@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2023

Thank you @palash25

@palash25
Copy link
Contributor

palash25 commented Sep 9, 2023

sorry for the inactivity on this. my RSI came back so i was taking a break from typing, i will try to submit the PR in a day or two.

@alamb
Copy link
Contributor Author

alamb commented Sep 10, 2023

No problem -- I hope you feel better soon

@drewhayward
Copy link
Contributor

Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.

I think it would be useful. Thank you

I think figuring out how to generate the data locally would be super valuable -- perhaps we can use a docker like approach as we do for tpch:

FILE="${TPCH_DIR}/supplier.tbl"
if test -f "${FILE}"; then
echo " tbl files exist ($FILE exists)."
else
echo " creating tbl files with tpch_dbgen..."
docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s ${SCALE_FACTOR}
fi

So it would run like

./bench.sh data h2o

Which would leave data in datafusion/benchmarks/data/h2o

🤔

@alamb
Copy link
Contributor Author

alamb commented Dec 13, 2024

Now that @Rachelint and @2010YOUY01 and others have started working on

I think this issue is more important than ever

I think the hardest part of this task is actually generating the benchmark data
Thankfully @MrPowers has created falsa to generate the dataset (so we don't need R installed):

Here are the instructions for generating data: https://github.com/MrPowers/mrpowers-benchmarks?tab=readme-ov-file#running-the-benchmarks-on-your-machine

Here are the queries:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants