Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join big tables takes too long #5736

Open
jangorecki opened this issue Nov 10, 2023 · 3 comments
Open

join big tables takes too long #5736

jangorecki opened this issue Nov 10, 2023 · 3 comments
Assignees

Comments

@jangorecki
Copy link
Member

jangorecki commented Nov 10, 2023

Detected by db-benchmark. Join 1e9 for data.table does not finish within 4h. It used to not finish this test because of OOM when run on 128GB mem machine. Now it runs on 250GB so I would expect it to finish. Considering that joins should scales pretty well in data.table it sounds abnormal that 4h is not enough.

c6id.metal machine
250 GB ram
128 cores
duckdblabs/db-benchmark#61

if someone has access to c6id.metal, it would be useful to run join script using verbose=TRUE, and also extend timeout, to see if maybe little more time is enough for completion.

@tdhock tdhock self-assigned this Nov 12, 2023
@tdhock
Copy link
Member

tdhock commented Nov 12, 2023

@Anirban166 @DorisAmoakohene this would be an interesting example to investigate, perhaps using atime

@HughParsonage
Copy link
Member

I can't quite follow that repo. What's the actual minimal code for the join that the benchmark is measuring?

@jangorecki
Copy link
Member Author

https://github.com/duckdblabs/db-benchmark/blob/master/datatable/join-datatable.R
this is the test script

data needs to be generated with

Rscript _data/join-datagen.R 1e9 NA 0 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants