-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a libcudf/thrust-based TPC-H derived datagen #16294
Add a libcudf/thrust-based TPC-H derived datagen #16294
Conversation
a834875
to
e45c419
Compare
TODO:
|
1e7abc5
to
cd5ec3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the function definitions/implementations from header files to .cpp/.cu files? It appears that dbgen.cu is a .cu file because of header it includes and that header contains device code which seems unnecessarily restrictive.
cpp/benchmarks/common/tpch_data_generator/random_column_generator.cu
Outdated
Show resolved
Hide resolved
cpp/benchmarks/common/tpch_data_generator/random_column_generator.hpp
Outdated
Show resolved
Hide resolved
cpp/benchmarks/common/tpch_data_generator/random_column_generator.hpp
Outdated
Show resolved
Hide resolved
cpp/benchmarks/common/cudf_tpch_datagen/random_column_generator.hpp
Outdated
Show resolved
Hide resolved
…tor.hpp Co-authored-by: Karthikeyan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot to get your head around and to build cleanly, so kudos to you @JayjeetAtGithub for getting it done!
Really I would have expected us to build this in Python, and I'm not sure why we didn't, but I suppose it's a good SOL test to do it in libcudf.
Overall it looks good, I had a couple of questions, but I approve.
Aside: I think that many of the generate_*
function implementations have very high cognitive complexity so breaking them up into subfunctions might help maintainability (but I guess there's not much reuse potential). I don't think this is necessary before merging.
cpp/benchmarks/common/tpch_data_generator/random_column_generator.cu
Outdated
Show resolved
Hide resolved
cpp/benchmarks/common/tpch_data_generator/random_column_generator.cu
Outdated
Show resolved
Hide resolved
cpp/benchmarks/common/tpch_data_generator/random_column_generator.cu
Outdated
Show resolved
Hide resolved
I tried using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work. This is a big step forward, and I'm really excited about the upcoming NVBench migration.
@JayjeetAtGithub, could you please document the concerns we discussed in a GitHub issue to ensure we won't miss them in the future?
cpp/benchmarks/common/tpch_data_generator/tpch_data_generator.cpp
Outdated
Show resolved
Hide resolved
auto const gather_map = cudf::table_view({indices->view(), keys->view()}); | ||
auto const gathered_table = cudf::gather( | ||
gather_map, ternary_mask->view(), cudf::out_of_bounds_policy::DONT_CHECK, stream, mr); | ||
auto gathered_table_columns = gathered_table->release(); | ||
return std::move(gathered_table_columns[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indices is not required here in gather_map table.
auto const gather_map = cudf::table_view({indices->view(), keys->view()}); | ||
auto const gathered_table = cudf::gather( | ||
gather_map, mask_index_type->view(), cudf::out_of_bounds_policy::DONT_CHECK, stream, mr); | ||
auto gathered_table_columns = gathered_table->release(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indices is not required here in gather_map table.
cpp/benchmarks/common/tpch_data_generator/tpch_data_generator.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing just the CMake.
How do you anticipate the generator be used? Will it just be used by other benchmarks in our repository? If so, the CMake looks fine as-is. If you intend broader usage (I'd caution against that unless we have a very good reason) then there is some additional machinery.
Thanks for taking a look @vyasr. For now, we only want our internal benchmark code to use the datagen by just linking to the static library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
Nice work! This is going to be very useful for benchmarks.
/merge |
Description
This PR adds a TPC-H (according to spec 3.0.1) inspired datagen written using
libcudf
andthrust
Implementation Status
Checklist