-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Hive hash
computation for nested types
#2720
base: branch-25.02
Are you sure you want to change the base?
Optimize Hive hash
computation for nested types
#2720
Conversation
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Hive hash
computation for nested typesHive hash
computation for nested types
Signed-off-by: Yan Feng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you post the benchmark results here?
Maybe also how it failed on the string type and how to reproduce it if you need help on that issue.
#include <nvbench/nvbench.cuh> | ||
|
||
constexpr auto min_width = 10; | ||
constexpr auto max_width = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this two be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization effect is related to the schema, and I'm not sure how to write the benchmark for a good comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark should support running either (or both) lists and structs. You can switch benchmark types using a boolean flag. That flag should not be constexpr
so we can compile all the code to verify syntax.
auto const bench_structs = false;
auto const data_profile = [&] {
if (bench_structs) {
...
.struct_types(...);
return ...;
} else {
...
.list_type(...);
return ...;
}();
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
@@ -486,15 +526,77 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input, | |||
|
|||
check_nested_depth(input); | |||
|
|||
// `flattened_column_views` only contains nested columns and columns that result from flattening | |||
// nested columns | |||
std::vector<cudf::column_view> flattened_column_views; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column view constructor will calculate null count which is time consuming.
The original approach does not need to calculate null count.
We may need to find a way to avoid this column view array.
Please help check if the contiguous_copy_column_device_views
is helpful?
// `input` to the index in `flattened_column_views` | ||
std::vector<cudf::size_type> nested_column_map; | ||
|
||
for (auto i = 0; i < input.num_columns(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract a function like get_flattened_device_views
Fixed the struct(string) type reports illegal memory access error: diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index ca720b5..5e9bb35 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -22,6 +22,7 @@
#include <cudf/structs/structs_column_view.hpp>
#include <cudf/table/experimental/row_operators.cuh>
#include <cudf/table/table_device_view.cuh>
+#include <cudf/table/table_device_view.cuh>
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>
@@ -566,17 +567,9 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
}
}
- std::vector<cudf::column_device_view> device_flattened_column_views;
- device_flattened_column_views.reserve(flattened_column_views.size());
-
- std::transform(
- flattened_column_views.begin(),
- flattened_column_views.end(),
- std::back_inserter(device_flattened_column_views),
- [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });
+ [[maybe_unused]] auto [device_view_owners, device_flattened_column_views] =
+ cudf::contiguous_copy_column_device_views<cudf::column_device_view>(flattened_column_views, stream);
- auto flattened_column_device_views =
- cudf::detail::make_device_uvector_async(device_flattened_column_views, stream, mr);
auto first_child_index_view =
cudf::detail::make_device_uvector_async(first_child_index, stream, mr);
auto nested_column_map_view =
@@ -594,7 +587,7 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
output_view.end<hive_hash_value_t>(),
hive_device_row_hasher<hive_hash_function, bool>(nullable,
*input_view,
- flattened_column_device_views.data(),
+ device_flattened_column_views,
first_child_index_view.data(),
nested_column_map_view.data()));
Please apply the above patch. |
Signed-off-by: Yan Feng <[email protected]>
- [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); }); Remember to never dereference the output from |
auto first_child_index_view = | ||
cudf::detail::make_device_uvector_async(first_child_index, stream, mr); | ||
auto nested_column_map_view = | ||
cudf::detail::make_device_uvector_async(nested_column_map, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory resource mr
is only passed to the function that creates the the final output (or part of it). These variables are not returned thus we only pass in cudf::get_current_device_resource_ref()
.
Currently they are the same but this is required for future evolution of libcudf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please check if there are similar situations in the other places that need to be changed.
Co-authored-by: Nghia Truong <[email protected]>
Hive hash
computation for nested typesHive hash
computation for nested types
I just came up with some random thought: Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement. |
Good idea for struct/primitive only types. For the struct/primitive nested types(without list), we can use this idea. |
Yes, you are right! In terms of list type, it seems impossible to get rid of stack operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, except the some NIT improvements
top.update_cur_hash(child_hash); | ||
} else { | ||
col_stack[stack_size++] = col_stack_frame(child_col); | ||
col_stack[stack_size++] = col_stack_frame(child_col_idx, curr_row_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, it is better to update the array element(col_stack_frame) inplacely instead of creating a new object.
if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); } | ||
} else { | ||
// Push the next element into the stack | ||
col_stack[stack_size++] = | ||
col_stack_frame(child_col.slice(top.get_and_inc_idx_to_process(), 1)); | ||
col_stack[stack_size++] = col_stack_frame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Main Optimization:
Flatten nested columns in advance, reducing the size of the
stack_frame
.Existing Issue:
TODO:
Possible Further Optimization:
flattened_column_views
,first_child_index
andnested_column_map
to shared memory.xxhash64
does not require_cur_hash
, which presents a challenge for unification.Benchmark:
struct<BOOL8, INT32, FLOAT32>
list
with a depth oflist_depth
, with the basic type beingINT32