Optimize `Hive hash` computation for nested types #2720

ustcfy · 2024-12-23T07:54:55Z

Main Optimization:

Flatten nested columns in advance, reducing the size of the stack_frame.

Existing Issue:

The optimization effect is related to the schema, and I'm not sure how to write the benchmark for a good comparison.

TODO:

Update comments.

Possible Further Optimization:

Move flattened_column_views, first_child_index and nested_column_map to shared memory.
Make various hash functions share the logic for flattening nested types, but xxhash64 does not require _cur_hash, which presents a challenge for unification.

Benchmark:

Environment: NVIDIA TITAN RTX
Sizes: 50MB, 100MB, 500MB, 1GB

schema: struct<BOOL8, INT32, FLOAT32>

size_bytes	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
50000000	3.063 ms	15.07%	1.150 ms	19.14%	-1913.227 us	-62.47%	FAST
100000000	5.933 ms	0.20%	1.892 ms	2.09%	-4041.607 us	-68.12%	FAST
250000000	14.623 ms	0.15%	4.071 ms	8.03%	-10551.985 us	-72.16%	FAST
500000000	29.164 ms	0.19%	6.832 ms	5.45%	-22331.750 us	-76.57%	FAST
1000000000	58.946 ms	0.09%	11.506 ms	0.46%	-47439.712 us	-80.48%	FAST

schema: list with a depth of list_depth, with the basic type being INT32

size_bytes	list_depth	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
50000000	1	408.665 us	6.49%	427.040 us	4.44%	18.375 us	4.50%	SLOW
100000000	1	619.601 us	62.44%	578.274 us	1.10%	-41.327 us	-6.67%	FAST
250000000	1	1.304 ms	28.94%	1.197 ms	0.14%	-107.499 us	-8.24%	FAST
500000000	1	2.424 ms	0.34%	2.265 ms	11.32%	-158.439 us	-6.54%	FAST
1000000000	1	4.698 ms	0.18%	3.666 ms	0.10%	-1032.364 us	-21.97%	FAST
50000000	2	888.976 us	1.05%	965.358 us	1.66%	76.382 us	8.59%	SLOW
100000000	2	1.856 ms	0.64%	1.957 ms	1.33%	101.652 us	5.48%	SLOW
250000000	2	4.822 ms	0.31%	5.332 ms	6.01%	509.677 us	10.57%	SLOW
500000000	2	9.576 ms	0.24%	10.320 ms	0.61%	744.076 us	7.77%	SLOW
1000000000	2	19.050 ms	0.18%	20.304 ms	1.62%	1.254 ms	6.58%	SLOW
50000000	3	999.941 us	0.80%	919.377 us	0.50%	-80.564 us	-8.06%	FAST
100000000	3	1.336 ms	0.69%	1.065 ms	1.17%	-270.761 us	-20.27%	FAST
250000000	3	4.559 ms	0.19%	3.831 ms	0.41%	-728.397 us	-15.98%	FAST
500000000	3	12.616 ms	0.31%	13.341 ms	0.43%	724.816 us	5.75%	SLOW
1000000000	3	28.351 ms	0.37%	30.913 ms	0.43%	2.562 ms	9.04%	SLOW
50000000	4	9.388 ms	0.50%	8.490 ms	0.27%	-898.402 us	-9.57%	FAST
100000000	4	9.561 ms	0.43%	8.476 ms	0.10%	-1084.155 us	-11.34%	FAST
250000000	4	9.590 ms	0.18%	8.493 ms	0.39%	-1097.442 us	-11.44%	FAST
500000000	4	10.126 ms	0.31%	8.831 ms	0.83%	-1294.458 us	-12.78%	FAST
1000000000	4	18.583 ms	0.87%	14.239 ms	0.50%	-4344.140 us	-23.38%	FAST
50000000	6	200.859 ms	0.07%	193.565 ms	0.01%	-7293.741 us	-3.63%	FAST
100000000	6	232.922 ms	0.03%	212.022 ms	0.34%	-20899.757 us	-8.97%	FAST
250000000	6	278.656 ms	0.49%	256.458 ms	0.04%	-22198.219 us	-7.97%	FAST
500000000	6	353.204 ms	0.15%	318.858 ms	0.07%	-34346.383 us	-9.72%	FAST
1000000000	6	481.230 ms	0.06%	435.179 ms	0.13%	-46050.776 us	-9.57%	FAST

Signed-off-by: Yan Feng <[email protected]>

thirtiseven

Could you post the benchmark results here?

Maybe also how it failed on the string type and how to reproduce it if you need help on that issue.

thirtiseven · 2024-12-23T08:43:00Z

src/main/cpp/benchmarks/hash.cu

+#include <nvbench/nvbench.cuh>
+
+constexpr auto min_width = 10;
+constexpr auto max_width = 10;


Should this two be removed?

The optimization effect is related to the schema, and I'm not sure how to write the benchmark for a good comparison.

The benchmark should support running either (or both) lists and structs. You can switch benchmark types using a boolean flag. That flag should not be constexpr so we can compile all the code to verify syntax.

auto const bench_structs = false; auto const data_profile = [&] { if (bench_structs) { ... .struct_types(...); return ...; } else { ... .list_type(...); return ...; }();

src/main/cpp/benchmarks/hash.cu

Signed-off-by: Yan Feng <[email protected]>

src/main/cpp/src/hive_hash.cu

res-life · 2024-12-24T09:09:40Z

src/main/cpp/src/hive_hash.cu

@@ -486,15 +526,77 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,

  check_nested_depth(input);

+  // `flattened_column_views` only contains nested columns and columns that result from flattening
+  // nested columns
+  std::vector<cudf::column_view> flattened_column_views;


The column view constructor will calculate null count which is time consuming.
The original approach does not need to calculate null count.
We may need to find a way to avoid this column view array.
Please help check if the contiguous_copy_column_device_views is helpful?

res-life · 2024-12-24T09:31:20Z

src/main/cpp/src/hive_hash.cu

+  // `input` to the index in `flattened_column_views`
+  std::vector<cudf::size_type> nested_column_map;
+
+  for (auto i = 0; i < input.num_columns(); i++) {


extract a function like get_flattened_device_views

res-life · 2024-12-24T09:59:47Z

Fixed the struct(string) type reports illegal memory access error:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index ca720b5..5e9bb35 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -22,6 +22,7 @@
 #include <cudf/structs/structs_column_view.hpp>
 #include <cudf/table/experimental/row_operators.cuh>
 #include <cudf/table/table_device_view.cuh>
+#include <cudf/table/table_device_view.cuh>
 
 #include <rmm/cuda_stream_view.hpp>
 #include <rmm/exec_policy.hpp>
@@ -566,17 +567,9 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     }
   }
 
-  std::vector<cudf::column_device_view> device_flattened_column_views;
-  device_flattened_column_views.reserve(flattened_column_views.size());
-
-  std::transform(
-    flattened_column_views.begin(),
-    flattened_column_views.end(),
-    std::back_inserter(device_flattened_column_views),
-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });
+  [[maybe_unused]] auto [device_view_owners, device_flattened_column_views] =
+    cudf::contiguous_copy_column_device_views<cudf::column_device_view>(flattened_column_views, stream);
 
-  auto flattened_column_device_views =
-    cudf::detail::make_device_uvector_async(device_flattened_column_views, stream, mr);
   auto first_child_index_view =
     cudf::detail::make_device_uvector_async(first_child_index, stream, mr);
   auto nested_column_map_view =
@@ -594,7 +587,7 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     output_view.end<hive_hash_value_t>(),
     hive_device_row_hasher<hive_hash_function, bool>(nullable,
                                                      *input_view,
-                                                     flattened_column_device_views.data(),
+                                                     device_flattened_column_views,
                                                      first_child_index_view.data(),
                                                      nested_column_map_view.data()));

Please apply the above patch.

Signed-off-by: Yan Feng <[email protected]>

ttnghia · 2024-12-24T17:29:19Z

-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });

Remember to never dereference the output from column_device_view::create or table_device_view::create directly. They return a pointer. If we don't store that pointer, it will be dangled right away thus the reference to it will be invalidiated.

src/main/cpp/src/hive_hash.cu

ttnghia · 2024-12-24T17:37:54Z

src/main/cpp/src/hive_hash.cu

+  auto first_child_index_view =
+    cudf::detail::make_device_uvector_async(first_child_index, stream, mr);
+  auto nested_column_map_view =
+    cudf::detail::make_device_uvector_async(nested_column_map, stream, mr);


The memory resource mr is only passed to the function that creates the the final output (or part of it). These variables are not returned thus we only pass in cudf::get_current_device_resource_ref().

Currently they are the same but this is required for future evolution of libcudf.

And please check if there are similar situations in the other places that need to be changed.

Co-authored-by: Nghia Truong <[email protected]>

sperlingxx · 2024-12-25T01:54:09Z

I just came up with some random thought:
How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result.
By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

res-life · 2024-12-25T02:25:03Z

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types.
But for list type, e.g., list(int) column:
row0: [1,2,3]
row1: [1,2]
row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is:
[[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3))
[[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea.
We can discuss this further more.

sperlingxx · 2024-12-26T01:29:27Z

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)
Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types. But for list type, e.g., list(int) column: row0: [1,2,3] row1: [1,2] row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is: [[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3)) [[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea. We can discuss this further more.

Yes, you are right! In terms of list type, it seems impossible to get rid of stack operations.

sperlingxx

LGTM in general, except the some NIT improvements

sperlingxx · 2024-12-26T01:32:03Z

src/main/cpp/src/hive_hash.cu

                top.update_cur_hash(child_hash);
              } else {
-                col_stack[stack_size++] = col_stack_frame(child_col);
+                col_stack[stack_size++] = col_stack_frame(child_col_idx, curr_row_idx);


IMHO, it is better to update the array element(col_stack_frame) inplacely instead of creating a new object.

sperlingxx · 2024-12-26T01:34:23Z

src/main/cpp/src/hive_hash.cu

              if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
            } else {
              // Push the next element into the stack
-              col_stack[stack_size++] =
-                col_stack_frame(child_col.slice(top.get_and_inc_idx_to_process(), 1));
+              col_stack[stack_size++] = col_stack_frame(


Same as above

ustcfy added 2 commits December 23, 2024 15:43

Optimize Hive hash computation for nested types

92921f1

Signed-off-by: Yan Feng <[email protected]>

Fix minor details

a77dff5

Signed-off-by: Yan Feng <[email protected]>

ustcfy changed the title ~~Optimize Hive hash computation for nested types~~ [Draft] Optimize Hive hash computation for nested types Dec 23, 2024

Fix minor details

97b9644

Signed-off-by: Yan Feng <[email protected]>

ustcfy requested a review from ttnghia December 23, 2024 08:42

thirtiseven reviewed Dec 23, 2024

View reviewed changes

ustcfy requested a review from res-life December 23, 2024 10:17

ustcfy self-assigned this Dec 23, 2024

ustcfy added 3 commits December 24, 2024 02:28

Fix minor details

1857bdd

Signed-off-by: Yan Feng <[email protected]>

Fix minor details

ad84406

Signed-off-by: Yan Feng <[email protected]>

Minor update

a3fe1c7

Signed-off-by: Yan Feng <[email protected]>

This comment was marked as outdated.

Sign in to view

sperlingxx reviewed Dec 24, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Show resolved Hide resolved

res-life reviewed Dec 24, 2024

View reviewed changes

Fix: Resolve string bug

9b43b03

Signed-off-by: Yan Feng <[email protected]>

ttnghia reviewed Dec 24, 2024

View reviewed changes

src/main/cpp/src/hive_hash.cu Outdated Show resolved Hide resolved

ttnghia reviewed Dec 24, 2024

View reviewed changes

Update src/main/cpp/src/hive_hash.cu

71f7df9

Co-authored-by: Nghia Truong <[email protected]>

ustcfy changed the title ~~[Draft] Optimize Hive hash computation for nested types~~ Optimize Hive hash computation for nested types Dec 25, 2024

sperlingxx reviewed Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `Hive hash` computation for nested types #2720

Optimize `Hive hash` computation for nested types #2720

ustcfy commented Dec 23, 2024 •

edited

Loading

thirtiseven left a comment

thirtiseven Dec 23, 2024

ustcfy Dec 23, 2024

ttnghia Dec 24, 2024

This comment was marked as outdated.

res-life Dec 24, 2024

res-life Dec 24, 2024

res-life commented Dec 24, 2024

ttnghia commented Dec 24, 2024 •

edited

Loading

ttnghia Dec 24, 2024

ttnghia Dec 24, 2024

sperlingxx commented Dec 25, 2024 •

edited

Loading

res-life commented Dec 25, 2024 •

edited

Loading

sperlingxx commented Dec 26, 2024

sperlingxx left a comment

sperlingxx Dec 26, 2024

sperlingxx Dec 26, 2024

Optimize Hive hash computation for nested types #2720

Are you sure you want to change the base?

Optimize Hive hash computation for nested types #2720

Conversation

ustcfy commented Dec 23, 2024 • edited Loading

Main Optimization:

Existing Issue:

TODO:

Possible Further Optimization:

Benchmark:

thirtiseven left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Dec 24, 2024

ttnghia commented Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented Dec 25, 2024 • edited Loading

res-life commented Dec 25, 2024 • edited Loading

sperlingxx commented Dec 26, 2024

sperlingxx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Optimize `Hive hash` computation for nested types #2720

Optimize `Hive hash` computation for nested types #2720

ustcfy commented Dec 23, 2024 •

edited

Loading

ttnghia commented Dec 24, 2024 •

edited

Loading

sperlingxx commented Dec 25, 2024 •

edited

Loading

res-life commented Dec 25, 2024 •

edited

Loading