Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Hive hash computation for nested types #2720

Draft
wants to merge 8 commits into
base: branch-25.02
Choose a base branch
from

Conversation

ustcfy
Copy link
Collaborator

@ustcfy ustcfy commented Dec 23, 2024

Main Optimization:

Flatten nested columns in advance, reducing the size of the stack_frame.

Existing Issue:

  • The optimization effect is related to the schema, and I'm not sure how to write the benchmark for a good comparison.

TODO:

  • Update comments.

Possible Further Optimization:

  • Move flattened_column_views, first_child_index and nested_column_map to shared memory.
  • Make various hash functions share the logic for flattening nested types, but xxhash64 does not require _cur_hash, which presents a challenge for unification.

Benchmark:

  1. Environment: NVIDIA TITAN RTX
  2. Sizes: 50MB, 100MB, 500MB, 1GB
  • schema: struct<BOOL8, INT32, FLOAT32>
size_bytes Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
50000000 3.063 ms 15.07% 1.150 ms 19.14% -1913.227 us -62.47% FAST
100000000 5.933 ms 0.20% 1.892 ms 2.09% -4041.607 us -68.12% FAST
250000000 14.623 ms 0.15% 4.071 ms 8.03% -10551.985 us -72.16% FAST
500000000 29.164 ms 0.19% 6.832 ms 5.45% -22331.750 us -76.57% FAST
1000000000 58.946 ms 0.09% 11.506 ms 0.46% -47439.712 us -80.48% FAST
  • schema: list with a depth of list_depth, with the basic type being INT32
size_bytes list_depth Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
50000000 1 408.665 us 6.49% 427.040 us 4.44% 18.375 us 4.50% SLOW
100000000 1 619.601 us 62.44% 578.274 us 1.10% -41.327 us -6.67% FAST
250000000 1 1.304 ms 28.94% 1.197 ms 0.14% -107.499 us -8.24% FAST
500000000 1 2.424 ms 0.34% 2.265 ms 11.32% -158.439 us -6.54% FAST
1000000000 1 4.698 ms 0.18% 3.666 ms 0.10% -1032.364 us -21.97% FAST
50000000 2 888.976 us 1.05% 965.358 us 1.66% 76.382 us 8.59% SLOW
100000000 2 1.856 ms 0.64% 1.957 ms 1.33% 101.652 us 5.48% SLOW
250000000 2 4.822 ms 0.31% 5.332 ms 6.01% 509.677 us 10.57% SLOW
500000000 2 9.576 ms 0.24% 10.320 ms 0.61% 744.076 us 7.77% SLOW
1000000000 2 19.050 ms 0.18% 20.304 ms 1.62% 1.254 ms 6.58% SLOW
50000000 3 999.941 us 0.80% 919.377 us 0.50% -80.564 us -8.06% FAST
100000000 3 1.336 ms 0.69% 1.065 ms 1.17% -270.761 us -20.27% FAST
250000000 3 4.559 ms 0.19% 3.831 ms 0.41% -728.397 us -15.98% FAST
500000000 3 12.616 ms 0.31% 13.341 ms 0.43% 724.816 us 5.75% SLOW
1000000000 3 28.351 ms 0.37% 30.913 ms 0.43% 2.562 ms 9.04% SLOW
50000000 4 9.388 ms 0.50% 8.490 ms 0.27% -898.402 us -9.57% FAST
100000000 4 9.561 ms 0.43% 8.476 ms 0.10% -1084.155 us -11.34% FAST
250000000 4 9.590 ms 0.18% 8.493 ms 0.39% -1097.442 us -11.44% FAST
500000000 4 10.126 ms 0.31% 8.831 ms 0.83% -1294.458 us -12.78% FAST
1000000000 4 18.583 ms 0.87% 14.239 ms 0.50% -4344.140 us -23.38% FAST
50000000 6 200.859 ms 0.07% 193.565 ms 0.01% -7293.741 us -3.63% FAST
100000000 6 232.922 ms 0.03% 212.022 ms 0.34% -20899.757 us -8.97% FAST
250000000 6 278.656 ms 0.49% 256.458 ms 0.04% -22198.219 us -7.97% FAST
500000000 6 353.204 ms 0.15% 318.858 ms 0.07% -34346.383 us -9.72% FAST
1000000000 6 481.230 ms 0.06% 435.179 ms 0.13% -46050.776 us -9.57% FAST

@ustcfy ustcfy changed the title Optimize Hive hash computation for nested types [Draft] Optimize Hive hash computation for nested types Dec 23, 2024
Signed-off-by: Yan Feng <[email protected]>
@ustcfy ustcfy requested a review from ttnghia December 23, 2024 08:42
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you post the benchmark results here?

Maybe also how it failed on the string type and how to reproduce it if you need help on that issue.

#include <nvbench/nvbench.cuh>

constexpr auto min_width = 10;
constexpr auto max_width = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this two be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization effect is related to the schema, and I'm not sure how to write the benchmark for a good comparison.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark should support running either (or both) lists and structs. You can switch benchmark types using a boolean flag. That flag should not be constexpr so we can compile all the code to verify syntax.

auto const bench_structs = false;
auto const data_profile = [&] {
  if (bench_structs) {
    ...
    .struct_types(...);
    return ...;
  } else {
    ...
    .list_type(...);
    return ...;
  }();
   

src/main/cpp/benchmarks/hash.cu Show resolved Hide resolved
@ustcfy ustcfy requested a review from res-life December 23, 2024 10:17
@ustcfy ustcfy self-assigned this Dec 23, 2024
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
sperlingxx

This comment was marked as outdated.

@@ -486,15 +526,77 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,

check_nested_depth(input);

// `flattened_column_views` only contains nested columns and columns that result from flattening
// nested columns
std::vector<cudf::column_view> flattened_column_views;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column view constructor will calculate null count which is time consuming.
The original approach does not need to calculate null count.
We may need to find a way to avoid this column view array.
Please help check if the contiguous_copy_column_device_views is helpful?

// `input` to the index in `flattened_column_views`
std::vector<cudf::size_type> nested_column_map;

for (auto i = 0; i < input.num_columns(); i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract a function like get_flattened_device_views

@res-life
Copy link
Collaborator

Fixed the struct(string) type reports illegal memory access error:

diff --git a/src/main/cpp/src/hive_hash.cu b/src/main/cpp/src/hive_hash.cu
index ca720b5..5e9bb35 100644
--- a/src/main/cpp/src/hive_hash.cu
+++ b/src/main/cpp/src/hive_hash.cu
@@ -22,6 +22,7 @@
 #include <cudf/structs/structs_column_view.hpp>
 #include <cudf/table/experimental/row_operators.cuh>
 #include <cudf/table/table_device_view.cuh>
+#include <cudf/table/table_device_view.cuh>
 
 #include <rmm/cuda_stream_view.hpp>
 #include <rmm/exec_policy.hpp>
@@ -566,17 +567,9 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     }
   }
 
-  std::vector<cudf::column_device_view> device_flattened_column_views;
-  device_flattened_column_views.reserve(flattened_column_views.size());
-
-  std::transform(
-    flattened_column_views.begin(),
-    flattened_column_views.end(),
-    std::back_inserter(device_flattened_column_views),
-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });
+  [[maybe_unused]] auto [device_view_owners, device_flattened_column_views] =
+    cudf::contiguous_copy_column_device_views<cudf::column_device_view>(flattened_column_views, stream);
 
-  auto flattened_column_device_views =
-    cudf::detail::make_device_uvector_async(device_flattened_column_views, stream, mr);
   auto first_child_index_view =
     cudf::detail::make_device_uvector_async(first_child_index, stream, mr);
   auto nested_column_map_view =
@@ -594,7 +587,7 @@ std::unique_ptr<cudf::column> hive_hash(cudf::table_view const& input,
     output_view.end<hive_hash_value_t>(),
     hive_device_row_hasher<hive_hash_function, bool>(nullable,
                                                      *input_view,
-                                                     flattened_column_device_views.data(),
+                                                     device_flattened_column_views,
                                                      first_child_index_view.data(),
                                                      nested_column_map_view.data()));

Please apply the above patch.

Signed-off-by: Yan Feng <[email protected]>
@ttnghia
Copy link
Collaborator

ttnghia commented Dec 24, 2024

-    [&stream](auto const& col) { return *cudf::column_device_view::create(col, stream); });

Remember to never dereference the output from column_device_view::create or table_device_view::create directly. They return a pointer. If we don't store that pointer, it will be dangled right away thus the reference to it will be invalidiated.

Comment on lines +572 to +575
auto first_child_index_view =
cudf::detail::make_device_uvector_async(first_child_index, stream, mr);
auto nested_column_map_view =
cudf::detail::make_device_uvector_async(nested_column_map, stream, mr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory resource mr is only passed to the function that creates the the final output (or part of it). These variables are not returned thus we only pass in cudf::get_current_device_resource_ref().

Currently they are the same but this is required for future evolution of libcudf.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please check if there are similar situations in the other places that need to be changed.

@ustcfy ustcfy changed the title [Draft] Optimize Hive hash computation for nested types Optimize Hive hash computation for nested types Dec 25, 2024
@sperlingxx
Copy link
Collaborator

sperlingxx commented Dec 25, 2024

I just came up with some random thought:
How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result.
By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

@res-life
Copy link
Collaborator

res-life commented Dec 25, 2024

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)

Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types.
But for list type, e.g., list(int) column:
row0: [1,2,3]
row1: [1,2]
row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is:
[[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3))
[[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea.
We can discuss this further more.

@sperlingxx
Copy link
Collaborator

I just came up with some random thought: How about flattening nested children through a post-order traversal ? Meanwhile, replaces first_child_index with parent_index. In addition, adds the virtual root at the end of flattened_column_views to store the final result. By doing that, it seems to be able to get rid of the stack operations and achieve the hash computation with a simple for-loop. (Because post-order traversal guarantees that all children are handled before their shared parents.)
Please correct me if I am wrong. And even if this is doable, I am happy about doing with a follow-up PR since it is a NIT improvement.

Good idea for struct/primitive only types. But for list type, e.g., list(int) column: row0: [1,2,3] row1: [1,2] row0 and row1 have different number of items, we need a stack for each row to record current child already processed. Please think about multiple nested list, like: list(list(int)), data is: [[1,2], [1,2,3]]: hash = hash(hash(1) + hash(2)) + hash(hash(1) + hash(2) + hash(3)) [[1], [1,2]]: hash = hash(hash(1)) + hash(hash(1) + hash(2) )

For the struct/primitive nested types(without list), we can use this idea. We can discuss this further more.

Yes, you are right! In terms of list type, it seems impossible to get rid of stack operations.

Copy link
Collaborator

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, except the some NIT improvements

top.update_cur_hash(child_hash);
} else {
col_stack[stack_size++] = col_stack_frame(child_col);
col_stack[stack_size++] = col_stack_frame(child_col_idx, curr_row_idx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it is better to update the array element(col_stack_frame) inplacely instead of creating a new object.

if (--stack_size > 0) { col_stack[stack_size - 1].update_cur_hash(top.get_hash()); }
} else {
// Push the next element into the stack
col_stack[stack_size++] =
col_stack_frame(child_col.slice(top.get_and_inc_idx_to_process(), 1));
col_stack[stack_size++] = col_stack_frame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants