Occupancy improvement for Hash table build #15700

tgujar · 2024-05-08T01:43:42Z

Description

Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in #15502.

At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.

This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-05-08T01:43:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tgujar · 2024-05-08T01:55:21Z

I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for device_row_hasher but I am unsure if there is a better way to implement this. We could introduce a macro here to generate the code, what do you think?

PointKernel · 2024-05-08T19:04:21Z

/ok to test

PointKernel · 2024-05-14T19:45:57Z

/ok to test

PointKernel · 2024-05-14T19:49:36Z

@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher.

ttnghia · 2024-05-14T20:23:44Z

cpp/src/join/mixed_join_common_utils.cuh

+                                                          id_to_type<type_id::DECIMAL128>,
+                                                          id_to_type<type_id::DECIMAL64>,
+                                                          id_to_type<type_id::DECIMAL32>,


I don't think decimal types are complex type. They are just a wrapper around some integer type.

Equality operator for Decimal will perform scaling which uses exponentiation.

cudf/cpp/include/cudf/fixed_point/fixed_point.hpp

Line 735 in 888e9d5

CUDF_HOST_DEVICE inline bool operator==(fixed_point<Rep1, Rad1> const& lhs,

I see a reduction in register usage if I comment out decimal types in #15502. I think we can still decide on the types excluded in the branches later on

Let me know if we could resolve this. I have addressed this here #15700 (comment)

PointKernel · 2024-05-16T02:56:52Z

/ok to test

PointKernel · 2024-05-16T14:55:44Z

@tgujar Could you take a look at the failing tests?

PointKernel · 2024-05-17T17:57:22Z

/ok to test

PointKernel · 2024-05-21T16:02:15Z

/ok to test

cpp/include/cudf/table/experimental/row_operators.cuh

davidwendt · 2024-05-30T14:35:12Z

This PR needs to be rebased on branch-24.08.

tgujar · 2024-05-30T14:36:00Z

Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths.
The current plan is to avoid using a macro(as mentioned here) and instead do dynamic dispatch on CPU side using std::variant and std::visit

ttnghia · 2024-08-02T04:49:29Z

/ok to test

davidwendt · 2024-08-02T13:10:37Z

/ok to test

PointKernel · 2024-08-23T00:48:54Z

/ok to test

cpp/include/cudf/detail/distinct_hash_join.cuh

PointKernel · 2024-08-26T18:18:18Z

cpp/include/cudf/table/experimental/row_operators.cuh

+                                      id_to_type<type_id::DECIMAL128>,
+                                      id_to_type<type_id::DECIMAL64>,
+                                      id_to_type<type_id::DECIMAL32>,


decimals are fixed-width types instead of compound. If this is intended, we probably want to use another term here.

Out of curiosity, Will removing the decimal from the current compound definition impact performance?

I believe row-operators may benefit being dispatched using the dispatch_storage_type defined here:

cudf/cpp/include/cudf/utilities/type_dispatcher.hpp

Line 212 in 44a9c10

struct dispatch_storage_type {

Since we do not compare columns of different types, the scale values are the same and so the compare on storage type for decimals/fixed-point will have the same result with less generated code.
Here are a few examples of dispatching with the dispatch_storage_type:

cudf/cpp/src/sort/sort_column.cu

Line 46 in 44a9c10

cudf::type_dispatcher<dispatch_storage_type>(input.type(),

cudf/cpp/include/cudf/detail/gather.cuh

Line 664 in 44a9c10

cudf::type_dispatcher<dispatch_storage_type>(source_column.type(),

Maybe this can be combined with the dispatch mechanism here and eliminate any need for special handling for fixed-point types.

I measured this by selectively commenting out the types in type_dispatcher.hpp and I see non-trivial speedup. In both the measured benchmarks, other complex types (list, struct, string, dict) are commented out.
I think for future work it would be good to see if we are still latency bound.
Decimal effect on Hash join occupancy - Sheet1.csv

(Speedup numbers in sheet because I learnt later that nvbench compare script only supports json format)

cpp/include/cudf/table/experimental/row_operators.cuh

PointKernel · 2024-08-26T18:38:26Z

cpp/src/join/mixed_join_kernels_semi_compound.cu

@@ -0,0 +1,44 @@
+/*
+ * Copyright (c) 2022-2024, NVIDIA CORPORATION.


Suggested change

* Copyright (c) 2022-2024, NVIDIA CORPORATION.

* Copyright (c) 2024, NVIDIA CORPORATION.

cpp/src/join/mixed_join_kernels_semi_nested.cu

cpp/src/join/mixed_join_semi.cu

PointKernel · 2024-08-26T18:53:28Z

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

bdice

Comments attached. Thanks for this! There's a lot of heavy templating but it's fairly readable in spite of that.

I am also interested in build time comparisons to the previous code.

bdice · 2024-08-26T21:17:16Z

cpp/CMakeLists.txt

  src/join/mixed_join_semi.cu
+  src/join/mixed_join_kernels_semi.cu


Keep these filenames alphabetized. If you like, you could rename this to mixed_join_semi_kernels.cu.

bdice · 2024-08-26T21:19:36Z

cpp/include/cudf/detail/distinct_hash_join.cuh

@@ -109,8 +130,8 @@ struct distinct_hash_join {
  std::shared_ptr<cudf::experimental::row::equality::preprocessed_table>
    _preprocessed_build;  ///< input table preprocssed for row operators
  std::shared_ptr<cudf::experimental::row::equality::preprocessed_table>
-    _preprocessed_probe;        ///< input table preprocssed for row operators
-  hash_table_type _hash_table;  ///< hash table built on `_build`
+    _preprocessed_probe;                         ///< input table preprocssed for row operators


Suggested change

_preprocessed_probe; ///< input table preprocssed for row operators

_preprocessed_probe; ///< input table preprocessed for row operators

cpp/include/cudf/table/experimental/row_operators.cuh

bdice · 2024-08-26T21:22:03Z

cpp/include/cudf/table/experimental/row_operators.cuh

+struct dispatch_void_conditional_generator {
+  /// The underlying type
+  template <typename T>
+  using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;


Suggested change

using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;

using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>;

bdice · 2024-08-26T21:28:21Z

cpp/include/cudf/table/experimental/row_operators.cuh

-  /// The type to dispatch to if the type is nested
-  using type = std::conditional_t<t == type_id::STRUCT or t == type_id::LIST, void, id_to_type<t>>;
+  /// The underlying type
+  using type = dispatch_void_if_nested_t<id_to_type<t>>;


Typically we define things the other way -- define the dispatch_void_if_nested struct, then define using dispatch_void_if_nested_t in terms of the ::type member of that struct.

Okay yep, but I think here I need dispatch_void_if_nested to be templated on cudf::type_id but I need dispatch_void_if_nested_t to be templated on some type T. Maybe they should be named differently?

bdice · 2024-08-26T22:29:39Z

cpp/src/join/distinct_hash_join.cu

+
+        auto const output_begin =
+          thrust::make_transform_output_iterator(build_indices->begin(), output_fn{});
+        // TODO conditional find for nulls once `cuco::static_set::find_if` is added


This feature now exists in cuCollections, I think. Let's refactor if we can.

I think maybe we could address this in a separate MR since the change wouldn't reflect this MR description. What do you think?

cpp/src/join/mixed_join_kernel_semi_impl.cuh

cpp/src/join/mixed_join_kernels_semi.cuh

cpp/src/join/mixed_join_kernels_semi_compound.cu

tgujar · 2024-08-27T17:33:22Z

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

robertmaynard

This PR breaks ODR violations as corrected in #16603.

It needs to be refactored so that all kernels are only launched from the TU that holds the implemenation.

robertmaynard · 2024-08-27T18:49:42Z

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

You should be able to follow the updated pattern seen in cpp/src/join/mixed_join_kernel_nulls.cu, cpp/src/join/mixed_join_kernel.cu, cpp/src/join/mixed_join_kernel.cuh, and cpp/src/join/mixed_join_kernel.hpp.

That restructing has us separate TU's for the mixed join kernel based on the nullability of the input. This was done by having the intermidate host launch code have a specilization in each TU.

tgujar · 2024-11-09T03:07:31Z

Splitting this MR so its easier to review and merge.

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 8, 2024

PointKernel added non-breaking Non-breaking change 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function Performance Performance related issue labels May 8, 2024

ttnghia reviewed May 14, 2024

View reviewed changes

davidwendt reviewed May 30, 2024

View reviewed changes

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved

tgujar and others added 11 commits June 3, 2024 07:57

nested template instantiation for hiding types

3df7dce

hasher conditional type dispatch works

088042c

delete dead comment block

34bad1e

Fix docs

40f291e

fix type logic, minor refactor

b56bf75

refactor

078f53d

added template specialization for equality comparator

14785e4

added template specialized calls to comparator

b88b60b

fix for register usage discrepancy

1b89198

fix for register usage discrepancy

ca63201

revert edited comment blocks

ff5e0d4

tgujar force-pushed the hash-occupancy branch from 347fb02 to ff5e0d4 Compare June 3, 2024 16:13

tgujar and others added 2 commits August 1, 2024 12:34

Merge branch 'distinct-join-occupancy' into hash-occupancy

b30f6ed

Merge branch 'branch-24.10' into hash-occupancy

621e29f

address review comments

970a829

GregoryKimball mentioned this pull request Aug 16, 2024

Refactor distinct using static_map insert_or_apply #16484

Open

3 tasks

tgujar and others added 2 commits August 22, 2024 17:13

Merge branch 'branch-24.10' into hash-occupancy

d063dbc

Merge branch 'branch-24.10' into hash-occupancy

50341bd

fix issue with distinct hash join

470d6e2

PointKernel reviewed Aug 26, 2024

View reviewed changes

bdice reviewed Aug 26, 2024

View reviewed changes

robertmaynard requested changes Aug 27, 2024

View reviewed changes

tgujar added 11 commits September 4, 2024 00:27

address review comments

57c9b5e

refactor find_any

8e6e6e5

Merge branch 'branch-24.10' into hash-occupancy

9cd1bda

Merge branch 'branch-24.10' into hash-occupancy

7d48f52

remove redundant SFINAE check

728478f

use distance from thrust namespace

926342f

update docs

d82754d

fix spelling

c78da6b

merge branch-24.10, needs fixes

daa2b40

fail with instantiating correct type

30ab4e3

fix issue with constness

f1db848

GregoryKimball mentioned this pull request Oct 29, 2024

[FEA] Investigate fast-path for hash joins that bypasses row operators #16026

Open

tgujar mentioned this pull request Nov 9, 2024

Occupancy improvement for distinct hash join with specialized dispatch #17290

Open

3 tasks

		@@ -0,0 +1,44 @@
		/*
		* Copyright (c) 2022-2024, NVIDIA CORPORATION.

	* Copyright (c) 2022-2024, NVIDIA CORPORATION.
	* Copyright (c) 2024, NVIDIA CORPORATION.

		src/join/mixed_join_semi.cu
		src/join/mixed_join_kernels_semi.cu

	_preprocessed_probe; ///< input table preprocssed for row operators
	_preprocessed_probe; ///< input table preprocessed for row operators

	using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;
	using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>;

Occupancy improvement for Hash table build #15700

Are you sure you want to change the base?

Occupancy improvement for Hash table build #15700

Conversation

tgujar commented May 8, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented May 8, 2024

tgujar commented May 8, 2024 • edited Loading

PointKernel commented May 8, 2024

PointKernel commented May 14, 2024

PointKernel commented May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PointKernel commented May 16, 2024

PointKernel commented May 16, 2024

PointKernel commented May 17, 2024

PointKernel commented May 21, 2024

davidwendt commented May 30, 2024

tgujar commented May 30, 2024

ttnghia commented Aug 2, 2024

davidwendt commented Aug 2, 2024

PointKernel commented Aug 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgujar Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PointKernel commented Aug 26, 2024

bdice left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgujar commented Aug 27, 2024

robertmaynard left a comment

Choose a reason for hiding this comment

robertmaynard commented Aug 27, 2024

tgujar commented Nov 9, 2024

tgujar commented May 8, 2024 •

edited

Loading

tgujar commented May 8, 2024 •

edited

Loading

PointKernel commented May 14, 2024 •

edited

Loading

tgujar Sep 18, 2024 •

edited

Loading

bdice left a comment •

edited

Loading