-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occupancy improvement for Hash table build #15700
base: branch-24.10
Are you sure you want to change the base?
Conversation
I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for |
/ok to test |
1 similar comment
/ok to test |
@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher. |
id_to_type<type_id::DECIMAL128>, | ||
id_to_type<type_id::DECIMAL64>, | ||
id_to_type<type_id::DECIMAL32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think decimal types are complex type. They are just a wrapper around some integer type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equality operator for Decimal will perform scaling which uses exponentiation.
CUDF_HOST_DEVICE inline bool operator==(fixed_point<Rep1, Rad1> const& lhs, |
I see a reduction in register usage if I comment out decimal types in #15502. I think we can still decide on the types excluded in the branches later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if we could resolve this. I have addressed this here #15700 (comment)
/ok to test |
@tgujar Could you take a look at the failing tests? |
/ok to test |
1 similar comment
/ok to test |
This PR needs to be rebased on branch-24.08. |
Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths. |
/ok to test |
/ok to test |
/ok to test |
id_to_type<type_id::DECIMAL128>, | ||
id_to_type<type_id::DECIMAL64>, | ||
id_to_type<type_id::DECIMAL32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decimals are fixed-width types instead of compound. If this is intended, we probably want to use another term here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, Will removing the decimal from the current compound definition impact performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe row-operators may benefit being dispatched using the dispatch_storage_type
defined here:
struct dispatch_storage_type { |
Since we do not compare columns of different types, the scale values are the same and so the compare on storage type for decimals/fixed-point will have the same result with less generated code.
Here are a few examples of dispatching with the
dispatch_storage_type
:cudf/cpp/src/sort/sort_column.cu
Line 46 in 44a9c10
cudf::type_dispatcher<dispatch_storage_type>(input.type(), |
cudf/cpp/include/cudf/detail/gather.cuh
Line 664 in 44a9c10
cudf::type_dispatcher<dispatch_storage_type>(source_column.type(), |
Maybe this can be combined with the dispatch mechanism here and eliminate any need for special handling for fixed-point types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I measured this by selectively commenting out the types in type_dispatcher.hpp
and I see non-trivial speedup. In both the measured benchmarks, other complex types (list, struct, string, dict) are commented out.
I think for future work it would be good to see if we are still latency bound.
Decimal effect on Hash join occupancy - Sheet1.csv
(Speedup numbers in sheet because I learnt later that nvbench compare script only supports json format)
@@ -0,0 +1,44 @@ | |||
/* | |||
* Copyright (c) 2022-2024, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright (c) 2022-2024, NVIDIA CORPORATION. | |
* Copyright (c) 2024, NVIDIA CORPORATION. |
@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments attached. Thanks for this! There's a lot of heavy templating but it's fairly readable in spite of that.
I am also interested in build time comparisons to the previous code.
cpp/CMakeLists.txt
Outdated
src/join/mixed_join_semi.cu | ||
src/join/mixed_join_kernels_semi.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep these filenames alphabetized. If you like, you could rename this to mixed_join_semi_kernels.cu
.
@@ -109,8 +130,8 @@ struct distinct_hash_join { | |||
std::shared_ptr<cudf::experimental::row::equality::preprocessed_table> | |||
_preprocessed_build; ///< input table preprocssed for row operators | |||
std::shared_ptr<cudf::experimental::row::equality::preprocessed_table> | |||
_preprocessed_probe; ///< input table preprocssed for row operators | |||
hash_table_type _hash_table; ///< hash table built on `_build` | |||
_preprocessed_probe; ///< input table preprocssed for row operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_preprocessed_probe; ///< input table preprocssed for row operators | |
_preprocessed_probe; ///< input table preprocessed for row operators |
struct dispatch_void_conditional_generator { | ||
/// The underlying type | ||
template <typename T> | ||
using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>; | |
using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>; |
/// The type to dispatch to if the type is nested | ||
using type = std::conditional_t<t == type_id::STRUCT or t == type_id::LIST, void, id_to_type<t>>; | ||
/// The underlying type | ||
using type = dispatch_void_if_nested_t<id_to_type<t>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically we define things the other way -- define the dispatch_void_if_nested
struct, then define using dispatch_void_if_nested_t
in terms of the ::type
member of that struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay yep, but I think here I need dispatch_void_if_nested
to be templated on cudf::type_id
but I need dispatch_void_if_nested_t
to be templated on some type T. Maybe they should be named differently?
|
||
auto const output_begin = | ||
thrust::make_transform_output_iterator(build_indices->begin(), output_fn{}); | ||
// TODO conditional find for nulls once `cuco::static_set::find_if` is added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature now exists in cuCollections, I think. Let's refactor if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we could address this in a separate MR since the change wouldn't reflect this MR description. What do you think?
Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR breaks ODR violations as corrected in #16603.
It needs to be refactored so that all kernels are only launched from the TU that holds the implemenation.
You should be able to follow the updated pattern seen in That restructing has us separate TU's for the mixed join kernel based on the nullability of the input. This was done by having the intermidate host launch code have a specilization in each TU. |
Splitting this MR so its easier to review and merge. |
Description
Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in #15502.
At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.
This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.
Checklist