-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HiveHash
support for nested types
#2534
Conversation
Signed-off-by: ustcfy <[email protected]>
5944002
to
12e83e7
Compare
build |
HiveHash
support for nested types
Signed-off-by: ustcfy <[email protected]>
Hi @ttnghia, could you help take a look ? Thx in advance. |
Signed-off-by: ustcfy <[email protected]>
src/main/cpp/src/hive_hash.cu
Outdated
*/ | ||
class col_stack_element { | ||
private: | ||
cudf::column_device_view _column; // current column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of storing a column in each stack element, you can first flatten the input nested column into a table (array of columns), then here just store the index of the column in that array. See cudf::experimental::row::lexicographic::preprocessed_table
for example of such flatten table.
Some preprocessing is needed so we can retrieve the index of the children columns in the array of flattened table. My initial idea is to flatten the columns level-by-level, then maintain a first_child_index
array along with a num_children
array. For example, with input STRUCT<INT, STRUCT<INT, STRING>>
:
flattened_table = [STRUCT<>, INT, STRUCT<>, INT, STRING]
first_child_index = [1, -1, 3, -1, -1]
num_children = [2, 0, 2, 0, 0]
Then you can start iterating through the columns starting with col_index
from 0
.
By using int
to store column index instead of column_device_view
, we can reduce memory usage significantly thus increasing the stack size a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out cudf/cpp/src/table/row_operators.cu
(cudf::experimental::row::lexicographic::preprocessed_table::create
) and cudf/cpp/src/structs/utilities.cpp
(cudf::structs::detail::flatten_nested_columns
) for example code of flattening table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach may require significant changes to the existing framework, as the calculations for these 5 columns are not independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! But as discussed, I'm fine to defer this for the follow-up work, so we can have this for 24.12.
Co-authored-by: Nghia Truong <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
build |
This PR is based on wjxiz1992#9.
This PR adds support for Hive hash with
struct
andlist
types.