Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HiveHash support for nested types #2534

Merged
merged 26 commits into from
Nov 25, 2024

Conversation

ustcfy
Copy link
Collaborator

@ustcfy ustcfy commented Oct 24, 2024

This PR is based on wjxiz1992#9.

This PR adds support for Hive hash with struct and list types.

@ustcfy ustcfy marked this pull request as ready for review October 25, 2024 10:13
@ustcfy ustcfy marked this pull request as draft October 28, 2024 09:19
@ustcfy ustcfy force-pushed the hivehash-nested-support branch from 5944002 to 12e83e7 Compare October 29, 2024 07:41
@ustcfy ustcfy requested a review from nvdbaranec October 29, 2024 07:43
@ustcfy ustcfy self-assigned this Oct 29, 2024
@ustcfy ustcfy requested a review from ttnghia October 29, 2024 07:48
@firestarman
Copy link
Collaborator

build

@firestarman firestarman marked this pull request as ready for review October 30, 2024 08:09
@ustcfy ustcfy changed the title Add HiveHash support for nested types Add HiveHash support for nested types Nov 6, 2024
Signed-off-by: ustcfy <[email protected]>
@firestarman firestarman requested a review from res-life November 7, 2024 08:51
@firestarman
Copy link
Collaborator

Hi @ttnghia, could you help take a look ? Thx in advance.

*/
class col_stack_element {
private:
cudf::column_device_view _column; // current column
Copy link
Collaborator

@ttnghia ttnghia Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing a column in each stack element, you can first flatten the input nested column into a table (array of columns), then here just store the index of the column in that array. See cudf::experimental::row::lexicographic::preprocessed_table for example of such flatten table.

Some preprocessing is needed so we can retrieve the index of the children columns in the array of flattened table. My initial idea is to flatten the columns level-by-level, then maintain a first_child_index array along with a num_children array. For example, with input STRUCT<INT, STRUCT<INT, STRING>>:

flattened_table = [STRUCT<>, INT, STRUCT<>, INT, STRING]
first_child_index = [1, -1, 3, -1, -1]
num_children = [2, 0, 2, 0, 0]

Then you can start iterating through the columns starting with col_index from 0.

By using int to store column index instead of column_device_view, we can reduce memory usage significantly thus increasing the stack size a lot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out cudf/cpp/src/table/row_operators.cu (cudf::experimental::row::lexicographic::preprocessed_table::create) and cudf/cpp/src/structs/utilities.cpp (cudf::structs::detail::flatten_nested_columns) for example code of flattening table.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach may require significant changes to the existing framework, as the calculations for these 5 columns are not independent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! But as discussed, I'm fine to defer this for the follow-up work, so we can have this for 24.12.

ustcfy and others added 2 commits November 8, 2024 19:35
Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@res-life
Copy link
Collaborator

build

@ustcfy ustcfy merged commit c170ea5 into NVIDIA:branch-24.12 Nov 25, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants