-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-partition Shuffle
operation to cuDF Polars
#17744
base: branch-25.02
Are you sure you want to change the base?
Add multi-partition Shuffle
operation to cuDF Polars
#17744
Conversation
) | ||
|
||
# Split and return the partitioned result | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madsbk - Just curious: When we integrate with rapidsmp
, can we be able to simply call pack
on the elements of this mapping and hand it off to the shuffle service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can use partition_and_pack()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay cool - That sounds good for a hash-based shuffle, but we will not want hash partitioning for sorting. Rather, we will want to pass in a pylibcudf column containing the final partition for each row. Do we need to add that utility in rapidsmp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we would need an util function that takes a table and split+pack it into a dict[PartID, PackedColumns]
based on the values in a column.
cc @wence- - It may make sense to get this in before |
Description
This PR pulls out the
Shuffle
logic from #17518 to simplify the review process.The goal is to establish the shuffle groundwork for multi-partition
Join
andSort
operations.Checklist