Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-partition Shuffle operation to cuDF Polars #17744

Open
wants to merge 9 commits into
base: branch-25.02
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

Description

This PR pulls out the Shuffle logic from #17518 to simplify the review process.

The goal is to establish the shuffle groundwork for multi-partition Join and Sort operations.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jan 15, 2025
@rjzamora rjzamora self-assigned this Jan 15, 2025
@rjzamora rjzamora requested review from a team as code owners January 15, 2025 17:26
@rjzamora rjzamora requested review from bdice and mroeschke January 15, 2025 17:26
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 15, 2025
)

# Split and return the partitioned result
return {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madsbk - Just curious: When we integrate with rapidsmp, can we be able to simply call pack on the elements of this mapping and hand it off to the shuffle service?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can use partition_and_pack()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cool - That sounds good for a hash-based shuffle, but we will not want hash partitioning for sorting. Rather, we will want to pass in a pylibcudf column containing the final partition for each row. Do we need to add that utility in rapidsmp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would need an util function that takes a table and split+pack it into a dict[PartID, PackedColumns] based on the values in a column.

@rjzamora
Copy link
Member Author

cc @wence- - It may make sense to get this in before Join or Groupby support. The Join logic largely depends on shuffling, and GroupBy may take a bit longer to clean up. (I can also push on Sort once the shuffling foundation is in place).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants