Add multi-partition `Shuffle` operation to cuDF Polars #17744

rjzamora · 2025-01-15T17:26:29Z

Description

This PR pulls out the Shuffle logic from #17518 to simplify the review process.

The goal is to establish the shuffle groundwork for multi-partition Join and Sort operations.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…igration

…-multi-shuffle

rjzamora · 2025-01-15T17:36:08Z

python/cudf_polars/cudf_polars/experimental/shuffle.py

+    )
+
+    # Split and return the partitioned result
+    return {


@madsbk - Just curious: When we integrate with rapidsmp, can we be able to simply call pack on the elements of this mapping and hand it off to the shuffle service?

Yes, I think we can use partition_and_pack()

Okay cool - That sounds good for a hash-based shuffle, but we will not want hash partitioning for sorting. Rather, we will want to pass in a pylibcudf column containing the final partition for each row. Do we need to add that utility in rapidsmp?

Yes, we would need an util function that takes a table and split+pack it into a dict[PartID, PackedColumns] based on the values in a column.

rjzamora · 2025-01-15T17:39:34Z

cc @wence- - It may make sense to get this in before Join or Groupby support. The Join logic largely depends on shuffling, and GroupBy may take a bit longer to clean up. (I can also push on Sort once the shuffling foundation is in place).

rjzamora added 8 commits January 9, 2025 07:25

try importing dask_expr from dask.dataframe

89392c0

Merge remote-tracking branch 'upstream/branch-25.02' into dask-expr-m…

2a6821d

…igration

Merge branch 'branch-25.02' into dask-expr-migration

5743030

Merge remote-tracking branch 'upstream/branch-25.02' into dask-expr-m…

7d36d3b

…igration

update the error message

88e078d

add basic shuffle support

1f77ec4

major revision

8c52fde

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

0886ab7

…-multi-shuffle

rjzamora added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jan 15, 2025

rjzamora self-assigned this Jan 15, 2025

rjzamora requested review from a team as code owners January 15, 2025 17:26

rjzamora requested review from bdice and mroeschke January 15, 2025 17:26

github-actions bot added the Python Affects Python cuDF API. label Jan 15, 2025

roll back unrelated changes

f714a51

rjzamora commented Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-partition `Shuffle` operation to cuDF Polars #17744

Add multi-partition `Shuffle` operation to cuDF Polars #17744

rjzamora commented Jan 15, 2025

rjzamora Jan 15, 2025

madsbk Jan 16, 2025

rjzamora Jan 16, 2025

madsbk Jan 16, 2025

rjzamora commented Jan 15, 2025

Add multi-partition Shuffle operation to cuDF Polars #17744

Are you sure you want to change the base?

Add multi-partition Shuffle operation to cuDF Polars #17744

Conversation

rjzamora commented Jan 15, 2025

Description

Checklist

rjzamora Jan 15, 2025

Choose a reason for hiding this comment

madsbk Jan 16, 2025

Choose a reason for hiding this comment

rjzamora Jan 16, 2025

Choose a reason for hiding this comment

madsbk Jan 16, 2025

Choose a reason for hiding this comment

rjzamora commented Jan 15, 2025

Add multi-partition `Shuffle` operation to cuDF Polars #17744

Add multi-partition `Shuffle` operation to cuDF Polars #17744