Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Avoid double MultiIndex factorization in groupby index result (#17644)
From an offline discussion, it was discovered that a `groupby` with multiple keys experienced a slowdown somewhere between 24.02 and 24.12. An identified hot spot was the creation of the resulting `MultiIndex` where each resulting level undergoes factorization upon construction (which changed somewhere in between these releases to fix consistency bugs) While the eager factorization is a new performance penalty, I identified that this factorization was being done twice unnecessarily when performing an `agg`. This PR ensures we avoid creating this MultiIndex until necessary and cache this MultiIndex result to avoid double factorization in the future ```python # PR In [1]: import cudf ...: ...: df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int}) In [2]: target = "label" ...: col = ['weekday', 'cat_2', 'brand'] In [3]: %%time ...: df_train[col + [target]].groupby(col).agg(['mean', 'count']) ...: ...: # PR CPU times: user 366 ms, sys: 109 ms, total: 474 ms Wall time: 482 ms #branch 25.02 CPU times: user 547 ms, sys: 112 ms, total: 659 ms Wall time: 658 ms ``` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #17644
- Loading branch information