Avoid double MultiIndex factorization in groupby index result #17644

mroeschke · 2024-12-20T19:19:48Z

Description

From an offline discussion, it was discovered that a groupby with multiple keys experienced a slowdown somewhere between 24.02 and 24.12. An identified hot spot was the creation of the resulting MultiIndex where each resulting level undergoes factorization upon construction (which changed somewhere in between these releases to fix consistency bugs)

While the eager factorization is a new performance penalty, I identified that this factorization was being done twice unnecessarily when performing an agg. This PR ensures we avoid creating this MultiIndex until necessary and cache this MultiIndex result to avoid double factorization in the future

# PR
In [1]: import cudf
   ...: 
   ...: df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int})

In [2]:     target = "label"
   ...:     col = ['weekday', 'cat_2', 'brand']

In [3]: %%time
   ...: df_train[col + [target]].groupby(col).agg(['mean', 'count'])
   ...: 
   ...: 
# PR
CPU times: user 366 ms, sys: 109 ms, total: 474 ms
Wall time: 482 ms

#branch 25.02
CPU times: user 547 ms, sys: 112 ms, total: 659 ms
Wall time: 658 ms

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…y/mi

mroeschke · 2024-12-30T22:48:09Z

/merge

Noticed while working on #17644 that `diff` and `fillna` were make some unnecessary shallow copies of the `grouping.value` object. Also noticed that `_cov_or_corr` just pulled the column names out of `grouping.value` object, so made a separate API, `values_column_names` to just create the column names without pulling out the actual columns. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17646

Follow up to #17644 This PR changes `MultiIndex` to delay computing `self._level` and `self._codes` via factorization of `self._column` until needed by certain methods. Before, for state consistency, those attributes were always eagerly computed. As discussed offline the performance benefit of not eagerly computing those attributes is more desirable. ```python import cudf df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int}) target = "label" col = ['weekday', 'cat_2', 'brand'] df_gb= df_train[col + [target]].groupby(col) %%time df_gb.agg(['mean', 'count']) # PR CPU times: user 144 ms, sys: 23.9 ms, total: 168 ms Wall time: 166 ms _ ._ __/__ _ _ _ _ _/_ Recorded: 12:12:46 Samples: 4 /_//_/// /_\ / //_// / //_'/ // Duration: 0.159 CPU time: 0.160 / _/ v5.0.0 Cell [3] 0.158 <module> <ipython-input-3-ee51998a643a>:1 `- 0.158 wrapper cudf/utils/performance_tracking.py:30 `- 0.158 DataFrameGroupBy.agg cudf/core/groupby/groupby.py:879 |- 0.129 DataFrameGroupBy._aggregate cudf/core/groupby/groupby.py:789 `- 0.029 NumericalColumn.astype cudf/core/column/column.py:1126 `- 0.029 NumericalColumn.as_numerical_column cudf/core/column/numerical.py:428 `- 0.029 inner contextlib.py:78 `- 0.029 cast cudf/core/_internals/unary.py:40 # branch 25.02 head CPU times: user 369 ms, sys: 105 ms, total: 474 ms Wall time: 478 ms _ ._ __/__ _ _ _ _ _/_ Recorded: 12:11:20 Samples: 79 /_//_/// /_\ / //_// / //_'/ // Duration: 0.480 CPU time: 0.478 / _/ v5.0.0 Cell [3] 0.479 <module> <ipython-input-3-ee51998a643a>:1 `- 0.478 wrapper cudf/utils/performance_tracking.py:30 `- 0.478 DataFrameGroupBy.agg cudf/core/groupby/groupby.py:879 |- 0.267 cached_property.__get__ functools.py:979 | `- 0.267 _Grouping.keys cudf/core/groupby/groupby.py:3534 | `- 0.267 wrapper cudf/utils/performance_tracking.py:30 | `- 0.267 MultiIndex._from_data cudf/core/multiindex.py:344 | `- 0.265 _compute_levels_and_codes cudf/core/multiindex.py:67 | |- 0.230 factorize cudf/core/algorithms.py:22 | | |- 0.168 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | | | |- 0.097 [self] cudf/core/column/column.py | | | |- 0.031 as_column cudf/core/column/column.py:1948 | | | | `- 0.028 NumericalColumn.astype cudf/core/column/column.py:1126 | | | | `- 0.028 NumericalColumn.as_numerical_column cudf/core/column/numerical.py:428 | | | | `- 0.028 inner contextlib.py:78 | | | | `- 0.028 cast cudf/core/_internals/unary.py:40 | | | |- 0.030 inner contextlib.py:78 | | | | `- 0.030 sort_by_key cudf/core/_internals/sorting.py:160 | | | `- 0.007 NumericalColumn.take cudf/core/column/column.py:943 | | | `- 0.007 inner contextlib.py:78 | | | `- 0.007 gather cudf/core/_internals/copying.py:18 | | |- 0.051 NumericalColumn.unique cudf/core/column/column.py:1342 | | | |- 0.036 inner contextlib.py:78 | | | | `- 0.036 drop_duplicates cudf/core/_internals/stream_compaction.py:82 | | | `- 0.015 NumericalColumn.is_unique cudf/core/column/column.py:1056 | | | `- 0.015 NumericalColumn.distinct_count cudf/core/column/column.py:1108 | | `- 0.006 NumericalColumn.dropna cudf/core/column/column.py:294 | | `- 0.006 NumericalColumn.copy cudf/core/column/column.py:481 | `- 0.028 _compile_module_with_cache cupy/cuda/compiler.py:473 | [4 frames hidden] cupy, <built-in> |- 0.133 DataFrameGroupBy._aggregate cudf/core/groupby/groupby.py:789 `- 0.079 wrapper cudf/utils/performance_tracking.py:30 `- 0.079 MultiIndex._from_columns_like_self cudf/core/frame.py:194 `- 0.079 wrapper cudf/utils/performance_tracking.py:30 |- 0.040 MultiIndex._copy_type_metadata cudf/core/multiindex.py:2101 | `- 0.038 _compute_levels_and_codes cudf/core/multiindex.py:67 | `- 0.038 factorize cudf/core/algorithms.py:22 | |- 0.021 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | | |- 0.010 [self] cudf/core/column/column.py | | `- 0.007 inner contextlib.py:78 | | `- 0.007 sort_by_key cudf/core/_internals/sorting.py:160 | `- 0.011 NumericalColumn.unique cudf/core/column/column.py:1342 | `- 0.010 inner contextlib.py:78 | `- 0.010 drop_duplicates cudf/core/_internals/stream_compaction.py:82 `- 0.039 MultiIndex._from_data cudf/core/multiindex.py:344 `- 0.038 _compute_levels_and_codes cudf/core/multiindex.py:67 `- 0.038 factorize cudf/core/algorithms.py:22 |- 0.021 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | |- 0.010 [self] cudf/core/column/column.py | `- 0.007 inner contextlib.py:78 | `- 0.007 sort_by_key cudf/core/_internals/sorting.py:160 `- 0.010 NumericalColumn.unique cudf/core/column/column.py:1342 `- 0.010 inner contextlib.py:78 `- 0.010 drop_duplicates cudf/core/_internals/stream_compaction.py:82 ``` Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #17728

mroeschke added 2 commits December 20, 2024 10:16

Use less grouping.keys

6fd6eab

ensure keys is a cached property

0f3b693

mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 20, 2024

mroeschke self-assigned this Dec 20, 2024

mroeschke requested a review from a team as a code owner December 20, 2024 19:19

mroeschke requested review from vyasr and brandon-b-miller December 20, 2024 19:19

mroeschke added 2 commits December 20, 2024 12:08

Merge remote-tracking branch 'upstream/branch-25.02' into perf/groupb…

bfc4e59

…y/mi

needs keys to pass to __getitem__

e663c01

mroeschke mentioned this pull request Dec 20, 2024

Avoid shallow copies in groupby methods #17646

Merged

3 tasks

vyasr approved these changes Dec 21, 2024

View reviewed changes

rapids-bot bot merged commit 88d9251 into rapidsai:branch-25.02 Dec 30, 2024
106 checks passed

mroeschke deleted the perf/groupby/mi branch December 30, 2024 22:48

mroeschke mentioned this pull request Jan 13, 2025

Delay setting MultiIndex.level/codes until needed #17728

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid double MultiIndex factorization in groupby index result #17644

Avoid double MultiIndex factorization in groupby index result #17644

mroeschke commented Dec 20, 2024

mroeschke commented Dec 30, 2024

Avoid double MultiIndex factorization in groupby index result #17644

Avoid double MultiIndex factorization in groupby index result #17644

Conversation

mroeschke commented Dec 20, 2024

Description

Checklist

mroeschke commented Dec 30, 2024