Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double MultiIndex factorization in groupby index result #17644

Merged
merged 4 commits into from
Dec 30, 2024

Conversation

mroeschke
Copy link
Contributor

Description

From an offline discussion, it was discovered that a groupby with multiple keys experienced a slowdown somewhere between 24.02 and 24.12. An identified hot spot was the creation of the resulting MultiIndex where each resulting level undergoes factorization upon construction (which changed somewhere in between these releases to fix consistency bugs)

While the eager factorization is a new performance penalty, I identified that this factorization was being done twice unnecessarily when performing an agg. This PR ensures we avoid creating this MultiIndex until necessary and cache this MultiIndex result to avoid double factorization in the future

# PR
In [1]: import cudf
   ...: 
   ...: df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int})

In [2]:     target = "label"
   ...:     col = ['weekday', 'cat_2', 'brand']

In [3]: %%time
   ...: df_train[col + [target]].groupby(col).agg(['mean', 'count'])
   ...: 
   ...: 
# PR
CPU times: user 366 ms, sys: 109 ms, total: 474 ms
Wall time: 482 ms

#branch 25.02
CPU times: user 547 ms, sys: 112 ms, total: 659 ms
Wall time: 658 ms

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 20, 2024
@mroeschke mroeschke self-assigned this Dec 20, 2024
@mroeschke mroeschke requested a review from a team as a code owner December 20, 2024 19:19
@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 88d9251 into rapidsai:branch-25.02 Dec 30, 2024
106 checks passed
@mroeschke mroeschke deleted the perf/groupby/mi branch December 30, 2024 22:48
rapids-bot bot pushed a commit that referenced this pull request Jan 4, 2025
Noticed while working on #17644 that `diff` and `fillna` were make some unnecessary shallow copies of the `grouping.value` object. Also noticed that `_cov_or_corr` just pulled the column names out of `grouping.value` object, so made a separate API, `values_column_names` to just create the column names without pulling out the actual columns.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #17646
rapids-bot bot pushed a commit that referenced this pull request Jan 15, 2025
Follow up to #17644

This PR changes `MultiIndex` to delay computing `self._level` and `self._codes` via factorization of `self._column` until needed by certain methods. Before, for state consistency, those attributes were always eagerly computed. As discussed offline the performance benefit of not eagerly computing those attributes is more desirable.

```python
import cudf
df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int})
target = "label"
col = ['weekday', 'cat_2', 'brand']
df_gb= df_train[col + [target]].groupby(col)
%%time
df_gb.agg(['mean', 'count'])


# PR
CPU times: user 144 ms, sys: 23.9 ms, total: 168 ms
Wall time: 166 ms

  _     ._   __/__   _ _  _  _ _/_   Recorded: 12:12:46  Samples:  4
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.159     CPU time: 0.160
/   _/                      v5.0.0

Cell [3]

0.158 <module>  <ipython-input-3-ee51998a643a>:1
`- 0.158 wrapper  cudf/utils/performance_tracking.py:30
   `- 0.158 DataFrameGroupBy.agg  cudf/core/groupby/groupby.py:879
      |- 0.129 DataFrameGroupBy._aggregate  cudf/core/groupby/groupby.py:789
      `- 0.029 NumericalColumn.astype  cudf/core/column/column.py:1126
         `- 0.029 NumericalColumn.as_numerical_column  cudf/core/column/numerical.py:428
            `- 0.029 inner  contextlib.py:78
               `- 0.029 cast  cudf/core/_internals/unary.py:40


# branch 25.02 head
CPU times: user 369 ms, sys: 105 ms, total: 474 ms
Wall time: 478 ms


  _     ._   __/__   _ _  _  _ _/_   Recorded: 12:11:20  Samples:  79
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.480     CPU time: 0.478
/   _/                      v5.0.0

Cell [3]

0.479 <module>  <ipython-input-3-ee51998a643a>:1
`- 0.478 wrapper  cudf/utils/performance_tracking.py:30
   `- 0.478 DataFrameGroupBy.agg  cudf/core/groupby/groupby.py:879
      |- 0.267 cached_property.__get__  functools.py:979
      |  `- 0.267 _Grouping.keys  cudf/core/groupby/groupby.py:3534
      |     `- 0.267 wrapper  cudf/utils/performance_tracking.py:30
      |        `- 0.267 MultiIndex._from_data  cudf/core/multiindex.py:344
      |           `- 0.265 _compute_levels_and_codes  cudf/core/multiindex.py:67
      |              |- 0.230 factorize  cudf/core/algorithms.py:22
      |              |  |- 0.168 NumericalColumn._label_encoding  cudf/core/column/column.py:1516
      |              |  |  |- 0.097 [self]  cudf/core/column/column.py
      |              |  |  |- 0.031 as_column  cudf/core/column/column.py:1948
      |              |  |  |  `- 0.028 NumericalColumn.astype  cudf/core/column/column.py:1126
      |              |  |  |     `- 0.028 NumericalColumn.as_numerical_column  cudf/core/column/numerical.py:428
      |              |  |  |        `- 0.028 inner  contextlib.py:78
      |              |  |  |           `- 0.028 cast  cudf/core/_internals/unary.py:40
      |              |  |  |- 0.030 inner  contextlib.py:78
      |              |  |  |  `- 0.030 sort_by_key  cudf/core/_internals/sorting.py:160
      |              |  |  `- 0.007 NumericalColumn.take  cudf/core/column/column.py:943
      |              |  |     `- 0.007 inner  contextlib.py:78
      |              |  |        `- 0.007 gather  cudf/core/_internals/copying.py:18
      |              |  |- 0.051 NumericalColumn.unique  cudf/core/column/column.py:1342
      |              |  |  |- 0.036 inner  contextlib.py:78
      |              |  |  |  `- 0.036 drop_duplicates  cudf/core/_internals/stream_compaction.py:82
      |              |  |  `- 0.015 NumericalColumn.is_unique  cudf/core/column/column.py:1056
      |              |  |     `- 0.015 NumericalColumn.distinct_count  cudf/core/column/column.py:1108
      |              |  `- 0.006 NumericalColumn.dropna  cudf/core/column/column.py:294
      |              |     `- 0.006 NumericalColumn.copy  cudf/core/column/column.py:481
      |              `- 0.028 _compile_module_with_cache  cupy/cuda/compiler.py:473
      |                    [4 frames hidden]  cupy, <built-in>
      |- 0.133 DataFrameGroupBy._aggregate  cudf/core/groupby/groupby.py:789
      `- 0.079 wrapper  cudf/utils/performance_tracking.py:30
         `- 0.079 MultiIndex._from_columns_like_self  cudf/core/frame.py:194
            `- 0.079 wrapper  cudf/utils/performance_tracking.py:30
               |- 0.040 MultiIndex._copy_type_metadata  cudf/core/multiindex.py:2101
               |  `- 0.038 _compute_levels_and_codes  cudf/core/multiindex.py:67
               |     `- 0.038 factorize  cudf/core/algorithms.py:22
               |        |- 0.021 NumericalColumn._label_encoding  cudf/core/column/column.py:1516
               |        |  |- 0.010 [self]  cudf/core/column/column.py
               |        |  `- 0.007 inner  contextlib.py:78
               |        |     `- 0.007 sort_by_key  cudf/core/_internals/sorting.py:160
               |        `- 0.011 NumericalColumn.unique  cudf/core/column/column.py:1342
               |           `- 0.010 inner  contextlib.py:78
               |              `- 0.010 drop_duplicates  cudf/core/_internals/stream_compaction.py:82
               `- 0.039 MultiIndex._from_data  cudf/core/multiindex.py:344
                  `- 0.038 _compute_levels_and_codes  cudf/core/multiindex.py:67
                     `- 0.038 factorize  cudf/core/algorithms.py:22
                        |- 0.021 NumericalColumn._label_encoding  cudf/core/column/column.py:1516
                        |  |- 0.010 [self]  cudf/core/column/column.py
                        |  `- 0.007 inner  contextlib.py:78
                        |     `- 0.007 sort_by_key  cudf/core/_internals/sorting.py:160
                        `- 0.010 NumericalColumn.unique  cudf/core/column/column.py:1342
                           `- 0.010 inner  contextlib.py:78
                              `- 0.010 drop_duplicates  cudf/core/_internals/stream_compaction.py:82
```

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #17728
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants