-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid double MultiIndex factorization in groupby index result #17644
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mroeschke
added
Python
Affects Python cuDF API.
improvement
Improvement / enhancement to an existing function
non-breaking
Non-breaking change
labels
Dec 20, 2024
3 tasks
vyasr
approved these changes
Dec 21, 2024
/merge |
rapids-bot bot
pushed a commit
that referenced
this pull request
Jan 4, 2025
Noticed while working on #17644 that `diff` and `fillna` were make some unnecessary shallow copies of the `grouping.value` object. Also noticed that `_cov_or_corr` just pulled the column names out of `grouping.value` object, so made a separate API, `values_column_names` to just create the column names without pulling out the actual columns. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17646
3 tasks
rapids-bot bot
pushed a commit
that referenced
this pull request
Jan 15, 2025
Follow up to #17644 This PR changes `MultiIndex` to delay computing `self._level` and `self._codes` via factorization of `self._column` until needed by certain methods. Before, for state consistency, those attributes were always eagerly computed. As discussed offline the performance benefit of not eagerly computing those attributes is more desirable. ```python import cudf df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int}) target = "label" col = ['weekday', 'cat_2', 'brand'] df_gb= df_train[col + [target]].groupby(col) %%time df_gb.agg(['mean', 'count']) # PR CPU times: user 144 ms, sys: 23.9 ms, total: 168 ms Wall time: 166 ms _ ._ __/__ _ _ _ _ _/_ Recorded: 12:12:46 Samples: 4 /_//_/// /_\ / //_// / //_'/ // Duration: 0.159 CPU time: 0.160 / _/ v5.0.0 Cell [3] 0.158 <module> <ipython-input-3-ee51998a643a>:1 `- 0.158 wrapper cudf/utils/performance_tracking.py:30 `- 0.158 DataFrameGroupBy.agg cudf/core/groupby/groupby.py:879 |- 0.129 DataFrameGroupBy._aggregate cudf/core/groupby/groupby.py:789 `- 0.029 NumericalColumn.astype cudf/core/column/column.py:1126 `- 0.029 NumericalColumn.as_numerical_column cudf/core/column/numerical.py:428 `- 0.029 inner contextlib.py:78 `- 0.029 cast cudf/core/_internals/unary.py:40 # branch 25.02 head CPU times: user 369 ms, sys: 105 ms, total: 474 ms Wall time: 478 ms _ ._ __/__ _ _ _ _ _/_ Recorded: 12:11:20 Samples: 79 /_//_/// /_\ / //_// / //_'/ // Duration: 0.480 CPU time: 0.478 / _/ v5.0.0 Cell [3] 0.479 <module> <ipython-input-3-ee51998a643a>:1 `- 0.478 wrapper cudf/utils/performance_tracking.py:30 `- 0.478 DataFrameGroupBy.agg cudf/core/groupby/groupby.py:879 |- 0.267 cached_property.__get__ functools.py:979 | `- 0.267 _Grouping.keys cudf/core/groupby/groupby.py:3534 | `- 0.267 wrapper cudf/utils/performance_tracking.py:30 | `- 0.267 MultiIndex._from_data cudf/core/multiindex.py:344 | `- 0.265 _compute_levels_and_codes cudf/core/multiindex.py:67 | |- 0.230 factorize cudf/core/algorithms.py:22 | | |- 0.168 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | | | |- 0.097 [self] cudf/core/column/column.py | | | |- 0.031 as_column cudf/core/column/column.py:1948 | | | | `- 0.028 NumericalColumn.astype cudf/core/column/column.py:1126 | | | | `- 0.028 NumericalColumn.as_numerical_column cudf/core/column/numerical.py:428 | | | | `- 0.028 inner contextlib.py:78 | | | | `- 0.028 cast cudf/core/_internals/unary.py:40 | | | |- 0.030 inner contextlib.py:78 | | | | `- 0.030 sort_by_key cudf/core/_internals/sorting.py:160 | | | `- 0.007 NumericalColumn.take cudf/core/column/column.py:943 | | | `- 0.007 inner contextlib.py:78 | | | `- 0.007 gather cudf/core/_internals/copying.py:18 | | |- 0.051 NumericalColumn.unique cudf/core/column/column.py:1342 | | | |- 0.036 inner contextlib.py:78 | | | | `- 0.036 drop_duplicates cudf/core/_internals/stream_compaction.py:82 | | | `- 0.015 NumericalColumn.is_unique cudf/core/column/column.py:1056 | | | `- 0.015 NumericalColumn.distinct_count cudf/core/column/column.py:1108 | | `- 0.006 NumericalColumn.dropna cudf/core/column/column.py:294 | | `- 0.006 NumericalColumn.copy cudf/core/column/column.py:481 | `- 0.028 _compile_module_with_cache cupy/cuda/compiler.py:473 | [4 frames hidden] cupy, <built-in> |- 0.133 DataFrameGroupBy._aggregate cudf/core/groupby/groupby.py:789 `- 0.079 wrapper cudf/utils/performance_tracking.py:30 `- 0.079 MultiIndex._from_columns_like_self cudf/core/frame.py:194 `- 0.079 wrapper cudf/utils/performance_tracking.py:30 |- 0.040 MultiIndex._copy_type_metadata cudf/core/multiindex.py:2101 | `- 0.038 _compute_levels_and_codes cudf/core/multiindex.py:67 | `- 0.038 factorize cudf/core/algorithms.py:22 | |- 0.021 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | | |- 0.010 [self] cudf/core/column/column.py | | `- 0.007 inner contextlib.py:78 | | `- 0.007 sort_by_key cudf/core/_internals/sorting.py:160 | `- 0.011 NumericalColumn.unique cudf/core/column/column.py:1342 | `- 0.010 inner contextlib.py:78 | `- 0.010 drop_duplicates cudf/core/_internals/stream_compaction.py:82 `- 0.039 MultiIndex._from_data cudf/core/multiindex.py:344 `- 0.038 _compute_levels_and_codes cudf/core/multiindex.py:67 `- 0.038 factorize cudf/core/algorithms.py:22 |- 0.021 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | |- 0.010 [self] cudf/core/column/column.py | `- 0.007 inner contextlib.py:78 | `- 0.007 sort_by_key cudf/core/_internals/sorting.py:160 `- 0.010 NumericalColumn.unique cudf/core/column/column.py:1342 `- 0.010 inner contextlib.py:78 `- 0.010 drop_duplicates cudf/core/_internals/stream_compaction.py:82 ``` Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #17728
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
improvement
Improvement / enhancement to an existing function
non-breaking
Non-breaking change
Python
Affects Python cuDF API.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
From an offline discussion, it was discovered that a
groupby
with multiple keys experienced a slowdown somewhere between 24.02 and 24.12. An identified hot spot was the creation of the resultingMultiIndex
where each resulting level undergoes factorization upon construction (which changed somewhere in between these releases to fix consistency bugs)While the eager factorization is a new performance penalty, I identified that this factorization was being done twice unnecessarily when performing an
agg
. This PR ensures we avoid creating this MultiIndex until necessary and cache this MultiIndex result to avoid double factorization in the futureChecklist