Avoid double MultiIndex factorization in groupby index result #17644
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
From an offline discussion, it was discovered that a
groupby
with multiple keys experienced a slowdown somewhere between 24.02 and 24.12. An identified hot spot was the creation of the resultingMultiIndex
where each resulting level undergoes factorization upon construction (which changed somewhere in between these releases to fix consistency bugs)While the eager factorization is a new performance penalty, I identified that this factorization was being done twice unnecessarily when performing an
agg
. This PR ensures we avoid creating this MultiIndex until necessary and cache this MultiIndex result to avoid double factorization in the futureChecklist