Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pandas groupby corr benchmark #37

Merged
merged 3 commits into from
Oct 4, 2023

Conversation

mroeschke
Copy link

Hello!

I was reviewing the groupby pandas benchmarks and noticed the groupby correlation benchmark could be more efficient by eliminating the Series call in apply and renaming the column after instead

In [1]: import warnings; import pandas as pd; import numpy as np

In [2]: warnings.filterwarnings("ignore", category=FutureWarning)

In [3]: warnings.filterwarnings("ignore", category=RuntimeWarning)

In [4]: pd.__version__
Out[4]: '2.1.1'

In [5]: n = 10

In [6]: k = 10

In [7]: np.random.seed(123)

In [8]: df = pd.DataFrame({"key": np.random.randint(0, k, n), "x": np.random.rand(n), "y": np.random.rand(n)})

In [9]: df
Out[9]: 
   key         x         y
0    2  0.480932  0.531551
1    2  0.392118  0.531828
2    6  0.343178  0.634401
3    1  0.729050  0.849432
4    3  0.438572  0.724455
5    9  0.059678  0.611024
6    6  0.398044  0.722443
7    1  0.737995  0.322959
8    0  0.182492  0.361789
9    1  0.175452  0.228263

# existing benchmark
In [10]: df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: pd.Series({'r2': x.corr(numeric_only=True)['x']['y']**2}))
Out[10]: 
   key        r2
0    2  1.000000
1    6  1.000000
2    1  0.367864
3    3       NaN
4    9       NaN
5    0       NaN

# proposed benchmark
In [11]: df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: (x['x'].corr(x['y'])**2)).rename(columns={None: "r2"})
Out[11]: 
   key        r2
0    2  1.000000
1    6  1.000000
2    1  0.367864
3    3       NaN
4    9       NaN
5    0       NaN

In [12]: n = 10_000

In [13]: k = 100

In [14]: df = pd.DataFrame({"key": np.random.randint(0, k, n), "x": np.random.rand(n), "y": np.random.rand(n)})

# existing benchmark
In [15]: %timeit df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: pd.Series({'r2': x.corr(numeric_only=True)['x']['y']**2}))
10.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# proposed benchmark
In [16]: %timeit df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: (x['x'].corr(x['y'])**2)).rename(columns={None: "r2"})
7.39 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jangorecki
Copy link

Thanks, also benchplot dict needs to be updated

@mroeschke
Copy link
Author

Thanks, also benchplot dict needs to be updated

Thanks, updated it as well

@mroeschke
Copy link
Author

@jangorecki any other changes needed for this PR?

@jangorecki
Copy link

I have not been running the code but it looks OK

@Tmonster Tmonster self-requested a review October 4, 2023 10:56
@Tmonster
Copy link
Collaborator

Tmonster commented Oct 4, 2023

Hi Matthew, just going to wait for the tests to pass and I'll merge 👍

@Tmonster Tmonster merged commit 00c4fdd into duckdblabs:master Oct 4, 2023
@mroeschke mroeschke deleted the perf/groupby_pandas_corr branch October 4, 2023 16:28
@mroeschke
Copy link
Author

Thanks for merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants