Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.compute.skew(skip_nulls=True) still counts NULL as an observation? #45733

Open
mroeschke opened this issue Mar 10, 2025 · 5 comments

Comments

@mroeschke
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pandas as pd, numpy as np
>>> pd.Series(np.array([1.0, 2.0, 3.0, 40.0, np.nan])).skew(skipna=True)
np.float64(1.988947740397821)
>>> pc.skew(pa.array([1.0, 2.0, 3.0, 40.0, None]), skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>

If skew is always calculating the unbiased skew, with pyarrow's value being lower than pandas's value it appears pyarrow might be counting None as an observation while pandas is not considering it's missing value as an observation.

cc @pitrou xref #45677

Component(s)

Python

@pitrou
Copy link
Member

pitrou commented Mar 11, 2025

If skew is always calculating the unbiased skew

No, it's computing the biased skew. Would unbiased be more/less useful? Or should we just add an option to make both variants available?

with pyarrow's value being lower than pandas's value it appears pyarrow might be counting None as an observation while pandas is not considering it's missing value as an observation.

No, it's always the same value regardless of the number of nulls:

>>> pc.skew([1.0, 2.0, 3.0, 40.0], skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>
>>> pc.skew([1.0, 2.0, 3.0, 40.0, None], skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>
>>> pc.skew([1.0, 2.0, 3.0, 40.0, None, None], skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>

@pitrou
Copy link
Member

pitrou commented Mar 11, 2025

cc @icexelloss

@mroeschke
Copy link
Contributor Author

Or should we just add an option to make both variants available?

This would be a nice option and aligns with what scipy does https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html. It also would allow us over in pandas to use an unbiased skew option for pandas.ArrowDtype data to align with the pandas default implementation.

@pitrou
Copy link
Member

pitrou commented Mar 11, 2025

Would you be willing to try and contribute it? This should be relatively easy if you know a bit of C++ and have already touched the Arrow C++ codebase.

@mroeschke
Copy link
Contributor Author

I can try attempting it at a later date, but I cannot guarantee I can get to it before Arrow 20 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants