[Python] `pyarrow.compute.skew(skip_nulls=True)` still counts NULL as an observation? #45733

mroeschke · 2025-03-10T23:46:47Z

Describe the bug, including details regarding any error messages, version, and platform.

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pandas as pd, numpy as np
>>> pd.Series(np.array([1.0, 2.0, 3.0, 40.0, np.nan])).skew(skipna=True)
np.float64(1.988947740397821)
>>> pc.skew(pa.array([1.0, 2.0, 3.0, 40.0, None]), skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>

If skew is always calculating the unbiased skew, with pyarrow's value being lower than pandas's value it appears pyarrow might be counting None as an observation while pandas is not considering it's missing value as an observation.

cc @pitrou xref #45677

Component(s)

Python

The text was updated successfully, but these errors were encountered:

pitrou · 2025-03-11T08:23:47Z

If skew is always calculating the unbiased skew

No, it's computing the biased skew. Would unbiased be more/less useful? Or should we just add an option to make both variants available?

with pyarrow's value being lower than pandas's value it appears pyarrow might be counting None as an observation while pandas is not considering it's missing value as an observation.

No, it's always the same value regardless of the number of nulls:

>>> pc.skew([1.0, 2.0, 3.0, 40.0], skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>
>>> pc.skew([1.0, 2.0, 3.0, 40.0, None], skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>
>>> pc.skew([1.0, 2.0, 3.0, 40.0, None, None], skip_nulls=True)
<pyarrow.DoubleScalar: 1.14831951332278>

pitrou · 2025-03-11T08:23:54Z

cc @icexelloss

mroeschke · 2025-03-11T16:31:45Z

Or should we just add an option to make both variants available?

This would be a nice option and aligns with what scipy does https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html. It also would allow us over in pandas to use an unbiased skew option for pandas.ArrowDtype data to align with the pandas default implementation.

pitrou · 2025-03-11T17:15:19Z

Would you be willing to try and contribute it? This should be relatively easy if you know a bit of C++ and have already touched the Arrow C++ codebase.

mroeschke · 2025-03-11T17:23:06Z

I can try attempting it at a later date, but I cannot guarantee I can get to it before Arrow 20 is released.

mroeschke added the Type: bug label Mar 10, 2025

github-actions bot added the Component: Python label Mar 10, 2025

pitrou added the good-second-issue label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] `pyarrow.compute.skew(skip_nulls=True)` still counts NULL as an observation? #45733

[Python] `pyarrow.compute.skew(skip_nulls=True)` still counts NULL as an observation? #45733

mroeschke commented Mar 10, 2025

pitrou commented Mar 11, 2025

pitrou commented Mar 11, 2025

mroeschke commented Mar 11, 2025

pitrou commented Mar 11, 2025

mroeschke commented Mar 11, 2025

[Python] pyarrow.compute.skew(skip_nulls=True) still counts NULL as an observation? #45733

[Python] pyarrow.compute.skew(skip_nulls=True) still counts NULL as an observation? #45733

Comments

mroeschke commented Mar 10, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

pitrou commented Mar 11, 2025

pitrou commented Mar 11, 2025

mroeschke commented Mar 11, 2025

pitrou commented Mar 11, 2025

mroeschke commented Mar 11, 2025

[Python] `pyarrow.compute.skew(skip_nulls=True)` still counts NULL as an observation? #45733

[Python] `pyarrow.compute.skew(skip_nulls=True)` still counts NULL as an observation? #45733