Support `adjust=True` in `ewm_mean_by` #21015

ancri · 2025-01-30T20:15:06Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import date, timedelta
import polars as pl

tbl = pl.DataFrame({
    'x': [1, 3],
    'date': [date(2025, 1, 1), date(2025, 1, 2)],
})
print(tbl.with_columns(x_ewm=pl.col('x').ewm_mean_by('date', half_life='1d')))

tbl_pandas = tbl.to_pandas()
print(tbl_pandas['x'].ewm(halflife=timedelta(days=1), times=tbl_pandas['date']).mean())

Log output

shape: (2, 3)
┌─────┬────────────┬───────┐
│ x   ┆ date       ┆ x_ewm │
│ --- ┆ ---        ┆ ---   │
│ i64 ┆ date       ┆ f64   │
╞═════╪════════════╪═══════╡
│ 1   ┆ 2025-01-01 ┆ 1.0   │
│ 3   ┆ 2025-01-02 ┆ 2.0   │
└─────┴────────────┴───────┘
0    1.000000
1    2.333333
Name: x, dtype: float64

Issue description

In the exponential moving average formula, you've forgotten to divide by the sum of weights. In the example above, this results in the average for 2025/1/2 being calculated as 2.0, which means it weighs the two observations equally, even though one is current and the other is as old as the halflife so it should get half the relative weight. The value should instead be 2.333. Note that pandas gets this right.

Expected behavior

The formula should be corrected to divide by sum(a_i) as the denominator.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.4.0-204-generic-x86_64-with-glibc2.31
Python:              3.10.11 (main, May 17 2023, 11:25:03) [GCC 9.4.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.4.1
azure.identity       <not installed>
boto3                1.34.74
cloudpickle          2.2.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.3.1
gevent               24.2.1
google.auth          2.37.0
great_tables         <not installed>
matplotlib           3.9.3
numpy                1.26.4
openpyxl             3.1.2
pandas               2.2.3
pyarrow              12.0.0
pydantic             2.7.1
pyiceberg            <not installed>
sqlalchemy           2.0.19
torch                2.3.1+cu121
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2025-01-30T20:44:52Z

pandas uses adjust=True by default and doesn't support adjust=False. In fact, pandas' adjust=True is flawed, see pandas-dev/pandas#54328

Polars supports adjust in ewm_mean, but ewm_mean_by is unadjusted and uses the formula documented here: https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.ewm_mean_by.html

You can match pandas' output like this:

In [30]: tbl.with_columns(x_ewm=pl.col('x').ewm_mean(half_life=1))
Out[30]:
shape: (2, 3)
┌─────┬────────────┬──────────┐
│ x   ┆ date       ┆ x_ewm    │
│ --- ┆ ---        ┆ ---      │
│ i64 ┆ date       ┆ f64      │
╞═════╪════════════╪══════════╡
│ 1   ┆ 2025-01-01 ┆ 1.0      │
│ 3   ┆ 2025-01-02 ┆ 2.333333 │
└─────┴────────────┴──────────┘

In [31]: tbl.with_columns(x_ewm=pl.col('x').ewm_mean(half_life=1, adjust=False))
Out[31]:
shape: (2, 3)
┌─────┬────────────┬───────┐
│ x   ┆ date       ┆ x_ewm │
│ --- ┆ ---        ┆ ---   │
│ i64 ┆ date       ┆ f64   │
╞═════╪════════════╪═══════╡
│ 1   ┆ 2025-01-01 ┆ 1.0   │
│ 3   ┆ 2025-01-02 ┆ 2.0   │
└─────┴────────────┴───────┘

I think this issue may need repurposing as a request to have a properly-implemented adjust=True for ewm_mean_by

ancri · 2025-01-30T21:14:24Z

Thank you!

Until the feature gets implemented, do you currently happen to have any proposed hack for obtaining adjusted ewm_mean_by behavior, such that I can reference a date(time) column to decay along?

By the way, while I understand that we match the parallel to pandas, from first principles it doesn't make any sense (adjusted or not) to ever weigh two datapoints equally, one which is current and another which is delayed by the same amount as the halflife. Under no interpretation of what someone is looking to do when they take the exponential moving average, does it make sense to have that behavior.

To drive the point home, in the example below (using the same table as the original post) the average comes out to be almost 1. Instead, it should be close to 2.

tbl.with_columns(
  x_ewm = pl.col('x').ewm_mean_by('date', half_life='1000d')
)
# shape: (2, 3)
# ┌─────┬────────────┬──────────┐
# │ x   ┆ date       ┆ x_ewm    │
# │ --- ┆ ---        ┆ ---      │
# │ i64 ┆ date       ┆ f64      │
# ╞═════╪════════════╪══════════╡
# │ 1   ┆ 2025-01-01 ┆ 1.0      │
# │ 3   ┆ 2025-01-02 ┆ 1.001386 │
# └─────┴────────────┴──────────┘

MarcoGorelli · 2025-01-30T21:24:23Z

Do you have a formula reference for how ewm_mean_by with adjust=True should be implemented?

MarcoGorelli · 2025-01-31T11:30:26Z

There is a suggestion in pandas-dev/pandas#54328 (comment) but I'm not expert enough on this to whether it's a justified approach

@alexander-beedie any chance I could get you or any quant you know to weigh on this please?

ancri · 2025-01-31T18:11:07Z

Without looking through that entire thread, my proposal is this. If you'd like to use a non-recursive formula, just use the ones from the pandas documentation for adjust=True:

If you instead prefer to do this recursively, I would use this formula:

The results should match between these two approaches.

The explanation is this:

x_i gets a raw weight of 1
y_i-1 gets a raw weight of (1 - a_i)
the raw weights are normalized to 1, so we divide the whole thing by (2 - a_i), which is the sum of the raw weights (1 + 1 - a_i)

MarcoGorelli · 2025-01-31T18:33:45Z

Polars already uses that for adjust=True in ewm_mean

It's for the time-based one that I'm asking what should be done. As highlighted in pandas-dev/pandas#54328, pandas' time-based adjust=True is flawed, so I wouldn't want to introduce it into Polars

The suggestion for time-based adjusted ewm in pandas-dev/pandas#54328 (comment) seems reasonable, I'd just appreciate it if an expert in the could confirm that

ancri · 2025-01-31T18:44:44Z

Thank you. Looking at the proposed discrete formula of:

Perhaps I'm being obtuse but I don't understand the rationale behind delta_t included in the multiplication, in both the numerator and the denominator. The 0.5^((t1-t2)/lambda) already takes care of the appropriate time decay (assuming here that lambda is supposed to be half_life).

I can see a justification for the proposed formula if you assume that the the time series should be considered to "retain" their previous value for the duration of each interval. This seems kind of strange. A more direct and natural interpretation is that you just observe different values at discrete points in time and you decay each observation accordingly by how long ago it was made, when averaging.

ancri added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 30, 2025

MarcoGorelli changed the title ~~Series.ewm_mean_by calcs are incorrect - missing a denominator~~ enh: support adjus=True in ewm_mean_by Jan 30, 2025

MarcoGorelli added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jan 30, 2025

alexander-beedie changed the title ~~enh: support adjus=True in ewm_mean_by~~ Support adjust=True in ewm_mean_by Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `adjust=True` in `ewm_mean_by` #21015

Support `adjust=True` in `ewm_mean_by` #21015

ancri commented Jan 30, 2025

MarcoGorelli commented Jan 30, 2025 •

edited

Loading

ancri commented Jan 30, 2025 •

edited by alexander-beedie

Loading

MarcoGorelli commented Jan 30, 2025

MarcoGorelli commented Jan 31, 2025

ancri commented Jan 31, 2025 •

edited

Loading

MarcoGorelli commented Jan 31, 2025

ancri commented Jan 31, 2025

Support adjust=True in ewm_mean_by #21015

Support adjust=True in ewm_mean_by #21015

Comments

ancri commented Jan 30, 2025

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

MarcoGorelli commented Jan 30, 2025 • edited Loading

ancri commented Jan 30, 2025 • edited by alexander-beedie Loading

MarcoGorelli commented Jan 30, 2025

MarcoGorelli commented Jan 31, 2025

ancri commented Jan 31, 2025 • edited Loading

MarcoGorelli commented Jan 31, 2025

ancri commented Jan 31, 2025

Support `adjust=True` in `ewm_mean_by` #21015

Support `adjust=True` in `ewm_mean_by` #21015

MarcoGorelli commented Jan 30, 2025 •

edited

Loading

ancri commented Jan 30, 2025 •

edited by alexander-beedie

Loading

ancri commented Jan 31, 2025 •

edited

Loading