Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support adjust=True in ewm_mean_by #21015

Open
2 tasks done
ancri opened this issue Jan 30, 2025 · 7 comments
Open
2 tasks done

Support adjust=True in ewm_mean_by #21015

ancri opened this issue Jan 30, 2025 · 7 comments
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@ancri
Copy link

ancri commented Jan 30, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import date, timedelta
import polars as pl

tbl = pl.DataFrame({
    'x': [1, 3],
    'date': [date(2025, 1, 1), date(2025, 1, 2)],
})
print(tbl.with_columns(x_ewm=pl.col('x').ewm_mean_by('date', half_life='1d')))

tbl_pandas = tbl.to_pandas()
print(tbl_pandas['x'].ewm(halflife=timedelta(days=1), times=tbl_pandas['date']).mean())

Log output

shape: (2, 3)
┌─────┬────────────┬───────┐
│ x   ┆ date       ┆ x_ewm │
│ --- ┆ ---        ┆ ---   │
│ i64 ┆ date       ┆ f64   │
╞═════╪════════════╪═══════╡
│ 1   ┆ 2025-01-01 ┆ 1.0   │
│ 3   ┆ 2025-01-02 ┆ 2.0   │
└─────┴────────────┴───────┘
0    1.000000
1    2.333333
Name: x, dtype: float64

Issue description

In the exponential moving average formula, you've forgotten to divide by the sum of weights. In the example above, this results in the average for 2025/1/2 being calculated as 2.0, which means it weighs the two observations equally, even though one is current and the other is as old as the halflife so it should get half the relative weight. The value should instead be 2.333. Note that pandas gets this right.

Expected behavior

The formula should be corrected to divide by sum(a_i) as the denominator.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.4.0-204-generic-x86_64-with-glibc2.31
Python:              3.10.11 (main, May 17 2023, 11:25:03) [GCC 9.4.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.4.1
azure.identity       <not installed>
boto3                1.34.74
cloudpickle          2.2.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.3.1
gevent               24.2.1
google.auth          2.37.0
great_tables         <not installed>
matplotlib           3.9.3
numpy                1.26.4
openpyxl             3.1.2
pandas               2.2.3
pyarrow              12.0.0
pydantic             2.7.1
pyiceberg            <not installed>
sqlalchemy           2.0.19
torch                2.3.1+cu121
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@ancri ancri added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 30, 2025
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jan 30, 2025

pandas uses adjust=True by default and doesn't support adjust=False. In fact, pandas' adjust=True is flawed, see pandas-dev/pandas#54328

Polars supports adjust in ewm_mean, but ewm_mean_by is unadjusted and uses the formula documented here: https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.ewm_mean_by.html


You can match pandas' output like this:

In [30]: tbl.with_columns(x_ewm=pl.col('x').ewm_mean(half_life=1))
Out[30]:
shape: (2, 3)
┌─────┬────────────┬──────────┐
│ xdatex_ewm    │
│ ---------      │
│ i64datef64      │
╞═════╪════════════╪══════════╡
│ 12025-01-011.0      │
│ 32025-01-022.333333 │
└─────┴────────────┴──────────┘

In [31]: tbl.with_columns(x_ewm=pl.col('x').ewm_mean(half_life=1, adjust=False))
Out[31]:
shape: (2, 3)
┌─────┬────────────┬───────┐
│ xdatex_ewm │
│ ---------   │
│ i64datef64   │
╞═════╪════════════╪═══════╡
│ 12025-01-011.0   │
│ 32025-01-022.0   │
└─────┴────────────┴───────┘

I think this issue may need repurposing as a request to have a properly-implemented adjust=True for ewm_mean_by

@MarcoGorelli MarcoGorelli changed the title Series.ewm_mean_by calcs are incorrect - missing a denominator enh: support adjus=True in ewm_mean_by Jan 30, 2025
@MarcoGorelli MarcoGorelli added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jan 30, 2025
@ancri
Copy link
Author

ancri commented Jan 30, 2025

Thank you!

Until the feature gets implemented, do you currently happen to have any proposed hack for obtaining adjusted ewm_mean_by behavior, such that I can reference a date(time) column to decay along?

By the way, while I understand that we match the parallel to pandas, from first principles it doesn't make any sense (adjusted or not) to ever weigh two datapoints equally, one which is current and another which is delayed by the same amount as the halflife. Under no interpretation of what someone is looking to do when they take the exponential moving average, does it make sense to have that behavior.

To drive the point home, in the example below (using the same table as the original post) the average comes out to be almost 1. Instead, it should be close to 2.

tbl.with_columns(
  x_ewm = pl.col('x').ewm_mean_by('date', half_life='1000d')
)
# shape: (2, 3)
# ┌─────┬────────────┬──────────┐
# │ x   ┆ date       ┆ x_ewm    │
# │ --- ┆ ---        ┆ ---      │
# │ i64 ┆ date       ┆ f64      │
# ╞═════╪════════════╪══════════╡
# │ 1   ┆ 2025-01-01 ┆ 1.0      │
# │ 3   ┆ 2025-01-02 ┆ 1.001386 │
# └─────┴────────────┴──────────┘

@MarcoGorelli
Copy link
Collaborator

Do you have a formula reference for how ewm_mean_by with adjust=True should be implemented?

@MarcoGorelli
Copy link
Collaborator

There is a suggestion in pandas-dev/pandas#54328 (comment) but I'm not expert enough on this to whether it's a justified approach

@alexander-beedie any chance I could get you or any quant you know to weigh on this please?

@alexander-beedie alexander-beedie changed the title enh: support adjus=True in ewm_mean_by Support adjust=True in ewm_mean_by Jan 31, 2025
@ancri
Copy link
Author

ancri commented Jan 31, 2025

Without looking through that entire thread, my proposal is this. If you'd like to use a non-recursive formula, just use the ones from the pandas documentation for adjust=True:

Image

If you instead prefer to do this recursively, I would use this formula:

Image

The results should match between these two approaches.

The explanation is this:

  • x_i gets a raw weight of 1
  • y_i-1 gets a raw weight of (1 - a_i)
  • the raw weights are normalized to 1, so we divide the whole thing by (2 - a_i), which is the sum of the raw weights (1 + 1 - a_i)

@MarcoGorelli
Copy link
Collaborator

Polars already uses that for adjust=True in ewm_mean

It's for the time-based one that I'm asking what should be done. As highlighted in pandas-dev/pandas#54328, pandas' time-based adjust=True is flawed, so I wouldn't want to introduce it into Polars

The suggestion for time-based adjusted ewm in pandas-dev/pandas#54328 (comment) seems reasonable, I'd just appreciate it if an expert in the could confirm that

@ancri
Copy link
Author

ancri commented Jan 31, 2025

Thank you. Looking at the proposed discrete formula of:

Image

Perhaps I'm being obtuse but I don't understand the rationale behind delta_t included in the multiplication, in both the numerator and the denominator. The 0.5^((t1-t2)/lambda) already takes care of the appropriate time decay (assuming here that lambda is supposed to be half_life).

I can see a justification for the proposed formula if you assume that the the time series should be considered to "retain" their previous value for the duration of each interval. This seems kind of strange. A more direct and natural interpretation is that you just observe different values at discrete points in time and you decay each observation accordingly by how long ago it was made, when averaging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants