Add SeasonGrouper, SeasonResampler #9524

dcherian · 2024-09-20T03:04:37Z

These two groupers allow defining custom seasons, and dropping incomplete seasons from the output. Both cases are treated by adjusting the factorization -- conversion from group labels to integer codes -- appropriately.

The last piece from #8509

Closes Ordered Groupby Keys #757
Closes Computing 'seasonal means' spanning 4 months (with resample or groupy) #6180
Closes xarray_ Group by season #6664
Closes How to change or reorder grouping by 'time.season'? #5134
Closes Error: Changing months combination in xarray season for India Region #6012
Closes Calculating ONDJFM mean for each season #6865
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

Example:

import xarray as xr
from xarray.groupers import SeasonGrouper, SeasonResampler

ds = xr.tutorial.open_dataset("air_temperature")

# custom seasons! 
ds.air.groupby(time=SeasonGrouper(["JF", "MAM", "JJAS", "OND"])).mean()

ds.air.resample(time=SeasonResampler(["DJF", "MAM", "JJAS", "ON"])).count()

TODO:

Needs a boatload of tests
narrative docs
support drop_incomplete in SeasonGrouper
more cftime calendar support

cc @tomvothecoder do you have time to contribute some tests? I bet we'll simplify a bunch of xcdat this way, and you probably already have tests :)

oliviermarti · 2024-09-20T09:08:02Z

First comment, but I have performed only quick test

1 -

In your short example, it's probably :
from xarray.groupers import SeasonGrouper, SeasonResampler
and not :
from xarray.core.groupers import SeasonGrouper, SeasonResampler

Or my Github knowledge is too limited, and I'm not testing the right branch.

2 - Season grouper

Seems OK for all I have tested. In particular I can :

Use it for one season only (subsampling):
ds.air.groupby(time=SeasonGrouper(["MAM"]))['MAM']
Use it for 4 months season, as we often use in paleoclimate with calendar shifting (oversampling)
ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

3 - Season resampler

Works as expected from the example.
I would really appreciate to have access to subsampling :
ds.air.resample(time=SeasonResampler(["DJF", "MAM", "JJA", "ON"]))
or :
ds.air.resample(time=SeasonResampler(["JJAS"]))
and to oversampling :
ds.air.resample(time=SeasonResampler(["DJFM", "MAMJ", "JJAS", "SOND"]))

It could be useful to have a NaN value for an incomplete season : the first DJF cannot not be computed, and is not. This mean that the first value is not a DJF one, but a MAM value. Could be a bit misleading.

4 - cftime

I have tested it with cftime calendars instead of datetime. It works with the traditional calendar (gregorian, standard). But not with others like 360_day, 365_day, julian., proleptic_gregorian :
TypeError: cannot compute the time difference between dates with different calendars

5 - Simple data

I've build a dataset with the number ot the month as a variable. So I'm sure that the computation is correct.

Thanks' for these features. They are quit easy and straigthforward to use.

In particular, it allows to work on variables, as xcdat features work on Dataset only, which yields a more complicated syntax.

I'm gonna try to imagine further tests.

Olivier

dcherian · 2024-09-20T15:24:17Z

Thanks @oliviermarti ! this is incredibly helpful

In your short example, it's probably : from xarray.groupers import SeasonGrouper, SeasonResampler

Yes, my mistake. I fixed the snippet.

ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

This should not work, did you really get correct results.

It could be useful to have a NaN value for an incomplete season

The drop_incomplete option should let you control this.

oliviermarti · 2024-09-20T15:43:37Z

ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

This should not work, did you really get correct results.

In fact not ! Only the first value is correct. A bit dangerous that it returns a result and not an error.

It could be useful to have a NaN value for an incomplete season
The drop_incomplete option should let you control this.

drop_incomplete compute a value, using less month. This is the behaviour of xcdat. It would like to get a nan (I should ask that to xcdat too).

Olivier

dcherian · 2024-09-20T23:48:59Z

ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

I went back to my dev notebook, turns out I figured this out already!

tomvothecoder · 2024-09-30T16:09:44Z

cc @tomvothecoder do you have time to contribute some tests? I bet we'll simplify a bunch of xcdat this way, and you probably already have tests :)

Hi @dcherian, thank you for this PR! I've been looking forward to having this feature in Xarray. No guarantees on a timeline, but I plan to start looking at this PR this week. I'll experiment with this feature and see how I can leverage it to simplify xCDAT PR #423 for custom seasons. I'll also try to contribute any useful tests.

These two groupers allow defining custom seasons, and dropping incomplete seasons from the output. Both cases are treated by adjusting the factorization -- conversion from group labels to integer codes -- appropriately.

tomvothecoder · 2024-11-13T19:25:27Z

Hey @dcherian, quick question. Will this PR add support for using SeasonGrouper along with Datetime components (e.g., ds.time.dt.year)?

For example, if I wanted to perform grouped averaging on year and custom seasons it might look like:

ds.air.groupby(time=[ds.time.dt.year, SeasonGrouper(["JF", "MAM", "JJAS", "OND"])]).mean()

dcherian · 2024-11-13T19:45:30Z

Yes, I think so this is conceptually similar to groupby(["time.month", "time.year"]) but we require dict inputs to be explicit

ds.air.groupby(
    {"time.year": UniqueGrouper(), "time": SeasonGrouper(["JF", "MAM", "JJAS", "OND"])}
).mean()

Here's an example, that is hopefully easy to interpret

Is this right?

One trouble I'm having here is defining good tests for arbitrary seasons. Do you have suggestions or existing ones in xcdat that i can generalize?

tomvothecoder · 2024-11-13T22:45:57Z

Yes, I think so this is conceptually similar to groupby(["time.month", "time.year"]) but we require dict inputs to be explicit
ds.coords["year"] = ds.time.dt.year
ds.air.groupby(year=UniqueGrouper, time=SeasonGrouper(["JF", "MAM", "JJAS", "OND"])).mean()
Here's an example, that is hopefully easy to interpret

Is this right?

Yeah that's exactly what I was looking for. Thanks!

One trouble I'm having here is defining good tests for arbitrary seasons. Do you have suggestions or existing ones in xcdat that i can generalize?

Sure I have some in xcdat that you can adapt for this PR. I will share them with you shortly.

tomvothecoder · 2024-11-13T23:56:04Z

Another question: If we're defining custom seasons with months that span the calendar year, those months are from the previous year correct?

For example for "NDJFM", "ND" should be from the previous year.

air.groupby(year=UniqueGrouper(), time=SeasonGrouper(["NDJFM"]))

dcherian · 2024-11-14T02:10:38Z

Yes it tried to be that smart

dcherian · 2024-11-14T19:36:09Z

@tomvothecoder @oliviermarti i fixed the existing tests now, please try it out!

FWIW the need to support seasons=["JJAS"] is adding quite some complexity. We should consider not supporting it.

tomvothecoder · 2024-11-15T00:16:20Z

I'm writing a few tests right now. How do you want me to add them to your fork branch?

Another question: If we're defining custom seasons with months that span the calendar year, those months are from the previous year correct?

For example for "NDJFM", "ND" should be from the previous year.
air.groupby(year=UniqueGrouper(), time=SeasonGrouper(["NDJFM"])) 

I noticed in a test I'm writing for the above code that "ND" is being taken from the same year, not the previous year. I think we expect the previous year "ND" to be used instead. I will show a clear example once I add the test.

dcherian · 2024-11-15T00:30:03Z

Ah nice find. A PR to this branch should be the easiest

* main: fix cf decoding of grid_mapping (pydata#9765) Allow wrapping `np.ndarray` subclasses (pydata#9760) Optimize polyfit (pydata#9766) Use `map_overlap` for rolling reductions with Dask (pydata#9770) fix html repr indexes section (pydata#9768)

* main: Add download stats badges (pydata#9786) Fix open_mfdataset for list of fsspec files (pydata#9785) add 'User-Agent'-header to pooch.retrieve (pydata#9782) Optimize `ffill`, `bfill` with dask when `limit` is specified (pydata#9771)

tomvothecoder · 2024-11-18T19:40:27Z

Ah nice find. A PR to this branch should be the easiest

Gotcha, will do.

RE: My comment above about annual seasonal averaging.

I've attached the Python script that compares the annual seasonal averages between Xarray and xCDAT. The custom seasons are "NDJFM", "AMJ". "NDJFM" spans the calendar year, so we expect the previous year "ND" to be used for grouping.

Results

Xarray (actual) uses the same year "ND" for grouping, while xCDAT (expected, PR #423) uses the previous year "ND". I manually verified that the averages for xCDAT are correct (here).

import numpy as np
import xarray as xr
import xcdat as xc  # noqa: F401
from xarray.groupers import SeasonGrouper, UniqueGrouper

# Create a sample dataset from 2001-01-01 to 2002-12-30
time = xr.cftime_range("2001-01-01", "2002-12-30", freq="MS", calendar="standard")
data = np.array(
    [
        1.0,
        1.25,
        1.5,
        1.75,
        2.0,
        1.1,
        1.35,
        1.6,
        1.85,
        1.2,
        1.45,
        1.7,
        1.95,
        1.05,
        1.3,
        1.55,
        1.8,
        1.15,
        1.4,
        1.65,
        1.9,
        1.25,
        1.5,
        1.75,
    ]
)
da = xr.DataArray(name="air", data=data, dims="time", coords={"time": time})
da["year"] = da.time.dt.year

# Actual (Xarray groupby with custom seasons)
# -------------------------------------------
actual = da.groupby(year=UniqueGrouper(), time=SeasonGrouper(["NDJFM", "AMJ"])).mean()

print(actual)

"""
Xarray uses the same year "ND" for "NDJFM" grouping (not expected).

<xarray.DataArray 'air' (year: 2, season: 2)> Size: 32B
array([[1.61666667, 1.38      ],
       [1.5       , 1.51      ]])
Coordinates:
  * year     (year) int64 16B 2001 2002
  * season   (season) object 16B 'AMJ' 'NDJFM'
"""

# Expected (xCDAT groupby with custom seasons)
# --------------------------------------------
ds = da.to_dataset()

custom_seasons = [["Nov", "Dec", "Jan", "Feb", "Mar"], ["Apr", "May", "Jun"]]
expected = ds.temporal.group_average(
    "air",
    weighted=False,
    freq="season",
    season_config={"custom_seasons": custom_seasons},
)

print(expected)
"""
xCDAT uses the previous year "ND" for "NDJFM" grouping (expected).

<xarray.DataArray 'air' (time: 5)> Size: 40B
array([1.25      , 1.61666667, 1.49      , 1.5       , 1.625     ])
Coordinates:
  * time     (time) object 40B 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
Attributes:
    operation:                temporal_avg
    mode:                     group_average
    freq:                     season
    weighted:                 False
    drop_incomplete_seasons:  False
    custom_seasons:           ['NovDecJanFebMar', 'AprMayJun']
"""

print(expected.time)
"""
xCDAT represents time coords with cftime, with the middle month representing
the season.

<xarray.DataArray 'time' (time: 5)> Size: 40B
array([cftime.DatetimeGregorian(2001, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2001, 5, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2002, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2002, 5, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2003, 1, 1, 0, 0, 0, 0, has_year_zero=False)],
      dtype=object)
Coordinates:
  * time     (time) object 40B 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
"""

tomvothecoder · 2024-11-19T16:28:29Z

Xarray (actual) uses the same year "ND" for grouping, while xCDAT (expected, PR #423) uses the previous year "ND". I manually verified that the averages for xCDAT are correct (here).

In xCDAT, I get the indices all of all time coords with months that span the calendar year and shift them over a year (+1) before grouping with Xarray (since Xarray uses same year months for grouping). I haven't looked at the Xarray code for grouping yet, but there is probably a cleaner way to support spanning years.

* Add tests for SeasonalGrouper API * Add more tests

dcherian · 2024-11-21T04:31:48Z

@tomvothecoder my mistake. that is a "resampling" operation, so

 da.resample(time=SeasonResampler(["NDJFM", "AMJ"], drop_incomplete=False)).mean()

gives what you want:

<xarray.DataArray (time: 5)> Size: 40B
array([1.25      , 1.61666667, 1.49      , 1.5       , 1.625     ])
Coordinates:
  * time     (time) object 40B 2000-11-01 00:00:00 ... 2002-11-01 00:00:00

We can't handle grouping by year and season separately.

This was referenced Sep 20, 2024

Add support for custom seasons spanning calendar years xCDAT/xcdat#423

Merged

[Feature]: custom seasons that span calendar years xCDAT/xcdat#416

Closed

dcherian force-pushed the custom-groupers branch from 54b2ef1 to 594d4a7 Compare September 20, 2024 03:14

TomNicholas added the topic-groupby label Sep 20, 2024

dcherian added 6 commits November 12, 2024 13:32

Add SeasonGrouper, SeasonResampler

7e3a6a4

These two groupers allow defining custom seasons, and dropping incomplete seasons from the output. Both cases are treated by adjusting the factorization -- conversion from group labels to integer codes -- appropriately.

Allow sliding seasons

879b496

cftime support

8268c46

Add skeleton tests

31cc519

Support "subsampled" seasons

96ae241

small edits

77dc5e0

dcherian force-pushed the custom-groupers branch from 9180536 to 77dc5e0 Compare November 12, 2024 20:36

Add reset

d68b1e4

dcherian added 3 commits November 14, 2024 09:18

Fix tests

1b7a9fc

Raise if seasons are not sorted for resampling

be5f933

fix Self import

bd21b48

dcherian force-pushed the custom-groupers branch from 7aaafb2 to c66ad96 Compare November 14, 2024 19:37

tomvothecoder mentioned this pull request Nov 14, 2024

[Enhancement]: Refactor custom_seasons to use Xarray's SeasonGrouper() API [WIP] xCDAT/xcdat#714

Open

dcherian added 3 commits November 15, 2024 11:26

Redo calendar fixtures

09640b7

fix test

8773faf

cftime tests

879af59

dcherian force-pushed the custom-groupers branch from c66ad96 to 879af59 Compare November 15, 2024 18:28

dcherian added 5 commits November 15, 2024 21:07

Fix doctest

2ca67da

typing

f5191e5

fix test

2512d53

Merge branch 'main' into custom-groupers

b9507fe

* main: Add download stats badges (pydata#9786) Fix open_mfdataset for list of fsspec files (pydata#9785) add 'User-Agent'-header to pooch.retrieve (pydata#9782) Optimize `ffill`, `bfill` with dask when `limit` is specified (pydata#9771)

tomvothecoder mentioned this pull request Nov 18, 2024

Add tests for SeasonGrouper API (PR #9524) dcherian/xarray#40

Merged

Add tests for SeasonGrouper API (PR pydata#9524) (#40)

b385532

* Add tests for SeasonalGrouper API * Add more tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SeasonGrouper, SeasonResampler #9524

Add SeasonGrouper, SeasonResampler #9524

dcherian commented Sep 20, 2024 •

edited

Loading

oliviermarti commented Sep 20, 2024

dcherian commented Sep 20, 2024 •

edited

Loading

oliviermarti commented Sep 20, 2024

dcherian commented Sep 20, 2024 •

edited

Loading

tomvothecoder commented Sep 30, 2024

tomvothecoder commented Nov 13, 2024 •

edited

Loading

dcherian commented Nov 13, 2024 •

edited

Loading

tomvothecoder commented Nov 13, 2024

tomvothecoder commented Nov 13, 2024

dcherian commented Nov 14, 2024

dcherian commented Nov 14, 2024

tomvothecoder commented Nov 15, 2024 •

edited

Loading

dcherian commented Nov 15, 2024

tomvothecoder commented Nov 18, 2024 •

edited

Loading

tomvothecoder commented Nov 19, 2024

dcherian commented Nov 21, 2024 •

edited

Loading

Add SeasonGrouper, SeasonResampler #9524

Are you sure you want to change the base?

Add SeasonGrouper, SeasonResampler #9524

Conversation

dcherian commented Sep 20, 2024 • edited Loading

oliviermarti commented Sep 20, 2024

1 -

2 - Season grouper

3 - Season resampler

4 - cftime

5 - Simple data

dcherian commented Sep 20, 2024 • edited Loading

oliviermarti commented Sep 20, 2024

dcherian commented Sep 20, 2024 • edited Loading

tomvothecoder commented Sep 30, 2024

tomvothecoder commented Nov 13, 2024 • edited Loading

dcherian commented Nov 13, 2024 • edited Loading

tomvothecoder commented Nov 13, 2024

tomvothecoder commented Nov 13, 2024

dcherian commented Nov 14, 2024

dcherian commented Nov 14, 2024

tomvothecoder commented Nov 15, 2024 • edited Loading

dcherian commented Nov 15, 2024

tomvothecoder commented Nov 18, 2024 • edited Loading

RE: My comment above about annual seasonal averaging.

Results

tomvothecoder commented Nov 19, 2024

dcherian commented Nov 21, 2024 • edited Loading

dcherian commented Sep 20, 2024 •

edited

Loading

dcherian commented Sep 20, 2024 •

edited

Loading

dcherian commented Sep 20, 2024 •

edited

Loading

tomvothecoder commented Nov 13, 2024 •

edited

Loading

dcherian commented Nov 13, 2024 •

edited

Loading

tomvothecoder commented Nov 15, 2024 •

edited

Loading

tomvothecoder commented Nov 18, 2024 •

edited

Loading

dcherian commented Nov 21, 2024 •

edited

Loading