Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average time #211

Merged
merged 8 commits into from
Feb 10, 2022
Merged

Average time #211

merged 8 commits into from
Feb 10, 2022

Conversation

ellesmith88
Copy link
Collaborator

Pull Request Checklist:

  • What kind of change does this PR introduce?:
    Adds average_time in clisops.core and clisops.ops and tests.

  • Does this PR introduce a breaking change?:
    No

@cehbrecht
Copy link
Collaborator

Only a question ... do we want to remove average_over_dims in the next release? Or is it useful on its own?

@ellesmith88
Copy link
Collaborator Author

@cehbrecht @agstephens I think we said to remove it at some point, I can do that in a new PR if we are ready to remove it?

Copy link
Collaborator

@agstephens agstephens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ellesmith88,
Thanks for putting this PR together. There is one issue for us to resolve before we press ahead - and it's not a problem with your code, it might be a problem with xarray and the decisions it makes.

@@ -217,3 +217,110 @@ def test_aux_variables():
)

assert "do_i_get_written" in result[0].variables


def test_average_over_years():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ellesmith88 this is all looking good and the tests seem to cover everything we'd need.
There is one issue that we might need to consider philosophically (about what xarray is doing):

In the test test_average_over_years():

  • the input data starts on 16/12/2005 and ends 16/12/2299.
  • the output data starts on 01/01/2005 and ends on 01/01/2299

In CF terms, we can think of the new time array as being a series of time periods rather than time points (i.e. time point 1 represents the entire year of 2005). In which case, the expected output array would typically be defined with a time axis and time_bounds bounds array.

And typically the first time point would be:

time[0] = 2005-07-01 (i.e. halfway(ish) through the year)
time_bounds[0] = [2005-01-01, 2299-12-31]

However, that is not a definitive rule. I think the first thing to do is to test the values of the bounds. If the bounds come out correctly then we can probably just accept what xarray is doing.

Please can you extend this test (and some of the others) to include a check for bounds values for time in a way that is logical.

Note that xarray has all these different frequencies:

https://github.com/pydata/xarray/blob/main/xarray/coding/cftime_offsets.py#L605-L608

I'm not sure if the YearEnd and YearBegin options might be relevant to this issue in some way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agstephens having a look at the output arrays, it seems that there are no time bounds.

For example, from the test test_average_over_years(), the output dataset is:

<xarray.Dataset>
Dimensions:   (time: 295, lat: 2, bnds: 2, lon: 2)
Coordinates:
  * time      (time) object 2005-01-01 00:00:00 ... 2299-01-01 00:00:00
    height    float64 1.5
  * lat       (lat) float64 -90.0 35.0
  * lon       (lon) float64 0.0 187.5
Dimensions without coordinates: bnds
Data variables:
    lat_bnds  (time, lat, bnds) float64 dask.array<chunksize=(1, 2, 2), meta=np.ndarray>
    lon_bnds  (time, lon, bnds) float64 dask.array<chunksize=(1, 2, 2), meta=np.ndarray>
    tas       (time, lat, lon) float32 dask.array<chunksize=(1, 2, 2), meta=np.ndarray>

The frequencies I'm using for calculating the time average are:
month: 1MS
year: 1AS

and I think the 'S' indicates using the start of the year or start of the month for resampling. I think this is probably what we want it to do, but I might be confused!
Other options include one such as AS-JUN, but I think this means the average is calculated over June - May rather than over Jan - Dec.

There are options in xarray resample (https://xarray.pydata.org/en/stable/generated/xarray.Dataset.resample.html) to change the label for the output times, but only to the left or right of the interval, so this isn't very helpful.

I can't see anything else about changing the label/ indicating the time period each time point covers, but will have another look to see what other options there might be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ellesmith88, thanks for looking into this.

I wonder if they have looked into this within the CDS code. Please can you get in touch with James (ECMWF) to ask for his thoughts on this issue. I wonder if they have come across and tackled this problem - or whether they are just happy to accept what xarray produces.

Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ellesmith88 Just looking at your work so far on this. I would propose that we might want to re-use a create_time_bounds(...) function in other code (in future). So please can you move out that functionality into a separate function. I think it might get used outside of clisops.core and clisops.ops so you could put them in a new module: clisops.utils.time_utils.

@agstephens agstephens merged commit d7c3d7a into master Feb 10, 2022
@agstephens agstephens deleted the average_time branch February 10, 2022 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants