Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check nan handling in aggregation #26

Open
delgadom opened this issue Aug 13, 2019 · 4 comments
Open

Check nan handling in aggregation #26

delgadom opened this issue Aug 13, 2019 · 4 comments

Comments

@delgadom
Copy link
Member

Proposed spec:

For a given region to be aggregated:

  • If there are no NaNs in a region, you're good. Just aggregate the data.
  • If a NaN is present in a region, but the region is not entirely NaNs, then make sure to not drop any weight, but to aggregate the data in the region using the weight in the non-nan pixels
  • If an entire region is NaN, the output should be NaN

Note that no interpolation is done in time, space, or any other dimensions with these tools

This should be tested in the output

@delgadom
Copy link
Member Author

@atrisovic @andyhultgren here is the proposed NaN handling spec. Does this seem right?

@andyhultgren
Copy link

@delgadom @atrisovic Yes this is right. Here is the full set of NaN conditions:

    1) defaults to area weights when user-defined weights are
        missing/zero for an entire adm polygon
    2) maintains NaN gridcell values (in climate data) instead of defaulting 
        to zero
    3) reweights around NaN gridcells in climate data when aggregating spatially
        (only outputs NaN value per region-day if all pixels are missing)
    4) assigns NaN values in climate data to a given unit of temporal aggregation
        (month or year) if any day has a NaN value

For example, our transforms of the climate data are structured to pass NaNs through like this:
my_poly = lambda x: x if np.isnan(x) else pow(x, term_val)
vfunc = np.vectorize(my_poly)
transdata = vfunc(raw_data)

And then when we spatially collapse gridcells to adm units, we take the weighted average while ignoring missing values (unless they are all missing for a given adm unit, in which case NaN is returned):
num = grp._get_numeric_data()
return num.multiply(num[weight_col], axis=0).sum() / num.notnull().astype('int').multiply(num[weight_col], axis=0).sum()

However when aggregating over time (for an adm unit) if e.g. one day has missing data for the entire adm unit, then the temporally-aggregated output (say a month containing that day) is also reported as missing. As in:
return grp.sum(skipna=False, axis=1)

All of these code snippets are from the master branch of the merge_transform_average.py script.

@delgadom
Copy link
Member Author

delgadom commented Aug 29, 2019

@andyhultgren thanks! You're absolutely right - @atrisovic the aggregation math in aggregations.py line 92 does need to be changed:

    weighted = xr.Dataset({
        variable: (
            (
                (ds[variable]*ds[aggwt])
                .groupby(agglev)
                .sum(dim='reshape_index')) /
            (
                ds[aggwt]
                .groupby(agglev)
                .sum(dim='reshape_index')))})

should be

    weighted = xr.Dataset({
        variable: (
            (
                (ds[variable]*ds[aggwt])
                .groupby(agglev)
                .sum(dim='reshape_index')) /
            (
                ds[aggwt]
                .where(ds[variable].notnull())
                .groupby(agglev)
                .sum(dim='reshape_index')))})

also @andyhultgren, sorry for the nitpicky off-topic point, but just so you know np.vectorize is super slow compared with using actual vector math operations. It's essentially just a for-loop, so this is a pure python cell-wise operation:

my_poly = lambda x: x if np.isnan(x) else pow(x, term_val)
vfunc = np.vectorize(my_poly)
transdata = vfunc(raw_data)

You'd see major speedups with something like this:

transdata = np.where(np.isinan(raw_data), raw_data, raw_data ** term_val)

Although exponentiation does preserve nans, so this doesn't actually do anything and could be written:

transdata = raw_data ** term_val

@delgadom
Copy link
Member Author

Oh and to completely respond - the aggregation code only aggregates in space - aggregation in time is defined in each transformation. mortality polynomials preserve daily resolution, Ag GDD/KDDs sum to the month, and energy is annual. We'll make sure the transformations fit the current spec, but just so you know this will ultimately be up to the user to make sure the transformations are done right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants