Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spatial fill nearest neighbor function #14

Open
delgadom opened this issue Oct 3, 2018 · 7 comments
Open

Add spatial fill nearest neighbor function #14

delgadom opened this issue Oct 3, 2018 · 7 comments

Comments

@delgadom
Copy link
Member

delgadom commented Oct 3, 2018

prototype implementation

import numpy as np
from scipy.spatial import cKDTree

def spatial_fillna_nearest_neighbor(
        da,
        x_dim='longitude',
        y_dim='latitude',
        distance_upper_bound=np.inf,
        inplace=False):

    xy_dims = [x_dim, y_dim]
    not_xy_dims = [d for d in da.dims if d not in xy_dims]

    not_all_nans = da.notnull().any(dim=not_xy_dims)

    # get vectors of isnull, notnull flags
    stacked_isnull_flag = (~not_all_nans).stack(obs=xy_dims)
    notnull_flag = (~stacked_isnull_flag.values)
    isnull_flag = stacked_isnull_flag.values
    
    # get full set of xy points
    xy_full = np.vstack([stacked_isnull_flag[x_dim], stacked_isnull_flag[y_dim]]).T
    
    # get set of isnull, notnull xy points
    xy_isnull = xy_full[isnull_flag]
    xy_notnull = xy_full[notnull_flag]
    
    # build kdtree from valid points
    tree = cKDTree(xy_notnull)
    _, null_nn_notnull_indices = tree.query(
        xy_isnull, k=1, distance_upper_bound=distance_upper_bound)
    
    nearest_neighbor_valid = (null_nn_notnull_indices != len(xy_notnull))
    
    # build a mask for null values that have been successfully mapped to nearest neighbors
    isnull_and_filled_flag = isnull_flag.copy()
    isnull_and_filled_flag[isnull_flag] = nearest_neighbor_valid
    
    # build an indexing array with filled values pointing to their nearest neighbors
    isnull_nn_indices = np.arange(xy_full.shape[0])
    isnull_nn_indices[isnull_and_filled_flag] = (
        isnull_nn_indices[notnull_flag][null_nn_notnull_indices[nearest_neighbor_valid]])

    if not inplace:
        da = da.copy()
        
    all_dims = (not_xy_dims + xy_dims)
    dim_inds = [da.dims.index(d) for d in all_dims]
    res_shapes = [da.shape[i] for i in dim_inds]
    dim_sorter = [all_dims.index(d) for d in da.dims]

    da.values = (
        da
        .stack(obs=xy_dims)
        .transpose(*tuple(list(not_xy_dims) + ['obs']))
        .values[..., isnull_nn_indices]
        .reshape(res_shapes)
        .transpose(*dim_sorter))
    
    if not inplace:
        return da
@delgadom
Copy link
Member Author

delgadom commented Oct 3, 2018

@kemccusker here's my implementation of a spatial nearest neighbor function. it assumes that the spatial pattern of nearest neighbors shouldn't change with time (it interpolates cells that are always NaN to cells that have at least one non-NaN value).

thinking about putting this in climate toolbox. any requests?

@dgergel
Copy link
Member

dgergel commented Feb 4, 2020

@delgadom I think it would be great if we could add this to the climate toolbox. One edit that I needed to make this function run:

notnull_flag = (~stacked_isnull_flag).values

@brews
Copy link
Member

brews commented Feb 4, 2020

Cool!

If used ~globally, do we care about bias from using rectangular, latlon grids at higher |lat|? (And maybe I'm being too nerdy and this isn't really an issue for target use cases)

The goal is just to interpolate-out NaNs, right?

@dgergel
Copy link
Member

dgergel commented Feb 5, 2020

I think for now we don't need to worry too much about bias at higher latitudes.

Yep - the goal is just to interpolate out NaNs, because the interpolate_na function in xarray only works on 1-d arrays and doesn't like a multiindex if you stack dims, so functionality outside of xarray is necessary. Scipy's cKDTree function has a tolerance option that its interpolate module doesn't have so that's the motivation for using it. I also considered switching cKDTree to BallTree but decided that was too nerdy (unless you have thoughts on this @brews?)

@brews
Copy link
Member

brews commented Feb 5, 2020

Do we need it to run fast/small and be super awesome? I wouldn't be surprised if someone already has balltree with haversine distance or something.

Edit:
Mature, Grown-up Brewster says: Seriously though, Mike's solution might be good enough. I wouldn't sweat it unless its been shown to be a bottleneck.

@dgergel
Copy link
Member

dgergel commented Feb 5, 2020

that's actually why I considered switching to balltree because it has the haversine option.

But then got pressed for time so I stuck with this. this actually is pretty fast, I used it on global data and it was significantly faster than my 1-d solution that involved stacking and then interpolating.

@delgadom
Copy link
Member Author

delgadom commented Mar 5, 2020

Yeah these are all great caveats for this function for sure. We currently just use it to map near-coastal pixels to coastal pixels for areas with a landmask mismatch (e.g. comparing NASA/NEX and BEST). I'm not super worried about the error introduced from slightly too frequently grabbing values from cells above/below rather than left/right, and the intention is to explicitly prevent interpolating over large distances with the distance_upper_bound argument, so I think we're good. Nearest neighbor is waaay faster than anything messing with haversine distance.

delgadom added a commit to ClimateImpactLab/impactlab-tools that referenced this issue Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants