idxmin / idxmax is not parallel friendly #9425

dcherian · 2024-09-03T15:45:23Z

Discussed in #9421

^{Originally posted by KBodolai September 2, 2024}
Hi there! I have a question about the chunking behaviour when using idxmin / idxmax for a chunked array.

What is the expected behaviour for the chunks after we run idxmin over one of the dimensions? Naively I'd expect it to keep the chunks along the other dimensions, but that doesn't seem to be what happens: (Example below with time, x, y)

import numpy as np
import xarray as xr

# create some dummy data and chunk
x, y, t = 1000, 1000, 57
rang = np.arange(t*x*y)
da = xr.DataArray(rang.reshape(t, x, y), coords={'time':range(t), 'x': range(x), 'y':range(y)})
da = da.chunk(dict(time=-1, x=256, y=256))

Now when I look at the array, it looks something like this:

da.idxmin('time')

But after doing idxmin I get the outputs below

My understanding is that it seems to be trying to keep the number of chunks. But oddly, when we do it for floats:

da = da.astype('float32')

before and after doing the idxmin looks like this:

Is this the expected behavour for this operation? I'm guessing the reshaping in the source code happens here, but I haven't been able to figure out how yet.

Thanks!
K.

The text was updated successfully, but these errors were encountered:

dcherian · 2024-09-03T15:45:39Z

Yes, this looks bad.

The code should use .vindex. Though I think the best approach is to use Variable.__getitem__ and it will handle all the complexity.

xarray/xarray/core/computation.py

Lines 2180 to 2187 in a8c9896

    
           if is_chunked_array(array.data): 
        
               chunkmanager = get_chunked_array_type(array.data) 
        
               chunks = dict(zip(array.dims, array.chunks)) 
        
               dask_coord = chunkmanager.from_array(array[dim].data, chunks=chunks[dim]) 
        
               data = dask_coord[duck_array_ops.ravel(indx.data)] 
        
               res = indx.copy(data=duck_array_ops.reshape(data, indx.shape)) 
        
               # we need to attach back the dim name 
        
               res.name = dim

Closes pydata#9425

KBodolai · 2024-11-21T10:44:36Z

Hey, I absolutely missed this issue! I haven't yet had the pleasure to contribute to xarray - but do let me know if there's anything I can help with here @dcherian, I'd love to give a hand.

dcherian added topic-dask topic-chunked-arrays Managing different chunked backends, e.g. dask labels Sep 3, 2024

dcherian added a commit to dcherian/xarray that referenced this issue Nov 19, 2024

Optimize idxmin, idxmax with dask

eb35563

Closes pydata#9425

This was referenced Nov 19, 2024

Optimize idxmin, idxmax with dask #9800

Draft

bug in dask.array.reshape_blockwise dask/dask#11540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idxmin / idxmax is not parallel friendly #9425

idxmin / idxmax is not parallel friendly #9425

dcherian commented Sep 3, 2024

dcherian commented Sep 3, 2024

KBodolai commented Nov 21, 2024

idxmin / idxmax is not parallel friendly #9425

idxmin / idxmax is not parallel friendly #9425

Comments

dcherian commented Sep 3, 2024

Discussed in #9421

dcherian commented Sep 3, 2024

KBodolai commented Nov 21, 2024