Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idxmin / idxmax is not parallel friendly #9425

Open
dcherian opened this issue Sep 3, 2024 Discussed in #9421 · 2 comments · May be fixed by #9800
Open

idxmin / idxmax is not parallel friendly #9425

dcherian opened this issue Sep 3, 2024 Discussed in #9421 · 2 comments · May be fixed by #9800
Labels
topic-chunked-arrays Managing different chunked backends, e.g. dask topic-dask

Comments

@dcherian
Copy link
Contributor

dcherian commented Sep 3, 2024

Discussed in #9421

Originally posted by KBodolai September 2, 2024
Hi there! I have a question about the chunking behaviour when using idxmin / idxmax for a chunked array.

What is the expected behaviour for the chunks after we run idxmin over one of the dimensions? Naively I'd expect it to keep the chunks along the other dimensions, but that doesn't seem to be what happens: (Example below with time, x, y)

import numpy as np
import xarray as xr

# create some dummy data and chunk
x, y, t = 1000, 1000, 57
rang = np.arange(t*x*y)
da = xr.DataArray(rang.reshape(t, x, y), coords={'time':range(t), 'x': range(x), 'y':range(y)})
da = da.chunk(dict(time=-1, x=256, y=256))

Now when I look at the array, it looks something like this:

Screenshot 2024-09-02 at 17 06 22
da.idxmin('time')

But after doing idxmin I get the outputs below
Screenshot 2024-09-02 at 17 00 17

My understanding is that it seems to be trying to keep the number of chunks. But oddly, when we do it for floats:

da = da.astype('float32')

before and after doing the idxmin looks like this:

Screenshot 2024-09-02 at 17 10 25 Screenshot 2024-09-02 at 17 10 11

Is this the expected behavour for this operation? I'm guessing the reshaping in the source code happens here, but I haven't been able to figure out how yet.

Thanks!
K.

@dcherian
Copy link
Contributor Author

dcherian commented Sep 3, 2024

Yes, this looks bad.

The code should use .vindex. Though I think the best approach is to use Variable.__getitem__ and it will handle all the complexity.

if is_chunked_array(array.data):
chunkmanager = get_chunked_array_type(array.data)
chunks = dict(zip(array.dims, array.chunks))
dask_coord = chunkmanager.from_array(array[dim].data, chunks=chunks[dim])
data = dask_coord[duck_array_ops.ravel(indx.data)]
res = indx.copy(data=duck_array_ops.reshape(data, indx.shape))
# we need to attach back the dim name
res.name = dim

@dcherian dcherian added topic-dask topic-chunked-arrays Managing different chunked backends, e.g. dask labels Sep 3, 2024
dcherian added a commit to dcherian/xarray that referenced this issue Nov 19, 2024
@KBodolai
Copy link

Hey, I absolutely missed this issue! I haven't yet had the pleasure to contribute to xarray - but do let me know if there's anything I can help with here @dcherian, I'd love to give a hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-chunked-arrays Managing different chunked backends, e.g. dask topic-dask
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants