Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing/storing a MultiscaleSpatialImage with dask array data #78

Open
m-albert opened this issue Aug 7, 2023 · 2 comments
Open

Computing/storing a MultiscaleSpatialImage with dask array data #78

m-albert opened this issue Aug 7, 2023 · 2 comments

Comments

@m-albert
Copy link

m-albert commented Aug 7, 2023

Hi, I recently discovered spatial-image and multiscale-spatial-image and find it super useful to have an xarray based representation of NGFF datasets (and multiscale images for that matter!).

Exploring how to best use this approach, I came across the following observation. Consider the following example:

import spatial_image as si
import multiscale_spatial_image as msi
import dask.array as da
from dask.diagnostics import ProgressBar

sim = si.to_spatial_image(da.zeros((10,10,10)))

msim = msi.to_multiscale(sim, scale_factors=[1, 2])
print(msim.groups)

with ProgressBar():
    # msim.to_zarr('file.zarr')
    msim.compute()

('/', '/scale0', '/scale1', '/scale2')
[########################################] | 100% Completed | 104.17 ms
[########################################] | 100% Completed | 105.15 ms
[########################################] | 100% Completed | 115.48 ms

When computing (or writing) a MultiscaleSpatialImage, the scale datasets are computed sequentially and independently, as indicated in the above output by the presence of three separate progress bars. And indeed if I understand correctly, xarray datatree implements the compute method by calling it independently on each tree node. This means that although the dask arrays of higher scales depend on the lowest scale, each scale is recomputed from scratch each time (caching is probably not relevant in the context of large enough data).

In my current understanding this leads to redundant computations, especially relevant for large/expensive underlying dask graphs/computations. It seems more desirable to obtain the downsampled data directly from the full resolution data, as chunks become computed.

Is there currently a good way to obtain this behaviour for MultiscaleSpatialImages?

I guess in principle one would want to bundle the computation of all tree nodes into a single call to dask.compute, so that the scheduler can take care of optimizations. A reasonable alternative to the above exemplified workflow would probably be to stream a SpatialImage to file first, and then stream this into a MultiscaleSpatialImage. However, this of course involves an additional i/o step.

Please excuse the lengthy issue and thanks a lot for any feedback / discussion :)

@m-albert m-albert changed the title Computing MultiscaleSpatialImage Computing/storing a MultiscaleSpatialImage with dask array data Aug 7, 2023
@thewtex
Copy link
Contributor

thewtex commented Oct 4, 2023

Hi,

Thanks for opening this.

Yes, we can save redundant computation if we know we want to compute all the content. By serializing, e.g. to zarr, by serializing the first scale, then base the dask task graph for the second scale starting from the first, etc.

Furthermore, for very large datasets, we only want to create task graphs per region at higher resolutions.

There is work in progress on these fronts in ngff-zarr to be ported to multiscale-spatial-image.

@m-albert
Copy link
Author

m-albert commented Oct 5, 2023

Hi @thewtex, thanks a lot for commenting on this.

There is work in progress on these fronts in ngff-zarr to be ported to multiscale-spatial-image.

Great to hear you're working on this @ ngff-zarr. I'm very interested in progress on this front and would be very happy to test / help if of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants