Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse chunk memory layout #48

Open
alimanfoo opened this issue Aug 1, 2019 · 1 comment
Open

Sparse chunk memory layout #48

alimanfoo opened this issue Aug 1, 2019 · 1 comment
Labels
protocol-extension Protocol extension related issue

Comments

@alimanfoo
Copy link
Member

alimanfoo commented Aug 1, 2019

This is a placeholder for a potential protocol extension to define sparse memory layouts for chunks.

The idea is to enable use of a sparse memory layout (e.g., CSR, CSC or COO) within each chunk of a Zarr array. I.e., a Zarr array has a regular chunk grid as normal, but instead of using a dense C contiguous or F contiguous layout for the data within each chunk, use a sparse memory layout.

E.g., in the case of COO the memory layout would comprise two memory blocks, one storing the coordinates, the other storing the data values. For the purposes of encoding and storage, these two memory blocks could be concatenated into a single memory block, which could then be passed down through filter and compressor codecs and stored as normal. When retrieving and decoding the chunk, the coordinates and the data values could be presented as views of different regions of the memory block, to avoid extra memory copies.

In terms of the Zarr v3 core protocol, this could be specified as a protocol extension, defining new memory layouts that could be used within the chunk_memory_layout array metadata property.

An implementation in Python could be relatively straightforward, by using an existing sparse array library like SciPy (for 2D chunks) or sparse (for ND chunks) to manage the chunks, instead of numpy.

This could also integrate nicely with blocked parallel computing frameworks like Dask, because each chunk would be presented as a sparse array, and so any computational steps within the task graph that could operate directly on the sparse representation could do so, rather than forcing data into a dense representation.

Note that this is different from discussions about defining conventions for storing sparse arrays in Zarr, where a collection of two or more Zarr arrays are used to store a single sparse array. (E.g., for a COO array, the coords would be stored in one Zarr array, and the data in a second Zarr array). That may be equally worthwhile to pursue, but is a different concept and probably serves slightly different use cases

@alimanfoo
Copy link
Member Author

cc @ryan-williams

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

2 participants