-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk Spec #7
Comments
I hope I represented the discussions we had on this correctly, if not please everyone feel free to jump in.
In this context, we should also keep in mind the netcdf data model |
One point I want to chime in on:
I don't see that variable-sized chunks ⟹ chunk-headers, if that's what the last sentence means. Mini-proposal: "heterogeneous grid" variable-chunk-sizingI was imagining variably-sized chunks still conforming to a hyper-rectangular grid where [the chunk-size along a given dimension] is only dependent on [the chunk's index along that axis]. Notable benefits:
ExampleSuppose we have a 3-D dataset where chunks have size (1,10,100,10,1) along dimension 1, (200,20,2,22) along dimension 2, and (3,30,300,30,3) along dimension 3.
There are 100 (5×4×5) chunks but we only require 14 (5+4+5) integers in In generalWith n dimensions, where k_𝑖 is the number of chunks along dimension i, That would contain (k₁ + k₂ + … + k𝗇) integers in total, which is much smaller than the number of chunks (k₁k₂…k𝗇), and would always be quite manageable. In the worst case, you have a number of integers in Motivation: distributed processingI was part of discussions last year where we brainstormed this kind of "heterogeneous grid" layout, but I can't find it on the Zarr repo; it may have been just with @tomwhite / @laserson or some CZI folks (@mckinsel?). The impetus was that, in a Spark/Hadoop context, we might load a (1-D or 2-D, say) dataset, filter out some elements (1-D) or rows/cols (2-D), and want to save the resulting dataset to disk, but would incur an expensive "shuffle" stage to get back to evenly-spaced chunks. Requiring perfectly-evenly-shaped chunks will add significant overhead in such settings. "Heterogeneous-grid" chunking is orthogonal to other possible approaches[Supporting a "heterogenous grid" like I'm describing] should be orthogonal to [supporting other, more flexible variable-chunk-sizing schemes]. What is the alternative?Thinking about it more, though, I don't see how you can really do any variable-chunk-sizing that doesn't conform to a "grid" like I'm describing… indexing/slicing quickly become undefined, unless I'm missing something? Interested in others' thoughts! Sorry if some of this is covered elsewhere already. (xref: #40) |
Yes, you're correct. I changed the sentence to "Variable chunk size would require header or information about chunks in the metadata." Also, the Heterogeneous-grid approach is very interesting. This could also be very useful |
Thanks @constantinpape for raising this, sorry for coming late to the party. Just wanted to xref this issue: https://github.com/zarr-developers/zarr/issues/245 - @jakirkham raised a requirement for the heterogeneous grid (a.k.a., non-uniform chunking), the use case being to store dask arrays without having to rechunk to a uniform grid. I'd also suggest we aim to break this issue up into a number of separate issues, each of which can come to a decision point. I don't have any concrete suggestions for how to do that right now, but will give it some thought. |
Following up on today's call and #3, define a specification for how chunks are represented in memory before going through (compression) filters and storage.
Minimum requirement: a chunk can store nd-tensors of primitive datatypes.
There was also a consensus to support big and little endian data (and C/F layout where appropriate).
On top, we discussed these questions:
Regarding 2.:
Use case 1: storing edge chunks that are not fully covered.
@axtimwalde pointed out that this allows direct mapping to memory without copying data in n5-imglib implementation.
Use case 2: appending / prepending to datasets. This could be used to implement prepending to datasets without modifying existing chunks. Note that one of @alimanfoo's motivations to NOT implement variable chunk size was to always have valid chunks when appending to a dataset.
Regarding 3:
The n5 use cases we discussed were simple examples like storing unique values in the spatial block corresponding to a chunk and more complicated examples like the n5-label-multiset.
Also, this could be useful to define non-primitive datatypes, e.g. strings encoded via offsets and values. See also 4.
Regarding 4:
During the discussion, several additional datatypes that could be supported were discussed:
More general, there is the question how we could provide a mechanism for extensions to the spec that define new datatypes.
In the current zarr implementation, numpy arrays of objects can be stored via a special filter, see #6.
In the current n5 implementation, non-primitive datatypes can be encoded into a varlength chunk and need to be decoded with a separate library (i.e. not part of n5-core) again.
The text was updated successfully, but these errors were encountered: