Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@rly this is a proposal.
The goal here is to support datasets with compound dtypes in both LindiH5Store and LindiClient such that the round-trip leaves compound dtypes intact. In particular, you can slice the dataset in the LindiClient in the same way as the h5py slicing of the original.
(btw, I didn't look too carefully at how this is handled in hdmf-zarr, but I think the below proposal is consistent with how things are represented in numpy and hdf5)
To illustrate, here is a test that was added:
To summarize this test:
Internally this is represented using the
_COMPOUND_DTYPE
attribute on the zarr array. In the above case, this would beThe data for the array is a JSON encoded array of the form
[[1, 3.14], [2, 6.28]]
See comments in the source code for more details.
A side note. Ideally, we wouldn't need to use JSON encoding, but rather we could point to the chunks in the original hdf5 file. But that would be a lot more tricky to get to work with zarr. And we're going to want to support special encoding for references anyway. One caveat of JSON encoding inline data approach is that it potentially will cause the .zarr.json file to be very large, in the case of large compound dtype datasets.