Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

empty/null datatype #318

Open
flying-sheep opened this issue Oct 17, 2024 · 18 comments
Open

empty/null datatype #318

flying-sheep opened this issue Oct 17, 2024 · 18 comments

Comments

@flying-sheep
Copy link

H5Py has h5py.Empty, and @ivirshup said something like that was debated here as well.

What’s the status of that discussion?

@d-v-b
Copy link
Contributor

d-v-b commented Nov 27, 2024

Can you say more about what h5py.Empty does? I suspect we don't have anything like it in zarr, but I'm not aware of any discussions about it.

@flying-sheep
Copy link
Author

flying-sheep commented Nov 29, 2024

It represents the concept of “something exists under that path, but it has neither dtype nor size”.

Like null in JavaScript or None in Python.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 29, 2024

We definitely don't have anything like that! Maybe it's useful when you want to partially initialize an array? But I'm not aware of a use case for this in Zarr.

@flying-sheep
Copy link
Author

flying-sheep commented Nov 29, 2024

Sorry, maybe “datatype” was misleading. I think what you’re referring to is a nullable dtype, which definitely has a lot of uses, but is a different question.

This is about storing an empty dataset. E.g. {"array": np.array([1, 2], dtype=np.int64)} can easily be represented as a zzarr store, but {"array": None} can’t.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 29, 2024

maybe I don't understand what an "empty dataset" is? It sounds like it's a dataset, but doesn't have certain properties specified, hence my reference to partial initialization. In Zarr, arrays have to have a complete metadata document, and the shape / dtype fields are mandatory, so it's not possible to have something that's a zarr array with an indeterminate or unset shape and dtype.

@flying-sheep
Copy link
Author

Yup, that’s why we’re asking to add support for that.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 29, 2024

Would you want certain fields of the zarr array / group metadata documents to be optional? What's the use case (besides mapping on to hdf5)?

@flying-sheep
Copy link
Author

I wrote down the use case here: #318 (comment)

The idea would be to have a marker that represents an empty array. Not 0 dimensions (which means 1 scalar value) but nothing. I don’t know what the best representation would be, but I don’t think making fields optional individually makes sense. E.g. it wouldn’t make sense to have an array without shape but with dtype. Either both or neither.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 29, 2024

Can you explain why you want to represent an empty array? I'm not sure I understand what this is for. The example in #318 (comment) doesn't really explain why you want this.

If the goal is to model a partial zarr / hdf5 hierarchy that has "holes", where details of arrays / groups are undefined (e.g. because you want to initialize a hierarchy but you don't know everything yet about your arrays / groups), then I think this can only be done in-memory, e.g. in a software model of a Zarr hierarchy. It's a pretty important aspect of Zarr that the stored representation of arrays (i.e., the contents of the actual metadata documents) are complete and valid, and I don't think there's any way that could change.

@flying-sheep
Copy link
Author

I just want to represent a null value. No hole, nothing undefined or invalid. Just a dataset that contains no data. Or a placeholder thereof.

Empty groups are able to exist, but empty datasets aren’t.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 29, 2024

You can make Zarr arrays that contain no data, but they have to have a defined dtype and shape. I thought you were proposing relaxing this restriction, i.e. allowing arrays to have an undefined dtype and shape, but maybe I misinterpreted your request?

@flying-sheep
Copy link
Author

I just want to represent a null value

No matter how it’s represented. I was told that there was a discussion about that, but apparently not.

If you’re telling me that there will be nothing like it, we’ll make up our own convention, but if there is going to be a canonical way to express that, I’d like to know how that’ll look.

@flying-sheep
Copy link
Author

@ilan-gold pointed out that h5py.Empty does have an associated dtype (but no shape or data)

HDF5 has the concept of Empty or Null datasets and attributes. These are not the same as an array with a shape of (), or a scalar dataspace in HDF5 terms. Instead, it is a dataset with an associated type, no data, and no shape.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 7, 2025

And what is the purpose of this thing? For example: what does hdf5 use empty datasets for? Once I know the answer to that question, we can sketch out how to achieve the same outcome with Zarr.

@flying-sheep
Copy link
Author

flying-sheep commented Jan 7, 2025

We serialize user-defined Python dictionaries with string keys as Zarr/HDF5 groups.

The presence of a sub-group and dataset in that group with name x means that there is a dict entry with name x. We support dict values of certain types like arrays or scalars. The only common Python type we don’t support is None.

There is a difference between having a dict entry with name floob and a None value, and not having that dict entry. We would like to know if you’ll ever add first-class support for named-but-empty datasets / null datasets in any shape, way or form.

If you do, we’ll use this type to represent Nones, if you won’t, we’ll use a convention like a Dataset with shape ().

@d-v-b
Copy link
Contributor

d-v-b commented Jan 7, 2025

A zarr array would not be my first choice for serializing the literal value None, but if that's your application, then I think adopting your own convention makes sense. You could also use a special value in the array attributes to encode that the array should be interpreted as the value None.

@flying-sheep
Copy link
Author

flying-sheep commented Jan 7, 2025

A zarr array would not be my first choice for serializing the literal value None

Thus this feature request. Unless you have another idea of something that’s a better fit than a ()-shaped array.

You could also use a special value in the array attributes to encode that the array should be interpreted as the value None.

That’s what I currently do in the draft PR that waits for this feature request to receive a definite yes or no!

@d-v-b
Copy link
Contributor

d-v-b commented Jan 7, 2025

broadly speaking, I would only use Zarr arrays to represent other n-dimensional arrays, and I would use something else entirely to model python primitives like None.

and I can't give a definite "no" to a feature request, because this repo doesn't really take feature requests? this is the repo for the zarr specifications, which have an extremely long release cycle (we have been working on just implementing v3 of the spec for over a year, and we don't have any plans for a v4 spec). So I would push ahead with your implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants