Skip to content

Typing of internal datatypes #7457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
headtr1ck opened this issue Jan 19, 2023 · 5 comments
Open

Typing of internal datatypes #7457

headtr1ck opened this issue Jan 19, 2023 · 5 comments

Comments

@headtr1ck
Copy link
Collaborator

Is your feature request related to a problem?

Currently there is no static typing of the underlying data structures used in DataArrays.
Simply running
reveal_type(da.data) returns Any.

Adding static typing support to that is unfortunately non-trivial since xarray supports a wide variety of duck-types.

This also comes with internal typing difficulties.

Describe the solution you'd like

I think the way to go is making the DataArray class generic in it's underlying data type.
Something like DataArray[np.ndarray] or DataArray[dask.array].

The implementation would require a TypeVar that is bound to some minimal required Protocol for internal consistency (I think at least it needs dtype and shape attributes).

Datasets would have to be typed the same way, this means only one datatype for all variables is possible, when you mix it it will fall back to the common ancestor which will be the before mentioned protocol. This is basically the same restriction that a dict has.

Now to the main issue that I see with this approach:
I don't know how to type coordinates. They have the same problems than mentioned above for Datasets.
I think it is very common to have dask arrays in the variables but simple numpy arrays in the coordinates, so either one excludes them from the typing or in such cases the common generic typing falls back to the protocol again.
Not sure what is the best approach here.

Describe alternatives you've considered

Since the most common workflow for beginners and intermediate-advanced users is to stick with the DataArrays themself and never touch the underlying data, I am not sure if this change is as beneficial as I want it to be. Maybe it just complicates things and leaving it as Any is easier to solve for advanced users that then have to cast or ignore this.

Additional context

It came up in this discussion:
#7020 (comment)_

@Illviljan
Copy link
Contributor

I've been thinking that a good and safe start on this issue is to replace all these raw Any's with T_DuckArray where T_DuckArray = Any.
It won't help in the typing but it gives some traceability and makes it easier to determine what the intention was.

@TomNicholas
Copy link
Member

Doesn't the python array api standard effort have some type of duck array protocol we could import? I feel like this has been mentioned before. Then we would start with @Illviljan 's suggestion and replace it with the correct duck array protocol later.

We might also consider that in the context of different distributed array backends dask arrays define a more specific API that includes methods like .chunk() too. Standardisation of that is a long-term dream though, not an immediate problem.

@headtr1ck
Copy link
Collaborator Author

Doesn't the python array api standard effort have some type of duck array protocol we could import? I feel like this has been mentioned before. Then we would start with @Illviljan 's suggestion and replace it with the correct duck array protocol later.

But then da.data will be of this protocol type and not the array class that you assume it has.
For internal type checking this is what we want but for the user this will be confusing.

@TomNicholas
Copy link
Member

But then da.data will be of this protocol type and not the array class that you assume it has.
For internal type checking this is what we want but for the user this will be confusing.

When will the user be using this type annotation? Isn't all this typing stuff basically a dev feature (internally and downstream)?

@Illviljan
Copy link
Contributor

I feel like this has been mentioned before.

Yes, now I recall, it was here #6894.

I think it could be interesting to try out: https://github.com/pmeier/array-protocol

Another good read:
data-apis/array-api#229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants