-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Typing of internal datatypes #7457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've been thinking that a good and safe start on this issue is to replace all these raw |
Doesn't the python array api standard effort have some type of duck array protocol we could import? I feel like this has been mentioned before. Then we would start with @Illviljan 's suggestion and replace it with the correct duck array protocol later. We might also consider that in the context of different distributed array backends dask arrays define a more specific API that includes methods like |
But then |
When will the user be using this type annotation? Isn't all this typing stuff basically a dev feature (internally and downstream)? |
Yes, now I recall, it was here #6894. I think it could be interesting to try out: https://github.com/pmeier/array-protocol Another good read: |
Is your feature request related to a problem?
Currently there is no static typing of the underlying data structures used in
DataArray
s.Simply running
reveal_type(da.data)
returnsAny
.Adding static typing support to that is unfortunately non-trivial since xarray supports a wide variety of duck-types.
This also comes with internal typing difficulties.
Describe the solution you'd like
I think the way to go is making the
DataArray
class generic in it's underlying data type.Something like
DataArray[np.ndarray]
orDataArray[dask.array]
.The implementation would require a TypeVar that is bound to some minimal required Protocol for internal consistency (I think at least it needs
dtype
andshape
attributes).Datasets would have to be typed the same way, this means only one datatype for all variables is possible, when you mix it it will fall back to the common ancestor which will be the before mentioned protocol. This is basically the same restriction that a dict has.
Now to the main issue that I see with this approach:
I don't know how to type coordinates. They have the same problems than mentioned above for Datasets.
I think it is very common to have dask arrays in the variables but simple numpy arrays in the coordinates, so either one excludes them from the typing or in such cases the common generic typing falls back to the protocol again.
Not sure what is the best approach here.
Describe alternatives you've considered
Since the most common workflow for beginners and intermediate-advanced users is to stick with the DataArrays themself and never touch the underlying data, I am not sure if this change is as beneficial as I want it to be. Maybe it just complicates things and leaving it as
Any
is easier to solve for advanced users that then have to cast or ignore this.Additional context
It came up in this discussion:
#7020 (comment)_
The text was updated successfully, but these errors were encountered: