Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running xarray unit tests with cubed #8

Open
tomwhite opened this issue Sep 20, 2023 · 7 comments
Open

Running xarray unit tests with cubed #8

tomwhite opened this issue Sep 20, 2023 · 7 comments

Comments

@tomwhite
Copy link
Member

To test the coverage and correctness of cubed-xarray we would ideally run the xarray unit test suite with the cubed chunk manager.

@tomwhite
Copy link
Member Author

The xarray test suite isn't really set up to do this yet, so as a first step to get a rough impression of how it does I replaced Dask with Cubed in this branch: https://github.com/tomwhite/xarray/tree/cubed-unit-tests.

Running the tests then gave:

% pytest --color=yes | tee test-output.log
% grep FAILED test-output.log | cut -d' ' -f 4- | sort | uniq -c | sort -nr
 318 TypeError: Only numeric dtypes are allowed in sum
 206 AttributeError: 'IndexVariable' object has no attribute 'npartitions'
 198 ValueError: missing object_codec for object array
 179 ModuleNotFoundError: No module named 'dask'
  92 AttributeError: 'Array' object has no attribute 'sum'
  69 AssertionError: assert False
  64 AttributeError: 'Array' object has no attribute 'reshape'. Did you mean: '_...
  48 AttributeError: 'Array' object has no attribute 'ravel'
  33 TypeError: Vectorized indexing is not supported
  27 NameError: name 'da' is not defined
  27 AssertionError: Left and right Dataset objects are not identical
  25 NotImplementedError: cubed.apply_gufunc doesn't support allow_rechunk
  19 AttributeError: 'Array' object has no attribute 'astype'. Did you mean: '_d...
  15 TypeError: cannot pickle '_thread.lock' object
  12 AttributeError: module 'cubed.array_api' has no attribute 'var'
  10 AttributeError: 'Array' object has no attribute 'any'
...

With a summary of:

1464 failed, 13590 passed, 1539 skipped, 229 xfailed, 46 xpassed, 572 warnings, 31 errors in 657.12s (0:10:57)

A few comments on some of these:

  • Cubed has strict rules for type support in various functions (TypeError: Only numeric dtypes are allowed in sum)
    • For example, sum doesn't allow bool types, and mean doesn't allow int types. It would be possible to loosen Cubed to allow these, since the Python array API spec doesn't say you can't support more types than it describes (see this comment for more detail: Array API implementation dask/dask#8750 (comment)). However, we might want to change xarray for some (e.g. cast bool types to int before summing?) and Cubed for others (e.g. mean of int seems reasonable).
  • Cubed doesn't support object types (ValueError: missing object_codec for object array)
    • There seem to be two cases: strings, and mixtures of e.g. bool and NaN
    • Again, the Python API spec is silent on these, so perhaps we could pass them through
  • Cubed unify_chunks doesn't coerce non-Array types like lists (AttributeError: 'IndexVariable' object has no attribute 'npartitions')
    • We should probably do this in cubed-xarray.
  • Xarray still has several places where it's not using the array API (AttributeError: 'Array' object has no attribute ...)
    • We know this, but it would be fairly easy to patch up the cases found here (sum, reshape, ravel, etc)

cc @TomNicholas @keewis

@TomNicholas
Copy link
Member

Xarray still has several places where it's not using the array API

Perhaps a good way to find these cases quickly would be to use a similar test-case-array-library-substitution trick like you've used here, but instead use the numpy minimal implementation of the array API mentioned here.

@tomwhite
Copy link
Member Author

Rerunning with all the changes (including cubed-dev/cubed#550) using https://github.com/tomwhite/xarray/tree/cubed-unit-tests-aug-2024:

% pytest --color=yes | tee test-output.log
% grep FAILED test-output.log | cut -d' ' -f 4- | sort | uniq -c | sort -nr
 164 ModuleNotFoundError: No module named 'dask'
 159 ValueError: missing object_codec for object array
 108 ValueError: dimensions ('dim_2',) must have the same length as the number o...
  71 AssertionError: assert False
  55 AssertionError: Left and right Dataset objects are not identical
  37 ValueError: dimensions ('dim_1',) must have the same length as the number o...
  36 ValueError: Array must have regular chunks, but found chunks=((1, 1, 1, 2, ...
  29 NotImplementedError: Delayed compute is not supported.
  28 NameError: name 'da' is not defined
  27 NotImplementedError: cubed.apply_gufunc doesn't support allow_rechunk
  19 AttributeError: 'numpy.ndarray' object has no attribute 'name'
  16 zarr.errors.ArrayNotFoundError: array not found at path %r' ''
  16 ValueError: dimensions ('dim_-3',) must have the same length as the number ...
  16 NotImplementedError: Slice step must be >= 1: (slice(-1, 1, -1),)
  15 TypeError: Only numeric or boolean dtypes are allowed in sum
  13 AttributeError: 'Array' object has no attribute 'astype'. Did you mean: '_d...
  12 AttributeError: module 'cubed.array_api' has no attribute 'var'
  12 AttributeError: 'str' object has no attribute 'ndim'
  12 AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
  10 AttributeError: 'Array' object has no attribute 'any'
...
1022 failed, 16483 passed, 1777 skipped, 228 xfailed, 11 xpassed, 958 warnings, 34 errors in 780.18s (0:13:00)

This is an improvement: 5.8% of tests are failing, compared to 11.3% last October.

@TomNicholas
Copy link
Member

Progress! But that's still a lot of errors 😕 Could you post the full output somewhere (e.g. you can collapse long output in markdown on github) so that we can look for the highest-priority failures to fix first?

@tomwhite
Copy link
Member Author

Here's the output log (it's got garbled characters in it for some reason, but should otherwise be OK):
test-output.log

@TomNicholas
Copy link
Member

Interesting, thanks. There's a mixture of issues here, but your summary above still seems accurate.

AttributeError: 'Array' object has no attribute 'astype'

That comes up a lot and should have been fixed already (to call xp.astype instead).

ModuleNotFoundError: No module named 'dask

NameError: name 'da' is not defined

Presumably these are just an artifact of the of the you hacked these tests to always use cubed in stead of dask?

NotImplementedError: cubed.apply_gufunc doesn't support allow_rechunk

This is interesting and perhaps non-trivial for cubed.

ValueError: dimensions ('x',) must have the same length as the number of da...

This could be a bug in the namedarray code.

AttributeError: 'numpy.ndarray' object has no attribute 'chunks'

This could be a general issue with the chunkmanager dispatch

zarr.errors.ArrayNotFoundError: array not found at path %r' ''

This seems like possibly a real bug in cubed.

@tomwhite
Copy link
Member Author

zarr.errors.ArrayNotFoundError: array not found at path %r' ''

This seems like possibly a real bug in cubed.

Reproduced in cubed-dev/cubed#556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants