Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimsed lazy indexing - h5netcdf backend - Active storage reductions #805

Merged
merged 158 commits into from
Oct 30, 2024

Conversation

davidhassell
Copy link
Collaborator

@davidhassell davidhassell commented Aug 22, 2024

Fixes #501

This PR has three parts, which ideally would have been in three separate PRs, but they needed to be co-developed, and splitting them out retrospectively turned out to be to hard. The three parts are:

  • Optimised lazy indexing

    • For data in files, a layer in the Dask graph which indexes the array object (such as a NetCDF4Array object) no longer triggers a read from disk. This has to be done within another layer by converting the indexed array object to a numpy array (typically with the new cf_asnyarray function). When an already-indexed array object is indexed again, the two indices are effectively combined to create a new index. E.g. If the array has shape (10, 10), the first index is [:, :], the second index is [::2, [1, 5, 3, 7]], and the third index is [::-2, 1::2], then this is equivalent to the single index [8::-4, 5:8:2]. Only the data for the combined index needs to retrieved from disk, which is a huge performance improvement, as the amount of IO is typically vastly reduced. The same happens for array objects in memory (such as numpy arrays), but for these the performance improvements are negligible.
  • h5netcdf

    • Introducing the ability to use h5netcdf as a backend. Essentially making cf-python compatible with h5netcdf read cfdm#307.
  • Active storage reductions

    • Implement the option to do active storage reductions - i.e. to move selected collapse calculations from the client's CPU to a remote server that is "close" (in a network sense) to the data being collapsed.

To help alleviate the difficulties of reviewing a 3-in-1 PR, I have marked up every code change with which of the three parts it relates to, and provided a checklist for all of the changes.

Optimised lazy indexing

  • cf/docstring/docstring.py:637: # REVIEW: getitem: _docstring_substitution_definitions: 'asanyarray'

  • cf/test/test_NetCDF4Array.py:138: # REVIEW: getitem: test_NetCDF4Array: test NetCDF4Array.shape
  • cf/test/test_NetCDF4Array.py:148: # REVIEW: getitem: test_NetCDF4Array: test NetCDF4Array.index

  • cf/test/test_Data.py:1482: # REVIEW: getitem: test_Data__getitem__: Chained subspaces reading from disk
  • cf/test/test_Data.py:3293: # REVIEW: getitem: test_Data_rechunk: rechunking after a getitem
  • cf/test/test_Data.py:4523: # # REVIEW: getitem: test_Data_active_storage: test Data.active_storage
  • cf/test/test_Data.py:4571: # REVIEW: getitem: test_Data_cull_graph: prevent new asanyarray layer
  • cf/test/test_Data.py:4830: # REVIEW: getitem: test_Data_is_masked: test Data.is_masked

  • cf/test/test_Field.py:1469: # REVIEW: getitem: test_Field_indices: make sure works when 'g.array' is not masked
  • cf/test/test_Field.py:1488: # REVIEW: getitem: test_Field_indices: make sure works when 'g.array' is not masked

  • cf/test/test_FullArray.py:1:# REVIEW: getitem: test_FullArray: new test module

  • cf/mixin/propertiesdata.py:4695: # REVIEW: getitem: to_dask_array: new keyword 'asanyarray'

  • cf/read_write/netcdf/netcdfwrite.py:752: # REVIEW: getitem: _cfa_write_non_standard_terms: set 'asanyarray'
  • cf/read_write/netcdf/netcdfwrite.py:813: # REVIEW: getitem: _cfa_unique: convert a to a usable array
  • cf/read_write/netcdf/netcdfwrite.py:969: # REVIEW: getitem: _cfa_aggregation_instructions: set 'asanyarray'

  • cf/data/dask_regrid.py:5:# REVIEW: getitem: regrid.py: import cf_asanyarray
  • cf/data/dask_regrid.py:180: # REVIEW: getitem: regrid: convert a to a usable array

  • cf/data/data.py:47:# REVIEW: getitem: data.py: import cf_asanyarray, cf_filled, cf_is_masked
  • cf/data/data.py:378: # REVIEW: getitem: __init__: set 'asanyarray'
  • cf/data/data.py:482: # REVIEW: getitem: __init__: Set whether or not to call np.asanyarray on chunks to convert them to numpy arrays.
  • cf/data/data.py:516: # REVIEW: getitem: __init__: set 'asanyarray'
  • cf/data/data.py:639: # REVIEW: getitem: cf_contains: set 'asanyarray'
  • cf/data/data.py:792: # REVIEW: getitem: __len__: set 'asanyarray'
  • cf/data/data.py:901: # REVIEW: getitem: __getitem__: set 'asanyarray'
  • cf/data/data.py:961: # REVIEW: getitem: __getitem__: set 'asanyarray=True' because subspaced chunks might not be in memory
  • cf/data/data.py:1184: # REVIEW: getitem: __asanyarray__: new property __asanyarray__
  • cf/data/data.py:1417: # REVIEW: getitem: _set_dask: new keyword 'asanyarray'
  • cf/data/data.py:1476: # REVIEW: getitem: _set_dask: set 'asanyarray'
  • cf/data/data.py:2552: # REVIEW: getitem: percentile: set 'asanyarray'
  • cf/data/data.py:3052: # REVIEW: getitem: percentile: rectify comment
  • cf/data/data.py:3240: # REVIEW: getitem: rechunk: set 'asanyarray'
  • cf/data/data.py:3299: # REVIEW: getitem: _asdatetime: set 'asanyarray'
  • cf/data/data.py:3357: # REVIEW: getitem: _asreftime: set 'asanyarray'
  • cf/data/data.py:3970: # REVIEW: getitem: _regrid: set 'asanyarray'
  • cf/data/data.py:4215: # REVIEW: getitem: concatenate: set 'asanyarray'
  • cf/data/data.py:4248: # REVIEW: getitem: concatenate: define the asanyarray status
  • cf/data/data.py:4259: # REVIEW: getitem: concatenate: set 'asanyarray'
  • cf/data/data.py:4905: # REVIEW: getitem: chunks: set 'asanyarray'
  • cf/data/data.py:4963: # REVIEW: getitem: Units: set 'asanyarray'
  • cf/data/data.py:5033: # REVIEW: getitem: dtype: set 'asanyarray'
  • cf/data/data.py:5149: # REVIEW: getitem: is_masked: set 'asanyarray'
  • cf/data/data.py:5195: # REVIEW: getitem: nbytes: set 'asanyarray'
  • cf/data/data.py:5232: # REVIEW: getitem: ndim: set 'asanyarray'
  • cf/data/data.py:5257: # REVIEW: getitem: npartitions: set 'asanyarray'
  • cf/data/data.py:5281: # REVIEW: getitem: numblocks set 'asanyarray'
  • cf/data/data.py:5314: # REVIEW: getitem: shape: set 'asanyarray'
  • cf/data/data.py:5356: # REVIEW: getitem: size set 'asanyarray'
  • cf/data/data.py:6555: # REVIEW: getitem: convert_reference_time: set 'asanyarray'
  • cf/data/data.py:6636: # REVIEW: getitem: get_deterministic_name: set 'asanyarray'
  • cf/data/data.py:6706: # REVIEW: getitem: get_filenames: set 'asanyarray'
  • cf/data/data.py:6843: # REVIEW: getitem: add_file_location: set 'asanyarray'
  • cf/data/data.py:8287: # REVIEW: getitem: unique: set 'asanyarray'
  • cf/data/data.py:9029: # REVIEW: getitem: hardmask: set 'asanyarray'
  • cf/data/data.py:9152: # REVIEW: getitem: soften_mask: set 'asanyarray'
  • cf/data/data.py:9184: # REVIEW: getitem: file_locations: set 'asanyarray'
  • cf/data/data.py:9245: # REVIEW: getitem: filled: set 'asanyarray'
  • cf/data/data.py:9877: # REVIEW: getitem: to_dask_array: new keyword 'asanyarray'
  • cf/data/data.py:10209: # REVIEW: getitem: del_file_location: set 'asanyarray'
  • cf/data/data.py:11394: # REVIEW: getitem: where: set 'asanyarray'
  • cf/data/data.py:11412: # REVIEW: getitem: where: set 'asanyarray'
  • cf/data/data.py:11458: # REVIEW: getitem: where: 'da.asanyarray' is no longer required
  • cf/data/data.py:11685: # REVIEW: getitem: cull_graph: set 'asanyarray'
  • cf/data/data.py:11957: # REVIEW: getitem: todict: new keywords 'apply_mask_hardness', 'asanyarray'

  • cf/data/array/h5netcdfarray.py:53: # REVIEW: getitem: _get_array: new method to convert subspace to numpy array.

  • cf/data/array/mixin/init.py:8:# REVIEW: getitem: __init__.py: import IndexMixin

  • cf/data/array/mixin/indexmixin.py:10:# REVIEW: getitem: IndexMixin: new mixin class IndexMixin

  • cf/data/array/netcdf4array.py:50: # REVIEW: getitem: _get_array: Ignore this for h5 review
  • cf/data/array/netcdf4array.py:51: # REVIEW: getitem: _get_array: new method to convert subspace to numpy array

  • cf/data/array/umarray.py:174: # REVIEW: getitem: _get_array: new method to convert subspace to numpy array
  • cf/data/array/umarray.py:275: # REVIEW: getitem: _set_FillValue: record _FillValue in attributes
  • cf/data/array/umarray.py:370: # REVIEW: getitem: _set_units: record units in attributes
  • cf/data/array/umarray.py:373: # REVIEW: getitem: _set_unpack: record unpack in attributes

  • cf/data/array/fullarray.py:124: # REVIEW: getitem: _get_array: new method to convert subspace to numpy array
  • cf/data/array/fullarray.py:159: # REVIEW: getitem: array: New property to convert subspace to numpy array

  • cf/data/dask_utils.py:130: # REVIEW: getitem: cf_contains: convert a to a usable array
  • cf/data/dask_utils.py:166: # REVIEW: getitem: cf_convolve1d: convert a to a usable array
  • cf/data/dask_utils.py:210: # REVIEW: getitem: cf_harden_mask: convert a to a usable array
  • cf/data/dask_utils.py:282: # REVIEW: getitem: cf_percentile: convert a to a usable array
  • cf/data/dask_utils.py:378: # REVIEW: getitem: cf_soften_mask: convert a to a usable array
  • cf/data/dask_utils.py:436: # REVIEW: getitem: cf_where: convert array, condition, x, y to usable arrays
  • cf/data/dask_utils.py:573: # REVIEW: getitem: cf_rt2dt: convert a to a usable array
  • cf/data/dask_utils.py:629: # REVIEW: getitem: cf_dt2rt: convert a to a usable array
  • cf/data/dask_utils.py:671: # REVIEW: getitem: cf_units: convert a to a usable array
  • cf/data/dask_utils.py:696: # REVIEW: getitem: cf_is_masked: convert a to a usable array
  • cf/data/dask_utils.py:730: # REVIEW: getitem: cf_filled: convert a to a usable array
  • cf/data/dask_utils.py:735:# REVIEW: getitem: cf_asanyarray: convert a to a usable array
  • cf/data/dask_utils.py:755: # REVIEW: getitem: cf_asanyarray: convert a to a usable array

  • cf/data/fragment/mixin/fragmentarraymixin.py:15: # REVIEW: getitem: _get_array: new method to convert subspace to numpy array
  • cf/data/fragment/mixin/fragmentarraymixin.py:169: # REVIEW: getitem: _size_1_axis: refactor to use original_shape

  • cf/data/fragment/netcdffragmentarray.py:10:# REVIEW: getitem: NetCDFFragmentArray: new inheritance to allow for different netCDF backends
  • cf/data/fragment/netcdffragmentarray.py:179: # REVIEW: getitem: _get_array: new method to convert subspace to numpy array

  • cf/data/creation.py:63: # REVIEW: getitem: to_dask: set 'asanyarray'
  • cf/data/creation.py:87: # REVIEW: getitem: to_dask: The file lock is now on the Array object (in its _get_array method), rather than being set on the Dask array itself.

  • cf/data/utils.py:877: # REVIEW: getitem: collapse: set 'asanyarray'

  • cf/functions.py:2449:# REVIEW: getitem: get_subspace: remove deprecated function

h5netcdf

  • cf/test/test_active_storage.py:1:# REVIEW: h5: test_active_storage.py: new test module

  • cf/test/test_NetCDF4Array.py:15:# REVIEW: h5: test_NetCDF4Array.py: renamed 'NetCDFArray' to 'NetCDF4Array'
  • cf/test/test_NetCDF4Array.py:44: # REVIEW: h5: test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array'
  • cf/test/test_NetCDF4Array.py:65: # REVIEW: h5: test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array'
  • cf/test/test_NetCDF4Array.py:76: # REVIEW: h5: test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array'
  • cf/test/test_NetCDF4Array.py:112: # REVIEW: h5: test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array'
  • cf/test/test_NetCDF4Array.py:120: # REVIEW: h5: test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array'

  • cf/test/test_read_write.py:82: # REVIEW: h5: test_read_mask: rename numpy to np
  • cf/test/test_read_write.py:564: # REVIEW: h5: test_write_datatype: rename numpy to np

  • cf/read_write/um/umread.py:1976: # REVIEW: h5: create_data: replace units/calendar API with attributes

  • cf/read_write/write.py:16:# REVIEW: h5: write: docstring improvements

  • cf/read_write/netcdf/netcdfread.py:212: # REVIEW: h5: _create_data: control caching
  • cf/read_write/netcdf/netcdfread.py:257: # REVIEW: h5: _create_data: replace units/calendar API with attributes
  • cf/read_write/netcdf/netcdfread.py:266: # REVIEW: h5: _create_data: don't cache data from CFA variables
  • cf/read_write/netcdf/netcdfread.py:627: # REVIEW: h5: _create_cfanetcdfarray: docstring/comment improvements
  • cf/read_write/netcdf/netcdfread.py:702: # REVIEW: h5: _create_cfanetcdfarray: choose the correct netCDF backend
  • cf/read_write/netcdf/netcdfread.py:756: # REVIEW: h5: _create_cfanetcdfarray_term: fix unknown fragment shape
  • cf/read_write/netcdf/netcdfread.py:775: # REVIEW: h5: _create_cfanetcdfarray_term: choose the correct netCDF backend
  • cf/read_write/netcdf/netcdfread.py:965: # REVIEW: h5: _cfa_parse_aggregated_data: use cfdm.netcdf_indexer to get data
  • cf/read_write/netcdf/netcdfwrite.py:582: # REVIEW: h5: Deleted function _convert_to_builtin_type was a CFA-0.4 thing

  • cf/read_write/read.py:61: # REVIEW: h5: read: new 'unpack' parameter to control auto-unpacking (previously always True)
  • cf/read_write/read.py:67: # REVIEW: h5: read: new 'netcdf_backend' parameter to control how to read files
  • cf/read_write/read.py:69: # REVIEW: h5: read: new 'storage_options' parameter to control access to S3
  • cf/read_write/read.py:71: # REVIEW: h5: read: 'cache' parameter to control whether or not to get to cache selected data elements

  • cf/data/array/init.py:4:# REVIEW: h5: __init__.py: import CFAH5netcdfArray
  • cf/data/array/init.py:7:# REVIEW: h5: __init__.py: import CFAH5netcdfArray
  • cf/data/array/init.py:12:# REVIEW: h5: __init__.py: import H5netcdfArray
  • cf/data/array/init.py:16:# REVIEW: h5: __init__.py: import NetCDF4Array

  • cf/data/array/h5netcdfarray.py:1:# REVIEW: h5: H5netcdfArray: New class to access netCDF with h5netcdf
  • cf/data/array/h5netcdfarray.py:52: # REVIEW: h5: _get_array: Ignore this for h5 review

  • cf/data/array/mixin/cfamixin.py:39: # REVIEW: h5: __init__: replace units/calendar API with attributes
  • cf/data/array/mixin/cfamixin.py:228: # REVIEW: h5: _parse_cfa: Refactoring of code that used to be in __init__
  • cf/data/array/mixin/cfamixin.py:469: # REVIEW: h5: get_storage_options: new method to get file access options

  • cf/data/array/netcdfarray.py:1:# REVIEW: h5: NetCDFArray: Replaced by NetCDF4Array

  • cf/data/array/cfah5netcdfarray.py:1:# REVIEW: h5: CFAH5netcdfArray: New class for accessing CFA with h5netcdf

  • cf/data/array/locks.py:1:# REVIEW: h5: locks.py: New module to provide file locks

  • cf/data/array/netcdf4array.py:1:# REVIEW: h5: NetCDF4Array: New class to access netCDF with netCDF4, replaces NetCDFArray

  • cf/data/array/umarray.py:15: # REVIEW: h5: __init__: replace units/calendar API with attributes

  • cf/data/array/cfanetcdf4array.py:1:# REVIEW: h5: CFAnetCDF4Array: New class for accessing CFA with netCDF4

  • cf/data/array/fullarray.py:19: # REVIEW: h5: __init__: replace units/calendar API with attributes

  • cf/data/fragment/init.py:3:# REVIEW: h5: __init__.py: import H5netcdfFragmentArray
  • cf/data/fragment/init.py:7:# REVIEW: h5: __init__.py: import NetCDF4FragmentArray

  • cf/data/fragment/fullfragmentarray.py:12: # REVIEW: h5: __init__: replace units/calendar API with attributes

  • cf/data/fragment/netcdffragmentarray.py:27: # REVIEW: h5: __init__: replace units/calendar API with attributes

  • cf/data/fragment/h5netcdffragmentarray.py:5:# REVIEW: h5: H5netcdfFragmentArray: New class to access netCDF fragment with h5netcdf

  • cf/data/fragment/umfragmentarray.py:12: # REVIEW: h5: __init__: replace units/calendar API with attributes

  • cf/data/fragment/umfragmentarray.py:13: # REVIEW: h5: __init__: new keyword 'storage_options'

  • cf/data/fragment/netcdf4fragmentarray.py:5:# REVIEW: h5: NetCDF4FragmentArray: New class to access netCDF fragment with netCDF4

  • cf/functions.py:2900: # REVIEW: h5: relpath: minor refactor
  • cf/functions.py:2939: # REVIEW: h5: relpath: minor refactor
  • cf/functions.py:2979: # REVIEW: h5: relpath: minor refactor
  • cf/functions.py:3350:# REVIEW: h5: environment: new dependencies

  • cf/cfimplementation.py:30:# REVIEW: h5: cfimplementation.py: import CFAH5netcdfArray, CFANetCDF4Array, H5netcdfArray,NetCDF4Array
  • cf/cfimplementation.py:119: # REVIEW: h5: initialise_CFANetCDF4Array: new method to initialise CFANetCDF4Array
  • cf/cfimplementation.py:136: # REVIEW: h5: initialise_CFAH5netcdfArray: new method to initialise CFAH5netcdfArray

  • cf/regrid/regridoperator.py:730: # REVIEW: h5: new name and location of file lock
  • cf/regrid/regrid.py:2469: # REVIEW: h5: new name and location of file lock

Active storage

  • cf/test/test_functions.py:50: # REVIEW: active: test_configuration: test cf.active_storage, cf.active_storage_url, cf.active_storage_max_requests

  • cf/data/collapse/collapse_active.py:1:# REVIEW: active: collapse_active.py: new module for active storage functionality

  • cf/data/collapse/dask_collapse.py:1:# REVIEW: active: dask_collapse.py: all unlabelled changes in this module are general tidying, and should be reviewed at the same time as active storage
  • cf/data/collapse/dask_collapse.py:235:# REVIEW: active: cf_mean_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:382:# REVIEW: active: cf_max_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:537:# REVIEW: active: cf_min_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:644:# REVIEW: active: cf_range_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:758:# REVIEW: active: cf_rms_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:843:# REVIEW: active: cf_sample_size_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:957:# REVIEW: active: cf_sum_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:1093:# REVIEW: active: cf_sum_of_weights_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:1137:# REVIEW: active: cf_sum_of_weights2_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:1183:# REVIEW: active: cf_unique_chunk: active storage decoration
  • cf/data/collapse/dask_collapse.py:1248:# REVIEW: active: cf_var_chunk: active storage decoration

  • cf/data/array/mixin/init.py:1:# REVIEW: active: __init__.py: import ActiveStorageMixin

  • cf/data/array/mixin/arraymixin.py:21: # REVIEW: active: _meta: Moved to here from FileArrayMixin

  • cf/data/array/mixin/activestoragemixin.py:1:# REVIEW: active: ActiveStorageMixin: new mixin class ActiveStorageMixin

  • cf/data/creation.py:86: # REVIEW: active: to_dask: '_dask_meta' renamed to '_meta' for consistency with Dask

  • cf/data/utils.py:866: # REVIEW: active: collapse: pass the active storage status onto the collapse functions
  • cf/data/utils.py:995: # REVIEW: active: parse_weights: minor refactor

  • cf/functions.py:166:# REVIEW: active: configuration: new keywords 'active_storage', 'active_storage_url', 'active_storage_max_requests'
  • cf/functions.py:428:# REVIEW: active: _configuration: new keywords 'active_storage', 'active_storage_url', 'active_storage_max_requests'
  • cf/functions.py:587:# REVIEW: active: regrid_logging: new examples
  • cf/functions.py:719:# REVIEW: active: relaxed_identities: new examples
  • cf/functions.py:853:# REVIEW: active: tempdir: new examples
  • cf/functions.py:1205:# REVIEW: active: active_storage: new function
  • cf/functions.py:1275:# REVIEW: active: active_storage_url: new function
  • cf/functions.py:1340:# REVIEW: active: active_storage_max_requests: new function

  • cf/field.py:5211: # REVIEW: active: active storage docstring
  • cf/field.py:7092: # REVIEW: active: collapse: include size 1 axes in collapse

  • cf/constants.py:66: # REVIEW: active: CONSTANTS: new constants 'active_storage', 'active_storage_url', 'active_storage_max_requests'

@davidhassell
Copy link
Collaborator Author

From the review:

ERROR: test_configuration (test_functions.functionTest.test_configuration)

Traceback (most recent call last):
File "/home/slb93/git-repos/cf-python/cf/functions.py", line 1265, in _parse
from activestorage import Active # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'activestorage'

Now nicely handled: 0509135

@davidhassell
Copy link
Collaborator Author

From the #805 (review):

I get eight test_Field_regrid_* tests failing on these assertions:

These test all pass for me ... I wonder what's going on, here?

@davidhassell
Copy link
Collaborator Author

davidhassell commented Oct 22, 2024

From the #805 (review):

======================================================================
ERROR: test_GENERAL (test_general.generalTest.test_GENERAL)

File "/home/slb93/git-repos/cfdm/cfdm/read_write/netcdf/netcdfwrite.py", line 2476, in _createVariable
g["nc"][ncvar] = g["netcdf"].createVariable(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/netCDF4/_netCDF4.pyx", line 2967, in netCDF4._netCDF4.Dataset.createVariable
File "src/netCDF4/_netCDF4.pyx", line 4305, in netCDF4._netCDF4.Variable.init
ValueError: chunksize cannot exceed dimension size

This, it turns out, is a bug in cfdm relating to the setting of HDF5 chunksizes for scalar arrays, which shall be fixed over there ...

EDIT: No, it's a bug in cf-python, afterall! Fixed in b72c17b. (Note that all will get refactored (for the better) when we daskify cfdm.)

@davidhassell
Copy link
Collaborator Author

From the #805 (review):

Consistency

This wasn't introduced in this PR, but when checking the inheritance structure I noticed that one of the collapse methods has an inconsistent argument ordering relative to the others namely means_abs having weights before axis unlike all others which accept weights (see below), so it would be good to make that consistent here.

I agree. That's one for 4.0.0 perhaps (as it would be a breaking API change). Do you agree?

@davidhassell
Copy link
Collaborator Author

From the #805 (review):

Names wise, I would prefer we rename the classes with H5netcdf in to H5NetCDF which follows the others in using 'NetCDF' capitalisation and follow Python Pascal Case style:

I would agree with you in general! Here, h5netcdf and netCDF4 are proper nouns, being the names of libraries, so I in this case I would say that they camel case to H5netcdf and NetCDF4 respectively.

@davidhassell
Copy link
Collaborator Author

davidhassell commented Oct 23, 2024

From #805 (review):

More generally, there are a lot of methods missing from the new Array classes API docs:

Fixed: 885a67d

./check_docs_api_coverage now only flags up the 8 methods that we expect (relating to UGRID, and we are still deciding what do about!)

@davidhassell
Copy link
Collaborator Author

Hi Sadie - I think I've responding to everything - over to you :)

@sadielbartholomew
Copy link
Member

Thanks for addressing my feedback. I'll reply to your response comments here - though in general if I have put the 'thumbs up' emoji on a comment I am happy and won't comment further!

These test all pass for me ... I wonder what's going on, here?

This time round (after fetching your new commits) they all pass, so either something fixed them or we can treat them as flaky and ignore those failures. Flaky tests are of course a problem in themselves, but not to concern us for this PR. :)

EDIT: No, it's a bug in cf-python, afterall! Fixed in b72c17b. (Note that all will get refactored (for the better) when we daskify cfdm.)

Great, always good to fix a bug! test_general now all passes for me locally.

I agree. That's one for 4.0.0 perhaps (as it would be a breaking API change). Do you agree?

Sure, shall I open an Issue for it and add it to #776?

./check_docs_api_coverage now only flags up the 8 methods that we expect (relating to UGRID, and we are still deciding what do about!)

Nice, thanks for sorting the rest. Do you know what the '6 missing .rst files' warning is about?:

$ ./check_docs_api_coverage                                                           ─╯
Method cf.Field.del_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.get_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.has_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.set_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Domain.del_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.get_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.has_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.set_mesh_id not in docs/source/class/cf.Domain.rst
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
Traceback (most recent call last):
  File "/home/slb93/git-repos/cf-python/docs/source/check_docs_api_coverage.py", line 78, in <module>
    raise ValueError(
ValueError: Found 8 undocumented methods and 6 missing .rst files

It may not concern additions from this PR, and can always be sorted at release-time, but in case it is to do with the PR it would be good to do a little investigation.

I would agree with you in general! Here, h5netcdf and netCDF4 are proper nouns, being the names of libraries, so I in this case I would say that they camel case to H5netcdf and NetCDF4 respectively.

Fair point. In which case, they can stay that way, no problem. I respect a grammatical justification!

As for the in-line comments, I have a small number left to respond to, and I will mark a small number you migjt have missed as well. Then we should be good to merge, though I'll approve the PR again to confirm :)

@sadielbartholomew
Copy link
Member

I have responded to everything now, back to you @davidhassell...

cf/data/data.py Outdated
@@ -1415,7 +1415,7 @@ def _clear_after_dask_update(self, clear=_ALL):
self._cfa_del_write()

# REVIEW: getitem: `_set_dask`: new keyword 'asanyarray'
def _set_dask(self, dx, copy=False, clear=_ALL, asanyarray=False):
def _set_dask(self, dx, copy=False, clear=_ALL, __asanyarray__=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we spoke about this today and agreed that a leading underscore to indicate internal-use intent for asanyarray is a good idea, but having both trailing and leading dunderscores for a name is meant to be reserved for special Python methods, so I don't think it is a good idea to have __asanyarray__ for anything, unless you want to elevate asanyarray in these cases to the level of a 'magic' method? I personally think it is misleading so best keep it as _anyanyarray in these cases too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - __asanyarray__= has gone!

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all of my feedback (and occasionally entertaining my silly comments when I thought there was an issue but there wasn't). I believe there is one comment left to look at, #805 (comment), and I've made a further comment RE the recent naming changes, but after that all is good! I am happy for you to merge after you've considered those.

And whilst we are here, do you know what this is about:

/home/slb93/git-repos/cf-python/cf/docstring/docstring.py:306: SyntaxWarning: invalid escape sequence '\_'
  "{{split_every: `int` or `dict`, optional}}": """split_every: `int` or `dict`, optional

since it is raised when I run the test suite (though not from a line change of this PR, but a recent one I believe). I can't see anything wrong with that particular line, so not sure what it is barking about?

davidhassell and others added 2 commits October 28, 2024 13:51
Co-authored-by: Sadie L. Bartholomew <[email protected]>
@davidhassell
Copy link
Collaborator Author

Hi Sadie - all done, now, I think!

I'm not sure about:

/home/slb93/git-repos/cf-python/cf/docstring/docstring.py:306: SyntaxWarning: invalid escape sequence '\_'
  "{{split_every: `int` or `dict`, optional}}": """split_every: `int` or `dict`, optional

as I don't see it when I run pre-commit, I'm using Python 3.12.2 - what about you?

@sadielbartholomew
Copy link
Member

Excellent, thanks @davidhassell.

I don't see it when I run pre-commit, I'm using Python 3.12.2 - what about you?

I am using 3.12.0. No worries, we can investigate separately since it points to a line you didn't touch so is very likely unrelated to this PR. Everything is ready I think so please merge!

@davidhassell
Copy link
Collaborator Author

A great many thanks to @sadielbartholomew for her customary careful and insightful review of this the huge PR. Much appreciated. Merging now.

@davidhassell davidhassell merged commit c734c0c into NCAS-CMS:main Oct 30, 2024
@davidhassell davidhassell added this to the NEXT VERSION milestone Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementing active storage reduction operations
2 participants