-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimsed lazy indexing - h5netcdf backend - Active storage reductions #805
Optimsed lazy indexing - h5netcdf backend - Active storage reductions #805
Conversation
…on into dask-active-storage
Dask active storage version 0
From the review:
Now nicely handled: 0509135 |
From the #805 (review):
These test all pass for me ... I wonder what's going on, here? |
From the #805 (review):
This, it turns out, is a bug in EDIT: No, it's a bug in cf-python, afterall! Fixed in b72c17b. (Note that all will get refactored (for the better) when we daskify |
From the #805 (review):
I agree. That's one for 4.0.0 perhaps (as it would be a breaking API change). Do you agree? |
From the #805 (review):
I would agree with you in general! Here, |
From #805 (review):
Fixed: 885a67d
|
Hi Sadie - I think I've responding to everything - over to you :) |
Thanks for addressing my feedback. I'll reply to your response comments here - though in general if I have put the 'thumbs up' emoji on a comment I am happy and won't comment further!
This time round (after fetching your new commits) they all pass, so either something fixed them or we can treat them as flaky and ignore those failures. Flaky tests are of course a problem in themselves, but not to concern us for this PR. :)
Great, always good to fix a bug!
Sure, shall I open an Issue for it and add it to #776?
Nice, thanks for sorting the rest. Do you know what the '6 missing .rst files' warning is about?: $ ./check_docs_api_coverage ─╯
Method cf.Field.del_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.get_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.has_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.set_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Domain.del_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.get_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.has_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.set_mesh_id not in docs/source/class/cf.Domain.rst
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
Traceback (most recent call last):
File "/home/slb93/git-repos/cf-python/docs/source/check_docs_api_coverage.py", line 78, in <module>
raise ValueError(
ValueError: Found 8 undocumented methods and 6 missing .rst files It may not concern additions from this PR, and can always be sorted at release-time, but in case it is to do with the PR it would be good to do a little investigation.
Fair point. In which case, they can stay that way, no problem. I respect a grammatical justification! As for the in-line comments, I have a small number left to respond to, and I will mark a small number you migjt have missed as well. Then we should be good to merge, though I'll approve the PR again to confirm :) |
I have responded to everything now, back to you @davidhassell... |
cf/data/data.py
Outdated
@@ -1415,7 +1415,7 @@ def _clear_after_dask_update(self, clear=_ALL): | |||
self._cfa_del_write() | |||
|
|||
# REVIEW: getitem: `_set_dask`: new keyword 'asanyarray' | |||
def _set_dask(self, dx, copy=False, clear=_ALL, asanyarray=False): | |||
def _set_dask(self, dx, copy=False, clear=_ALL, __asanyarray__=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we spoke about this today and agreed that a leading underscore to indicate internal-use intent for asanyarray
is a good idea, but having both trailing and leading dunderscores for a name is meant to be reserved for special Python methods, so I don't think it is a good idea to have __asanyarray__
for anything, unless you want to elevate asanyarray
in these cases to the level of a 'magic' method? I personally think it is misleading so best keep it as _anyanyarray
in these cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - __asanyarray__=
has gone!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all of my feedback (and occasionally entertaining my silly comments when I thought there was an issue but there wasn't). I believe there is one comment left to look at, #805 (comment), and I've made a further comment RE the recent naming changes, but after that all is good! I am happy for you to merge after you've considered those.
And whilst we are here, do you know what this is about:
/home/slb93/git-repos/cf-python/cf/docstring/docstring.py:306: SyntaxWarning: invalid escape sequence '\_'
"{{split_every: `int` or `dict`, optional}}": """split_every: `int` or `dict`, optional
since it is raised when I run the test suite (though not from a line change of this PR, but a recent one I believe). I can't see anything wrong with that particular line, so not sure what it is barking about?
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Hi Sadie - all done, now, I think! I'm not sure about:
as I don't see it when I run |
Excellent, thanks @davidhassell.
I am using |
A great many thanks to @sadielbartholomew for her customary careful and insightful review of this the huge PR. Much appreciated. Merging now. |
Fixes #501
This PR has three parts, which ideally would have been in three separate PRs, but they needed to be co-developed, and splitting them out retrospectively turned out to be to hard. The three parts are:
Optimised lazy indexing
NetCDF4Array
object) no longer triggers a read from disk. This has to be done within another layer by converting the indexed array object to anumpy
array (typically with the newcf_asnyarray
function). When an already-indexed array object is indexed again, the two indices are effectively combined to create a new index. E.g. If the array has shape(10, 10)
, the first index is[:, :]
, the second index is[::2, [1, 5, 3, 7]]
, and the third index is[::-2, 1::2]
, then this is equivalent to the single index[8::-4, 5:8:2]
. Only the data for the combined index needs to retrieved from disk, which is a huge performance improvement, as the amount of IO is typically vastly reduced. The same happens for array objects in memory (such asnumpy
arrays), but for these the performance improvements are negligible.h5netcdf
h5netcdf
as a backend. Essentially making cf-python compatible with h5netcdf read cfdm#307.Active storage reductions
To help alleviate the difficulties of reviewing a 3-in-1 PR, I have marked up every code change with which of the three parts it relates to, and provided a checklist for all of the changes.
Optimised lazy indexing
_docstring_substitution_definitions
: 'asanyarray'test_NetCDF4Array
: testNetCDF4Array.shape
test_NetCDF4Array
: testNetCDF4Array.index
test_Data__getitem__
: Chained subspaces reading from disktest_Data_rechunk
: rechunking after a getitemtest_Data_active_storage
: testData.active_storage
test_Data_cull_graph
: prevent new asanyarray layertest_Data_is_masked
: testData.is_masked
test_Field_indices
: make sure works when 'g.array' is not maskedtest_Field_indices
: make sure works when 'g.array' is not maskedtest_FullArray
: new test moduleto_dask_array
: new keyword 'asanyarray'_cfa_write_non_standard_terms
: set 'asanyarray'_cfa_unique
: convert a to a usable array_cfa_aggregation_instructions
: set 'asanyarray'regrid.py
: importcf_asanyarray
regrid
: convert a to a usable arraydata.py
: import cf_asanyarray, cf_filled, cf_is_masked__init__
: set 'asanyarray'__init__
: Set whether or not to callnp.asanyarray
on chunks to convert them to numpy arrays.__init__
: set 'asanyarray'cf_contains
: set 'asanyarray'__len__
: set 'asanyarray'__getitem__
: set 'asanyarray'__getitem__
: set 'asanyarray=True' because subspaced chunks might not be in memory__asanyarray__
: new property__asanyarray__
_set_dask
: new keyword 'asanyarray'_set_dask
: set 'asanyarray'percentile
: set 'asanyarray'percentile
: rectify commentrechunk
: set 'asanyarray'_asdatetime
: set 'asanyarray'_asreftime
: set 'asanyarray'_regrid
: set 'asanyarray'concatenate
: set 'asanyarray'concatenate
: define the asanyarray statusconcatenate
: set 'asanyarray'chunks
: set 'asanyarray'Units
: set 'asanyarray'dtype
: set 'asanyarray'is_masked
: set 'asanyarray'nbytes
: set 'asanyarray'ndim
: set 'asanyarray'npartitions
: set 'asanyarray'numblocks
set 'asanyarray'shape
: set 'asanyarray'size
set 'asanyarray'convert_reference_time
: set 'asanyarray'get_deterministic_name
: set 'asanyarray'get_filenames
: set 'asanyarray'add_file_location
: set 'asanyarray'unique
: set 'asanyarray'hardmask
: set 'asanyarray'soften_mask
: set 'asanyarray'file_locations
: set 'asanyarray'filled
: set 'asanyarray'to_dask_array
: new keyword 'asanyarray'del_file_location
: set 'asanyarray'where
: set 'asanyarray'where
: set 'asanyarray'where
: 'da.asanyarray' is no longer requiredcull_graph
: set 'asanyarray'todict
: new keywords 'apply_mask_hardness', 'asanyarray'_get_array
: new method to convert subspace to numpy array.__init__.py
: importIndexMixin
IndexMixin
: new mixin classIndexMixin
_get_array
: Ignore this for h5 review_get_array
: new method to convert subspace to numpy array_get_array
: new method to convert subspace to numpy array_set_FillValue
: record _FillValue in attributes_set_units
: record units in attributes_set_unpack
: record unpack in attributes_get_array
: new method to convert subspace to numpy arrayarray
: New property to convert subspace to numpy arraycf_contains
: convert a to a usable arraycf_convolve1d
: convert a to a usable arraycf_harden_mask
: convert a to a usable arraycf_percentile
: convert a to a usable arraycf_soften_mask
: convert a to a usable arraycf_where
: convert array, condition, x, y to usable arrayscf_rt2dt
: convert a to a usable arraycf_dt2rt
: convert a to a usable arraycf_units
: convert a to a usable arraycf_is_masked
: convert a to a usable arraycf_filled
: convert a to a usable arraycf_asanyarray
: convert a to a usable arraycf_asanyarray
: convert a to a usable array_get_array
: new method to convert subspace to numpy array_size_1_axis
: refactor to useoriginal_shape
NetCDFFragmentArray
: new inheritance to allow for different netCDF backends_get_array
: new method to convert subspace to numpy arrayto_dask
: set 'asanyarray'to_dask
: The file lock is now on theArray
object (in its_get_array
method), rather than being set on the Dask array itself.collapse
: set 'asanyarray'get_subspace
: remove deprecated functionh5netcdf
test_active_storage.py
: new test moduletest_NetCDF4Array.py
: renamed 'NetCDFArray' to 'NetCDF4Array'test_NetCDF4Array
: renamed 'NetCDFArray' to 'NetCDF4Array'test_NetCDF4Array
: renamed 'NetCDFArray' to 'NetCDF4Array'test_NetCDF4Array
: renamed 'NetCDFArray' to 'NetCDF4Array'test_NetCDF4Array
: renamed 'NetCDFArray' to 'NetCDF4Array'test_NetCDF4Array
: renamed 'NetCDFArray' to 'NetCDF4Array'test_read_mask
: rename numpy to nptest_write_datatype
: rename numpy to npcreate_data
: replace units/calendar API with attributeswrite
: docstring improvements_create_data
: control caching_create_data
: replace units/calendar API with attributes_create_data
: don't cache data from CFA variables_create_cfanetcdfarray
: docstring/comment improvements_create_cfanetcdfarray
: choose the correct netCDF backend_create_cfanetcdfarray_term
: fix unknown fragment shape_create_cfanetcdfarray_term
: choose the correct netCDF backend_cfa_parse_aggregated_data
: usecfdm.netcdf_indexer
to get dataread
: new 'unpack' parameter to control auto-unpacking (previously always True)read
: new 'netcdf_backend' parameter to control how to read filesread
: new 'storage_options' parameter to control access to S3read
: 'cache' parameter to control whether or not to get to cache selected data elements__init__.py
: importCFAH5netcdfArray
__init__.py
: importCFAH5netcdfArray
__init__.py
: importH5netcdfArray
__init__.py
: importNetCDF4Array
H5netcdfArray
: New class to access netCDF withh5netcdf
_get_array
: Ignore this for h5 review__init__
: replace units/calendar API with attributes_parse_cfa
: Refactoring of code that used to be in__init__
get_storage_options
: new method to get file access optionsNetCDFArray
: Replaced byNetCDF4Array
CFAH5netcdfArray
: New class for accessing CFA withh5netcdf
locks.py
: New module to provide file locksNetCDF4Array
: New class to access netCDF withnetCDF4
, replacesNetCDFArray
__init__
: replace units/calendar API with attributesCFAnetCDF4Array
: New class for accessing CFA withnetCDF4
__init__
: replace units/calendar API with attributes__init__.py
: importH5netcdfFragmentArray
__init__.py
: importNetCDF4FragmentArray
__init__
: replace units/calendar API with attributes__init__
: replace units/calendar API with attributesH5netcdfFragmentArray
: New class to access netCDF fragment withh5netcdf
__init__
: replace units/calendar API with attributes__init__
: new keyword 'storage_options'NetCDF4FragmentArray
: New class to access netCDF fragment withnetCDF4
relpath
: minor refactorrelpath
: minor refactorrelpath
: minor refactorenvironment
: new dependenciescfimplementation.py
: importCFAH5netcdfArray
,CFANetCDF4Array
,H5netcdfArray
,NetCDF4Array
initialise_CFANetCDF4Array
: new method to initialiseCFANetCDF4Array
initialise_CFAH5netcdfArray
: new method to initialiseCFAH5netcdfArray
Active storage
test_configuration
: testcf.active_storage
, cf.active_storage_url, cf.active_storage_max_requests
collapse_active.py
: new module for active storage functionalitydask_collapse.py
: all unlabelled changes in this module are general tidying, and should be reviewed at the same time as active storagecf_mean_chunk
: active storage decorationcf_max_chunk
: active storage decorationcf_min_chunk
: active storage decorationcf_range_chunk
: active storage decorationcf_rms_chunk
: active storage decorationcf_sample_size_chunk
: active storage decorationcf_sum_chunk
: active storage decorationcf_sum_of_weights_chunk
: active storage decorationcf_sum_of_weights2_chunk
: active storage decorationcf_unique_chunk
: active storage decorationcf_var_chunk
: active storage decoration__init__.py
: importActiveStorageMixin
_meta
: Moved to here fromFileArrayMixin
ActiveStorageMixin
: new mixin classActiveStorageMixin
to_dask
: '_dask_meta' renamed to '_meta' for consistency with Daskcollapse
: pass the active storage status onto the collapse functionsparse_weights
: minor refactorconfiguration
: new keywords 'active_storage', 'active_storage_url', 'active_storage_max_requests'_configuration
: new keywords 'active_storage', 'active_storage_url', 'active_storage_max_requests'regrid_logging
: new examplesrelaxed_identities
: new examplestempdir
: new examplesactive_storage
: new functionactive_storage_url
: new functionactive_storage_max_requests
: new functioncollapse
: include size 1 axes in collapseCONSTANTS
: new constants 'active_storage', 'active_storage_url', 'active_storage_max_requests'