Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make NetCDF file cache handling compatible with dask distributed #2822
base: main
Are you sure you want to change the base?
Make NetCDF file cache handling compatible with dask distributed #2822
Changes from 11 commits
7f6a8d4
6d31c20
1e26d1a
be40c5b
cbd00f0
af4ee66
dad3b14
fc58ca4
09c821a
4f9c5ed
ec76fa6
06d8811
aaf91b9
a2ad42f
9126bbe
5e576f9
63e7507
ea04595
523671a
fde3896
5b137e8
c2b1533
9fce5a7
4993b65
7c173e7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that this has to be handled in your PR, but if I remember correctly this
Dataset
-levelset_auto_maskandscale
was added to netcdf4-python quite a while ago. It seems error prone and confusing to silently call the method only if it exists and to not log/inform the user that it wasn't used when it was expected. Maybe we should remove this method on the file handler class and always callfile_handle.set_auto_maskandscale
no matter what. Your wrapper does it already.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about this function name. Obviously it makes sense in this PR because it solves this specific problem, but it feels like there is a (shorter) more generic name that gets the point across. Another thing is that
distributed_friendly
is mentioned here, but that friendliness is a side effect of the "serializable" nature of the way you're accessing the data here, right?get_serializable_dask_array
?I don't feel super strongly about this, but the name was distracting to me so I thought I'd say something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
get_serializable_dask_array
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check how the docs render this. If the argument type isn't "clickable" to go directly to the xarray docs for the CFM then we could wrap the mention of it in the description with:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument type was already clickable, but in the description it was not. I have now made it clickable in both cases (screenshot from local doc production):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
chunks
is never used here. The current calling from the file handler is accessing the full shape of the variable so this is fine, but only for now. I mean thatmap_blocks
will only ever call this function once. However, if you added ablock_info
kwarg to the function signature or whatever themap_blocks
special keyword argument is, then you could change[:]
to access a specific sub-set of the NetCDF file variable and only do a partial load. This should improve performance a lot (I think 🤞) if it was actually used in the file handler.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? I'm passing
chunks=chunks
when I callda.map_blocks
. What do you mean, it is never used? Do you mean I could be usingchunk-location
andnum-chunks
from ablock_info
dictionary passed toget_chunk
?I will try to wrap may head around this ☺
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that's what I'm saying. I think the result of
get_chunk()
right now is broken for any chunk size other than the full shape of the array because you never do any slicing of the NetCDF variable insideget_chunk()
. So, if you had a full array of 100x100 and a chunk size of 50x50, then map_blocks would call this function 4 times ((0-50, 0-50), (0-50, 50-100), (50-100, 0-50), (50-100, 50-100)). BUT each call would return the full variable 100x100. So I think this would be a case where the dask array would say "yeah, I have shape 100x100", but then once you computed it you'd get a 200x200 array back.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it now, I think.
Check warning on line 527 in satpy/readers/utils.py
CodeScene Delta Analysis / CodeScene Cloud Delta Analysis (main)
❌ New issue: Excess Number of Function Arguments