-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-chunking for ERA5 loading #1608
base: main
Are you sure you want to change the base?
Conversation
It's not clear to me that this is a win overall. The only case that benefits from this is xesmf but the highlevel API thing is really bad. I also like the previous version because it is the default |
@@ -72,6 +72,7 @@ def rechunk_map_blocks( | |||
# Load dataset | |||
ds = xr.open_zarr( | |||
"gs://weatherbench2/datasets/era5/1959-2023_01_10-wb13-6h-1440x721.zarr", | |||
chunks={"time": "auto"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This the a case where I think the naive user-code is what we want. This may be slightly better but we want to optimize what most users will have in practice
EDIT: Whoops, I missing @fjetter's comment, which is basically the same as mine
To be clear: The default should be auto chunking. It is not totally clear to me how we get there, but this is what we should aim for |
The default is |
it is as far as I understand, auto means whatever chunks zarr has because our auto chunking was kind of stupid in the past |
We don't have to merge this PR for now, but it showcases what @phofl and I believe would be a better default in general. This PR can be used to highlight benefits or complications on the way to establishing this. The biggest benefit is likely in graph sizes at larger scales, but we don't measure this and still run into a variety of other errors so that workloads still don't work reliably end-to-end. |
Ah, I see. I hadn't realized this was being used to help motivate a new default. Thanks for clarifying that |
You guys are seeing the classic earth system analytics problem: "spatial analysis" vs "timeseries analysis". There's no good chunking that works well for both problems, indeed i'd recommend running each workload with two orthogonal chunking strategies. The thing to fix is to make sure the workload runs in any case, even if its slow. |
This PR defaults to using
auto
chunks along the time dimension when loading the ERA5 dataset in geospatial benchmarks. Due to the small size of native inputs (4 MiB), this results in smaller graphs and (mostly) runtime improvements. Here are the runtimes for the small scale:The performance regression on
test_highlevel_api
is curious and should be investigated. Unfortunately, we do not measure task graph sizes, so I can't provide any numbers here.