Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-chunking for ERA5 loading #1608

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Auto-chunking for ERA5 loading #1608

wants to merge 1 commit into from

Conversation

hendrikmakait
Copy link
Member

This PR defaults to using auto chunks along the time dimension when loading the ERA5 dataset in geospatial benchmarks. Due to the small size of native inputs (4 MiB), this results in smaller graphs and (mostly) runtime improvements. Here are the runtimes for the small scale:

auto-chunking

The performance regression on test_highlevel_api is curious and should be investigated. Unfortunately, we do not measure task graph sizes, so I can't provide any numbers here.

@fjetter
Copy link
Member

fjetter commented Nov 15, 2024

It's not clear to me that this is a win overall. The only case that benefits from this is xesmf but the highlevel API thing is really bad. I also like the previous version because it is the default

@@ -72,6 +72,7 @@ def rechunk_map_blocks(
# Load dataset
ds = xr.open_zarr(
"gs://weatherbench2/datasets/era5/1959-2023_01_10-wb13-6h-1440x721.zarr",
chunks={"time": "auto"},
Copy link
Member

@jrbourbeau jrbourbeau Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the a case where I think the naive user-code is what we want. This may be slightly better but we want to optimize what most users will have in practice

EDIT: Whoops, I missing @fjetter's comment, which is basically the same as mine

@phofl
Copy link
Contributor

phofl commented Nov 15, 2024

To be clear: The default should be auto chunking. It is not totally clear to me how we get there, but this is what we should aim for

@jrbourbeau
Copy link
Member

The default is chunks="auto" -- looks like that's somehow different than chunks={"time": "auto"} (though it's not clear to me if that's intended or not)

@phofl
Copy link
Contributor

phofl commented Nov 15, 2024

it is as far as I understand, auto means whatever chunks zarr has because our auto chunking was kind of stupid in the past

@hendrikmakait
Copy link
Member Author

We don't have to merge this PR for now, but it showcases what @phofl and I believe would be a better default in general. This PR can be used to highlight benefits or complications on the way to establishing this. The biggest benefit is likely in graph sizes at larger scales, but we don't measure this and still run into a variety of other errors so that workloads still don't work reliably end-to-end.

@jrbourbeau
Copy link
Member

it showcases what @phofl and I believe would be a better default in general

Ah, I see. I hadn't realized this was being used to help motivate a new default. Thanks for clarifying that

@dcherian
Copy link
Contributor

dcherian commented Nov 15, 2024

You guys are seeing the classic earth system analytics problem: "spatial analysis" vs "timeseries analysis". There's no good chunking that works well for both problems, indeed i'd recommend running each workload with two orthogonal chunking strategies. The thing to fix is to make sure the workload runs in any case, even if its slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants