Auto-chunking for ERA5 loading #1608

hendrikmakait · 2024-11-15T15:05:02Z

This PR defaults to using auto chunks along the time dimension when loading the ERA5 dataset in geospatial benchmarks. Due to the small size of native inputs (4 MiB), this results in smaller graphs and (mostly) runtime improvements. Here are the runtimes for the small scale:

The performance regression on test_highlevel_api is curious and should be investigated. Unfortunately, we do not measure task graph sizes, so I can't provide any numbers here.

fjetter · 2024-11-15T15:18:08Z

It's not clear to me that this is a win overall. The only case that benefits from this is xesmf but the highlevel API thing is really bad. I also like the previous version because it is the default

jrbourbeau · 2024-11-15T15:18:49Z

tests/geospatial/workloads/climatology.py

@@ -72,6 +72,7 @@ def rechunk_map_blocks(
    # Load dataset
    ds = xr.open_zarr(
        "gs://weatherbench2/datasets/era5/1959-2023_01_10-wb13-6h-1440x721.zarr",
+        chunks={"time": "auto"},


This the a case where I think the naive user-code is what we want. This may be slightly better but we want to optimize what most users will have in practice

EDIT: Whoops, I missing @fjetter's comment, which is basically the same as mine

phofl · 2024-11-15T15:22:54Z

To be clear: The default should be auto chunking. It is not totally clear to me how we get there, but this is what we should aim for

jrbourbeau · 2024-11-15T15:28:00Z

The default is chunks="auto" -- looks like that's somehow different than chunks={"time": "auto"} (though it's not clear to me if that's intended or not)

phofl · 2024-11-15T15:29:41Z

it is as far as I understand, auto means whatever chunks zarr has because our auto chunking was kind of stupid in the past

hendrikmakait · 2024-11-15T15:30:36Z

We don't have to merge this PR for now, but it showcases what @phofl and I believe would be a better default in general. This PR can be used to highlight benefits or complications on the way to establishing this. The biggest benefit is likely in graph sizes at larger scales, but we don't measure this and still run into a variety of other errors so that workloads still don't work reliably end-to-end.

jrbourbeau · 2024-11-15T15:53:24Z

it showcases what @phofl and I believe would be a better default in general

Ah, I see. I hadn't realized this was being used to help motivate a new default. Thanks for clarifying that

dcherian · 2024-11-15T16:19:39Z

You guys are seeing the classic earth system analytics problem: "spatial analysis" vs "timeseries analysis". There's no good chunking that works well for both problems, indeed i'd recommend running each workload with two orthogonal chunking strategies. The thing to fix is to make sure the workload runs in any case, even if its slow.

Auto-chunking for ERA5 loading

f69fdb5

jrbourbeau reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-chunking for ERA5 loading #1608

Auto-chunking for ERA5 loading #1608

hendrikmakait commented Nov 15, 2024

fjetter commented Nov 15, 2024

jrbourbeau Nov 15, 2024 •

edited

Loading

phofl commented Nov 15, 2024

jrbourbeau commented Nov 15, 2024

phofl commented Nov 15, 2024

hendrikmakait commented Nov 15, 2024

jrbourbeau commented Nov 15, 2024

dcherian commented Nov 15, 2024 •

edited

Loading

Auto-chunking for ERA5 loading #1608

Are you sure you want to change the base?

Auto-chunking for ERA5 loading #1608

Conversation

hendrikmakait commented Nov 15, 2024

fjetter commented Nov 15, 2024

jrbourbeau Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

phofl commented Nov 15, 2024

jrbourbeau commented Nov 15, 2024

phofl commented Nov 15, 2024

hendrikmakait commented Nov 15, 2024

jrbourbeau commented Nov 15, 2024

dcherian commented Nov 15, 2024 • edited Loading

jrbourbeau Nov 15, 2024 •

edited

Loading

dcherian commented Nov 15, 2024 •

edited

Loading