Add benchmark for NetCDF --> Zarr cloud-optimization #1551

jrbourbeau · 2024-09-18T21:26:17Z

NetCDF datasets being slow/not scaling well has come up a lot. This PR adds a new benchmark that loads the nex-gddp-cmip6 dataset (https://registry.opendata.aws/nex-gddp-cmip6/) from AWS, which is stored as a bunch of .nc files, and converts that dataset to Zarr, a more modern, cloud-optimized format.

This is using xr.open_mfdataset(..., parallel=True) which is both common and really slow when opening lots of NetCDF files, which I like because I've seen this with many users in practice.

One thing I'm not sure about is how representative this benchmark is as is. I don't know if folks do this NetCDF --> Zarr conversion in isolation, or always in conjunction with other "cloud optimizing" steps like rechunking.

EDIT: Here's a cluster link for the "small" version of this test https://cloud.coiled.io/clusters/594106/account/dask-engineering/information. It takes ~20 minutes and costs ~$0.75

jrbourbeau · 2024-09-18T21:26:38Z

tests/geospatial/test_netcdf_to_zarr.py

+        # Get netCDF data files -- see https://registry.opendata.aws/nex-gddp-cmip6
+        # for dataset details.
+        file_list = []
+        for model in models:
+            for variable in variables:
+                source_directory = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
+                file_list += [f"s3://{path}" for path in s3.glob(source_directory)]
+        files = [s3.open(f) for f in file_list]
+        print(f"Processing {len(files)} NetCDF files")
+
+        ds = xr.open_mfdataset(
+            files,
+            engine="h5netcdf",
+            combine="nested",
+            concat_dim="time",
+            parallel=True,
+        )
+        print(f"Converting {format_bytes(ds.nbytes)} from NetCDF to Zarr")
+        ds.to_zarr(s3_url)


@maxrjones thanks for pointing to this dataset over in #1545 (comment). Does this looks like what you've seen in the wild?

I've gone ahead and added a rechunking step (from "pancake" to "pencil" chunks), which seems to be pretty common when cloud-optimizing a NetCDF dataset to Zarr

Yeah, I think it's right that when cloud-optimizing it's usually best to include a rechunking step. But, there's many cases in which people avoid the cloud-optimization step and do subsequent analyses on the data loaded directly from the original NetCDF files.

Yeah, I think it's right that when cloud-optimizing it's usually best to include a rechunking step.

Awesome, thanks for confirming that's usually the case 👍

But, there's many cases in which people avoid the cloud-optimization step and do subsequent analyses on the data loaded directly from the original NetCDF files

I guess in this case, the "subsequent analyses" is just writing to Zarr. Do you think this still captures user pain well? FWIW my experience is using xr.open_mfdataset(..., parallel=True) + any other step performs very poorly

I guess in this case, the "subsequent analyses" is just writing to Zarr. Do you think this still captures user pain well? FWIW my experience is using xr.open_mfdataset(..., parallel=True) + any other step performs very poorly

Yeah, it could make sense to focus first on the simplest operations that perform poorly. You mention that going from pancake to pencil chunks is a common step that performs poorly. That's true and is the motivation for the rechunker library. The worst performance would likely be seen when combining operations that perform optimally on pancake oriented chunks with operations that perform optimally on churro oriented chunks.

pancake oriented chunks...churro oriented chunks

💯

.github/workflows/tests.yml

dcherian · 2024-09-19T15:56:26Z

tests/geospatial/test_cloud_optimize.py

+            engine="h5netcdf",
+            combine="nested",
+            concat_dim="time",
+            parallel=True,


Suggested change

parallel=True,

parallel=True, data_vars="minimal", coords="minimal", compat="override",

These may be needed for decent perf, I haven't looked at the files to be sure

Ah, thanks for pointing those out. Are these extra kwargs decent defaults with reading in lots of NetCDF files? Just trying to get a sense for how often the different configurations are used

^ yes, they are basically mandatory

Cool, I'll take them for a spin and update here. Also, should those be the default in xarray?

pydata/xarray#8778

hendrikmakait

Thanks, @jrbourbeau, this is good to go once the workspace has been adjusted. I have one question regarding scaling this benchmark but that's non-blocking.

hendrikmakait · 2024-09-20T16:01:55Z

tests/geospatial/test_cloud_optimize.py

+            # 715 files. One model and all variables.
+            # Currently fails after hitting 20 minute idle timeout
+            # sending `to_zarr` graph to the scheduler.
+            models = models[:1]


Is there a difference here in how we scale with respect to models and variables? I'm wondering if it would make more sense to pick a subset (larger than one) of both models and variables instead.

tests/geospatial/test_cloud_optimize.py

Co-authored-by: Hendrik Makait <[email protected]>

Convert NetCDF to Zarr

8c6482b

jrbourbeau commented Sep 18, 2024

View reviewed changes

Nit

94828c0

jrbourbeau commented Sep 18, 2024

View reviewed changes

.github/workflows/tests.yml Outdated Show resolved Hide resolved

Add rechunking step

d151a0a

jrbourbeau changed the title ~~Add benchmark for converting NetCDF to Zarr~~ Add benchmark for NetCDF --> Zarr cloud-optimization Sep 19, 2024

dcherian reviewed Sep 19, 2024

View reviewed changes

jrbourbeau added 2 commits September 19, 2024 12:00

Update open_mfdataset kwargs

874dae9

Remove temp changes

3f72cf2

jrbourbeau self-assigned this Sep 20, 2024

hendrikmakait approved these changes Sep 20, 2024

View reviewed changes

jrbourbeau and others added 2 commits September 20, 2024 12:30

Apply suggestions from Hendrik

22e67a4

Co-authored-by: Hendrik Makait <[email protected]>

Update

2b33c84

jrbourbeau merged commit c87787c into main Sep 20, 2024
5 checks passed

jrbourbeau deleted the netcdf-to-zarr branch September 20, 2024 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark for NetCDF --> Zarr cloud-optimization #1551

Add benchmark for NetCDF --> Zarr cloud-optimization #1551

jrbourbeau commented Sep 18, 2024 •

edited

Loading

jrbourbeau Sep 18, 2024

jrbourbeau Sep 19, 2024

maxrjones Sep 19, 2024

jrbourbeau Sep 19, 2024

maxrjones Sep 20, 2024

jrbourbeau Sep 20, 2024

dcherian Sep 19, 2024

jrbourbeau Sep 19, 2024

dcherian Sep 19, 2024

jrbourbeau Sep 19, 2024

dcherian Sep 19, 2024

hendrikmakait left a comment

hendrikmakait Sep 20, 2024

jrbourbeau Sep 20, 2024

	parallel=True,
	parallel=True, data_vars="minimal", coords="minimal", compat="override",

Add benchmark for NetCDF --> Zarr cloud-optimization #1551

Add benchmark for NetCDF --> Zarr cloud-optimization #1551

Conversation

jrbourbeau commented Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau commented Sep 18, 2024 •

edited

Loading