Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark for NetCDF --> Zarr cloud-optimization #1551

Merged
merged 7 commits into from
Sep 20, 2024
Merged

Conversation

jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Sep 18, 2024

NetCDF datasets being slow/not scaling well has come up a lot. This PR adds a new benchmark that loads the nex-gddp-cmip6 dataset (https://registry.opendata.aws/nex-gddp-cmip6/) from AWS, which is stored as a bunch of .nc files, and converts that dataset to Zarr, a more modern, cloud-optimized format.

This is using xr.open_mfdataset(..., parallel=True) which is both common and really slow when opening lots of NetCDF files, which I like because I've seen this with many users in practice.

One thing I'm not sure about is how representative this benchmark is as is. I don't know if folks do this NetCDF --> Zarr conversion in isolation, or always in conjunction with other "cloud optimizing" steps like rechunking.

EDIT: Here's a cluster link for the "small" version of this test https://cloud.coiled.io/clusters/594106/account/dask-engineering/information. It takes ~20 minutes and costs ~$0.75

Comment on lines 75 to 93
# Get netCDF data files -- see https://registry.opendata.aws/nex-gddp-cmip6
# for dataset details.
file_list = []
for model in models:
for variable in variables:
source_directory = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
file_list += [f"s3://{path}" for path in s3.glob(source_directory)]
files = [s3.open(f) for f in file_list]
print(f"Processing {len(files)} NetCDF files")

ds = xr.open_mfdataset(
files,
engine="h5netcdf",
combine="nested",
concat_dim="time",
parallel=True,
)
print(f"Converting {format_bytes(ds.nbytes)} from NetCDF to Zarr")
ds.to_zarr(s3_url)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones thanks for pointing to this dataset over in #1545 (comment). Does this looks like what you've seen in the wild?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and added a rechunking step (from "pancake" to "pencil" chunks), which seems to be pretty common when cloud-optimizing a NetCDF dataset to Zarr

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's right that when cloud-optimizing it's usually best to include a rechunking step. But, there's many cases in which people avoid the cloud-optimization step and do subsequent analyses on the data loaded directly from the original NetCDF files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's right that when cloud-optimizing it's usually best to include a rechunking step.

Awesome, thanks for confirming that's usually the case 👍

But, there's many cases in which people avoid the cloud-optimization step and do subsequent analyses on the data loaded directly from the original NetCDF files

I guess in this case, the "subsequent analyses" is just writing to Zarr. Do you think this still captures user pain well? FWIW my experience is using xr.open_mfdataset(..., parallel=True) + any other step performs very poorly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in this case, the "subsequent analyses" is just writing to Zarr. Do you think this still captures user pain well? FWIW my experience is using xr.open_mfdataset(..., parallel=True) + any other step performs very poorly

Yeah, it could make sense to focus first on the simplest operations that perform poorly. You mention that going from pancake to pencil chunks is a common step that performs poorly. That's true and is the motivation for the rechunker library. The worst performance would likely be seen when combining operations that perform optimally on pancake oriented chunks with operations that perform optimally on churro oriented chunks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pancake oriented chunks...churro oriented chunks

💯

@jrbourbeau jrbourbeau changed the title Add benchmark for converting NetCDF to Zarr Add benchmark for NetCDF --> Zarr cloud-optimization Sep 19, 2024
engine="h5netcdf",
combine="nested",
concat_dim="time",
parallel=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parallel=True,
parallel=True, data_vars="minimal", coords="minimal", compat="override",

These may be needed for decent perf, I haven't looked at the files to be sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing those out. Are these extra kwargs decent defaults with reading in lots of NetCDF files? Just trying to get a sense for how often the different configurations are used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ yes, they are basically mandatory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll take them for a spin and update here. Also, should those be the default in xarray?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau jrbourbeau self-assigned this Sep 20, 2024
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jrbourbeau, this is good to go once the workspace has been adjusted. I have one question regarding scaling this benchmark but that's non-blocking.

Comment on lines 65 to 68
# 715 files. One model and all variables.
# Currently fails after hitting 20 minute idle timeout
# sending `to_zarr` graph to the scheduler.
models = models[:1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference here in how we scale with respect to models and variables? I'm wondering if it would make more sense to pick a subset (larger than one) of both models and variables instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

tests/geospatial/test_cloud_optimize.py Show resolved Hide resolved
tests/geospatial/test_cloud_optimize.py Outdated Show resolved Hide resolved
@jrbourbeau jrbourbeau merged commit c87787c into main Sep 20, 2024
5 checks passed
@jrbourbeau jrbourbeau deleted the netcdf-to-zarr branch September 20, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants