-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark for NetCDF --> Zarr cloud-optimization #1551
Conversation
# Get netCDF data files -- see https://registry.opendata.aws/nex-gddp-cmip6 | ||
# for dataset details. | ||
file_list = [] | ||
for model in models: | ||
for variable in variables: | ||
source_directory = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc" | ||
file_list += [f"s3://{path}" for path in s3.glob(source_directory)] | ||
files = [s3.open(f) for f in file_list] | ||
print(f"Processing {len(files)} NetCDF files") | ||
|
||
ds = xr.open_mfdataset( | ||
files, | ||
engine="h5netcdf", | ||
combine="nested", | ||
concat_dim="time", | ||
parallel=True, | ||
) | ||
print(f"Converting {format_bytes(ds.nbytes)} from NetCDF to Zarr") | ||
ds.to_zarr(s3_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxrjones thanks for pointing to this dataset over in #1545 (comment). Does this looks like what you've seen in the wild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and added a rechunking step (from "pancake" to "pencil" chunks), which seems to be pretty common when cloud-optimizing a NetCDF dataset to Zarr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's right that when cloud-optimizing it's usually best to include a rechunking step. But, there's many cases in which people avoid the cloud-optimization step and do subsequent analyses on the data loaded directly from the original NetCDF files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's right that when cloud-optimizing it's usually best to include a rechunking step.
Awesome, thanks for confirming that's usually the case 👍
But, there's many cases in which people avoid the cloud-optimization step and do subsequent analyses on the data loaded directly from the original NetCDF files
I guess in this case, the "subsequent analyses" is just writing to Zarr. Do you think this still captures user pain well? FWIW my experience is using xr.open_mfdataset(..., parallel=True)
+ any other step performs very poorly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in this case, the "subsequent analyses" is just writing to Zarr. Do you think this still captures user pain well? FWIW my experience is using xr.open_mfdataset(..., parallel=True) + any other step performs very poorly
Yeah, it could make sense to focus first on the simplest operations that perform poorly. You mention that going from pancake to pencil chunks is a common step that performs poorly. That's true and is the motivation for the rechunker library. The worst performance would likely be seen when combining operations that perform optimally on pancake oriented chunks with operations that perform optimally on churro oriented chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pancake oriented chunks...churro oriented chunks
💯
engine="h5netcdf", | ||
combine="nested", | ||
concat_dim="time", | ||
parallel=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel=True, | |
parallel=True, data_vars="minimal", coords="minimal", compat="override", |
These may be needed for decent perf, I haven't looked at the files to be sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing those out. Are these extra kwargs decent defaults with reading in lots of NetCDF files? Just trying to get a sense for how often the different configurations are used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ yes, they are basically mandatory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'll take them for a spin and update here. Also, should those be the default in xarray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jrbourbeau, this is good to go once the workspace has been adjusted. I have one question regarding scaling this benchmark but that's non-blocking.
# 715 files. One model and all variables. | ||
# Currently fails after hitting 20 minute idle timeout | ||
# sending `to_zarr` graph to the scheduler. | ||
models = models[:1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference here in how we scale with respect to models and variables? I'm wondering if it would make more sense to pick a subset (larger than one) of both models and variables instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Hendrik Makait <[email protected]>
NetCDF datasets being slow/not scaling well has come up a lot. This PR adds a new benchmark that loads the
nex-gddp-cmip6
dataset (https://registry.opendata.aws/nex-gddp-cmip6/) from AWS, which is stored as a bunch of.nc
files, and converts that dataset to Zarr, a more modern, cloud-optimized format.This is using
xr.open_mfdataset(..., parallel=True)
which is both common and really slow when opening lots of NetCDF files, which I like because I've seen this with many users in practice.One thing I'm not sure about is how representative this benchmark is as is. I don't know if folks do this NetCDF --> Zarr conversion in isolation, or always in conjunction with other "cloud optimizing" steps like rechunking.
EDIT: Here's a cluster link for the "small" version of this test https://cloud.coiled.io/clusters/594106/account/dask-engineering/information. It takes ~20 minutes and costs ~$0.75