Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

TomNicholas · 2024-06-08T04:04:19Z

The reason I made this package is to handle one particularly challenging use case - the [C]Worthy mCDR Atlas - which I still haven't done. Once it's done I plan to write a blog post talking about it, and maybe add it as a usage example to this repository.

This dataset has some characteristics that make it really challenging to kerchunk/virtualize¹:

It's ~50TB compressed on-disk,
It has ~500,000 netCDF files(!), each with about 40 variables,
The largest variables are 3-dimensional, and require concatenation along an additional 3 dimensions, so the resulting variables are 6-dimensional,
It requires merging in lower-dimensional variables too, not just concatenation,
It has time encoding on some coordinates.

This dataset is therefore comparable to some of the largest datasets already available in Zarr (at least in terms of the number of chunks and variables, if not on-disk size), and is very similar to the pathological case described in #104

24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store.

If we can virtualize this we should be able to virtualize most things 💪

To get this done requires many features to be implemented:

Additionally once zarr-python actually understands some kind of chunk manifest, I want to also go back and create an actual zarr store for this dataset. That will additionally require:

The zarr-python storage transformer + chunk manifest stuff to actually be implemented (Manifest storage transformer zarr-specs#287), EDIT: We can just use Icechunk now
Ideally not having to re-do the reference generation, instead using Open on-disk kerchunk references as a virtual dataset #118 before calling .virtualize.to_zarr(),
The ability to write out selected variables as normal compressed zarr arrays on-disk ("Inlining" data when writing references to disk #62 (comment)).

In fact pretty much the only ways in which this dataset could be worse is if it had differences in encoding between netCDF files, variable-length chunks, or netCDF groups, but thankfully it has none of those 😅 ↩

The text was updated successfully, but these errors were encountered:

TomNicholas · 2025-01-30T19:03:18Z

This notebook shows where I'm at with this so far. The notebook is intended to be linked from a blog post, and it explains why we tackle the problem the way we do.

The most interesting part is under the subtitle "The Whole Dataset". Although that code has not yet been executed, that's what we're going for.

It's got a bunch of holes in it at the moment though:

It uses Add open_virtual_mfdataset #349, and I haven't actually tested deploying lithops in the cloud with that PR yet (though I have got a passing test for local lithops).
To deploy lithops I have to work out Docker, which is where I lost a bit of momentum,
Also it uses Icechunk, zarr-python v3, and kerchunk all together, so is subject to all the churn going on with those dependencies right now (see Planning explicit dependency on Zarr v3 #392), and it would be easier to sort that all out before continuing with this notebook,
It requires the append_dim kwarg to .to_icechunk, which isn't really documented yet (see Support appending to icechunk store #311).

Just FYI @matt-long @andersy005

TomNicholas added documentation Improvements or additions to documentation usage example Real world use case examples labels Jun 8, 2024

TomNicholas mentioned this issue Aug 7, 2024

Use case: UCLA-ROMS fluid HPC model output #217

Open

TomNicholas mentioned this issue Sep 27, 2024

Use Case: [C]Worthy OAE dataset earth-mover/icechunk#119

Open

8 tasks

TomNicholas mentioned this issue Nov 20, 2024

PetaByte-scale virtual ref performance benchmark earth-mover/icechunk#401

Open

TomNicholas mentioned this issue Dec 17, 2024

TypeError: ufunc 'isnan' not supported for the input types #356

Open

TomNicholas mentioned this issue Jan 12, 2025

Add hypergrib as as a grib reader #238

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

TomNicholas commented Jun 8, 2024 •

edited

Loading

TomNicholas commented Jan 30, 2025 •

edited

Loading

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Comments

TomNicholas commented Jun 8, 2024 • edited Loading

Footnotes

TomNicholas commented Jan 30, 2025 • edited Loading

TomNicholas commented Jun 8, 2024 •

edited

Loading

TomNicholas commented Jan 30, 2025 •

edited

Loading