Dataclass for "VirtualZarrStore" #375

TomNicholas · 2023-10-16T20:03:56Z

Problem

Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.

Suggestion

Instead create a new VirtualZarrStore dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.

Advantages

Easier to read and interrogate than multiply-nested dicts
Allows direct validation
Serializes in obvious ways (via .to_json, to_parquet, .to_dict or similar.)
Easier to write tests, by using fixtures to generate VirtualZarrStore objects
Concentrates concerns over changes/enhancements to Zarr Spec in one class
A v2->v3 converter could act directly on these objects
Possibly easier to understand whenever anyone reimplements kerchunk in other languages?

Implementation ideas

Implementation could subclass Zarr Object Model classes (where .to_json is analogous to the ZOM's .serialize), which then would be solidified as the recommended abstract representation once ZEP006 is accepted
Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
Attributes of this dataclass need to always be serializable, so the VirtualZarrStore should be basically a json schema (see #373)

Questions

Is it possible to do this in a broadly backwards-compatible manner?

The text was updated successfully, but these errors were encountered:

martindurant · 2023-10-18T14:23:30Z

I will want to spend some time thinking about this.

There are two objections that immediately come to mind:

most operations within kerchunk work on the content of keys, so they will always be working at the dict level to directly set values. The mapper and zarr views necessarily prevent this.
during combine, we now support writing directly to parquet. The interface is still store-like, but the access pattern is very different; so it's not a case of "build the dicts, then serialise to parquet", but "serialise to parquet on the fly" (in order to save memory).

So maybe it could be the other way around: the reference sets, dict-like stores, acquire .to_zarr and .to_mapper methods which use the information already contained within, but the primary representation is still dicts.

TomNicholas mentioned this issue Oct 16, 2023

Refactor file format backend openers #376

Open

ivirshup mentioned this issue Oct 16, 2023

Refactor MultiZarrToZarr into multiple functions #377

Open

TomNicholas mentioned this issue Feb 3, 2024

JSON not properly decoded by backends #415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataclass for "VirtualZarrStore" #375

Dataclass for "VirtualZarrStore" #375

TomNicholas commented Oct 16, 2023

martindurant commented Oct 18, 2023

Dataclass for "VirtualZarrStore" #375

Dataclass for "VirtualZarrStore" #375

Comments

TomNicholas commented Oct 16, 2023

Problem

Suggestion

Advantages

Implementation ideas

Questions

martindurant commented Oct 18, 2023