You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.
Suggestion
Instead create a new VirtualZarrStore dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.
Advantages
Easier to read and interrogate than multiply-nested dicts
Allows direct validation
Serializes in obvious ways (via .to_json, to_parquet, .to_dict or similar.)
Easier to write tests, by using fixtures to generate VirtualZarrStore objects
Concentrates concerns over changes/enhancements to Zarr Spec in one class
Possibly easier to understand whenever anyone reimplements kerchunk in other languages?
Implementation ideas
Implementation could subclass Zarr Object Model classes (where .to_json is analogous to the ZOM's .serialize), which then would be solidified as the recommended abstract representation once ZEP006 is accepted
Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
Attributes of this dataclass need to always be serializable, so the VirtualZarrStore should be basically a json schema (see #373)
Questions
Is it possible to do this in a broadly backwards-compatible manner?
The text was updated successfully, but these errors were encountered:
I will want to spend some time thinking about this.
There are two objections that immediately come to mind:
most operations within kerchunk work on the content of keys, so they will always be working at the dict level to directly set values. The mapper and zarr views necessarily prevent this.
during combine, we now support writing directly to parquet. The interface is still store-like, but the access pattern is very different; so it's not a case of "build the dicts, then serialise to parquet", but "serialise to parquet on the fly" (in order to save memory).
So maybe it could be the other way around: the reference sets, dict-like stores, acquire .to_zarr and .to_mapper methods which use the information already contained within, but the primary representation is still dicts.
Problem
Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.
Suggestion
Instead create a new
VirtualZarrStore
dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.Advantages
.to_json
,to_parquet
,.to_dict
or similar.)VirtualZarrStore
objectsImplementation ideas
.to_json
is analogous to the ZOM's.serialize
), which then would be solidified as the recommended abstract representation once ZEP006 is acceptedVirtualZarrStore
should be basically a json schema (see #373)Questions
The text was updated successfully, but these errors were encountered: