Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataclass for "VirtualZarrStore" #375

Open
TomNicholas opened this issue Oct 16, 2023 · 1 comment
Open

Dataclass for "VirtualZarrStore" #375

TomNicholas opened this issue Oct 16, 2023 · 1 comment

Comments

@TomNicholas
Copy link

Problem

Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.

Suggestion

Instead create a new VirtualZarrStore dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.

Advantages

  • Easier to read and interrogate than multiply-nested dicts
  • Allows direct validation
  • Serializes in obvious ways (via .to_json, to_parquet, .to_dict or similar.)
  • Easier to write tests, by using fixtures to generate VirtualZarrStore objects
  • Concentrates concerns over changes/enhancements to Zarr Spec in one class
  • A v2->v3 converter could act directly on these objects
  • Possibly easier to understand whenever anyone reimplements kerchunk in other languages?

Implementation ideas

  • Implementation could subclass Zarr Object Model classes (where .to_json is analogous to the ZOM's .serialize), which then would be solidified as the recommended abstract representation once ZEP006 is accepted
  • Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
  • Attributes of this dataclass need to always be serializable, so the VirtualZarrStore should be basically a json schema (see #373)

Questions

  • Is it possible to do this in a broadly backwards-compatible manner?
@martindurant
Copy link
Member

I will want to spend some time thinking about this.

There are two objections that immediately come to mind:

  • most operations within kerchunk work on the content of keys, so they will always be working at the dict level to directly set values. The mapper and zarr views necessarily prevent this.
  • during combine, we now support writing directly to parquet. The interface is still store-like, but the access pattern is very different; so it's not a case of "build the dicts, then serialise to parquet", but "serialise to parquet on the fly" (in order to save memory).

So maybe it could be the other way around: the reference sets, dict-like stores, acquire .to_zarr and .to_mapper methods which use the information already contained within, but the primary representation is still dicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants