Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Python 3 tracking issue #514

Open
4 tasks
jhamman opened this issue Oct 6, 2024 · 8 comments
Open
4 tasks

Zarr Python 3 tracking issue #514

jhamman opened this issue Oct 6, 2024 · 8 comments

Comments

@jhamman
Copy link
Contributor

jhamman commented Oct 6, 2024

Zarr-Python 3.0 is getting close to a full release. This issue tracks the integration of the 3.0 release with Kerchunk.

Here's a running list of issues we expect we need to solve either here or upstream:

  • Develop an abstraction over Zarr Python's store interface (Zarr-Python 3.0's store is no longer a mutable mapping)
  • Develop a reference-filesystem like store for Zarr

Eventually, we may also want to:

  • construct Zarr's metadata classes directly rather than using Zarr's top-level API w/ memory stores
  • support writing v3 metadata (Kerchunk and Zarr V3 #235 is the issue for that)

xref: #504

@mpiannucci
Copy link
Contributor

mpiannucci commented Oct 23, 2024

I have been working on the zarr python support in kerchunk in #516, but i need to cycle off of it for a bit. So I am going to list out what is left to be done here and how it can be split off as best I can in hopes that someone else can help me to move this along in the meantime:

Heres where the functionality is at in that PR.

Generating references:

  • HDF
  • netcdf3
  • grib2
  • zarr (not tested, might work)
  • tiff (not tested, might work)
  • others (not sure)

Reading with xarray

Currently creating a store works, however there are caveats:

  • fsspec ReferenceFilesystem async cat_file is broken Fix broken async reference file system _cat_file method filesystem_spec#1734
  • zarr 3 RemoteStore issues: zarr python 3 requires that an fsspec filesystem used for a RemoteStore is an AsyncFilesystem. ReferenceFilesystem supports this, however this also means that the remote filesystem within the ReferenceFilesystem must be async. This means that all usage where data files are on a non async filesystem will not work. This blocks most of the tests in kerchunk as it stands today because LocalFilesystem is not async

Codecs

Codecs (filters, compressors) are treated differently in zarr 3. zarr 2 stores read with zarr 3 will still use numcodecs codecs, however if users want to ever use these codecs with another store (say to read a grib virtual dataset with Icechunk) the codecs from this package need to conform to zarr's Codec abc

  • Codecs support zarr 3 Codec API

I think that this is mostly it, and im happy to help anyone who is interested in helping drive this forward

@martindurant
Copy link
Member

remote filesystem within the ReferenceFilesystem must be async

where data files are on a non async filesystem will not work

In a side channel, I commented that such a wrapper could be written, but would be a pretty ugly piece of code.

@ap--
Copy link

ap-- commented Oct 23, 2024

Sorry for jumping into the discussion, I've been following along for a bit.
Would the asynclocal filesystem in morefs be a good workaround for enabling tests?

https://github.com/iterative/morefs/blob/ea30fa8d3fa81a2b739b20810a81aede490e4571/src/morefs/asyn_local.py#L30

@martindurant
Copy link
Member

Sorry, @ap-- , only just noticed what you wrote. Your async wrapper for local is useful and should be more comprehensive than what I have done (below); but we might want to be able to wrap other sync filesystem too.

#524 is a sketch of how you might wrap a synchronous filesystem to make it work with async calls.

In [1]: import fsspec
   ...: import kerchunk.sync_wrapper
   ...: fs = fsspec.filesystem("file")
   ...: sfs = kerchunk.sync_wrapper.SyncronousWrapper(fs)
   ...: await sfs._ls(".")
   ...:
Out[1]:
[files, ...]

@moradology
Copy link

moradology commented Oct 30, 2024

I'm taking a crack at some of the outstanding tasks here though I'm not certain I have all the context that might be useful, so please disabuse me of any confusions that are obvious. As of now, cutting changes against @mpiannucci's v3 branch. Here's support for an abstraction over Zarr Python's store interface

So far, I've tested things on a locally generated zarr store produced from a couple of concatenated netcdfs. Any wisdom regarding your local testing and debugging workflows might prove useful going forward

EDIT: rereading things and thinking about the intent here, I'm wondering if my approach in the above PR (which is just to incorporate special logic for handling Store instances where relevant in kerchunk) is off. Might the 'abstraction over Zarr python's store interface' better be addressed by simply implementing a MutableMapping instance that wraps Zarr-Python's Store within zarr python itself?

@mpiannucci
Copy link
Contributor

With this PR merged fsspec/filesystem_spec#1745 we are one step closer to getting the tests updated to work against zarr 3. Great work @moradology

I dont think that zarr-developers/zarr-python#2468 is necessary to get the tests working here, as you can always create the filesystem and pass it to the RemoteStore manually without using a url

@TomNicholas
Copy link

Is it just the codecs support left to do here now? I'm wondering if the references generation can be punted on by e.g. warning / raising if you try to call ZarrToZarr with zarr-python v3 installed...

@mpiannucci
Copy link
Contributor

@moradology is working on the tests, but yes AFAIK the codecs are the main thing left other than zarr to zarr and maybe multizarrtozarr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants