Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending to references on disk #21

Closed
TomNicholas opened this issue Mar 10, 2024 · 11 comments
Closed

Appending to references on disk #21

TomNicholas opened this issue Mar 10, 2024 · 11 comments
Labels
Icechunk 🧊 Relates to Icechunk library / spec references formats Storing byte range info on disk

Comments

@TomNicholas
Copy link
Member

Kerchunk has some support for appending to references stored on disk as parquet. This pattern makes sense when you have operational data that might be appended to with new days of data but you don't want to risk mutating the existing data.

Is there a sensible way to handle this kind of use case in the context of VirtualiZarr's API?

@forrestfwilliams
Copy link

I'd be curious to know if this is possible (or planned) as well!

@TomNicholas
Copy link
Member Author

@forrestfwilliams as it's possible in kerchunk it should be possible here! The basic pattern to follow might look like xarray's to_zarr method when using the append_dim kwarg (see docs on modifying existing zarr stores).

Personally this is not the highest-priority feature for me, so if you were interested in it I would be happy to help you think it through / contribute here 😄

@forrestfwilliams
Copy link

Hey @TomNicholas thanks. I work on a project called ITS_LIVE which is monitoring glacier velocities for the entire globe, and we're trying to find a way to efficiently represent our entire archive of NETCDF files (hosted in an open S3 bucket) as a Zarr.

We're creating new NETCDF files everyday, so we'd like to find a way to use VirtualiZarr that does not involve re-creating the entire dataset every time.

@jhkennedy and @betolink (who are also on the ITS_LIVE) may also be interested in this issue.

@TomNicholas
Copy link
Member Author

@forrestfwilliams cool!

This pattern of appending could become quite neat if we use zarr chunk manifests instead of kerchunk's format. See this comment zarr-developers/zarr-specs#287 (comment)

@TomNicholas TomNicholas added the references formats Storing byte range info on disk label May 16, 2024
@TomNicholas
Copy link
Member Author

TomNicholas commented May 16, 2024

Instead of trying to append to the kerchunk parquet references on-disk, what we could do is simply re-open the existing references as a virtual dataset, find byte ranges in the new files using open_virtual_dataset, concatenate the new and the old together, and re-write out the complete set of references.

The advantage of this is that it works without trying to update part of the on-disk representation in-place (you simply re-write the entire thing in one go instead), and it doesn't require re-finding all the byte range information in all the files you already indexed. The disadvantage is that you are re-writing on-disk references that you already created. I think this could be a nice solution in the short term though.

To allow this we need a way to open existing kerchunk references as a virtual dataset, i.e.

vds = xr.open_virtual_dataset('kerchunk_refs.json', filetype='kerchunk_json')

which I've opened #118 to track.

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 23, 2024

To allow this we need a way to open existing kerchunk references as a virtual dataset

This was added in #251!


The basic pattern to follow might look like xarray's .to_zarr method when using the append_dim kwarg

This would be a pain with Kerchunk-formatted references but with Icechunk it should be straightforward! We can simply add append_dim to virtualizarr's new .to_icechunk method. See earth-mover/icechunk#104 (comment) for more explanation.

This would be incredibly useful but it's not something I personally need right now, so if someone wants to have a crack at adding append_dim in the meantime then please go for it and I will advise 🙂

@rsignell
Copy link
Collaborator

rsignell commented Oct 23, 2024

Whoa, this will be awesome! Maybe I can find someone to do this!

@TomNicholas
Copy link
Member Author

Does anyone actually really need a special option to append to kerchunk reference files?

Considering:
a) For the JSON cases that is functionally the same as opening it, concatenating virtual chunks, and resaving a new refs.json file (you can do all that with virtualizarr today),
b) The Icechunk version of this feature is already completed by #272.

I think we should close this unless someone really wants to be able to specifically append to kerchunk parquet files in place.

@rsignell
Copy link
Collaborator

rsignell commented Feb 5, 2025

The append-to-icechunk is certainly enough for my use cases.

@betolink
Copy link

betolink commented Feb 5, 2025

Hey @TomNicholas, we have a meeting with ITS_LIVE tomorrow @ 4PM ET and I'd love if you could join us for 30 minutes to talk about this.

@douglatornell
Copy link
Contributor

I've got a project brewing that will need appending, but I've been waiting and watching developments here to see what comes out as the recommended best way. Looks like that will be append-to-icechunk, and that's okay with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Icechunk 🧊 Relates to Icechunk library / spec references formats Storing byte range info on disk
Projects
None yet
Development

No branches or pull requests

5 participants