Appending to references on disk #21

TomNicholas · 2024-03-10T20:38:06Z

Kerchunk has some support for appending to references stored on disk as parquet. This pattern makes sense when you have operational data that might be appended to with new days of data but you don't want to risk mutating the existing data.

Is there a sensible way to handle this kind of use case in the context of VirtualiZarr's API?

forrestfwilliams · 2024-05-02T12:23:11Z

I'd be curious to know if this is possible (or planned) as well!

TomNicholas · 2024-05-02T14:30:20Z

@forrestfwilliams as it's possible in kerchunk it should be possible here! The basic pattern to follow might look like xarray's to_zarr method when using the append_dim kwarg (see docs on modifying existing zarr stores).

Personally this is not the highest-priority feature for me, so if you were interested in it I would be happy to help you think it through / contribute here 😄

forrestfwilliams · 2024-05-03T15:27:56Z

Hey @TomNicholas thanks. I work on a project called ITS_LIVE which is monitoring glacier velocities for the entire globe, and we're trying to find a way to efficiently represent our entire archive of NETCDF files (hosted in an open S3 bucket) as a Zarr.

We're creating new NETCDF files everyday, so we'd like to find a way to use VirtualiZarr that does not involve re-creating the entire dataset every time.

@jhkennedy and @betolink (who are also on the ITS_LIVE) may also be interested in this issue.

TomNicholas · 2024-05-03T16:32:24Z

@forrestfwilliams cool!

This pattern of appending could become quite neat if we use zarr chunk manifests instead of kerchunk's format. See this comment zarr-developers/zarr-specs#287 (comment)

TomNicholas · 2024-05-16T04:27:52Z

Instead of trying to append to the kerchunk parquet references on-disk, what we could do is simply re-open the existing references as a virtual dataset, find byte ranges in the new files using open_virtual_dataset, concatenate the new and the old together, and re-write out the complete set of references.

The advantage of this is that it works without trying to update part of the on-disk representation in-place (you simply re-write the entire thing in one go instead), and it doesn't require re-finding all the byte range information in all the files you already indexed. The disadvantage is that you are re-writing on-disk references that you already created. I think this could be a nice solution in the short term though.

To allow this we need a way to open existing kerchunk references as a virtual dataset, i.e.

vds = xr.open_virtual_dataset('kerchunk_refs.json', filetype='kerchunk_json')

which I've opened #118 to track.

TomNicholas · 2024-10-23T15:24:19Z

To allow this we need a way to open existing kerchunk references as a virtual dataset

This was added in #251!

The basic pattern to follow might look like xarray's .to_zarr method when using the append_dim kwarg

This would be a pain with Kerchunk-formatted references but with Icechunk it should be straightforward! We can simply add append_dim to virtualizarr's new .to_icechunk method. See earth-mover/icechunk#104 (comment) for more explanation.

This would be incredibly useful but it's not something I personally need right now, so if someone wants to have a crack at adding append_dim in the meantime then please go for it and I will advise 🙂

rsignell · 2024-10-23T20:56:17Z

Whoa, this will be awesome! Maybe I can find someone to do this!

TomNicholas · 2025-02-05T16:13:59Z

Does anyone actually really need a special option to append to kerchunk reference files?

Considering:
a) For the JSON cases that is functionally the same as opening it, concatenating virtual chunks, and resaving a new refs.json file (you can do all that with virtualizarr today),
b) The Icechunk version of this feature is already completed by #272.

I think we should close this unless someone really wants to be able to specifically append to kerchunk parquet files in place.

rsignell · 2025-02-05T18:58:32Z

The append-to-icechunk is certainly enough for my use cases.

betolink · 2025-02-05T19:20:07Z

Hey @TomNicholas, we have a meeting with ITS_LIVE tomorrow @ 4PM ET and I'd love if you could join us for 30 minutes to talk about this.

douglatornell · 2025-02-05T21:01:21Z

I've got a project brewing that will need appending, but I've been waiting and watching developments here to see what comes out as the recommended best way. Looks like that will be append-to-icechunk, and that's okay with me!

TomNicholas mentioned this issue May 3, 2024

Manifest storage transformer zarr-developers/zarr-specs#287

Open

TomNicholas added the references formats Storing byte range info on disk label May 16, 2024

TomNicholas mentioned this issue Sep 25, 2024

Reading virtual references back out into VirtualiZarr Manifests earth-mover/icechunk#104

Open

TomNicholas mentioned this issue Oct 3, 2024

Add hypergrib as as a grib reader #238

Open

TomNicholas added the Icechunk 🧊 Relates to Icechunk library / spec label Oct 23, 2024

maxrjones mentioned this issue Nov 19, 2024

Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308

Open

abarciauskas-bgse mentioned this issue Nov 20, 2024

Support appending to icechunk store #311

Closed

TomNicholas closed this as completed Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending to references on disk #21

Appending to references on disk #21

TomNicholas commented Mar 10, 2024

forrestfwilliams commented May 2, 2024

TomNicholas commented May 2, 2024

forrestfwilliams commented May 3, 2024

TomNicholas commented May 3, 2024

TomNicholas commented May 16, 2024 •

edited

Loading

TomNicholas commented Oct 23, 2024 •

edited

Loading

rsignell commented Oct 23, 2024 •

edited

Loading

TomNicholas commented Feb 5, 2025

rsignell commented Feb 5, 2025

betolink commented Feb 5, 2025

douglatornell commented Feb 5, 2025

Appending to references on disk #21

Appending to references on disk #21

Comments

TomNicholas commented Mar 10, 2024

forrestfwilliams commented May 2, 2024

TomNicholas commented May 2, 2024

forrestfwilliams commented May 3, 2024

TomNicholas commented May 3, 2024

TomNicholas commented May 16, 2024 • edited Loading

TomNicholas commented Oct 23, 2024 • edited Loading

rsignell commented Oct 23, 2024 • edited Loading

TomNicholas commented Feb 5, 2025

rsignell commented Feb 5, 2025

betolink commented Feb 5, 2025

douglatornell commented Feb 5, 2025

TomNicholas commented May 16, 2024 •

edited

Loading

TomNicholas commented Oct 23, 2024 •

edited

Loading

rsignell commented Oct 23, 2024 •

edited

Loading