-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appending to references on disk #21
Comments
I'd be curious to know if this is possible (or planned) as well! |
@forrestfwilliams as it's possible in kerchunk it should be possible here! The basic pattern to follow might look like xarray's Personally this is not the highest-priority feature for me, so if you were interested in it I would be happy to help you think it through / contribute here 😄 |
Hey @TomNicholas thanks. I work on a project called ITS_LIVE which is monitoring glacier velocities for the entire globe, and we're trying to find a way to efficiently represent our entire archive of NETCDF files (hosted in an open S3 bucket) as a Zarr. We're creating new NETCDF files everyday, so we'd like to find a way to use @jhkennedy and @betolink (who are also on the ITS_LIVE) may also be interested in this issue. |
@forrestfwilliams cool! This pattern of appending could become quite neat if we use zarr chunk manifests instead of kerchunk's format. See this comment zarr-developers/zarr-specs#287 (comment) |
Instead of trying to append to the kerchunk parquet references on-disk, what we could do is simply re-open the existing references as a virtual dataset, find byte ranges in the new files using The advantage of this is that it works without trying to update part of the on-disk representation in-place (you simply re-write the entire thing in one go instead), and it doesn't require re-finding all the byte range information in all the files you already indexed. The disadvantage is that you are re-writing on-disk references that you already created. I think this could be a nice solution in the short term though. To allow this we need a way to open existing kerchunk references as a virtual dataset, i.e. vds = xr.open_virtual_dataset('kerchunk_refs.json', filetype='kerchunk_json') which I've opened #118 to track. |
This was added in #251!
This would be a pain with Kerchunk-formatted references but with Icechunk it should be straightforward! We can simply add This would be incredibly useful but it's not something I personally need right now, so if someone wants to have a crack at adding |
Whoa, this will be awesome! Maybe I can find someone to do this! |
Does anyone actually really need a special option to append to kerchunk reference files? Considering: I think we should close this unless someone really wants to be able to specifically append to kerchunk parquet files in place. |
The append-to-icechunk is certainly enough for my use cases. |
Hey @TomNicholas, we have a meeting with ITS_LIVE tomorrow @ 4PM ET and I'd love if you could join us for 30 minutes to talk about this. |
I've got a project brewing that will need appending, but I've been waiting and watching developments here to see what comes out as the recommended best way. Looks like that will be append-to-icechunk, and that's okay with me! |
Kerchunk has some support for appending to references stored on disk as parquet. This pattern makes sense when you have operational data that might be appended to with new days of data but you don't want to risk mutating the existing data.
Is there a sensible way to handle this kind of use case in the context of VirtualiZarr's API?
The text was updated successfully, but these errors were encountered: