-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tar index as json file #1807
base: master
Are you sure you want to change the base?
tar index as json file #1807
Conversation
Still a draft, I need to tweak a few things, I'll mark it as ready soon. |
It might be sensible to also check against a "Truthy" value here and default to a filename that is similar to the name of the actual tar file. At the moment, I assumed that the user was giving a file-like string. |
Alright, this seems to work the way I would imagine it to. I'd be very happy to get some feedback if this is useful or not 🙂 |
Thanks for having a go at this! Please ping me next week for a thorough review. Before proceeding, I wonder if there are any existing TAR index formats and/or index file naming conventions that we should attempt to follow. I agree that JSON is a sensible alternative if not. You might also be interested in https://github.com/pauldmccarthy/indexed_gzip , which allows for starting to read at checkpoints throughout the GZIP stream rather than always reading from the beginning as normal. If the two ideas were combined, you could random-access to the beginning of any file in the archive. |
I wasn't able to find anything, and perhaps the most interesting answer was here, which "long story short" seems to indicate that indexed tar with standard GNU tools is not supported, and therefore, there is also no standard for naming such an index.
Perhaps also this? https://github.com/colon3ltocard/pyindexedtar I would have a try at adding some of the ideas there, and let you know later in the week. |
(from pyindexedtar) I don't think this is a reasonable requirement. We normally are operating in the regime where the original data can't be changed, so the index needs to go elsewhere, possibly in a totally different storage location from the original. This is the kind of thing that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation is fine aside a couple of comments.
Would appreciate a test (that the index file gets created and looks correct) and text to describe what happens.
Extension to tar.gz or other types can follow later, if there is interest.
@@ -84,21 +86,39 @@ def __init__( | |||
self.tar = tarfile.TarFile(fileobj=self.fo) | |||
self.dir_cache = None | |||
|
|||
self.index_store = index_store | |||
if isinstance(index_store, (str, pathlib.Path)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to only allow str here. That would allow for the index to be stored anywhere (in another storage backend), whereas Path is local specific. If you use fsspec.open(), then you could include anything that it can handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. Does fsspec
have some way to check if a file exists, akin to pathlib.Path(...).exists()
? I could do try/except and catch for a FileNotFoundError
, unless there is already a way to check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be two-step, create a filesystem instance (fsspec.url_to_fs), and use .exists on that - so probably fine to use try/except here. We have discussed elsewhere whether we should have top-level operations like exists(url) that dispatch to the implementations like fsspec.open, but nothing has been done yet (cf fsspec.generic)
# NOTE(PG): Not sure if JSON is the best way to go here, but it's | ||
# simple and human-readable. | ||
logger.debug(f"Reloading from {self.index_store}") | ||
with self.index_store.open("r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsspec.open(self.index_store)
|
||
self.index = out | ||
if self.index_store is not None: | ||
with self.index_store.open("w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsspec.open(self.index_store, "wt")
This adds an index file caching for tar filesystems, useful when you have listed the tar already and don't want to re-examine it.
At the moment I use JSON for this, but I'm not sure if that is the best idea or if something else would be better.