Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar index as json file #1807

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

pgierz
Copy link

@pgierz pgierz commented Mar 14, 2025

This adds an index file caching for tar filesystems, useful when you have listed the tar already and don't want to re-examine it.

At the moment I use JSON for this, but I'm not sure if that is the best idea or if something else would be better.

@pgierz pgierz marked this pull request as draft March 14, 2025 09:38
@pgierz
Copy link
Author

pgierz commented Mar 14, 2025

Still a draft, I need to tweak a few things, I'll mark it as ready soon.

@pgierz pgierz marked this pull request as ready for review March 14, 2025 09:42
@pgierz
Copy link
Author

pgierz commented Mar 14, 2025

It might be sensible to also check against a "Truthy" value here and default to a filename that is similar to the name of the actual tar file. At the moment, I assumed that the user was giving a file-like string.

@pgierz pgierz marked this pull request as draft March 14, 2025 09:46
@pgierz pgierz marked this pull request as ready for review March 14, 2025 10:12
@pgierz
Copy link
Author

pgierz commented Mar 14, 2025

Alright, this seems to work the way I would imagine it to. I'd be very happy to get some feedback if this is useful or not 🙂

@martindurant
Copy link
Member

Thanks for having a go at this! Please ping me next week for a thorough review.

Before proceeding, I wonder if there are any existing TAR index formats and/or index file naming conventions that we should attempt to follow. I agree that JSON is a sensible alternative if not.

You might also be interested in https://github.com/pauldmccarthy/indexed_gzip , which allows for starting to read at checkpoints throughout the GZIP stream rather than always reading from the beginning as normal. If the two ideas were combined, you could random-access to the beginning of any file in the archive.

@pgierz
Copy link
Author

pgierz commented Mar 17, 2025

Before proceeding, I wonder if there are any existing TAR index formats and/or index file naming conventions that we should attempt to follow. I agree that JSON is a sensible alternative if not.

I wasn't able to find anything, and perhaps the most interesting answer was here, which "long story short" seems to indicate that indexed tar with standard GNU tools is not supported, and therefore, there is also no standard for naming such an index.

You might also be interested in https://github.com/pauldmccarthy/indexed_gzip , which allows for starting to read at checkpoints throughout the GZIP stream rather than always reading from the beginning as normal. If the two ideas were combined, you could random-access to the beginning of any file in the archive.

Perhaps also this? https://github.com/colon3ltocard/pyindexedtar

I would have a try at adding some of the ideas there, and let you know later in the week.

@martindurant
Copy link
Member

No additional index file, the archive should contain the index and be 'all inclusive'

(from pyindexedtar)

I don't think this is a reasonable requirement. We normally are operating in the regime where the original data can't be changed, so the index needs to go elsewhere, possibly in a totally different storage location from the original. This is the kind of thing that kerchunk does, and indeed the idea of making a filesystem out of buffers within archived files has been a longstanding aim there - similar but different from the case here of the offsets to files.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation is fine aside a couple of comments.

Would appreciate a test (that the index file gets created and looks correct) and text to describe what happens.

Extension to tar.gz or other types can follow later, if there is interest.

@@ -84,21 +86,39 @@ def __init__(
self.tar = tarfile.TarFile(fileobj=self.fo)
self.dir_cache = None

self.index_store = index_store
if isinstance(index_store, (str, pathlib.Path)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to only allow str here. That would allow for the index to be stored anywhere (in another storage backend), whereas Path is local specific. If you use fsspec.open(), then you could include anything that it can handle.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. Does fsspec have some way to check if a file exists, akin to pathlib.Path(...).exists()? I could do try/except and catch for a FileNotFoundError, unless there is already a way to check that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be two-step, create a filesystem instance (fsspec.url_to_fs), and use .exists on that - so probably fine to use try/except here. We have discussed elsewhere whether we should have top-level operations like exists(url) that dispatch to the implementations like fsspec.open, but nothing has been done yet (cf fsspec.generic)

# NOTE(PG): Not sure if JSON is the best way to go here, but it's
# simple and human-readable.
logger.debug(f"Reloading from {self.index_store}")
with self.index_store.open("r") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec.open(self.index_store)


self.index = out
if self.index_store is not None:
with self.index_store.open("w") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec.open(self.index_store, "wt")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants