Skip to content

rework github _open() implementation to support LFS #1810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 17, 2025

Conversation

thomasgilgenast
Copy link
Contributor

@thomasgilgenast thomasgilgenast commented Mar 14, 2025

#1438 raised the question of whether it should be possible to read Git LFS files from GitHub via the github implementation.

This PR is an unsolicited proposal of one possible path forward to supporting Git LFS files hosted on GitHub via the github implementation. It reworks the way the github implementation accesses content from github repos, enabling Git LFS file support in the process.

Before this PR, the content of github files was always fetched using the rurl (the https://raw.githubusercontent.com/ URL). This URL works well for fetching normal files tracked by Git, but does not work for Git LFS files.

In this PR, the _open() function in the github implementation is reworked to instead use the GitHub API to fetch the content. This API endpoint is documented in the GitHub API docs here.

The way this API works is unfortunately more complicated than the rurl. For files <1MB in size, the file's contents will be base64 encoded and included in the JSON response under the content key. For files >1MB in size, the content key will have an empty string value, and the only way to get its content is to make an additional GET request to a specific URL provided under the download_url key of the JSON response.

The downsides of the new reworked implementation of _open() are

  1. More logic is required - sometimes the content can be found in content and other times we need to use download_url
  2. Two requests are needed for files >1MB in size (one to the content API and another to hit download_url)

The upside is that the download_url points to the actual file contents for files tracked by Git LFS. This means that the same github implementation will work for normal files (both smaller and larger than 1MB) and also for Git LFS files.

I would be happy to take feedback on whether or not these downsides are worth the improved support for LFS, or if this should be moved to a new, separate "github_lfs" implementation to avoid complicating the current github implementation.

In addition to reworking the _open() function, I also removed the implementation of cat(), which was also referencing rurl. I am not super familiar with why cat() was overridden in the github implementation. I believe that if we still want to use the http implementation of cat() here, then that should still be possible. Let me know if this would be required to move forward with this PR.

Finally, I've added a few basic tests to cover the added/modified functionality. These tests interact with the real GitHub API, so it may not be desirable to leave them in the test suite as-is - let me know what would be preferred here and I would be happy to adjust.

@martindurant
Copy link
Member

To clarify: this does not use LFS directly, so there's no way to enable git-lfs for normal (non github) git repos, correct?

1 similar comment
@martindurant
Copy link
Member

To clarify: this does not use LFS directly, so there's no way to enable git-lfs for normal (non github) git repos, correct?

@thomasgilgenast
Copy link
Contributor Author

thomasgilgenast commented Mar 14, 2025

Yes, this PR not use LFS directly, and therefore it only works for getting Git LFS file content from GitHub and cannot be used with Git LFS repos that are not hosted on GitHub. I see it as a narrow "upgrade" to the github implementation (which was already limited to repos hosted on GitHub, but would not work as-expected for LFS files).

On the other hand, this PR does not require managing a local clone of the Git LFS repo and does not require the user to have git or git-lfs installed.

I think it's worth asking if the best path forward is to instead add LFS support to the git implementation, or to add a new git_lfs implementation that would similarly interact with a clone on local disk.

If I understand correctly, the git implementation is for dealing with local git repos that have already been cloned to local disk, while the github implementation feels a lot more like http, to the extent that it treats github as "just another source of remote files" (albeit one with versioning powers via sha). For my specific use case, I would personally strongly prefer to avoid locally cloning the entire Git LFS repo to local disk (e.g., if I just need one or two files from the repo and if the repo contains multiple large non-LFS files that will make the initial cloning step slow). I also strongly prefer not to have to manually manage the local clone of the repo. But I could also understand if this my personal use case does not match the intent behind the original issue.

def test_github_open_large_file():
# test opening a large file >1 MB
with fsspec.open("github://mwaskom:seaborn-data@83bfba7/brain_networks.csv") as f:
assert f.readline().startswith(b"network,1,1,2,2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: this will download the whole file in one shot, even though it's "large". Do you think we can generate a buffered file (like a HTTPFile), given that we expect the remote to accept HTTP range requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a great idea - thank you for mentioning this. I will rework this and then rewrite this test to use a range request to avoid fetching the whole file over the network.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have thoughts on if it would also be important to introduce other concepts from the http implementation such as support for async, reuse of an aiohttp ClientSession for the multiple calls to the GitHub API, etc.? Or would it be fine to just use HTTPFile and ensure that range requests are supported, but otherwise not reach for full feature parity with the http implementation in this first PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess maybe my question may not make sense to ask if the relevant async support is already "enabled" just by virtue of the asynchronous kwarg on the HTTPFile constructor. I should probably take a closer look before asking this question 😅 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not think about the async-ness. We probably don't want this for github, since there is a fairly onerous API rate limit, even if you come with a developer token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a751042 I tweaked _open() to return an HTTPFile wrapping the download_url in cases where the initial response from the GitHub API does not include the file content already. I believe this should enable streaming and range queries.

@martindurant
Copy link
Member

You are quite right, LFS for local git would be different, then. I think your "upgrade" is a fine scope to keep this PR to.

Comment on lines 257 to 267
return HTTPFile(
self.http_fs,
content_json["download_url"],
session=self.session,
block_size=block_size,
autocommit=autocommit,
cache_options=cache_options,
size=content_json["size"],
loop=self.loop,
**kwargs,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be an alternative here to instead just do something like

Suggested change
return HTTPFile(
self.http_fs,
content_json["download_url"],
session=self.session,
block_size=block_size,
autocommit=autocommit,
cache_options=cache_options,
size=content_json["size"],
loop=self.loop,
**kwargs,
)
return self.http_fs.open(
content_json["download_url"],
mode=mode,
block_size=block_size,
cache_options=cache_options,
autocommit=autocommit,
**kwargs,
)

which avoids some of the complexity of creating the HTTPFile directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and made this change in 75a6f4e but would be happy to revert if it's going in the wrong direction.

pyproject.toml Outdated
@@ -56,7 +56,7 @@ full = [
fuse = ["fusepy"]
gcs = ["gcsfs"]
git = ["pygit2"]
github = ["requests"]
github = ["fsspec[http]", "requests"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the http extra here since now the github backend depends on HTTPFileSystem which requires aiohttp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a package depend on itself? Maybe list aiohttp here instead.

But actually it's optional - for normal small files, you don't need it. Do you think there's an easy way to defer the import and use of HTTPFileSystem (+ appropriate error message if it fails)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will defer to you on this. In my experience, I have used this syntax to specify that one extra depends on another extra. But in this case, the http extra has only one dependency (aiohttp), and new deps are probably added to that extra very rarely, so the degree of "extra repetition" and "risk of forgetting to add any new http dep to the github extra too" is low.

But you also bring up a good point that we could just make it optional by deferring the import.

I took a stab at this in 8e7fb56 - let me know what you think about this approach!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quirk with 8e7fb56 is that it does not create an HTTPFileSystem at the instance level to be re-used for all file downloads from this GithubFileSystem. Instead it creates a new HTTPFileSystem for every large file to be downloaded.

It's possible to rework this so that self.http_fs is set to either None or HTTPFileSystem(**kwargs) depending on whether or not the import succeeds, and then we only raise the exception when a large-file download is attempted. Let me know if this would be preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and tried this out in 65e09cf - let me know what you think!

@martindurant
Copy link
Member

That will do nicely, thank you

@martindurant martindurant merged commit 961412d into fsspec:master Mar 17, 2025
10 checks passed
@thomasgilgenast
Copy link
Contributor Author

Thank you @martindurant for the helpful comments, patience with my meandering, and quick merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants