rework github _open() implementation to support LFS #1810

thomasgilgenast · 2025-03-14T18:50:05Z

#1438 raised the question of whether it should be possible to read Git LFS files from GitHub via the github implementation.

This PR is an unsolicited proposal of one possible path forward to supporting Git LFS files hosted on GitHub via the github implementation. It reworks the way the github implementation accesses content from github repos, enabling Git LFS file support in the process.

Before this PR, the content of github files was always fetched using the rurl (the https://raw.githubusercontent.com/ URL). This URL works well for fetching normal files tracked by Git, but does not work for Git LFS files.

In this PR, the _open() function in the github implementation is reworked to instead use the GitHub API to fetch the content. This API endpoint is documented in the GitHub API docs here.

The way this API works is unfortunately more complicated than the rurl. For files <1MB in size, the file's contents will be base64 encoded and included in the JSON response under the content key. For files >1MB in size, the content key will have an empty string value, and the only way to get its content is to make an additional GET request to a specific URL provided under the download_url key of the JSON response.

The downsides of the new reworked implementation of _open() are

More logic is required - sometimes the content can be found in content and other times we need to use download_url
Two requests are needed for files >1MB in size (one to the content API and another to hit download_url)

The upside is that the download_url points to the actual file contents for files tracked by Git LFS. This means that the same github implementation will work for normal files (both smaller and larger than 1MB) and also for Git LFS files.

I would be happy to take feedback on whether or not these downsides are worth the improved support for LFS, or if this should be moved to a new, separate "github_lfs" implementation to avoid complicating the current github implementation.

In addition to reworking the _open() function, I also removed the implementation of cat(), which was also referencing rurl. I am not super familiar with why cat() was overridden in the github implementation. I believe that if we still want to use the http implementation of cat() here, then that should still be possible. Let me know if this would be required to move forward with this PR.

Finally, I've added a few basic tests to cover the added/modified functionality. These tests interact with the real GitHub API, so it may not be desirable to leave them in the test suite as-is - let me know what would be preferred here and I would be happy to adjust.

martindurant · 2025-03-14T19:05:33Z

To clarify: this does not use LFS directly, so there's no way to enable git-lfs for normal (non github) git repos, correct?

martindurant · 2025-03-14T19:05:33Z

To clarify: this does not use LFS directly, so there's no way to enable git-lfs for normal (non github) git repos, correct?

thomasgilgenast · 2025-03-14T19:37:39Z

Yes, this PR not use LFS directly, and therefore it only works for getting Git LFS file content from GitHub and cannot be used with Git LFS repos that are not hosted on GitHub. I see it as a narrow "upgrade" to the github implementation (which was already limited to repos hosted on GitHub, but would not work as-expected for LFS files).

On the other hand, this PR does not require managing a local clone of the Git LFS repo and does not require the user to have git or git-lfs installed.

I think it's worth asking if the best path forward is to instead add LFS support to the git implementation, or to add a new git_lfs implementation that would similarly interact with a clone on local disk.

If I understand correctly, the git implementation is for dealing with local git repos that have already been cloned to local disk, while the github implementation feels a lot more like http, to the extent that it treats github as "just another source of remote files" (albeit one with versioning powers via sha). For my specific use case, I would personally strongly prefer to avoid locally cloning the entire Git LFS repo to local disk (e.g., if I just need one or two files from the repo and if the repo contains multiple large non-LFS files that will make the initial cloning step slow). I also strongly prefer not to have to manually manage the local clone of the repo. But I could also understand if this my personal use case does not match the intent behind the original issue.

martindurant · 2025-03-14T19:39:58Z

fsspec/implementations/tests/test_github.py

+def test_github_open_large_file():
+    # test opening a large file >1 MB
+    with fsspec.open("github://mwaskom:seaborn-data@83bfba7/brain_networks.csv") as f:
+        assert f.readline().startswith(b"network,1,1,2,2")


Question: this will download the whole file in one shot, even though it's "large". Do you think we can generate a buffered file (like a HTTPFile), given that we expect the remote to accept HTTP range requests?

This sounds like a great idea - thank you for mentioning this. I will rework this and then rewrite this test to use a range request to avoid fetching the whole file over the network.

Do you have thoughts on if it would also be important to introduce other concepts from the http implementation such as support for async, reuse of an aiohttp ClientSession for the multiple calls to the GitHub API, etc.? Or would it be fine to just use HTTPFile and ensure that range requests are supported, but otherwise not reach for full feature parity with the http implementation in this first PR?

I guess maybe my question may not make sense to ask if the relevant async support is already "enabled" just by virtue of the asynchronous kwarg on the HTTPFile constructor. I should probably take a closer look before asking this question 😅 .

I did not think about the async-ness. We probably don't want this for github, since there is a fairly onerous API rate limit, even if you come with a developer token.

In a751042 I tweaked _open() to return an HTTPFile wrapping the download_url in cases where the initial response from the GitHub API does not include the file content already. I believe this should enable streaming and range queries.

martindurant · 2025-03-14T19:44:04Z

You are quite right, LFS for local git would be different, then. I think your "upgrade" is a fine scope to keep this PR to.

…d_url

thomasgilgenast · 2025-03-15T00:22:06Z

fsspec/implementations/github.py

+        return HTTPFile(
+            self.http_fs,
+            content_json["download_url"],
+            session=self.session,
+            block_size=block_size,
+            autocommit=autocommit,
+            cache_options=cache_options,
+            size=content_json["size"],
+            loop=self.loop,
+            **kwargs,
+        )


I think there might be an alternative here to instead just do something like

Suggested change

return HTTPFile(

self.http_fs,

content_json["download_url"],

session=self.session,

block_size=block_size,

autocommit=autocommit,

cache_options=cache_options,

size=content_json["size"],

loop=self.loop,

**kwargs,

)

return self.http_fs.open(

content_json["download_url"],

mode=mode,

block_size=block_size,

cache_options=cache_options,

autocommit=autocommit,

**kwargs,

)

which avoids some of the complexity of creating the HTTPFile directly.

I've gone ahead and made this change in 75a6f4e but would be happy to revert if it's going in the wrong direction.

thomasgilgenast · 2025-03-15T00:24:20Z

pyproject.toml

@@ -56,7 +56,7 @@ full = [
 fuse = ["fusepy"]
 gcs = ["gcsfs"]
 git = ["pygit2"]
-github = ["requests"]
+github = ["fsspec[http]", "requests"]


I've added the http extra here since now the github backend depends on HTTPFileSystem which requires aiohttp.

Can a package depend on itself? Maybe list aiohttp here instead.

But actually it's optional - for normal small files, you don't need it. Do you think there's an easy way to defer the import and use of HTTPFileSystem (+ appropriate error message if it fails)?

I will defer to you on this. In my experience, I have used this syntax to specify that one extra depends on another extra. But in this case, the http extra has only one dependency (aiohttp), and new deps are probably added to that extra very rarely, so the degree of "extra repetition" and "risk of forgetting to add any new http dep to the github extra too" is low.

But you also bring up a good point that we could just make it optional by deferring the import.

I took a stab at this in 8e7fb56 - let me know what you think about this approach!

One quirk with 8e7fb56 is that it does not create an HTTPFileSystem at the instance level to be re-used for all file downloads from this GithubFileSystem. Instead it creates a new HTTPFileSystem for every large file to be downloaded.

It's possible to rework this so that self.http_fs is set to either None or HTTPFileSystem(**kwargs) depending on whether or not the import succeeds, and then we only raise the exception when a large-file download is attempted. Let me know if this would be preferred.

I went ahead and tried this out in 65e09cf - let me know what you think!

martindurant · 2025-03-17T21:48:17Z

That will do nicely, thank you

thomasgilgenast · 2025-03-17T22:05:13Z

Thank you @martindurant for the helpful comments, patience with my meandering, and quick merge!

rework github implementation to support lfs

c085e2d

thomasgilgenast mentioned this pull request Mar 14, 2025

Reading Files from Git LFS Repo #1438

Open

martindurant reviewed Mar 14, 2025

View reviewed changes

Thomas Gilgenast added 3 commits March 14, 2025 19:53

make github _open() return HTTPFile when getting content from downloa…

a751042

…d_url

tweak tests for github implementation

f505aff

tweak tests for github implementation

7d88086

thomasgilgenast commented Mar 15, 2025

View reviewed changes

Thomas Gilgenast and others added 5 commits March 14, 2025 20:27

delegate HTTPFile creation to HTTPFileSystem.open()

75a6f4e

rework github large file open to treat http extra as optional

8e7fb56

update docstring

083392b

alternate approach to deferred import that allows HTTPFileSystem reuse

65e09cf

lint

38aadce

martindurant merged commit 961412d into fsspec:master Mar 17, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework github _open() implementation to support LFS #1810

rework github _open() implementation to support LFS #1810

thomasgilgenast commented Mar 14, 2025 •

edited

Loading

martindurant commented Mar 14, 2025

martindurant commented Mar 14, 2025

thomasgilgenast commented Mar 14, 2025 •

edited

Loading

martindurant Mar 14, 2025

thomasgilgenast Mar 14, 2025

thomasgilgenast Mar 14, 2025

thomasgilgenast Mar 14, 2025

martindurant Mar 14, 2025

thomasgilgenast Mar 15, 2025

martindurant commented Mar 14, 2025

thomasgilgenast Mar 15, 2025

thomasgilgenast Mar 15, 2025

thomasgilgenast Mar 15, 2025

martindurant Mar 17, 2025

thomasgilgenast Mar 17, 2025

thomasgilgenast Mar 17, 2025

thomasgilgenast Mar 17, 2025

martindurant commented Mar 17, 2025

thomasgilgenast commented Mar 17, 2025

rework github _open() implementation to support LFS #1810

rework github _open() implementation to support LFS #1810

Conversation

thomasgilgenast commented Mar 14, 2025 • edited Loading

martindurant commented Mar 14, 2025

martindurant commented Mar 14, 2025

thomasgilgenast commented Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Mar 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Mar 17, 2025

thomasgilgenast commented Mar 17, 2025

thomasgilgenast commented Mar 14, 2025 •

edited

Loading

thomasgilgenast commented Mar 14, 2025 •

edited

Loading