-
Notifications
You must be signed in to change notification settings - Fork 374
rework github _open() implementation to support LFS #1810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rework github _open() implementation to support LFS #1810
Conversation
To clarify: this does not use LFS directly, so there's no way to enable git-lfs for normal (non github) git repos, correct? |
1 similar comment
To clarify: this does not use LFS directly, so there's no way to enable git-lfs for normal (non github) git repos, correct? |
Yes, this PR not use LFS directly, and therefore it only works for getting Git LFS file content from GitHub and cannot be used with Git LFS repos that are not hosted on GitHub. I see it as a narrow "upgrade" to the github implementation (which was already limited to repos hosted on GitHub, but would not work as-expected for LFS files). On the other hand, this PR does not require managing a local clone of the Git LFS repo and does not require the user to have git or git-lfs installed. I think it's worth asking if the best path forward is to instead add LFS support to the git implementation, or to add a new git_lfs implementation that would similarly interact with a clone on local disk. If I understand correctly, the git implementation is for dealing with local git repos that have already been cloned to local disk, while the github implementation feels a lot more like http, to the extent that it treats github as "just another source of remote files" (albeit one with versioning powers via |
def test_github_open_large_file(): | ||
# test opening a large file >1 MB | ||
with fsspec.open("github://mwaskom:seaborn-data@83bfba7/brain_networks.csv") as f: | ||
assert f.readline().startswith(b"network,1,1,2,2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: this will download the whole file in one shot, even though it's "large". Do you think we can generate a buffered file (like a HTTPFile), given that we expect the remote to accept HTTP range requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a great idea - thank you for mentioning this. I will rework this and then rewrite this test to use a range request to avoid fetching the whole file over the network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have thoughts on if it would also be important to introduce other concepts from the http implementation such as support for async, reuse of an aiohttp ClientSession for the multiple calls to the GitHub API, etc.? Or would it be fine to just use HTTPFile and ensure that range requests are supported, but otherwise not reach for full feature parity with the http implementation in this first PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess maybe my question may not make sense to ask if the relevant async support is already "enabled" just by virtue of the asynchronous
kwarg on the HTTPFile constructor. I should probably take a closer look before asking this question 😅 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not think about the async-ness. We probably don't want this for github, since there is a fairly onerous API rate limit, even if you come with a developer token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a751042 I tweaked _open()
to return an HTTPFile wrapping the download_url
in cases where the initial response from the GitHub API does not include the file content already. I believe this should enable streaming and range queries.
You are quite right, LFS for local git would be different, then. I think your "upgrade" is a fine scope to keep this PR to. |
fsspec/implementations/github.py
Outdated
return HTTPFile( | ||
self.http_fs, | ||
content_json["download_url"], | ||
session=self.session, | ||
block_size=block_size, | ||
autocommit=autocommit, | ||
cache_options=cache_options, | ||
size=content_json["size"], | ||
loop=self.loop, | ||
**kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be an alternative here to instead just do something like
return HTTPFile( | |
self.http_fs, | |
content_json["download_url"], | |
session=self.session, | |
block_size=block_size, | |
autocommit=autocommit, | |
cache_options=cache_options, | |
size=content_json["size"], | |
loop=self.loop, | |
**kwargs, | |
) | |
return self.http_fs.open( | |
content_json["download_url"], | |
mode=mode, | |
block_size=block_size, | |
cache_options=cache_options, | |
autocommit=autocommit, | |
**kwargs, | |
) |
which avoids some of the complexity of creating the HTTPFile directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and made this change in 75a6f4e but would be happy to revert if it's going in the wrong direction.
pyproject.toml
Outdated
@@ -56,7 +56,7 @@ full = [ | |||
fuse = ["fusepy"] | |||
gcs = ["gcsfs"] | |||
git = ["pygit2"] | |||
github = ["requests"] | |||
github = ["fsspec[http]", "requests"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the http extra here since now the github backend depends on HTTPFileSystem which requires aiohttp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a package depend on itself? Maybe list aiohttp here instead.
But actually it's optional - for normal small files, you don't need it. Do you think there's an easy way to defer the import and use of HTTPFileSystem (+ appropriate error message if it fails)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will defer to you on this. In my experience, I have used this syntax to specify that one extra depends on another extra. But in this case, the http
extra has only one dependency (aiohttp
), and new deps are probably added to that extra very rarely, so the degree of "extra repetition" and "risk of forgetting to add any new http dep to the github extra too" is low.
But you also bring up a good point that we could just make it optional by deferring the import.
I took a stab at this in 8e7fb56 - let me know what you think about this approach!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quirk with 8e7fb56 is that it does not create an HTTPFileSystem at the instance level to be re-used for all file downloads from this GithubFileSystem. Instead it creates a new HTTPFileSystem for every large file to be downloaded.
It's possible to rework this so that self.http_fs
is set to either None
or HTTPFileSystem(**kwargs)
depending on whether or not the import succeeds, and then we only raise the exception when a large-file download is attempted. Let me know if this would be preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and tried this out in 65e09cf - let me know what you think!
That will do nicely, thank you |
Thank you @martindurant for the helpful comments, patience with my meandering, and quick merge! |
#1438 raised the question of whether it should be possible to read Git LFS files from GitHub via the github implementation.
This PR is an unsolicited proposal of one possible path forward to supporting Git LFS files hosted on GitHub via the github implementation. It reworks the way the github implementation accesses content from github repos, enabling Git LFS file support in the process.
Before this PR, the content of github files was always fetched using the
rurl
(the https://raw.githubusercontent.com/ URL). This URL works well for fetching normal files tracked by Git, but does not work for Git LFS files.In this PR, the
_open()
function in the github implementation is reworked to instead use the GitHub API to fetch the content. This API endpoint is documented in the GitHub API docs here.The way this API works is unfortunately more complicated than the
rurl
. For files <1MB in size, the file's contents will be base64 encoded and included in the JSON response under thecontent
key. For files >1MB in size, thecontent
key will have an empty string value, and the only way to get its content is to make an additional GET request to a specific URL provided under thedownload_url
key of the JSON response.The downsides of the new reworked implementation of
_open()
arecontent
and other times we need to usedownload_url
download_url
)The upside is that the
download_url
points to the actual file contents for files tracked by Git LFS. This means that the same github implementation will work for normal files (both smaller and larger than 1MB) and also for Git LFS files.I would be happy to take feedback on whether or not these downsides are worth the improved support for LFS, or if this should be moved to a new, separate "github_lfs" implementation to avoid complicating the current github implementation.
In addition to reworking the
_open()
function, I also removed the implementation ofcat()
, which was also referencingrurl
. I am not super familiar with whycat()
was overridden in the github implementation. I believe that if we still want to use the http implementation ofcat()
here, then that should still be possible. Let me know if this would be required to move forward with this PR.Finally, I've added a few basic tests to cover the added/modified functionality. These tests interact with the real GitHub API, so it may not be desirable to leave them in the test suite as-is - let me know what would be preferred here and I would be happy to adjust.