Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading from remote resources over http #19

Merged
merged 7 commits into from
Mar 13, 2024

Conversation

davemfish
Copy link
Collaborator

@davemfish davemfish commented Mar 7, 2024

This PR uses fsspec to open files. fsspec.open will detect the correct protocol to use based on the filepath string.

FSSPEC supports a large number of protocols
https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations

Availability depends on which extra dependencies are installed (see extras_require ) .

So far this PR added requests and aiohttp as requirements in order to support http and https protocols. Are there others we know we want to support?

If a user is creating metadata for a remote dataset, GDAL drivers handle reading the dataset itself, geometamaker, via fsspec, checks for and reads any existing remote MCF for that dataset. If a user wishes to write metadata docs for a remote resource, they will have to use the new workspace arg in MetadataControl.write to give a local directory where files can be written.

Fixes #18

@davemfish davemfish marked this pull request as ready for review March 7, 2024 14:52
@davemfish davemfish self-assigned this Mar 7, 2024
@davemfish davemfish requested review from phargogh and empavia March 7, 2024 14:53
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for finding out about fsspec, it looks like a really promising package for a consistent interface for filesystem accesses.

To answer your question about other protocols, I think GCS and GDrive are the two main ones that come to mind.

One thing that I'm not clear about is reconciling the drivers that fsspec offers with the virtual filesystems that GDAL supports. The worst case there is that there's some storage backend that fsspec supports that GDAL does not. But I'm also cautiously optimistic that we might be able to find a workaround, like if we can pass GDAL an open file object or something. Anyways, what do you think about this challenge?

@@ -2,10 +2,13 @@
# --------------------
# This file records the packages and requirements needed in order for
# the library to work as expected. And to run tests.
aiohttp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is aiohttp implicitly required by fsspec? Or does it change the behavior of fsspec if it's available at runtime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It changes the behavior. http protocol did not seem to be supported at all without it:

>>> of = fsspec.open('https://storage.googleapis.com/gef-ckan-public-data/awc-isric-soilgrids/awc.tif.yml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\dmf\projects\geometamaker\env-test\Lib\site-packages\fsspec\core.py", line 459, in open
    out = open_files(
          ^^^^^^^^^^^
  File "C:\Users\dmf\projects\geometamaker\env-test\Lib\site-packages\fsspec\core.py", line 283, in open_files
    fs, fs_token, paths = get_fs_token_paths(
                          ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dmf\projects\geometamaker\env-test\Lib\site-packages\fsspec\core.py", line 623, in get_fs_token_paths
    chain = _un_chain(urlpath0, storage_options or {})
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dmf\projects\geometamaker\env-test\Lib\site-packages\fsspec\core.py", line 332, in _un_chain
    cls = get_filesystem_class(protocol)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dmf\projects\geometamaker\env-test\Lib\site-packages\fsspec\registry.py", line 238, in get_filesystem_class
    raise ImportError(bit["err"]) from e
ImportError: HTTPFileSystem requires "requests" and "aiohttp" to be installed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It's strange to me that aiohttp is listed as an extra requirement in fsspec's setup.py but not requests. Oh well!

@davemfish
Copy link
Collaborator Author

To answer your question about other protocols, I think GCS and GDrive are the two main ones that come to mind.

I agree. Do you think it makes sense to always install those dependencies (gcsfs, gdrivefs)? Or manage them as extra_requires in a similar way that fsspec does?

If I understand correctly, files on GCS or GDrive could be referenced with https when they have public URLs, but using the gcs:// or gdrivefs protocols instead would allow authentication and probably even write access. But we would need to build out our geometamaker interface to accommodate all that, by passing through kwargs to fsspec.open, or using those other ***fs APIs directly. What do you think @phargogh

Copy link

@empavia empavia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great from me end! Seems that it covers the items we wanted. Happy to test it out once it's ready.

@phargogh
Copy link
Member

phargogh commented Mar 8, 2024

To answer your question about other protocols, I think GCS and GDrive are the two main ones that come to mind.

I agree. Do you think it makes sense to always install those dependencies (gcsfs, gdrivefs)? Or manage them as extra_requires in a similar way that fsspec does?

I personally love the idea of extras! But it's much easier w/r/t maintenance to just install everything and I don't think we gain much by adding extras at this point in the project. In practice there's some user annoyance when we forget to install a specific extra, plus the headache of having to deal with friendly exceptions like what fsspec provides when a dependency is missing, tests to make sure we're handling that behavior well, etc. All of that leads to more frustration than I think extras are worth at this point in the project. If this library really takes off and we have a whole pile of dependencies we may want to section off, I think that'd be the point where we should consider adding extras.

If I understand correctly, files on GCS or GDrive could be referenced with https when they have public URLs, but using the gcs:// or gdrivefs protocols instead would allow authentication and probably even write access. But we would need to build out our geometamaker interface to accommodate all that, by passing through kwargs to fsspec.open, or using those other ***fs APIs directly. What do you think @phargogh

That is really awesome that public files could already be accessed over https! I think that's a great feature to start with.

I think we should explore this option a little more before making a decision on being able to read from private filesystems and write to these filesystems. Open questions on my mind include:

  • Can we read all fsspec-required auth parameters from GDAL's auth configuration?
  • How will we read a GDAL-supported dataset from Google Drive (which is not supported by GDAL's VSIs) without downloading the whole dataset?
  • Are there any other cases where GDAL will handle a URI with the same protocol prefix (e.g. gs://) as fsspec, or will we have to pass GDAL VSI-style path strings?

@davemfish
Copy link
Collaborator Author

One thing that I'm not clear about is reconciling the drivers that fsspec offers with the virtual filesystems that GDAL supports. The worst case there is that there's some storage backend that fsspec supports that GDAL does not. But I'm also cautiously optimistic that we might be able to find a workaround, like if we can pass GDAL an open file object or something. Anyways, what do you think about this challenge?

Yeah good question. I guess it goes hand in hand with whether we want to support other file protocols besides HTTP. If so we would need to figure out how to have GDAL open files on those other protocols.

@davemfish
Copy link
Collaborator Author

davemfish commented Mar 8, 2024

That is really awesome that public files could already be accessed over https! I think that's a great feature to start with.

Okay, great. In that case I think we don't need to add any other dependencies right now and maybe this PR is complete enough for this case.

I think we should explore this option a little more before making a decision on being able to read from private filesystems and write to these filesystems. Open questions on my mind include:

  • Can we read all fsspec-required auth parameters from GDAL's auth configuration?
  • How will we read a GDAL-supported dataset from Google Drive (which is not supported by GDAL's VSIs) without downloading the whole dataset?
  • Are there any other cases where GDAL will handle a URI with the same protocol prefix (e.g. gs://) as fsspec, or will we have to pass GDAL VSI-style path strings?

Great points, thanks for thinking about this! I'll put these notes in another issue.

@davemfish davemfish requested a review from phargogh March 8, 2024 21:35
@phargogh phargogh merged commit a0e788d into natcap:main Mar 13, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for reading from remote resources
3 participants