-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Create obstore store in fsspec on demand #198
base: main
Are you sure you want to change the base?
[FEAT] Create obstore store in fsspec on demand #198
Conversation
constructe store with from_url using protocol and bucket name
obstore/python/obstore/fsspec.py
Outdated
@@ -45,6 +47,9 @@ def __init__( | |||
self, | |||
store: obs.store.ObjectStore, | |||
*args, | |||
config: dict[str, Any] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow these, store should be optional?
And before merge we should enable typing overloads for better typing. You can see how from_url is implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use store here for deciding the store Interface (whether it is S3Store, GCSStore, ...), so that in AsyncFsspecStore we don't need to decide the interface based on the protocol.
Maybe there's a better way of deciding the store interface?
obstore_fs: AsyncFsspecStore = fsspec.filesystem(
"s3",
store=S3Store,
config={
"endpoint": "http://localhost:30002",
"access_key_id": "minio",
"secret_access_key": "miniostorage",
"virtual_hosted_style_request": True, # path contain bucket name
},
client_options={"timeout": "99999s", "allow_http": "true"},
retry_config={
"max_retries": 2,
"backoff": {
"base": 2,
"init_backoff": timedelta(seconds=2),
"max_backoff": timedelta(seconds=16),
},
"retry_timeout": timedelta(minutes=3),
},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at the typing later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's confusing because store is the type of the class and not an instance.
We should be able to use the from_url top level function directly here?
file_path = "/".join(path_li[1:]) | ||
return (bucket, file_path) | ||
|
||
@lru_cache(maxsize=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this cache size could be user specified but we can come back to it
Would there be one fsspec instance per cloud provider? So if you wanted to use s3 and gcs you'd make two separate instances? |
Based on what I know, to use fsspec, we will do: fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore) Each will have their own AsyncFsspecStore innstance already. To config, we can use (based on my current implementation): s3_fs: AsyncFsspecStore = fsspec.filesystem(
"s3",
store=S3Store,
config={...}
)
gcs_fs: AsyncFsspecStore = fsspec.filesystem(
"gs",
store=GCSStore,
config={...}
) |
It would be nice to take out the |
34f79f0
to
29464a7
Compare
Specify protocol s3, gs, and abfs
I use obstore/obstore/python/obstore/fsspec.py Lines 272 to 279 in a0d9e1d
|
Is it true that a single fsspec class can't be associated with more than one protocol? E.g. Azure has three different protocols |
The latest PRs allow you to access the |
I think we can if those protocols refer to the same object instance. s3fs do have 2 protocols ("s3", "s3a"), see: https://github.com/fsspec/s3fs/blob/023aecf00b5c6243ff5f8a016dac8b6af3913c6b/s3fs/core.py#L277 I think abfs, adlfs, and az have different implementation so that they exports different classes. If we use them in obstore, I think we can define a class with protocol (abfs, adlfs, az), but need to test is they all work |
obstore/python/obstore/fsspec.py
Outdated
@@ -104,6 +104,12 @@ def _split_path(self, path: str) -> Tuple[str, str]: | |||
# no bucket name in path | |||
return "", path | |||
|
|||
if path.startswith(self.protocol + "://"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that this function will always receive something a URL like s3://mybucket/path/to/file
, I'm inclined for this function to use urlparse
instead of manually handling the parts of the URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not always be s3://mybucket/path/to/file
, but may be without protocol like mybucket/path/to/file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use urlparse like this here, which works for both s3://mybucket/path/to/file
and mybucket/path/to/file
obstore/obstore/python/obstore/fsspec.py
Lines 108 to 112 in 75c738e
res = urlparse(path) | |
if res.scheme: | |
if res.scheme != self.protocol: | |
raise ValueError(f"Expect protocol to be {self.protocol}. Got {res.schema}") | |
path = res.netloc + res.path |
Oh cool! That seems to indicate that we could have a single class that defines supported protocols as: protocol = ("s3", "s3a", "gs", "az", "abfs", "adlfs") Because the fsspec class used for each is the same? It's just custom kwargs that would need to be passed down for each? |
I don't think we can put all the protocols together into a class, as when using obstore/obstore/python/obstore/fsspec.py Lines 122 to 124 in 6614906
I think the better way is to create |
I did a quick look through your PR; it's really good progress but a few thoughts:
|
Thanks for the suggestion! I just added ruff linter and remove error for path. I also add the check for validating the two bucket name from two path are the same.
For
Yes! I will update the test in the next few days |
Check if AsyncFsspecStore is registered and test invalid types pass into register
bucket for https is the netloc of the url (e.g. https://www.google.com/path, www.google.com is the bucket here)
Hi @kylebarron , I would like to confirm if changing the output for The problem I faced is that when I call For example, call UPDATE: this is how I do it obstore/obstore/python/obstore/fsspec.py Lines 297 to 325 in f6ba27c
|
To solve error when _walk is called recurrsively with the previous result by ls
path with bucket name
I think that's a question for @martindurant. I don't know what's standard for fsspec It appears that based on import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.ls("sentinel-cogs")
# ['sentinel-cogs/sentinel-s2-l2a-cogs'] |
@machichima I merged a PR that updates our use of ruff, a fast Python linter. As part of this, there were some minor updates to the existing fsspec code. Would you be able to merge in main? I also wanted to have ruff merged so that we could run the lints on this PR's changes. |
It would be great if we could have integration tests with s3fs. We could reuse the existing test setup we have in |
Yes, fsspec expects that the paths returned by ls() are the same form as you can then use in further open/get/put etc operations. Since the input paths contain the bucket (and optional protocol), the output of ls() should too. The alternative would be, to have the filesystem instance not expect the bucket in the input. You could still get this to work by extracting the bucket once (as passed in fsspec.open/url_to_fs). This is essentially what the "prefix" filesystem does:
|
Construct the obstore store instance on demand in fsspec when calling methods. This allows automatic store creation for reads/writes across different buckets, aligning usage with fsspec conventions