[FEAT] Create obstore store in fsspec on demand #198

machichima · 2025-02-03T15:11:37Z

Construct the obstore store instance on demand in fsspec when calling methods. This allows automatic store creation for reads/writes across different buckets, aligning usage with fsspec conventions

constructe store with from_url using protocol and bucket name

kylebarron · 2025-02-03T15:17:15Z

obstore/python/obstore/fsspec.py

@@ -45,6 +47,9 @@ def __init__(
        self,
        store: obs.store.ObjectStore,
        *args,
+        config: dict[str, Any] = {},


If we allow these, store should be optional?

And before merge we should enable typing overloads for better typing. You can see how from_url is implemented

I use store here for deciding the store Interface (whether it is S3Store, GCSStore, ...), so that in AsyncFsspecStore we don't need to decide the interface based on the protocol.

Maybe there's a better way of deciding the store interface?

obstore_fs: AsyncFsspecStore = fsspec.filesystem( "s3", store=S3Store, config={ "endpoint": "http://localhost:30002", "access_key_id": "minio", "secret_access_key": "miniostorage", "virtual_hosted_style_request": True, # path contain bucket name }, client_options={"timeout": "99999s", "allow_http": "true"}, retry_config={ "max_retries": 2, "backoff": { "base": 2, "init_backoff": timedelta(seconds=2), "max_backoff": timedelta(seconds=16), }, "retry_timeout": timedelta(minutes=3), }, )

I'll have a look at the typing later on

Oh that's confusing because store is the type of the class and not an instance.

We should be able to use the from_url top level function directly here?

kylebarron · 2025-02-03T15:18:15Z

obstore/python/obstore/fsspec.py

+            file_path = "/".join(path_li[1:])
+            return (bucket, file_path)
+
+    @lru_cache(maxsize=10)


It would be nice if this cache size could be user specified but we can come back to it

kylebarron · 2025-02-03T15:20:27Z

Would there be one fsspec instance per cloud provider? So if you wanted to use s3 and gcs you'd make two separate instances?

machichima · 2025-02-03T15:36:26Z

Would there be one fsspec instance per cloud provider? So if you wanted to use s3 and gcs you'd make two separate instances?

Based on what I know, to use fsspec, we will do:

fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)

Each will have their own AsyncFsspecStore innstance already. To config, we can use (based on my current implementation):

s3_fs: AsyncFsspecStore = fsspec.filesystem(
    "s3",
    store=S3Store,
    config={...}
)

gcs_fs: AsyncFsspecStore = fsspec.filesystem(
    "gs",
    store=GCSStore,
    config={...}
)

kylebarron · 2025-02-03T23:27:32Z

It would be nice to take out the store arg and use from_url directly. from_url will automatically construct the correct store based on the url protocol

Specify protocol s3, gs, and abfs

machichima · 2025-02-04T15:19:32Z

I use from_url and remove store in the newest commit. However, by doing this, we need to specify the protocol by inherit the AsyncFsspecStore class for each store instance. I added here

obstore/obstore/python/obstore/fsspec.py

Lines 272 to 279 in a0d9e1d

    
           class S3FsspecStore(AsyncFsspecStore): 
        
               protocol = "s3" 
        
           class GCSFsspecStore(AsyncFsspecStore): 
        
               protocol = "gs" 
        
           class AzureFsspecStore(AsyncFsspecStore): 
        
               protocol = "abfs"

kylebarron · 2025-02-04T16:11:00Z

Is it true that a single fsspec class can't be associated with more than one protocol? E.g. Azure has three different protocols abfs, adlfs and az, but it looks like adlfs exports three separate classes.

kylebarron · 2025-02-05T00:30:16Z

The latest PRs allow you to access the config back out of a store, which may be useful to you? You can validate that you already have a store in your cache for a specific bucket

machichima · 2025-02-06T14:18:44Z

Is it true that a single fsspec class can't be associated with more than one protocol? E.g. Azure has three different protocols abfs, adlfs and az, but it looks like adlfs exports three separate classes.

I think we can if those protocols refer to the same object instance. s3fs do have 2 protocols ("s3", "s3a"), see: https://github.com/fsspec/s3fs/blob/023aecf00b5c6243ff5f8a016dac8b6af3913c6b/s3fs/core.py#L277

I think abfs, adlfs, and az have different implementation so that they exports different classes. If we use them in obstore, I think we can define a class with protocol (abfs, adlfs, az), but need to test is they all work

obstore/python/obstore/fsspec.py

kylebarron · 2025-02-06T15:41:14Z

obstore/python/obstore/fsspec.py

@@ -104,6 +104,12 @@ def _split_path(self, path: str) -> Tuple[str, str]:
            # no bucket name in path
            return "", path

+        if path.startswith(self.protocol + "://"):


Assuming that this function will always receive something a URL like s3://mybucket/path/to/file, I'm inclined for this function to use urlparse instead of manually handling the parts of the URL

It will not always be s3://mybucket/path/to/file, but may be without protocol like mybucket/path/to/file

I use urlparse like this here, which works for both s3://mybucket/path/to/file and mybucket/path/to/file

obstore/obstore/python/obstore/fsspec.py

Lines 108 to 112 in 75c738e

res = urlparse(path)

if res.scheme:

if res.scheme != self.protocol:

raise ValueError(f"Expect protocol to be {self.protocol}. Got {res.schema}")

path = res.netloc + res.path

kylebarron · 2025-02-06T15:43:35Z

I think we can if those protocols refer to the same object instance. s3fs do have 2 protocols ("s3", "s3a"), see: fsspec/s3fs@023aecf/s3fs/core.py#L277

Oh cool! That seems to indicate that we could have a single class that defines supported protocols as:

    protocol = ("s3", "s3a", "gs", "az", "abfs", "adlfs")

Because the fsspec class used for each is the same? It's just custom kwargs that would need to be passed down for each?

machichima · 2025-02-07T14:25:41Z

Oh cool! That seems to indicate that we could have a single class that defines supported protocols as:
    protocol = ("s3", "s3a", "gs", "az", "abfs", "adlfs")
Because the fsspec class used for each is the same? It's just custom kwargs that would need to be passed down for each?

I don't think we can put all the protocols together into a class, as when using fsspec.register_implementation("s3", AsyncFsspecStore), fsspec wouldn't tell AsyncFsspecStore what the protocol is, so that when constructing store instance, we cannot get the protocol

obstore/obstore/python/obstore/fsspec.py

Lines 122 to 124 in 6614906

    
           def _construct_store(self, bucket: str): 
        
               return from_url( 
        
                   url=f"{self.protocol}://{bucket}",

I think the better way is to create obstore.fsspec.register("protocol"), that wraps around the fsspec.register and directly set the protocol for AsyncFsspecStore (like what mentioned in this comment), then we do not need more classes. Let me have a try.

obstore/python/obstore/fsspec.py

kylebarron · 2025-02-07T16:35:16Z

I did a quick look through your PR; it's really good progress but a few thoughts:

There are a bunch of cases where bucket, path = self._split_path(path) doesn't work because path is not in scope. E.g. in _cp_file where path1 and path2 are in scope
in _cp_file we need to validate that the bucket of the source and destination paths are the same
We need some tests for edits that happen in this PR
It's not clear how BufferedFileSimple works, because that subclasses from an upstream fsspec.spec.AbstractBufferedFile but doesn't touch obstore apis at all
If you don't already, I'd highly suggest using a linter like https://docs.astral.sh/ruff/ in your editor, so that you can catch some of these issue before hitting CI

machichima · 2025-02-08T08:20:29Z

If you don't already, I'd highly suggest using a linter like https://docs.astral.sh/ruff/ in your editor, so that you can catch some of these issue before hitting CI

Thanks for the suggestion! I just added ruff linter and remove error for path. I also add the check for validating the two bucket name from two path are the same.

It's not clear how BufferedFileSimple works, because that subclasses from an upstream fsspec.spec.AbstractBufferedFile but doesn't touch obstore apis at all

For BufferedFileSimple, when self.fs.cat_file() is called, it will direct to the _cat_file() function in AsyncFsspecStore.

We need some tests for edits that happen in this PR

Yes! I will update the test in the next few days

Check if AsyncFsspecStore is registered and test invalid types pass into register

bucket for https is the netloc of the url (e.g. https://www.google.com/path, www.google.com is the bucket here)

machichima · 2025-02-08T15:18:06Z

Hi @kylebarron ,

I would like to confirm if changing the output for _ls to the path with bucket name is fine here?

The problem I faced is that when I call find(), it will get into _find() in fsspec, which will then call _walk() (see here). Then in _walk(), _ls is called (see here). But as _walk() will run recurrsively if the path is directory (see here), if _ls give the path without bucket name, then when walk is called recurrsively by the previous result, _ls will get the path without bucket name, which cause error.

For example, call _walk("bucket/") first, and the _ls in _walk give "dir/", then walk will be called again with _walk("dir/"), which gives error

UPDATE: this is how I do it

obstore/obstore/python/obstore/fsspec.py

Lines 297 to 325 in f6ba27c

    
           async def _ls(self, path, detail=True, **kwargs): 
        
               bucket, path = self._split_path(path) 
        
               store = self._construct_store(bucket) 
        
               result = await obs.list_with_delimiter_async(store, path) 
        
               objects = result["objects"] 
        
               prefs = result["common_prefixes"] 
        
               if detail: 
        
                   return [ 
        
                       { 
        
                           "name": self._fill_bucket_name(object["path"], bucket), 
        
                           "size": object["size"], 
        
                           "type": "file", 
        
                           "e_tag": object["e_tag"], 
        
                       } 
        
                       for object in objects 
        
                   ] + [ 
        
                       { 
        
                           "name": self._fill_bucket_name(pref, bucket), 
        
                           "size": 0, 
        
                           "type": "directory", 
        
                       } 
        
                       for pref in prefs 
        
                   ] 
        
               else: 
        
                   return sorted( 
        
                       [self._fill_bucket_name(object["path"], bucket) for object in objects] 
        
                       + [self._fill_bucket_name(pref, bucket) for pref in prefs] 
        
                   )

To solve error when _walk is called recurrsively with the previous result by ls

path with bucket name

kylebarron · 2025-02-11T17:37:30Z

I would like to confirm if changing the output for _ls to the path with bucket name is fine here?

I think that's a question for @martindurant. I don't know what's standard for fsspec

It appears that based on s3fs behavior, the bucket name is always returned in list results:

import s3fs

fs = s3fs.S3FileSystem(anon=True)
fs.ls("sentinel-cogs")
# ['sentinel-cogs/sentinel-s2-l2a-cogs']

kylebarron · 2025-02-11T20:12:48Z

@machichima I merged a PR that updates our use of ruff, a fast Python linter. As part of this, there were some minor updates to the existing fsspec code. Would you be able to merge in main?

I also wanted to have ruff merged so that we could run the lints on this PR's changes.

kylebarron · 2025-02-11T20:15:18Z

I would like to confirm if changing the output for _ls to the path with bucket name is fine here?

I think that's a question for @martindurant. I don't know what's standard for fsspec

It appears that based on s3fs behavior, the bucket name is always returned in list results:
import s3fs

fs = s3fs.S3FileSystem(anon=True)
fs.ls("sentinel-cogs")
# ['sentinel-cogs/sentinel-s2-l2a-cogs']

It would be great if we could have integration tests with s3fs. We could reuse the existing test setup we have in test_fsspec.py, but then validate that we get the same results for all methods across both the obstore-backed store and the s3fs implementation.

martindurant · 2025-02-11T20:23:38Z

Yes, fsspec expects that the paths returned by ls() are the same form as you can then use in further open/get/put etc operations. Since the input paths contain the bucket (and optional protocol), the output of ls() should too.

The alternative would be, to have the filesystem instance not expect the bucket in the input. You could still get this to work by extracting the bucket once (as passed in fsspec.open/url_to_fs). This is essentially what the "prefix" filesystem does:

In [10]: fs, path = fsspec.url_to_fs("dir::file:///Users")

In [11]: path
Out[11]: ''

In [12]: fs.ls("", detail=False) # contents of /Users
Out[12]: ['moxie', '.localized', 'Shared', 'mdurant']

feat: split bucket from path + construct store

909b5b0

constructe store with from_url using protocol and bucket name

machichima mentioned this pull request Feb 3, 2025

Support obstore as storage for df.to_parquet() #164

Open

kylebarron reviewed Feb 3, 2025

View reviewed changes

feat: remove store + add protocol + apply to all methods

29464a7

machichima force-pushed the obstore-instance-in-fsspec branch from 34f79f0 to 29464a7 Compare February 4, 2025 14:06

feat: inherit from AsyncFsspecStore to specify protocol

a0d9e1d

Specify protocol s3, gs, and abfs

fix: correctly split protocol if exists in path

6614906

machichima mentioned this pull request Feb 6, 2025

[WIP] support df.to_parquet and df.read_parquet() #165

Open

kylebarron reviewed Feb 6, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved

kylebarron reviewed Feb 6, 2025

View reviewed changes

feat: use urlparse to extract protocol

75c738e

kylebarron reviewed Feb 7, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved

kylebarron added 2 commits February 7, 2025 10:49

Merge branch 'main' into obstore-instance-in-fsspec

2209839

update typing

46c6b59

machichima added 2 commits February 8, 2025 16:09

fix: unbounded error

9ab35e1

fix: remove redundant import

cb80495

machichima added 3 commits February 8, 2025 16:45

feat: add register() to register AsyncFsspecStore for provided protocol

b6a3d3a

feat: add validation for protocol in register()

68cdff9

test: for register()

fa5b539

Check if AsyncFsspecStore is registered and test invalid types pass into register

machichima added 3 commits February 8, 2025 18:14

feat: add async parameter for register()

b704779

test: test async store created by register()

61deac4

feat: add http(s) into protocol_with_bucket list

4bc1599

bucket for https is the netloc of the url (e.g. https://www.google.com/path, www.google.com is the bucket here)

machichima mentioned this pull request Feb 9, 2025

[WIP] Apply obstore as storage backend flyteorg/flytekit#3033

Open

3 tasks

machichima added 5 commits February 9, 2025 11:12

feat: ls return path with bucket name

4dc9143

To solve error when _walk is called recurrsively with the previous result by ls

feat: enable re-register same protocol

fb607d0

test: update pytest fixture to use register()

b74948a

test: update test with new path format

f6ba27c

path with bucket name

fix: mkdocs build error

30250cf

machichima changed the title ~~[WIP] [FEAT] Create obstore store in fsspec on demand~~ [FEAT] Create obstore store in fsspec on demand Feb 9, 2025

kylebarron mentioned this pull request Feb 11, 2025

Fsspec: Integration tests with s3fs #260

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Create obstore store in fsspec on demand #198

[FEAT] Create obstore store in fsspec on demand #198

machichima commented Feb 3, 2025

kylebarron Feb 3, 2025

machichima Feb 3, 2025

machichima Feb 3, 2025

kylebarron Feb 3, 2025

kylebarron Feb 3, 2025

kylebarron commented Feb 3, 2025

machichima commented Feb 3, 2025 •

edited

Loading

kylebarron commented Feb 3, 2025

machichima commented Feb 4, 2025

kylebarron commented Feb 4, 2025

kylebarron commented Feb 5, 2025

machichima commented Feb 6, 2025 •

edited

Loading

kylebarron Feb 6, 2025

machichima Feb 7, 2025

machichima Feb 7, 2025 •

edited

Loading

kylebarron commented Feb 6, 2025

machichima commented Feb 7, 2025

kylebarron commented Feb 7, 2025

machichima commented Feb 8, 2025

machichima commented Feb 8, 2025 •

edited

Loading

kylebarron commented Feb 11, 2025 •

edited

Loading

kylebarron commented Feb 11, 2025

kylebarron commented Feb 11, 2025

martindurant commented Feb 11, 2025

	res = urlparse(path)
	if res.scheme:
	if res.scheme != self.protocol:
	raise ValueError(f"Expect protocol to be {self.protocol}. Got {res.schema}")
	path = res.netloc + res.path

[FEAT] Create obstore store in fsspec on demand #198

Are you sure you want to change the base?

[FEAT] Create obstore store in fsspec on demand #198

Conversation

machichima commented Feb 3, 2025

kylebarron Feb 3, 2025

Choose a reason for hiding this comment

machichima Feb 3, 2025

Choose a reason for hiding this comment

machichima Feb 3, 2025

Choose a reason for hiding this comment

kylebarron Feb 3, 2025

Choose a reason for hiding this comment

kylebarron Feb 3, 2025

Choose a reason for hiding this comment

kylebarron commented Feb 3, 2025

machichima commented Feb 3, 2025 • edited Loading

kylebarron commented Feb 3, 2025

machichima commented Feb 4, 2025

kylebarron commented Feb 4, 2025

kylebarron commented Feb 5, 2025

machichima commented Feb 6, 2025 • edited Loading

kylebarron Feb 6, 2025

Choose a reason for hiding this comment

machichima Feb 7, 2025

Choose a reason for hiding this comment

machichima Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

kylebarron commented Feb 6, 2025

machichima commented Feb 7, 2025

kylebarron commented Feb 7, 2025

machichima commented Feb 8, 2025

machichima commented Feb 8, 2025 • edited Loading

kylebarron commented Feb 11, 2025 • edited Loading

kylebarron commented Feb 11, 2025

kylebarron commented Feb 11, 2025

martindurant commented Feb 11, 2025

machichima commented Feb 3, 2025 •

edited

Loading

machichima commented Feb 6, 2025 •

edited

Loading

machichima Feb 7, 2025 •

edited

Loading

machichima commented Feb 8, 2025 •

edited

Loading

kylebarron commented Feb 11, 2025 •

edited

Loading