Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Improve ObjectStoreUrl docs + examples #10619

Merged
merged 2 commits into from
May 26, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 22, 2024

Which issue does this PR close?

related to #10616

Rationale for this change

While working on #10549 I kept getting confused about what an ObjectStoreUrl was / is.

Thus I felt some documentation / examples would help

What changes are included in this PR?

Add more docs + examples

Are these changes tested?

new doc examples

Are there any user-facing changes?

Docs + examples

No functional changes

Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb LGTM.
This example clearly explains how to use ObjectStoreUrl.

@goldmedal
Copy link
Contributor

On a related note, I found that the usage flow of ObjectStore in DataFusion involves registering the source as a table and then querying the table. This approach makes sense. However, I think it might be possible to provide something like a table function or other methods to allow users to dynamically decide which path they want to scan.

There are similar features in DuckDB and Spark:

  • DuckDB: SELECT * FROM read_parquet('file-path/test.parquet')
  • Spark: spark.read.parquet('xxxx')

Does DataFusion already have this feature, or did I miss it? If not, I think it would be a valuable feature for users of object storage.

@alamb
Copy link
Contributor Author

alamb commented May 26, 2024

Does DataFusion already have this feature, or did I miss it? If not, I think it would be a valuable feature for users of object storage.

THis feature is in datafusion-cli (not DataFusion core) via a dynamic schema provider

andrewlamb@Andrews-MacBook-Pro-2:~/Software/influxdb_iox$ datafusion-cli
DataFusion CLI v38.0.0
> select * from '/tmp/foo.parquet';
+---------+
| column1 |
+---------+
| 1       |
+---------+
1 row(s) fetched.
Elapsed 0.015 seconds.

Automatically allowing access to reading arbtrary files / urls is likely not appropriate for datafusion by default, however having it be an easier feature to add would be good.

Maybe this would be a good feature to add as an example or maybe even move the dynamic table provider into datafusion core (not enabled by default) if this is a feature others want

@goldmedal
Copy link
Contributor

goldmedal commented May 26, 2024

Maybe this would be a good feature to add as an example or maybe even move the dynamic table provider into datafusion core (not enabled by default) if this is a feature others want

Cool! Thanks for the information!
I think some users may want to use it directly through the core. It's very convenient to build a system that can query object storage dynamically.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb

@comphead comphead merged commit 3a795a4 into apache:main May 26, 2024
23 checks passed
@alamb alamb deleted the alamb/Objectstore_url_example branch May 27, 2024 09:45
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants