Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for loading files from s3 #416

Closed
jonmmease opened this issue Nov 9, 2023 · 5 comments
Closed

Add support for loading files from s3 #416

jonmmease opened this issue Nov 9, 2023 · 5 comments

Comments

@jonmmease
Copy link
Collaborator

VegaFusion could support loading files from s3 compatible object storage.

The DataFusion connection would use the object_store crate as in https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html and https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/query-http-csv.rs.

The DuckDb connection would use the httpfs extension as in https://duckdb.org/docs/guides/import/s3_import.html.

In both cases, AWS credentials would be loaded from the standard environment variables, and the Vega spec could contain s3 URLs like s3://<bucket>/<path>.

This would work for VegaFusion server as well as VegaFusion Python.

In terms of implementation, the Connection.scan_* methods would need special handling for s3 urls somewhere around here for scan_csv.

https://github.com/hex-inc/vegafusion/blob/6d352b78df1a2fca4db0b9a29aae2e5283df9a43/vegafusion-sql/src/connection/datafusion_conn.rs#L144

Cross reference #87 for adding scan_parquet support, which is particularly well suited for storage on s3.

@jonmmease
Copy link
Collaborator Author

We should be able to test this locally and on CI with minio, which is available from conda-forge as minio-server

@jonmmease
Copy link
Collaborator Author

Partial implementation in progress in #417. This does not include the duckdb connection support discussed above

@kszlim
Copy link

kszlim commented Dec 2, 2023

Would be cool to support polars too via scan_parquet

@jonmmease
Copy link
Collaborator Author

Polars support is possible, but would be a pretty big project. I'm hoping we'll eventually get support for Ibis, which can wrap Polars along with a bunch of other backends.

@jonmmease
Copy link
Collaborator Author

Duckdb parquet + s3 support added in 1.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants