-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for loading files from s3 #416
Comments
We should be able to test this locally and on CI with minio, which is available from conda-forge as |
Partial implementation in progress in #417. This does not include the duckdb connection support discussed above |
Would be cool to support polars too via |
Polars support is possible, but would be a pretty big project. I'm hoping we'll eventually get support for Ibis, which can wrap Polars along with a bunch of other backends. |
Duckdb parquet + s3 support added in 1.5.0 |
VegaFusion could support loading files from s3 compatible object storage.
The DataFusion connection would use the
object_store
crate as in https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html and https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/query-http-csv.rs.The DuckDb connection would use the httpfs extension as in https://duckdb.org/docs/guides/import/s3_import.html.
In both cases, AWS credentials would be loaded from the standard environment variables, and the Vega spec could contain s3 URLs like
s3://<bucket>/<path>
.This would work for VegaFusion server as well as VegaFusion Python.
In terms of implementation, the
Connection.scan_*
methods would need special handling for s3 urls somewhere around here forscan_csv
.https://github.com/hex-inc/vegafusion/blob/6d352b78df1a2fca4db0b9a29aae2e5283df9a43/vegafusion-sql/src/connection/datafusion_conn.rs#L144
Cross reference #87 for adding
scan_parquet
support, which is particularly well suited for storage on s3.The text was updated successfully, but these errors were encountered: