Add support for loading files from s3 #416

jonmmease · 2023-11-09T14:17:14Z

VegaFusion could support loading files from s3 compatible object storage.

The DataFusion connection would use the object_store crate as in https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html and https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/query-http-csv.rs.

The DuckDb connection would use the httpfs extension as in https://duckdb.org/docs/guides/import/s3_import.html.

In both cases, AWS credentials would be loaded from the standard environment variables, and the Vega spec could contain s3 URLs like s3://<bucket>/<path>.

This would work for VegaFusion server as well as VegaFusion Python.

In terms of implementation, the Connection.scan_* methods would need special handling for s3 urls somewhere around here for scan_csv.

https://github.com/hex-inc/vegafusion/blob/6d352b78df1a2fca4db0b9a29aae2e5283df9a43/vegafusion-sql/src/connection/datafusion_conn.rs#L144

Cross reference #87 for adding scan_parquet support, which is particularly well suited for storage on s3.

The text was updated successfully, but these errors were encountered:

jonmmease · 2023-11-09T14:19:38Z

We should be able to test this locally and on CI with minio, which is available from conda-forge as minio-server

jonmmease · 2023-11-12T12:30:45Z

Partial implementation in progress in #417. This does not include the duckdb connection support discussed above

kszlim · 2023-12-02T08:26:31Z

Would be cool to support polars too via scan_parquet

jonmmease · 2023-12-02T12:29:27Z

Polars support is possible, but would be a pretty big project. I'm hoping we'll eventually get support for Ibis, which can wrap Polars along with a bunch of other backends.

jonmmease · 2023-12-12T01:28:44Z

Duckdb parquet + s3 support added in 1.5.0

jonmmease closed this as completed Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for loading files from s3 #416

Add support for loading files from s3 #416

jonmmease commented Nov 9, 2023

jonmmease commented Nov 9, 2023

jonmmease commented Nov 12, 2023

kszlim commented Dec 2, 2023

jonmmease commented Dec 2, 2023

jonmmease commented Dec 12, 2023

Add support for loading files from s3 #416

Add support for loading files from s3 #416

Comments

jonmmease commented Nov 9, 2023

jonmmease commented Nov 9, 2023

jonmmease commented Nov 12, 2023

kszlim commented Dec 2, 2023

jonmmease commented Dec 2, 2023

jonmmease commented Dec 12, 2023