Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interactions with S3 bucket (parquet/iceberg export, object reads) #321

Closed
alanpaulkwan opened this issue Dec 25, 2024 · 2 comments
Closed

Comments

@alanpaulkwan
Copy link

alanpaulkwan commented Dec 25, 2024

The architecture many are going for is to have a datalake without lockin. One issue with Apecloud, while I love DuckDB's internal storage, is its reliance on the DuckDB format (albeit a very open one).

Thus, it would be great to support Parquet exports per DuckDB, and writing out to s3 buckets if possible. Later on, as Iceberg becomes easier to implement, Iceberg as well.

In other words, support

select * from parquet_scan('s3://bucket/file.parquet')

Ideally, deviating from DuckDB's syntax, it would also allow you to specify endpoints like

s3:endpoint_pnemonic:// 

And same with exports

Right now one can s3fuse an s3 bucket but still no object exports.

@TianyuZhang1214
Copy link
Contributor

TianyuZhang1214 commented Dec 26, 2024

@alanpaulkwan
Thanks for your feedback! Let me answer your questions:

  1. Export/Import and Scan Data in Parquet Format:
    We already support this feature, as DuckDB does. You can connect to MyDuck Server via the Postgres protocol and execute the EXPORT/IMPORT DATABASE command. Here’s an example:
SET s3_region='ap-northeast-1';
SET s3_access_key_id='xxxxxxxxxxxxxxxxx';
SET s3_secret_access_key='xxxxxxxxxxxxxxxxxxx';
SET s3_endpoint='s3.ap-northeast-1.amazonaws.com';

-- Export data into Parquet format on S3
EXPORT DATABASE 's3://your-bucket-name/your-path-name/' (FORMAT PARQUET);

-- Import data from directory on S3
IMPORT DATABASE 's3://your-bucket-name/your-path-name/';

-- Read data in Parquet format on S3
SELECT * FROM parquet_scan('s3://your-bucket-name/your-file.parquet');
  1. Export Data in Iceberg Format:
    I’m currently investigating the implementation of Add S3 support for writing Iceberg-format files #276 , including INSERT/UPDATE/DELETE operations. While append-only is relatively straightforward, modifications require more time. As you mentioned, exporting files in Iceberg format is not particularly complex, so I plan to implement a strategy to export data in Iceberg format first. Issue Export Data into Iceberg Format on Object Storage #324 has been created for this task.

@alanpaulkwan
Copy link
Author

Ah my bad, I think I was confused becuase of some stuff I read in the documentation, and I was operating in MySQL interface where a lot of this stuff wasn't working. But it's great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants