-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add docs for running extension tests * Delta Lake docs (#313) * Create delta.mdx * Update delta.mdx * Update index.mdx * Update delta.mdx * Update delta.mdx * Fixes --------- Co-authored-by: Prashanth Rao <[email protected]> Co-authored-by: prrao87 <[email protected]> * Add Iceberg Extension Documentation (#314) * add ice_berg docu * Update src/content/docs/extensions/iceberg.mdx Co-authored-by: Guodong Jin <[email protected]> * Update src/content/docs/extensions/iceberg.mdx Co-authored-by: Guodong Jin <[email protected]> * restructure * restructure * restructure * update table * update table * Apply suggestions from code review * update table * Fixes --------- Co-authored-by: Guodong Jin <[email protected]> Co-authored-by: Prashanth Rao <[email protected]> Co-authored-by: prrao87 <[email protected]> * Fix file extension * Fix header * Update sidebar * Minor fixes * bump version --------- Co-authored-by: sterling <[email protected]> Co-authored-by: ziyi chen <[email protected]> Co-authored-by: Sterling Shi <[email protected]> Co-authored-by: Guodong Jin <[email protected]>
- Loading branch information
1 parent
252c198
commit c0f5ff4
Showing
6 changed files
with
448 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
--- | ||
title: "Delta Lake" | ||
--- | ||
|
||
## Usage | ||
|
||
The `delta` extension adds support for scanning/copying from the [`Delta Lake open-source storage format`](https://delta.io/). | ||
Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture. | ||
Using this extension, you can interact with Delta tables from within Kùzu using the `LOAD FROM` and `COPY FROM` clauses. | ||
|
||
The Delta functionality is not available by default, so you would first need to install the `DELTA` | ||
extension by running the following commands: | ||
|
||
```sql | ||
INSTALL DELTA; | ||
LOAD EXTENSION DELTA; | ||
``` | ||
|
||
### Example dataset | ||
|
||
Let's look at an example dataset to demonstrate how the Delta extension can be used. | ||
Firstly, let's create a Delta table containing student information using Python and save the Delta table in the `'/tmp/student'` directory: | ||
Before running the script, make sure the `deltalake` Python package is properly installed (we will also use Pandas). | ||
```shell | ||
pip install deltalake pandas | ||
``` | ||
|
||
```python | ||
# create_delta_table.py | ||
import pandas as pd | ||
from deltalake import DeltaTable, write_deltalake | ||
|
||
student = { | ||
"name": ["Alice", "Bob", "Carol"], | ||
"ID": [0, 3, 7] | ||
} | ||
|
||
write_deltalake(f"/tmp/student", pd.DataFrame.from_dict(student)) | ||
``` | ||
|
||
In the following sections, we will first scan the Delta table to query its contents in Cypher, and | ||
then proceed to copy the data and construct a node table. | ||
|
||
### Scan the Delta table | ||
`LOAD FROM` is a Cypher clause that scans a file or object element by element, but doesn’t actually | ||
move the data into a Kùzu table. | ||
|
||
To scan the Delta table created above, you can do the following: | ||
|
||
```cypher | ||
LOAD FROM '/tmp/student' (file_format='delta') RETURN *; | ||
``` | ||
``` | ||
┌────────┬───────┐ | ||
│ name │ ID │ | ||
│ STRING │ INT64 │ | ||
├────────┼───────┤ | ||
│ Alice │ 0 │ | ||
│ Bob │ 3 │ | ||
│ Carol │ 7 │ | ||
└────────┴───────┘ | ||
``` | ||
:::note[Note] | ||
Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting Kùzu autodetect the file format at runtime. | ||
When scanning from the Delta table, `file_format` option must be provided since Kùzu is not capable of autodetecting Delta tables. | ||
::: | ||
|
||
### Copy the Delta table into a node table | ||
You can then use a `COPY FROM` statement to directly copy the contents of the Delta table into a Kùzu node table. | ||
|
||
```cypher | ||
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); | ||
COPY student FROM '/tmp/student' (file_format='delta') | ||
``` | ||
|
||
Just like above in `LOAD FROM`, the `file_format` parameter is mandatory when specifying the `COPY FROM` clause as well. | ||
|
||
```cypher | ||
// First, create the node table | ||
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); | ||
``` | ||
``` | ||
┌─────────────────────────────────┐ | ||
│ result │ | ||
│ STRING │ | ||
├─────────────────────────────────┤ | ||
│ Table student has been created. │ | ||
└─────────────────────────────────┘ | ||
``` | ||
```cypher | ||
COPY student FROM '/tmp/student' (file_format='delta'); | ||
``` | ||
``` | ||
┌─────────────────────────────────────────────────┐ | ||
│ result │ | ||
│ STRING │ | ||
├─────────────────────────────────────────────────┤ | ||
│ 3 tuples have been copied to the student table. │ | ||
└─────────────────────────────────────────────────┘ | ||
``` | ||
|
||
### Access Delta tables hosted on S3 | ||
Kùzu also supports scanning/copying a Delta table hosted on S3 in the same way as from a local file system. | ||
Before reading and writing from S3, you have to configure the connection using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. | ||
|
||
#### Supported options | ||
|
||
| Option name | Description | | ||
|----------|----------| | ||
| `s3_access_key_id` | S3 access key id | | ||
| `s3_secret_access_key` | S3 secret access key | | ||
| `s3_endpoint` | S3 endpoint | | ||
| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) | | ||
| `s3_region` | S3 region | | ||
|
||
#### Requirements on the S3 server API | ||
|
||
| Feature | Required S3 API features | | ||
|----------|----------| | ||
| Public file reads | HTTP Range request | | ||
| Private file reads | Secret key authentication| | ||
|
||
#### Scan Delta table from S3 | ||
Reading or scanning a Delta table that's on S3 is as simple as reading from regular files: | ||
|
||
```sql | ||
LOAD FROM 's3://kuzu-sample/sample-delta' (file_format='delta') | ||
RETURN * | ||
``` | ||
|
||
#### Copy Delta table hosted on S3 into a local node table | ||
|
||
Copying from Delta tables on S3 is also as simple as copying from regular files: | ||
|
||
```cypher | ||
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); | ||
COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta') | ||
``` | ||
|
||
## Limitations | ||
|
||
When using the Delta Lake extension in Kùzu, keep the following limitations in mind. | ||
|
||
- Writing (i.e., exporting to) Delta files from Kùzu is currently not supported. | ||
|
Oops, something went wrong.