From ef69029c1eded31e72aea02351cb82b0f9248db8 Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Mon, 9 Dec 2024 15:37:54 -0500 Subject: [PATCH 1/6] Create delta.mdx --- src/content/docs/extensions/delta.mdx | 94 +++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) create mode 100644 src/content/docs/extensions/delta.mdx diff --git a/src/content/docs/extensions/delta.mdx b/src/content/docs/extensions/delta.mdx new file mode 100644 index 00000000..f9c3b891 --- /dev/null +++ b/src/content/docs/extensions/delta.mdx @@ -0,0 +1,94 @@ +--- +title: "DELTA extension" +--- + +import { Tabs, TabItem } from '@astrojs/starlight/components'; + +## Usage + +The `delta` extension adds support for scanning/copying from the [`Delta Lake open-source storage format`](https://delta.io/). Using this extension, you can +interact with DELTA tables using [`LOAD FROM`](/cypher/query-clauses/load-from), +[`COPY FROM`](/import/copy-from-query-results), similar to how you would +with CSV files. + +The DELTA functionality is not available by default, so you would first need to install the DELTA +extension by running the following commands: + +```sql +INSTALL DELTA; +LOAD EXTENSION DELTA; +``` + +### Example dataset + +Let's look at an example dataset to demonstrate how the DELTA extension can be used. +Firstly, let's create a DELTA table containing student information using python and save the delta table in the `'/tmp/student'` directory: +```python +import pandas as pd +from deltalake import DeltaTable, write_deltalake + +student = { + "name": ["Alice", "Bob", "Carol"], + "ID": [0, 3, 7] +} + +write_deltalake(f"/tmp/student", pd.DataFrame.from_dict(student)) +``` + +In the following sections, we will first scan the DELTA table to query its contents in Cypher, and +then proceed to copy the data and construct a node table. + +### Scan the DELTA table +`LOAD FROM` is a Cypher query that scans a file or object element by element, but doesn’t actually +move the data into a Kùzu table. + +To scan the delta table created above, you can do the following: + +```cypher +LOAD FROM '/tmp/student'(file_format='delta') RETURN *; +``` +Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting kuzu sniff the file format at runtime. When scanning from the DELTA table, `file_format` option must be provided since kuzu is not capable of sniffing delta tables. + +### Copy the DELTA table into a node table +You can then use a `COPY FROM` statement to directly copy the contents of the DELTA table into a node table. + +```cypher +CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +COPY student FROM '/tmp/student' (file_format='delta') +``` +Note: The `file_format` parameter is also needed in the copy from clause as mentioned in the `LOAD FROM` section. + +### Access the DELTA table hosted on S3 +Kùzu also supports scanning/copying a DELTA table hosted on S3 in the same way as from a local file system. +Before reading and writing from S3, users have to configure using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. + +### Supported options: + +| Option name | Description | +|----------|----------| +| `s3_access_key_id` | S3 access key id | +| `s3_secret_access_key` | S3 secret access key | +| `s3_endpoint` | S3 endpoint | +| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) | +| `s3_region` | S3 region | + +### Requirements on the S3 server API + +| Feature | Required S3 API features | +|----------|----------| +| Public file reads | HTTP Range request | +| Private file reads | Secret key authentication| + +### Read DELTA table from S3: +Reading from S3 is as simple as reading from regular files: + +```sql +LOAD FROM 's3://kuzu-sample/sample-delta' (file_format='delta') +RETURN *; +``` + +### Copy DELTA table hosted on S3 into a local node table +```cypher +CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta') +``` From b81dcf1c4ebe112733fc376e7f74ea88c949026e Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Mon, 9 Dec 2024 15:42:44 -0500 Subject: [PATCH 2/6] Update delta.mdx --- src/content/docs/extensions/delta.mdx | 32 +++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/src/content/docs/extensions/delta.mdx b/src/content/docs/extensions/delta.mdx index f9c3b891..55c2d3b6 100644 --- a/src/content/docs/extensions/delta.mdx +++ b/src/content/docs/extensions/delta.mdx @@ -49,6 +49,19 @@ LOAD FROM '/tmp/student'(file_format='delta') RETURN *; ``` Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting kuzu sniff the file format at runtime. When scanning from the DELTA table, `file_format` option must be provided since kuzu is not capable of sniffing delta tables. +Result: +```cypher +kuzu> LOAD FROM '/tmp/student'(file_format='delta') RETURN *; +┌────────┬───────┐ +│ name │ ID │ +│ STRING │ INT64 │ +├────────┼───────┤ +│ Alice │ 0 │ +│ Bob │ 3 │ +│ Carol │ 7 │ +└────────┴───────┘ +``` + ### Copy the DELTA table into a node table You can then use a `COPY FROM` statement to directly copy the contents of the DELTA table into a node table. @@ -58,6 +71,25 @@ COPY student FROM '/tmp/student' (file_format='delta') ``` Note: The `file_format` parameter is also needed in the copy from clause as mentioned in the `LOAD FROM` section. +Result: +```cypher +kuzu> CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +┌─────────────────────────────────┐ +│ result │ +│ STRING │ +├─────────────────────────────────┤ +│ Table student has been created. │ +└─────────────────────────────────┘ + +kuzu> COPY student FROM '/tmp/student' (file_format='delta'); +┌─────────────────────────────────────────────────┐ +│ result │ +│ STRING │ +├─────────────────────────────────────────────────┤ +│ 3 tuples have been copied to the student table. │ +└─────────────────────────────────────────────────┘ +``` + ### Access the DELTA table hosted on S3 Kùzu also supports scanning/copying a DELTA table hosted on S3 in the same way as from a local file system. Before reading and writing from S3, users have to configure using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. From c5a0729ca2c65d8214872b23f1ba2174c37577c4 Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Wed, 18 Dec 2024 10:46:21 -0500 Subject: [PATCH 3/6] Update index.mdx --- src/content/docs/extensions/index.mdx | 1 + 1 file changed, 1 insertion(+) diff --git a/src/content/docs/extensions/index.mdx b/src/content/docs/extensions/index.mdx index a0554b9f..54ab9b72 100644 --- a/src/content/docs/extensions/index.mdx +++ b/src/content/docs/extensions/index.mdx @@ -36,6 +36,7 @@ The following extensions are implemented, with more to come: | [postgres](/extensions/attach) | Scan data from an attached PostgreSQL database | | [sqlite](/extensions/attach) | Scan data from an attached SQLite database | | [json](/extensions/json) | Scan and manipulate JSON data | +| [delta](/extensions/delta) | Scan data from a delta table | ## Using Extensions in Kùzu From 7065712bbe761e4571da5a51d3363afff4a1be29 Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Wed, 18 Dec 2024 10:50:46 -0500 Subject: [PATCH 4/6] Update delta.mdx --- src/content/docs/extensions/delta.mdx | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/content/docs/extensions/delta.mdx b/src/content/docs/extensions/delta.mdx index 55c2d3b6..8a75f99a 100644 --- a/src/content/docs/extensions/delta.mdx +++ b/src/content/docs/extensions/delta.mdx @@ -23,6 +23,11 @@ LOAD EXTENSION DELTA; Let's look at an example dataset to demonstrate how the DELTA extension can be used. Firstly, let's create a DELTA table containing student information using python and save the delta table in the `'/tmp/student'` directory: +Before running the script, make sure the deltalake package is properly installed by: +```shell +pip3 install deltalake +``` + ```python import pandas as pd from deltalake import DeltaTable, write_deltalake From 08dd57702daa9c617092df253636464903024dd6 Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Wed, 18 Dec 2024 16:27:36 -0500 Subject: [PATCH 5/6] Update delta.mdx --- src/content/docs/extensions/delta.mdx | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/content/docs/extensions/delta.mdx b/src/content/docs/extensions/delta.mdx index 8a75f99a..bceb7b6d 100644 --- a/src/content/docs/extensions/delta.mdx +++ b/src/content/docs/extensions/delta.mdx @@ -129,3 +129,6 @@ RETURN *; CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta') ``` + +### Limitations +Writing (i.e., exporting to) Delta lake is currently not supported. From 488437d4db9a28849a21fd9a180dc4c202e4d9ca Mon Sep 17 00:00:00 2001 From: prrao87 Date: Thu, 19 Dec 2024 16:28:20 -0500 Subject: [PATCH 6/6] Fixes --- src/content/docs/extensions/delta.mdx | 82 ++++++++++++++++----------- 1 file changed, 48 insertions(+), 34 deletions(-) diff --git a/src/content/docs/extensions/delta.mdx b/src/content/docs/extensions/delta.mdx index bceb7b6d..e26cd1a6 100644 --- a/src/content/docs/extensions/delta.mdx +++ b/src/content/docs/extensions/delta.mdx @@ -1,5 +1,5 @@ --- -title: "DELTA extension" +title: "Delta Lake" --- import { Tabs, TabItem } from '@astrojs/starlight/components'; @@ -7,11 +7,11 @@ import { Tabs, TabItem } from '@astrojs/starlight/components'; ## Usage The `delta` extension adds support for scanning/copying from the [`Delta Lake open-source storage format`](https://delta.io/). Using this extension, you can -interact with DELTA tables using [`LOAD FROM`](/cypher/query-clauses/load-from), +interact with Delta tables using [`LOAD FROM`](/cypher/query-clauses/load-from), [`COPY FROM`](/import/copy-from-query-results), similar to how you would with CSV files. -The DELTA functionality is not available by default, so you would first need to install the DELTA +The Delta functionality is not available by default, so you would first need to install the `DELTA` extension by running the following commands: ```sql @@ -21,14 +21,15 @@ LOAD EXTENSION DELTA; ### Example dataset -Let's look at an example dataset to demonstrate how the DELTA extension can be used. -Firstly, let's create a DELTA table containing student information using python and save the delta table in the `'/tmp/student'` directory: -Before running the script, make sure the deltalake package is properly installed by: +Let's look at an example dataset to demonstrate how the Delta extension can be used. +Firstly, let's create a Delta table containing student information using Python and save the Delta table in the `'/tmp/student'` directory: +Before running the script, make sure the `deltalake` Python package is properly installed (we will also use Pandas). ```shell -pip3 install deltalake +pip install deltalake pandas ``` ```python +# create_delta_table.py import pandas as pd from deltalake import DeltaTable, write_deltalake @@ -40,23 +41,19 @@ student = { write_deltalake(f"/tmp/student", pd.DataFrame.from_dict(student)) ``` -In the following sections, we will first scan the DELTA table to query its contents in Cypher, and +In the following sections, we will first scan the Delta table to query its contents in Cypher, and then proceed to copy the data and construct a node table. -### Scan the DELTA table -`LOAD FROM` is a Cypher query that scans a file or object element by element, but doesn’t actually +### Scan the Delta table +`LOAD FROM` is a Cypher clause that scans a file or object element by element, but doesn’t actually move the data into a Kùzu table. -To scan the delta table created above, you can do the following: +To scan the Delta table created above, you can do the following: ```cypher -LOAD FROM '/tmp/student'(file_format='delta') RETURN *; +LOAD FROM '/tmp/student' (file_format='delta') RETURN *; +``` ``` -Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting kuzu sniff the file format at runtime. When scanning from the DELTA table, `file_format` option must be provided since kuzu is not capable of sniffing delta tables. - -Result: -```cypher -kuzu> LOAD FROM '/tmp/student'(file_format='delta') RETURN *; ┌────────┬───────┐ │ name │ ID │ │ STRING │ INT64 │ @@ -66,27 +63,37 @@ kuzu> LOAD FROM '/tmp/student'(file_format='delta') RETURN *; │ Carol │ 7 │ └────────┴───────┘ ``` +:::note[Note] +Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting Kùzu autodetect the file format at runtime. +When scanning from the Delta table, `file_format` option must be provided since Kùzu is not capable of autodetecting Delta tables. +::: -### Copy the DELTA table into a node table -You can then use a `COPY FROM` statement to directly copy the contents of the DELTA table into a node table. +### Copy the Delta table into a node table +You can then use a `COPY FROM` statement to directly copy the contents of the Delta table into a Kùzu node table. ```cypher CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); COPY student FROM '/tmp/student' (file_format='delta') ``` -Note: The `file_format` parameter is also needed in the copy from clause as mentioned in the `LOAD FROM` section. -Result: +Just like above in `LOAD FROM`, the `file_format` parameter is mandatory when specifying the `COPY FROM` clause as well. + ```cypher -kuzu> CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +// First, create the node table +CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +``` +``` ┌─────────────────────────────────┐ │ result │ │ STRING │ ├─────────────────────────────────┤ │ Table student has been created. │ └─────────────────────────────────┘ - -kuzu> COPY student FROM '/tmp/student' (file_format='delta'); +``` +```cypher +COPY student FROM '/tmp/student' (file_format='delta'); +``` +``` ┌─────────────────────────────────────────────────┐ │ result │ │ STRING │ @@ -95,11 +102,11 @@ kuzu> COPY student FROM '/tmp/student' (file_format='delta'); └─────────────────────────────────────────────────┘ ``` -### Access the DELTA table hosted on S3 -Kùzu also supports scanning/copying a DELTA table hosted on S3 in the same way as from a local file system. -Before reading and writing from S3, users have to configure using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. +### Access Delta tables hosted on S3 +Kùzu also supports scanning/copying a Delta table hosted on S3 in the same way as from a local file system. +Before reading and writing from S3, you have to configure the connection using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. -### Supported options: +#### Supported options | Option name | Description | |----------|----------| @@ -109,26 +116,33 @@ Before reading and writing from S3, users have to configure using the [CALL](htt | `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) | | `s3_region` | S3 region | -### Requirements on the S3 server API +#### Requirements on the S3 server API | Feature | Required S3 API features | |----------|----------| | Public file reads | HTTP Range request | | Private file reads | Secret key authentication| -### Read DELTA table from S3: -Reading from S3 is as simple as reading from regular files: +#### Scan Delta table from S3 +Reading or scanning a Delta table that's on S3 is as simple as reading from regular files: ```sql LOAD FROM 's3://kuzu-sample/sample-delta' (file_format='delta') -RETURN *; +RETURN * ``` -### Copy DELTA table hosted on S3 into a local node table +#### Copy Delta table hosted on S3 into a local node table + +Copying from Delta tables on S3 is also as simple as copying from regular files: + ```cypher CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta') ``` ### Limitations -Writing (i.e., exporting to) Delta lake is currently not supported. + +When using the Delta Lake extension in Kùzu, keep the following limitations in mind. + +- Writing (i.e., exporting to) Delta files is currently not supported. +- We currently do not support scanning/copying nested data (i.e., of type `STRUCT`) in the Delta table columns.