Skip to content

Commit

Permalink
v0.7.1 release (#321)
Browse files Browse the repository at this point in the history
* add docs for running extension tests

* Delta Lake docs (#313)

* Create delta.mdx

* Update delta.mdx

* Update index.mdx

* Update delta.mdx

* Update delta.mdx

* Fixes

---------

Co-authored-by: Prashanth Rao <[email protected]>
Co-authored-by: prrao87 <[email protected]>

* Add Iceberg Extension Documentation (#314)

* add ice_berg docu

* Update src/content/docs/extensions/iceberg.mdx

Co-authored-by: Guodong Jin <[email protected]>

* Update src/content/docs/extensions/iceberg.mdx

Co-authored-by: Guodong Jin <[email protected]>

* restructure

* restructure

* restructure

* update table

* update table

* Apply suggestions from code review

* update table

* Fixes

---------

Co-authored-by: Guodong Jin <[email protected]>
Co-authored-by: Prashanth Rao <[email protected]>
Co-authored-by: prrao87 <[email protected]>

* Fix file extension

* Fix header

* Update sidebar

* Minor fixes

* bump version

---------

Co-authored-by: sterling <[email protected]>
Co-authored-by: ziyi chen <[email protected]>
Co-authored-by: Sterling Shi <[email protected]>
Co-authored-by: Guodong Jin <[email protected]>
  • Loading branch information
5 people authored Dec 20, 2024
1 parent 252c198 commit c0f5ff4
Show file tree
Hide file tree
Showing 6 changed files with 448 additions and 18 deletions.
2 changes: 2 additions & 0 deletions astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,8 @@ export default defineConfig({
]
},
{ label: 'JSON', link: '/extensions/json' },
{ label: 'Iceberg', link: '/extensions/iceberg', badge: { text: 'New' }},
{ label: 'Delta Lake', link: '/extensions/delta', badge: { text: 'New' }},
],
autogenerate: { directory: 'reference' },
},
Expand Down
45 changes: 40 additions & 5 deletions src/content/docs/developer-guide/testing-framework.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ you must specify the dataset to be used and other optional
parameters such as `BUFFER_POOL_SIZE`.

:::caution[Note]
Avoid using the character `-` in test file names. In the Google Test Framework, `-` has a special meaning that can inadvertently exclude a test case, leading to the test file being silently skipped. To prevent this issue, our `e2e_test` framework will throw an exception if a test file name contains `-`.
Avoid using the character `-` in test file names and case names. In the Google Test Framework, `-` has a special meaning that can inadvertently exclude a test case, leading to the test file being silently skipped. To prevent this issue, our `e2e_test` framework will throw an exception if a test file name contains `-`.
:::

Here is a basic example of a test:
Expand All @@ -36,15 +36,18 @@ Here is a basic example of a test:
The first three lines represents the header, separated by `--`. The testing
framework will parse the file and register a [GTest
programatically](http://google.github.io/googletest/advanced.html#registering-tests-programmatically).
All e2e tests will have a prefix `e2e_test_` when being registered, which is used to distinguish them from other internal tests. e.g. a e2e_test named `BasicTest` will be registered as a GTest named `e2e_test_BasicTest`.
When it comes to the test case name, the provided example above would be equivalent to:

```
TEST_F(basic, BasicTest) {
TEST_F(basic, e2e_test_BasicTest) {
...
}
```

The test group name will be the relative path of the file under the `test/test_files` directory, delimited by `~`, followed by a dot and the test case name.
For the main source code tests, the test group name will be the relative path of the file under the `test/test_files` directory, delimited by `~`, followed by a dot and the test case name.

For the extension code tests, the test group name will be the relative path of the file under the `extension/name_of_extension/test/test_files` directory, delimited by `~`, followed by a dot and the test case name.

The testing framework will test each logical plan created from the prepared
statements and assert the result.
Expand Down Expand Up @@ -81,10 +84,29 @@ $ ctest -V -R common~types~interval.DifferentTypesCheck
$ ctest -j 10
```

To switch between main tests and extension tests, pass 'E2E_TEST_FILES_DIRECTORY=extension' as an environment variable when calling ctest.

Example:

```
# First cd to build/relwithdebinfo/test (after running make extension-test)
$ cd build/relwithdebinfo/test
# Run all the extension tests (-R e2e_test is used to filter the extension tests, as all extension tests are e2e tests)
$ E2E_TEST_FILES_DIRECTORY=extension ctest -R e2e_test
```

:::caution[Note]
Windows has different syntax for setting environment variable, to run all extension tests in windows, run
```
$ set "E2E_TEST_FILES_DIRECTORY=extension" && ctest -R e2e_test
```
:::

#### 2. Running directly from `e2e_test` binary

The test binaries are available in `build/release[or debug]/test/runner`
folder. You can run `e2e_test` specifying the relative path file inside
The test binaries are available in `build/relwithdebinfo[or debug or release]/test/runner`
folder. To run any of the main tests, you can run `e2e_test` specifying the relative path file inside
`test_files`:

```
Expand All @@ -98,6 +120,19 @@ $ ./e2e_test long_string_pk/long_string_pk.test
$ ./e2e_test .
```

To run any of the extension tests, you can run `e2e_test` with environment variable `E2E_TEST_FILES_DIRECTORY=extension` and specify the relative path file inside
`extension`:
```
# Run all tests inside extension/duckdb
$ E2E_TEST_FILES_DIRECTORY=extension ./e2e_test duckdb
# Run all tests from extension/json/test/copy_to_json.test file
$ E2E_TEST_FILES_DIRECTORY=extension ./e2e_test json/test/copy_to_json.test
# Run all extension tests
$ E2E_TEST_FILES_DIRECTORY=extension ./e2e_test .
```

:::caution[Note]
Some test files contain multiple test cases, and sometimes it is not easy
to find the output from a failed test. In this situation, the flag
Expand Down
145 changes: 145 additions & 0 deletions src/content/docs/extensions/delta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
title: "Delta Lake"
---

## Usage

The `delta` extension adds support for scanning/copying from the [`Delta Lake open-source storage format`](https://delta.io/).
Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture.
Using this extension, you can interact with Delta tables from within Kùzu using the `LOAD FROM` and `COPY FROM` clauses.

The Delta functionality is not available by default, so you would first need to install the `DELTA`
extension by running the following commands:

```sql
INSTALL DELTA;
LOAD EXTENSION DELTA;
```

### Example dataset

Let's look at an example dataset to demonstrate how the Delta extension can be used.
Firstly, let's create a Delta table containing student information using Python and save the Delta table in the `'/tmp/student'` directory:
Before running the script, make sure the `deltalake` Python package is properly installed (we will also use Pandas).
```shell
pip install deltalake pandas
```

```python
# create_delta_table.py
import pandas as pd
from deltalake import DeltaTable, write_deltalake

student = {
"name": ["Alice", "Bob", "Carol"],
"ID": [0, 3, 7]
}

write_deltalake(f"/tmp/student", pd.DataFrame.from_dict(student))
```

In the following sections, we will first scan the Delta table to query its contents in Cypher, and
then proceed to copy the data and construct a node table.

### Scan the Delta table
`LOAD FROM` is a Cypher clause that scans a file or object element by element, but doesn’t actually
move the data into a Kùzu table.

To scan the Delta table created above, you can do the following:

```cypher
LOAD FROM '/tmp/student' (file_format='delta') RETURN *;
```
```
┌────────┬───────┐
│ name │ ID │
│ STRING │ INT64 │
├────────┼───────┤
│ Alice │ 0 │
│ Bob │ 3 │
│ Carol │ 7 │
└────────┴───────┘
```
:::note[Note]
Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting Kùzu autodetect the file format at runtime.
When scanning from the Delta table, `file_format` option must be provided since Kùzu is not capable of autodetecting Delta tables.
:::

### Copy the Delta table into a node table
You can then use a `COPY FROM` statement to directly copy the contents of the Delta table into a Kùzu node table.

```cypher
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID));
COPY student FROM '/tmp/student' (file_format='delta')
```

Just like above in `LOAD FROM`, the `file_format` parameter is mandatory when specifying the `COPY FROM` clause as well.

```cypher
// First, create the node table
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID));
```
```
┌─────────────────────────────────┐
│ result │
│ STRING │
├─────────────────────────────────┤
│ Table student has been created. │
└─────────────────────────────────┘
```
```cypher
COPY student FROM '/tmp/student' (file_format='delta');
```
```
┌─────────────────────────────────────────────────┐
│ result │
│ STRING │
├─────────────────────────────────────────────────┤
│ 3 tuples have been copied to the student table. │
└─────────────────────────────────────────────────┘
```

### Access Delta tables hosted on S3
Kùzu also supports scanning/copying a Delta table hosted on S3 in the same way as from a local file system.
Before reading and writing from S3, you have to configure the connection using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement.

#### Supported options

| Option name | Description |
|----------|----------|
| `s3_access_key_id` | S3 access key id |
| `s3_secret_access_key` | S3 secret access key |
| `s3_endpoint` | S3 endpoint |
| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) |
| `s3_region` | S3 region |

#### Requirements on the S3 server API

| Feature | Required S3 API features |
|----------|----------|
| Public file reads | HTTP Range request |
| Private file reads | Secret key authentication|

#### Scan Delta table from S3
Reading or scanning a Delta table that's on S3 is as simple as reading from regular files:

```sql
LOAD FROM 's3://kuzu-sample/sample-delta' (file_format='delta')
RETURN *
```

#### Copy Delta table hosted on S3 into a local node table

Copying from Delta tables on S3 is also as simple as copying from regular files:

```cypher
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID));
COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta')
```

## Limitations

When using the Delta Lake extension in Kùzu, keep the following limitations in mind.

- Writing (i.e., exporting to) Delta files from Kùzu is currently not supported.

Loading

0 comments on commit c0f5ff4

Please sign in to comment.