Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Lake docs #313

Merged
merged 7 commits into from
Dec 19, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions src/content/docs/extensions/delta.mdx
prrao87 marked this conversation as resolved.
Show resolved Hide resolved
acquamarin marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
title: "DELTA extension"
---

import { Tabs, TabItem } from '@astrojs/starlight/components';

## Usage

The `delta` extension adds support for scanning/copying from the [`Delta Lake open-source storage format`](https://delta.io/). Using this extension, you can
interact with DELTA tables using [`LOAD FROM`](/cypher/query-clauses/load-from),
[`COPY FROM`](/import/copy-from-query-results), similar to how you would
with CSV files.

The DELTA functionality is not available by default, so you would first need to install the DELTA
extension by running the following commands:

```sql
INSTALL DELTA;
LOAD EXTENSION DELTA;
```

### Example dataset

Let's look at an example dataset to demonstrate how the DELTA extension can be used.
Firstly, let's create a DELTA table containing student information using python and save the delta table in the `'/tmp/student'` directory:
```python
import pandas as pd
from deltalake import DeltaTable, write_deltalake

student = {
"name": ["Alice", "Bob", "Carol"],
"ID": [0, 3, 7]
}

write_deltalake(f"/tmp/student", pd.DataFrame.from_dict(student))
```

In the following sections, we will first scan the DELTA table to query its contents in Cypher, and
then proceed to copy the data and construct a node table.

### Scan the DELTA table
`LOAD FROM` is a Cypher query that scans a file or object element by element, but doesn’t actually
move the data into a Kùzu table.

To scan the delta table created above, you can do the following:

```cypher
LOAD FROM '/tmp/student'(file_format='delta') RETURN *;
```
Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting kuzu sniff the file format at runtime. When scanning from the DELTA table, `file_format` option must be provided since kuzu is not capable of sniffing delta tables.

Result:
```cypher
kuzu> LOAD FROM '/tmp/student'(file_format='delta') RETURN *;
┌────────┬───────┐
│ name │ ID │
│ STRING │ INT64 │
├────────┼───────┤
│ Alice │ 0 │
│ Bob │ 3 │
│ Carol │ 7 │
└────────┴───────┘
```

### Copy the DELTA table into a node table
You can then use a `COPY FROM` statement to directly copy the contents of the DELTA table into a node table.

```cypher
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID));
COPY student FROM '/tmp/student' (file_format='delta')
```
Note: The `file_format` parameter is also needed in the copy from clause as mentioned in the `LOAD FROM` section.

Result:
```cypher
kuzu> CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID));
┌─────────────────────────────────┐
│ result │
│ STRING │
├─────────────────────────────────┤
│ Table student has been created. │
└─────────────────────────────────┘

kuzu> COPY student FROM '/tmp/student' (file_format='delta');
┌─────────────────────────────────────────────────┐
│ result │
│ STRING │
├─────────────────────────────────────────────────┤
│ 3 tuples have been copied to the student table. │
└─────────────────────────────────────────────────┘
```

### Access the DELTA table hosted on S3
Kùzu also supports scanning/copying a DELTA table hosted on S3 in the same way as from a local file system.
Before reading and writing from S3, users have to configure using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement.

### Supported options:

| Option name | Description |
|----------|----------|
| `s3_access_key_id` | S3 access key id |
| `s3_secret_access_key` | S3 secret access key |
| `s3_endpoint` | S3 endpoint |
| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) |
| `s3_region` | S3 region |

### Requirements on the S3 server API

| Feature | Required S3 API features |
|----------|----------|
| Public file reads | HTTP Range request |
| Private file reads | Secret key authentication|

### Read DELTA table from S3:
Reading from S3 is as simple as reading from regular files:

```sql
LOAD FROM 's3://kuzu-sample/sample-delta' (file_format='delta')
RETURN *;
```

### Copy DELTA table hosted on S3 into a local node table
```cypher
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID));
COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta')
```
Loading