From 56d451cc61adc25f6acfdd771234ebd6f09e3fa4 Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Thu, 19 Dec 2024 16:29:55 -0500 Subject: [PATCH 01/15] Delta Lake docs (#313) * Create delta.mdx * Update delta.mdx * Update index.mdx * Update delta.mdx * Update delta.mdx * Fixes --------- Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> Co-authored-by: prrao87 --- src/content/docs/extensions/delta.mdx | 148 ++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 src/content/docs/extensions/delta.mdx diff --git a/src/content/docs/extensions/delta.mdx b/src/content/docs/extensions/delta.mdx new file mode 100644 index 00000000..e26cd1a6 --- /dev/null +++ b/src/content/docs/extensions/delta.mdx @@ -0,0 +1,148 @@ +--- +title: "Delta Lake" +--- + +import { Tabs, TabItem } from '@astrojs/starlight/components'; + +## Usage + +The `delta` extension adds support for scanning/copying from the [`Delta Lake open-source storage format`](https://delta.io/). Using this extension, you can +interact with Delta tables using [`LOAD FROM`](/cypher/query-clauses/load-from), +[`COPY FROM`](/import/copy-from-query-results), similar to how you would +with CSV files. + +The Delta functionality is not available by default, so you would first need to install the `DELTA` +extension by running the following commands: + +```sql +INSTALL DELTA; +LOAD EXTENSION DELTA; +``` + +### Example dataset + +Let's look at an example dataset to demonstrate how the Delta extension can be used. +Firstly, let's create a Delta table containing student information using Python and save the Delta table in the `'/tmp/student'` directory: +Before running the script, make sure the `deltalake` Python package is properly installed (we will also use Pandas). +```shell +pip install deltalake pandas +``` + +```python +# create_delta_table.py +import pandas as pd +from deltalake import DeltaTable, write_deltalake + +student = { + "name": ["Alice", "Bob", "Carol"], + "ID": [0, 3, 7] +} + +write_deltalake(f"/tmp/student", pd.DataFrame.from_dict(student)) +``` + +In the following sections, we will first scan the Delta table to query its contents in Cypher, and +then proceed to copy the data and construct a node table. + +### Scan the Delta table +`LOAD FROM` is a Cypher clause that scans a file or object element by element, but doesn’t actually +move the data into a Kùzu table. + +To scan the Delta table created above, you can do the following: + +```cypher +LOAD FROM '/tmp/student' (file_format='delta') RETURN *; +``` +``` +┌────────┬───────┐ +│ name │ ID │ +│ STRING │ INT64 │ +├────────┼───────┤ +│ Alice │ 0 │ +│ Bob │ 3 │ +│ Carol │ 7 │ +└────────┴───────┘ +``` +:::note[Note] +Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting Kùzu autodetect the file format at runtime. +When scanning from the Delta table, `file_format` option must be provided since Kùzu is not capable of autodetecting Delta tables. +::: + +### Copy the Delta table into a node table +You can then use a `COPY FROM` statement to directly copy the contents of the Delta table into a Kùzu node table. + +```cypher +CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +COPY student FROM '/tmp/student' (file_format='delta') +``` + +Just like above in `LOAD FROM`, the `file_format` parameter is mandatory when specifying the `COPY FROM` clause as well. + +```cypher +// First, create the node table +CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +``` +``` +┌─────────────────────────────────┐ +│ result │ +│ STRING │ +├─────────────────────────────────┤ +│ Table student has been created. │ +└─────────────────────────────────┘ +``` +```cypher +COPY student FROM '/tmp/student' (file_format='delta'); +``` +``` +┌─────────────────────────────────────────────────┐ +│ result │ +│ STRING │ +├─────────────────────────────────────────────────┤ +│ 3 tuples have been copied to the student table. │ +└─────────────────────────────────────────────────┘ +``` + +### Access Delta tables hosted on S3 +Kùzu also supports scanning/copying a Delta table hosted on S3 in the same way as from a local file system. +Before reading and writing from S3, you have to configure the connection using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. + +#### Supported options + +| Option name | Description | +|----------|----------| +| `s3_access_key_id` | S3 access key id | +| `s3_secret_access_key` | S3 secret access key | +| `s3_endpoint` | S3 endpoint | +| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) | +| `s3_region` | S3 region | + +#### Requirements on the S3 server API + +| Feature | Required S3 API features | +|----------|----------| +| Public file reads | HTTP Range request | +| Private file reads | Secret key authentication| + +#### Scan Delta table from S3 +Reading or scanning a Delta table that's on S3 is as simple as reading from regular files: + +```sql +LOAD FROM 's3://kuzu-sample/sample-delta' (file_format='delta') +RETURN * +``` + +#### Copy Delta table hosted on S3 into a local node table + +Copying from Delta tables on S3 is also as simple as copying from regular files: + +```cypher +CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); +COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta') +``` + +### Limitations + +When using the Delta Lake extension in Kùzu, keep the following limitations in mind. + +- Writing (i.e., exporting to) Delta files is currently not supported. +- We currently do not support scanning/copying nested data (i.e., of type `STRUCT`) in the Delta table columns. From 7d075a7bf5e7369113b453d7fbc319eafbd5b3a5 Mon Sep 17 00:00:00 2001 From: Sterling Shi <156466823+SterlingT3485@users.noreply.github.com> Date: Thu, 19 Dec 2024 17:44:02 -0500 Subject: [PATCH 02/15] Add Iceberg Extension Documentation (#314) * add ice_berg docu * Update src/content/docs/extensions/iceberg.mdx Co-authored-by: Guodong Jin * Update src/content/docs/extensions/iceberg.mdx Co-authored-by: Guodong Jin * restructure * restructure * restructure * update table * update table * Apply suggestions from code review * update table * Fixes --------- Co-authored-by: Guodong Jin Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> Co-authored-by: prrao87 --- src/content/docs/extensions/iceberg.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/content/docs/extensions/iceberg.md b/src/content/docs/extensions/iceberg.md index e8a17b28..53c98e2f 100644 --- a/src/content/docs/extensions/iceberg.md +++ b/src/content/docs/extensions/iceberg.md @@ -242,5 +242,9 @@ COPY student FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_mov When using the Iceberg extension in Kùzu, keep the following limitations in mind. +<<<<<<< HEAD - Writing (i.e., exporting to) Iceberg tables from Kùzu is currently not supported. +======= +- Writing (i.e., exporting to) Iceberg tables is currently not supported. +>>>>>>> 6bf35c9 (Add Iceberg Extension Documentation (#314)) - We currently do not support scanning/copying nested data (i.e., of type `STRUCT`) in the Iceberg table columns. From 7a0dff1a581d335bba08fa900ac7dbfefa27d5ba Mon Sep 17 00:00:00 2001 From: prrao87 Date: Thu, 19 Dec 2024 17:46:30 -0500 Subject: [PATCH 03/15] Fix file extension --- src/content/docs/extensions/delta.mdx | 148 -------------------------- 1 file changed, 148 deletions(-) delete mode 100644 src/content/docs/extensions/delta.mdx diff --git a/src/content/docs/extensions/delta.mdx b/src/content/docs/extensions/delta.mdx deleted file mode 100644 index e26cd1a6..00000000 --- a/src/content/docs/extensions/delta.mdx +++ /dev/null @@ -1,148 +0,0 @@ ---- -title: "Delta Lake" ---- - -import { Tabs, TabItem } from '@astrojs/starlight/components'; - -## Usage - -The `delta` extension adds support for scanning/copying from the [`Delta Lake open-source storage format`](https://delta.io/). Using this extension, you can -interact with Delta tables using [`LOAD FROM`](/cypher/query-clauses/load-from), -[`COPY FROM`](/import/copy-from-query-results), similar to how you would -with CSV files. - -The Delta functionality is not available by default, so you would first need to install the `DELTA` -extension by running the following commands: - -```sql -INSTALL DELTA; -LOAD EXTENSION DELTA; -``` - -### Example dataset - -Let's look at an example dataset to demonstrate how the Delta extension can be used. -Firstly, let's create a Delta table containing student information using Python and save the Delta table in the `'/tmp/student'` directory: -Before running the script, make sure the `deltalake` Python package is properly installed (we will also use Pandas). -```shell -pip install deltalake pandas -``` - -```python -# create_delta_table.py -import pandas as pd -from deltalake import DeltaTable, write_deltalake - -student = { - "name": ["Alice", "Bob", "Carol"], - "ID": [0, 3, 7] -} - -write_deltalake(f"/tmp/student", pd.DataFrame.from_dict(student)) -``` - -In the following sections, we will first scan the Delta table to query its contents in Cypher, and -then proceed to copy the data and construct a node table. - -### Scan the Delta table -`LOAD FROM` is a Cypher clause that scans a file or object element by element, but doesn’t actually -move the data into a Kùzu table. - -To scan the Delta table created above, you can do the following: - -```cypher -LOAD FROM '/tmp/student' (file_format='delta') RETURN *; -``` -``` -┌────────┬───────┐ -│ name │ ID │ -│ STRING │ INT64 │ -├────────┼───────┤ -│ Alice │ 0 │ -│ Bob │ 3 │ -│ Carol │ 7 │ -└────────┴───────┘ -``` -:::note[Note] -Note: The `file_format` parameter is used to explicitly specify the file format of the given file instead of letting Kùzu autodetect the file format at runtime. -When scanning from the Delta table, `file_format` option must be provided since Kùzu is not capable of autodetecting Delta tables. -::: - -### Copy the Delta table into a node table -You can then use a `COPY FROM` statement to directly copy the contents of the Delta table into a Kùzu node table. - -```cypher -CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); -COPY student FROM '/tmp/student' (file_format='delta') -``` - -Just like above in `LOAD FROM`, the `file_format` parameter is mandatory when specifying the `COPY FROM` clause as well. - -```cypher -// First, create the node table -CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); -``` -``` -┌─────────────────────────────────┐ -│ result │ -│ STRING │ -├─────────────────────────────────┤ -│ Table student has been created. │ -└─────────────────────────────────┘ -``` -```cypher -COPY student FROM '/tmp/student' (file_format='delta'); -``` -``` -┌─────────────────────────────────────────────────┐ -│ result │ -│ STRING │ -├─────────────────────────────────────────────────┤ -│ 3 tuples have been copied to the student table. │ -└─────────────────────────────────────────────────┘ -``` - -### Access Delta tables hosted on S3 -Kùzu also supports scanning/copying a Delta table hosted on S3 in the same way as from a local file system. -Before reading and writing from S3, you have to configure the connection using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. - -#### Supported options - -| Option name | Description | -|----------|----------| -| `s3_access_key_id` | S3 access key id | -| `s3_secret_access_key` | S3 secret access key | -| `s3_endpoint` | S3 endpoint | -| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) | -| `s3_region` | S3 region | - -#### Requirements on the S3 server API - -| Feature | Required S3 API features | -|----------|----------| -| Public file reads | HTTP Range request | -| Private file reads | Secret key authentication| - -#### Scan Delta table from S3 -Reading or scanning a Delta table that's on S3 is as simple as reading from regular files: - -```sql -LOAD FROM 's3://kuzu-sample/sample-delta' (file_format='delta') -RETURN * -``` - -#### Copy Delta table hosted on S3 into a local node table - -Copying from Delta tables on S3 is also as simple as copying from regular files: - -```cypher -CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); -COPY student FROM 's3://kuzu-sample/student-delta' (file_format='delta') -``` - -### Limitations - -When using the Delta Lake extension in Kùzu, keep the following limitations in mind. - -- Writing (i.e., exporting to) Delta files is currently not supported. -- We currently do not support scanning/copying nested data (i.e., of type `STRUCT`) in the Delta table columns. From 2ea4ce4888d1056c4edddf6b61db9c7dcc41f924 Mon Sep 17 00:00:00 2001 From: prrao87 Date: Fri, 20 Dec 2024 08:34:03 -0500 Subject: [PATCH 04/15] Minor fixes --- src/content/docs/extensions/iceberg.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/src/content/docs/extensions/iceberg.md b/src/content/docs/extensions/iceberg.md index 53c98e2f..e8a17b28 100644 --- a/src/content/docs/extensions/iceberg.md +++ b/src/content/docs/extensions/iceberg.md @@ -242,9 +242,5 @@ COPY student FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_mov When using the Iceberg extension in Kùzu, keep the following limitations in mind. -<<<<<<< HEAD - Writing (i.e., exporting to) Iceberg tables from Kùzu is currently not supported. -======= -- Writing (i.e., exporting to) Iceberg tables is currently not supported. ->>>>>>> 6bf35c9 (Add Iceberg Extension Documentation (#314)) - We currently do not support scanning/copying nested data (i.e., of type `STRUCT`) in the Iceberg table columns. From 227f91b2ad102c78206334c16d87fe8f2bc8a71a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=9B=A7=E5=9B=A7?= Date: Mon, 13 Jan 2025 23:28:47 +0800 Subject: [PATCH 05/15] Create wasm.mdx Update wasm.mdx Update docs (#331) Fix demo script Starting merge for 0.8.0 --- astro.config.mjs | 1 + src/content/docs/client-apis/wasm.mdx | 146 ++++++++++++++++++ .../docs/import/copy-from-dataframe.md | 6 +- src/content/docs/import/index.mdx | 8 +- 4 files changed, 156 insertions(+), 5 deletions(-) create mode 100644 src/content/docs/client-apis/wasm.mdx diff --git a/astro.config.mjs b/astro.config.mjs index cde86425..8d034755 100644 --- a/astro.config.mjs +++ b/astro.config.mjs @@ -148,6 +148,7 @@ export default defineConfig({ { label: 'Go', link: '/client-apis/go' }, { label: 'C++', link: '/client-apis/cpp' }, { label: 'C', link: '/client-apis/c' }, + { label: 'WebAssembly', link: '/client-apis/wasm' }, { label: '.NET', link: '/client-apis/net', badge: { text: 'Community', variant: 'caution'}}, { label: 'Elixir', link: '/client-apis/elixir', badge: { text: 'Community', variant: 'caution'}} ], diff --git a/src/content/docs/client-apis/wasm.mdx b/src/content/docs/client-apis/wasm.mdx new file mode 100644 index 00000000..7cd1b54f --- /dev/null +++ b/src/content/docs/client-apis/wasm.mdx @@ -0,0 +1,146 @@ +--- +title: WebAssembly (Wasm) +--- + +[WebAssembly](https://webassembly.org/), a.k.a. _Wasm_, is a standard defining any suitable low-level +programming language as compilation target, enabling deployment of software within web browsers on a variety +of devices. This page describes Kùzu's Wasm API, enabling Kùzu databases to run inside Wasm-capable +browsers. + +## Benefits of WASM + +Kùzu-Wasm enables the following: + +- Fast, in-browser graph analysis without ever sending data to a server +- Strong data privacy guarantees, as the data never leaves the browser +- Real-time interactive dashboards +- Lightweight, portable solutions that leverage graphs within web applications + +## Installation + +```bash +npm i kuzu-wasm +``` + +## Example usage + +The example shown below sets up a graph database in the browser using Kùzu's WebAssembly implementation. + +```js +// Import library +import kuzu from './index.js'; + +(async () => { + // Write the data into WASM filesystem + const userCSV = `Adam,30 +Karissa,40 +Zhang,50 +Noura,25`; + const cityCSV = `Waterloo,150000 +Kitchener,200000 +Guelph,75000`; + const followsCSV = `Adam,Karissa,2020 +Adam,Zhang,2020 +Karissa,Zhang,2021 +Zhang,Noura,2022`; + const livesInCSV = `Adam,Waterloo +Karissa,Waterloo +Zhang,Kitchener +Noura,Guelph`; + + await kuzu.FS.writeFile("/user.csv", userCSV); + await kuzu.FS.writeFile("/city.csv", cityCSV); + await kuzu.FS.writeFile("/follows.csv", followsCSV); + await kuzu.FS.writeFile("/lives-in.csv", livesInCSV); + + // Create an empty database and connect to it + const db = new kuzu.Database("./test"); + const conn = new kuzu.Connection(db); + + // Create the tables + await conn.query( + "CREATE NODE TABLE User(name STRING, age INT64, PRIMARY KEY (name))" + ); + await conn.query( + "CREATE NODE TABLE City(name STRING, population INT64, PRIMARY KEY (name))" + ); + await conn.query("CREATE REL TABLE Follows(FROM User TO User, since INT64)"); + await conn.query("CREATE REL TABLE LivesIn(FROM User TO City)"); + + // Load the data + await conn.query('COPY User FROM "user.csv"'); + await conn.query('COPY City FROM "city.csv"'); + await conn.query('COPY Follows FROM "follows.csv"'); + await conn.query('COPY LivesIn FROM "lives-in.csv"'); + const queryResult = await conn.query("MATCH (u:User) -[l:LivesIn]-> (c:City) RETURN u.name, c.name"); + + // Get all rows from the query result + const rows = await queryResult.getAllObjects(); + + // Print the rows + for (const row of rows) { + console.log(`User ${row['u.name']} lives in ${row['c.name']}`); + } +})(); +``` + +This script can be directly embedded in an HTML file, for example: + +```html + + + +

Welcome to WASM Test Server

+ + + +``` + +## Understanding the package + +In this package, three different variants of WebAssembly modules are provided: +- **Default**: This is the default build of the WebAssembly module. It does not support multi-threading and uses Emscripten's default filesystem. This build has the smallest size and works in both Node.js and browser environments. It has the best compatibility and does not require cross-origin isolation. However, the performance maybe limited due to the lack of multithreading support. This build is located at the root level of the package. +- **Multi-threaded**: This build supports multi-threading and uses Emscripten's default filesystem. This build has a larger size compared to the default build and only requires [cross-origin isolation](https://web.dev/articles/cross-origin-isolation-guide) in the browser environment. This build is located in the `multithreaded` directory. +- **Node.js**: This build is optimized for Node.js and uses Node.js's filesystem instead of Emscripten's default filesystem (`NODEFS` flag is enabled). This build also supports multi-threading. It is distributed as a CommonJS module rather than an ES module to maximize compatibility. This build is located in the `nodejs` directory. Note that this build only works in Node.js and does not work in the browser environment. + +In each variant, there are two different versions of the WebAssembly module: +- **Async**: This version of the module is the default version and each function call returns a Promise. This version dispatches all the function calls to the WebAssembly module to a Web Worker or Node.js worker thread to prevent blocking the main thread. However, this version may have a slight overhead due to the serialization and deserialization of the data required by the worker threads. This version is located at the root level of each variant (e.g., `kuzu-wasm`, `kuzu-wasm/multithreaded`, `kuzu-wasm/nodejs`). +- **Sync**: This version of the module is synchronous and does not require any callbacks (other than the module initialization). This version is good for scripting / CLI / prototyping purposes but is not recommended to be used in GUI applications or web servers because it may block the main thread and cause unexpected freezes. This alternative version is located in the `sync` directory of each variant (e.g., `kuzu-wasm/sync`, `kuzu-wasm/multithreaded/sync`, `kuzu-wasm/nodejs/sync`). + +Note that you cannot mix and match the variants and versions. For example, a `Database` object created with the default variant cannot be passed to a function in the multithreaded variant. Similarly, a `Database` object created with the async version cannot be passed to a function in the sync version. + +### Loading the Worker script (for async versions) +In each variant, the main module is bundled as one script file. However, the worker script is located in a separate file. The worker script is required to run the WebAssembly module in a Web Worker or Node.js worker thread. If you are using a build tool like Webpack, the worker script needs to be copied to the output directory. For example, in Webpack, you can use the `copy-webpack-plugin` to copy the worker script to the output directory. + +By default, the worker script is resolved under the same directory / URL prefix as the main module. If you want to change the location of the worker script, you can use pass the optional worker path parameter to the `setWorkerPath` function. For example: +```javascript +import { setWorkerPath } from 'kuzu-wasm'; +setWorkerPath('path/to/worker.js'); +``` + +Note that this function must be called before any other function calls to the WebAssembly module. After the initialization is started, the worker script path cannot be changed and not finding the worker script will cause an error. + +For the Node.js variant, the worker script can be resolved automatically and you do not need to set the worker path. + +## API documentation +The API documentation can be found [here](https://kuzudb.com/api-docs/wasm/). + +## Local development + +This section is relevant if you are interested in contributing to Kùzu's Wasm API. + +First, build the WebAssembly module: + +```bash +npm run build +``` + +This will build the WebAssembly module in the `release` directory and create a tarball ready for publishing under the current directory. + +You can run the tests as follows: + +```bash +npm test +``` diff --git a/src/content/docs/import/copy-from-dataframe.md b/src/content/docs/import/copy-from-dataframe.md index f3b8416c..84ba99ad 100644 --- a/src/content/docs/import/copy-from-dataframe.md +++ b/src/content/docs/import/copy-from-dataframe.md @@ -33,7 +33,7 @@ conn.execute("COPY Person FROM df") ## Polars -You can utilize an existing Polars DataFrame to copy data directly into Kùzu. +You can utilize an existing Polars DataFrame to copy data directly into Kùzu. ```python import kuzu @@ -73,3 +73,7 @@ pa_table = pa.table({ conn.execute("COPY Person FROM pa_table") ``` + +## Ignore erroneous rows + +See the [Ignore erroneous rows](/import#ignore-erroneous-rows) section for more details. diff --git a/src/content/docs/import/index.mdx b/src/content/docs/import/index.mdx index d23fa87d..cc388cc2 100644 --- a/src/content/docs/import/index.mdx +++ b/src/content/docs/import/index.mdx @@ -202,12 +202,12 @@ CALL warning_limit=1024; ``` ### Skippable Errors By Source -Currently `IGNORE_ERRORS` option works when scanning files (and not in-memory data frames or when running -`COPY/LOAD FROM` on sub-queries). For different files, the errors that can be skipped can be different. -If the error is not skippable in a file format, `COPY/LOAD FROM` will instead error and fail. +Currently `IGNORE_ERRORS` option works when scanning files or in-memory data frames (but not when running +`COPY/LOAD FROM` on sub-queries). For different sources, the errors that can be skipped can be different. +If the error is not skippable for a specific source, `COPY/LOAD FROM` will instead error and fail. Below is a table that shows the errors that are skippable by each source. ||Parsing Errors|Casting Errors|Duplicate/Null/Missing Primary-Key errors| |----|----|----|----| |CSV| X | X | X | -|JSON/Numpy/Parquet|||X| +|JSON/Numpy/Parquet/PyArrow/Pandas/Polars Dataframes|||X| From 9f4d9d0f1c86ec7edfc1f5da3bb9f37bb19f9765 Mon Sep 17 00:00:00 2001 From: Howe Wang <104328541+WWW0030@users.noreply.github.com> Date: Mon, 20 Jan 2025 11:39:35 -0500 Subject: [PATCH 06/15] remove progress_bar_time from docs (#337) --- src/content/docs/cypher/configuration.md | 1 - 1 file changed, 1 deletion(-) diff --git a/src/content/docs/cypher/configuration.md b/src/content/docs/cypher/configuration.md index a970cb53..e52a7964 100644 --- a/src/content/docs/cypher/configuration.md +++ b/src/content/docs/cypher/configuration.md @@ -17,7 +17,6 @@ configuration **cannot** be used with other query clauses, such as `RETURN`. | `HOME_DIRECTORY`| system home directory | user home directory | | `FILE_SEARCH_PATH`| file search path | N/A | | `PROGRESS_BAR` | enable progress bar in CLI | false | -| `PROGRESS_BAR_TIME` | show progress bar after time in ms | 1000 | | `CHECKPOINT_THRESHOLD` | the WAL size threshold in bytes at which to automatically trigger a checkpoint | 16777216 (16MB) | | `WARNING_LIMIT` | maximum number of [warnings](/import#warnings-table-inspect-skipped-rows) that can be stored in a single connection. | 8192 | | `SPILL_TO_DISK` | spill data disk if there is not enough memory when running `COPY FROM (cannot be set to TRUE under in-memory or read-only mode) | true | From 604954a67bc7c549404b5749635965bea4381d98 Mon Sep 17 00:00:00 2001 From: Prashanth Rao <35005448+prrao87@users.noreply.github.com> Date: Tue, 21 Jan 2025 09:59:21 -0500 Subject: [PATCH 07/15] Fix ignore errors in DataFrame section (#338) --- .../docs/import/copy-from-dataframe.md | 57 ++++++++++++++++++- src/content/docs/import/index.mdx | 13 ++++- 2 files changed, 66 insertions(+), 4 deletions(-) diff --git a/src/content/docs/import/copy-from-dataframe.md b/src/content/docs/import/copy-from-dataframe.md index 84ba99ad..40b66ea1 100644 --- a/src/content/docs/import/copy-from-dataframe.md +++ b/src/content/docs/import/copy-from-dataframe.md @@ -76,4 +76,59 @@ conn.execute("COPY Person FROM pa_table") ## Ignore erroneous rows -See the [Ignore erroneous rows](/import#ignore-erroneous-rows) section for more details. +When copying from DataFrames, you can ignore rows in DataFrames that contain duplicate, null +or missing primary key errors. + +:::note[Note] +Currently, you cannot ignore parsing or type-casting errors when copying from DataFrames (the +underlying data must be parseable and type-castable). +::: + +Let's understand this with an example. + +```py +import pandas as pd + +persons = ["Rhea", "Alice", "Rhea", None] +age = [25, 23, 25, 24] + +df = pd.DataFrame({"name": persons, "age": age}) +print(df) +``` +The given DataFrame is as follows: +``` + name age +0 Rhea 25 +1 Alice 23 +2 Rhea 25 +3 None 24 +``` +As can be seen,the Pandas DataFrame has a duplicate name "Rhea", and null value (`None`) +for the `name`, which is the desired primary key field. We can ignore the erroneous rows during import +by setting the `ignore_errors` parameter to `True` in the `COPY FROM` command. + +```py +import kuzu + +db = kuzu.Database("test_db") +conn = kuzu.Connection(db) + +# Create a Person node table with name as the primary key +conn.execute("CREATE NODE TABLE Person(name STRING PRIMARY KEY, age INT64)") +# Enable the `ignore_errors` parameter below to ignore the erroneous rows +conn.execute("COPY Person FROM df (ignore_errors=true)") + +# Display results +res = conn.execute("MATCH (p:Person) RETURN p.name, p.age") +print(res.get_as_df()) +``` +This is the resulting DataFrame after ignoring errors: +``` + p.name p.age +0 Rhea 25 +1 Alice 23 +``` +If the `ignore_errors` parameter is not set, the import operation will fail with an error. + +You can see [Ignore erroneous rows](/import#ignore-erroneous-rows) section for details on +which kinds of errors can be ignored when copying from Pandas or Polars DataFrames. diff --git a/src/content/docs/import/index.mdx b/src/content/docs/import/index.mdx index cc388cc2..ea7d3a0a 100644 --- a/src/content/docs/import/index.mdx +++ b/src/content/docs/import/index.mdx @@ -208,6 +208,13 @@ If the error is not skippable for a specific source, `COPY/LOAD FROM` will inste Below is a table that shows the errors that are skippable by each source. ||Parsing Errors|Casting Errors|Duplicate/Null/Missing Primary-Key errors| -|----|----|----|----| -|CSV| X | X | X | -|JSON/Numpy/Parquet/PyArrow/Pandas/Polars Dataframes|||X| +|---|:---:|:---:|:---:| +|CSV| ✅ | ✅ | ✅ | +|JSON| ❌ | ❌ | ✅ | +|Numpy| ❌ | ❌ | ✅ | +|Parquet| ❌ | ❌ | ✅ | +|PyArrow tables| ❌ | ❌ | ✅ | +|Pandas DataFrames| ❌ | ❌ | ✅ | +|Polars DataFrames| ❌ | ❌ | ✅ | + + From 51ef56acea8171b27d36453dd33dc3e09c25f21c Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Tue, 21 Jan 2025 14:54:10 -0500 Subject: [PATCH 08/15] Add doc for `show_indexes`, `show_official_extensions` (#339) * Add doc for `show_indexes`, `show_official_extensions` and `show_loaded_extensions` * Apply suggestions from code review * Update src/content/docs/cypher/query-clauses/call.md --------- Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> --- src/content/docs/cypher/query-clauses/call.md | 83 +++++++++++++++++++ 1 file changed, 83 insertions(+) diff --git a/src/content/docs/cypher/query-clauses/call.md b/src/content/docs/cypher/query-clauses/call.md index 23fd4401..806bca5e 100644 --- a/src/content/docs/cypher/query-clauses/call.md +++ b/src/content/docs/cypher/query-clauses/call.md @@ -19,6 +19,9 @@ The following tables lists the built-in schema functions you can use with the `C | `SHOW_WARNINGS()` | returns the contents of the [Warnings Table](/import#warnings-table-inspecting-skipped-rows) | | `CLEAR_WARNINGS()` | clears all warnings in the [Warnings Table](/import#warnings-table-inspecting-skipped-rows) | | `TABLE_INFO('tableName')` | returns metadata information of the given table | +| `SHOW_OFFICIAL_EXTENSIONS` | returns all official [extensions](/extensions) which can be installed by `INSTALL ` | +| `SHOW_LOADED_EXTENSIONS` | returns all loaded extensions | +| `SHOW_INDEXES` | returns all indexes built in the system | ### TABLE_INFO @@ -198,3 +201,83 @@ This function has no output. ```cypher CALL clear_warnings(); ``` + +### SHOW_OFFICIAL_EXTENSIONS +If you would like to know all official [extensions](../../extensions) available in Kùzu, you can run the `SHOW_OFFICIAL_EXTENSIONS` function. + +| Column | Description | Type | +| ------ | ----------- | ---- | +| name | name of the extension | STRING | +| description | description of the extension | STRING | + +```cypher +CALL SHOW_OFFICIAL_EXTENSIONS() RETURN *; +``` + +Output: +``` +┌──────────┬─────────────────────────────────────────────────────────────────────────┐ +│ name │ description │ +│ STRING │ STRING │ +├──────────┼─────────────────────────────────────────────────────────────────────────┤ +│ SQLITE │ Adds support for reading from SQLITE tables │ +│ JSON │ Adds support for JSON operations │ +│ ICEBERG │ Adds support for reading from iceberg tables │ +│ HTTPFS │ Adds support for reading and writing files over a HTTP(S)/S3 filesystem │ +│ DELTA │ Adds support for reading from delta tables │ +│ POSTGRES │ Adds support for reading from POSTGRES tables │ +│ FTS │ Adds support for full-text search indexes │ +│ DUCKDB │ Adds support for reading from duckdb tables │ +└──────────┴─────────────────────────────────────────────────────────────────────────┘ +``` + +### SHOW_LOADED_EXTENSIONS +If you would like to know information about loaded extensions in Kùzu, you can run the `SHOW_LOADED_EXTENSIONS` function. + +| Column | Description | Type | +| ------ | ----------- | ---- | +| extension name | name of the extension | STRING | +| extension source | whether the extension is officially supported by Kùzu Inc., or developed by a third-party | STRING | +| extension path | the path to the extension | STRING | + +```cypher +CALL SHOW_LOADED_EXTENSIONS() RETURN *; +``` + +``` +┌────────────────┬──────────────────┬─────────────────────────────────────────────────────────────────────────────┐ +│ extension name │ extension source │ extension path │ +│ STRING │ STRING │ STRING │ +├────────────────┼──────────────────┼─────────────────────────────────────────────────────────────────────────────┤ +│ FTS │ OFFICIAL │ extension/fts/build/libfts.kuzu_extension │ +└────────────────┴──────────────────┴─────────────────────────────────────────────────────────────────────────────┘ +``` + +### SHOW_INDEXES +If you would like to know information about indexes built in kuzu, you can run the `SHOW_INDEXES` function. + +| Column | Description | Type | +| ------ | ----------- | ---- | +| table name | the table which the index is built on | STRING | +| index name | the name of the index | STRING | +| index type | the type of the index (e.g. FTS, HNSW) | STRING | +| property names | the properties which the index is built on | STRING[] | +| extension loaded | whether the depended extension has been loaded | BOOL | +| index definition | the cypher query to create the index | STRING | + +Note: +Some indexes are implemented within extensions. If a required extension is not loaded, the extension loaded field will display false, and the index definition field will be null. + +```cypher +CALL SHOW_INDEXES() RETURN *; +``` + +``` +┌────────────┬────────────┬────────────┬─────────────────────────┬──────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ table name │ index name │ index type │ property names │ extension loaded │ index definition │ +│ STRING │ STRING │ STRING │ STRING[] │ BOOL │ STRING │ +├────────────┼────────────┼────────────┼─────────────────────────┼──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ book │ bookIdx │ FTS │ [abstract,author,title] │ True │ CALL CREATE_FTS_INDEX('book', 'bookIdx', ['abstract', 'author', 'title' ], stemmer := 'porter'); │ +└────────────┴────────────┴────────────┴─────────────────────────┴──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘ +``` + From ca49c2ff6c98d0da52a753554398f0a255136c5c Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Wed, 22 Jan 2025 09:38:13 -0500 Subject: [PATCH 09/15] FTS index (#332) * Create full-text-search.md * Update full-text-search.md * Update FTS docs --------- Co-authored-by: prrao87 --- astro.config.mjs | 5 +- .../docs/extensions/full-text-search.md | 201 ++++++++++++++++++ 2 files changed, 204 insertions(+), 2 deletions(-) create mode 100644 src/content/docs/extensions/full-text-search.md diff --git a/astro.config.mjs b/astro.config.mjs index 8d034755..bdfb694d 100644 --- a/astro.config.mjs +++ b/astro.config.mjs @@ -198,8 +198,9 @@ export default defineConfig({ ] }, { label: 'JSON', link: '/extensions/json' }, - { label: 'Iceberg', link: '/extensions/iceberg', badge: { text: 'New' }}, - { label: 'Delta Lake', link: '/extensions/delta', badge: { text: 'New' }}, + { label: 'Iceberg', link: '/extensions/iceberg' }, + { label: 'Delta Lake', link: '/extensions/delta' }, + { label: 'Full-text search', link: '/extensions/full-text-search', badge: { text: 'New' }}, ], autogenerate: { directory: 'reference' }, }, diff --git a/src/content/docs/extensions/full-text-search.md b/src/content/docs/extensions/full-text-search.md new file mode 100644 index 00000000..e24a08b6 --- /dev/null +++ b/src/content/docs/extensions/full-text-search.md @@ -0,0 +1,201 @@ +--- +title: "Full Text Search" +--- + +## Usage + +The `FTS` (full-text search) extension adds support for matching within the content of a string property +while returning the documents with a proximity score to the query. It is enabled by building an index +on string properties in a table and allows searching through the strings via a keyword query. +Currently, Kùzu supports only indexing on a node table's `STRING` properties. + +The FTS functionality is not available by default, so you would first need to install the `FTS` +extension by running the following commands: + +```sql +INSTALL FTS; +LOAD EXTENSION FTS; +``` + +### Example dataset + +Let's look at an example dataset to demonstrate how the FTS extension can be used. +First, let's create a `Book` table containing each book's information, including the title, author and abstract. + +```cypher +CREATE NODE TABLE Book (ID SERIAL, abstract STRING, author STRING, title STRING, PRIMARY KEY (ID)); +CREATE (b:Book {abstract: 'An exploration of quantum mechanics.', author: 'Alice Johnson', title: 'The Quantum World'}); +CREATE (b:Book {abstract: 'A magic journey through time and space.', author: 'John Smith', title: 'Chronicles of the Universe'}); +CREATE (b:Book {abstract: 'An introduction to machine learning techniques.', author: 'Emma Brown', title: 'Learning Machines'}); +CREATE (b:Book {abstract: 'A deep dive into the history of ancient civilizations.', author: 'Michael Lee', title: 'Echoes of the Past'}); +CREATE (b:Book {abstract: 'A fantasy tale of dragons and magic.', author: 'Charlotte Harris', title: 'The Dragon\'s Call'}); +``` + +In the following sections, we will build a full-text search index on the book table, and demonstrate how to search for books relevant to a keyword query. + +### Create FTS index + +Kuzu provides a function `CREATE_FTS_INDEX` to create the full-text search index on a table: + +```cypher +CALL CREATE_FTS_INDEX('TABLE_NAME', 'INDEX_NAME', ['PROP1', 'PROP2', 'PROP3'...], OPTIONAL_PARAM1 := 'OPTIONAL_VAL1') +``` +- `TABLE_NAME`: The name of the table to build FTS index. +- `INDEX_NAME`: The name of the FTS index to create. +- `PROPERTIES`: A list of properties in the table to build FTS index on. Full text search will only search the properties with FTS index built on. + +The following optional parameters are supported: + +- `stemmer`: The text normalization technique to use. Should be one of: `arabic`, `basque`, `catalan`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `lithuanian`, `nepali`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `serbian`, `spanish`, `swedish`, `tamil`, `turkish`, or `none` if no stemming is to be used. Defaults to `english`, +which uses a Snowball stemmer. + +The example below shows how to create an FTS index on the book table with the `abstract`, `author` and `title` properties using the `porter` stemmer. + +:::caution[Note] +Kùzu uses special syntax for optional parameters. Note how the `:=` operator is used to assign a value +to an optional parameter in the example below. +::: + +```cypher +CALL CREATE_FTS_INDEX( + 'Book', // Table name + 'book_index', // Index name + ['abstract', 'author', 'title'], // Properties to build FTS index on + stemmer := 'porter' // Stemmer to use (optional) +) +``` + +Depending on the size of the dataset, the index creation may take some time. Once the index creation is complete, +the index will be ready to use for full-text search. + +### Query FTS index + +Kuzu provides a table function `QUERY_FTS_INDEX` to query the FTS index on a table: + +```cypher +CALL QUERY_FTS_INDEX( + 'TABLE_NAME', + 'INDEX_NAME', + 'QUERY', + OPTIONAL_PARAM1 := 'OPTIONAL_VAL1'... +) +``` +- `TABLE_NAME`: The name of the table to query +- `INDEX_NAME`: The name of the FTS index to query +- `QUERY`: The query string + +The following optional parameters are supported: + +1. `conjunctive`: Whether all keywords in the query should appear in order for a document to be retrieved, default to false. +2. `K`: parameter controls the influence of term frequency saturation. It limits the effect of additional occurrences of a term within a document. Defaults to 1.2. +3. `B`: parameter controls the degree of length normalization by adjusting the influence of document length. Defaults to 0.75. + +The below example shows how to query books related to the `quantum machine` and order the books by their scores: +```cypher +CALL QUERY_FTS_INDEX('Book', 'book_index', 'quantum machine') +RETURN _node.title, score +ORDER BY score DESC; +``` + +Result: +``` +┌───────────────────┬──────────┐ +│ _node.title │ score │ +│ STRING │ DOUBLE │ +├───────────────────┼──────────┤ +│ The Quantum World │ 0.857996 │ +│ Learning Machines │ 0.827832 │ +└───────────────────┴──────────┘ +``` + +The `conjunctive` option can be used when you want to retrieve only the books containing _all_ the keywords in the query. +```cypher +CALL QUERY_FTS_INDEX('Book', 'book_index', 'dragon magic', conjunctive := true) +RETURN _node.title, score +ORDER BY score DESC; +``` + +Result: +``` +┌───────────────────┬──────────┐ +│ _node.title │ score │ +│ STRING │ DOUBLE │ +├───────────────────┼──────────┤ +│ The Dragon's Call │ 1.208044 │ +└───────────────────┴──────────┘ +``` + +If you want to retrieve books with either the `dragon` OR `magic` keywords, set `conjunctive` to `false` +```cypher +CALL QUERY_FTS_INDEX('Book', 'book_index', 'dragon magic', conjunctive := false) +RETURN _node.title, score +ORDER BY score DESC; +``` + +Result: +``` +┌────────────────────────────┬──────────┐ +│ _node.title │ score │ +│ STRING │ DOUBLE │ +├────────────────────────────┼──────────┤ +│ The Dragon's Call │ 1.208044 │ +│ Chronicles of the Universe │ 0.380211 │ +└────────────────────────────┴──────────┘ +``` + +### Drop FTS index + +Use the function `DROP_FTS_INDEX` to drop the FTS index on a table: + +```cypher +CALL DROP_FTS_INDEX('TABLE_NAME', 'INDEX_NAME') +``` + +The example below shows how to drop the `book_index` index from the `Book` table: + +```cypher +CALL DROP_FTS_INDEX('Book', 'book_index') +``` + +### Show FTS indexes + +There is no function specifically to show FTS indexes, but there is a general function [`SHOW_INDEXES`](/cypher/query-clauses/call) that +can be used to show all the indexes available in the database. + +```cypher +CALL SHOW_INDEXES() RETURN *; +``` +This will return a list of all the indexes available in the database, while also listing the type of each +index. Scan the table to find the FTS indexes that are currently available. + +``` +┌────────────┬─────────────┬────────────┬─────────────────────────┬──────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ table name │ index nam │ index type │ property names │ extension loaded │ index definition │ +│ STRING │ STRING │ STRING │ STRING[] │ BOOL │ STRING │ +├────────────┼─────────────┼────────────┼─────────────────────────┼──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ book │ book_index │ FTS │ [abstract,author,title] │ True │ CALL CREATE_FTS_INDEX('book', 'book_index', ['abstract', 'author', 'title' ], stemmer := 'porter'); │ +└────────────┴─────────────┴────────────┴─────────────────────────┴──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────┘ +``` + +### Prepared statement + +[Prepared statements](/get-started/prepared-statements) allows you to execute a query with different parameter values without rebinding the same query. +A typical use case where parameters are useful is when you want to find books with different contents. + +Example: +Let's start with preparing a cypher statement which queries the `book_index`. +```c++ +auto preparedStatement = conn->prepare("CALL QUERY_FTS_INDEX('Book', 'book_index', $q) RETURN _node.ID, score;"); +``` +Now, we can find books with different contents using the prepared statement without rebinding. + +#### Find books related to `machine learning` +```c++ +auto result = conn->execute(prepared.get, std::make_pair(std::string("q"), std::string("machine learning"))); +``` + +#### Find books related to `dragons` +```c++ +auto result = conn->execute(prepared.get, std::make_pair(std::string("q"), std::string("dragons"))); +``` + From 789dd4cf489fb1890f97bf9d031e7d3133421a22 Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Thu, 23 Jan 2025 15:29:14 -0500 Subject: [PATCH 10/15] Document the behaviour of import/export database with indexes (#340) --- src/content/docs/migrate/index.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/content/docs/migrate/index.md b/src/content/docs/migrate/index.md index ba7c0d23..23a710f1 100644 --- a/src/content/docs/migrate/index.md +++ b/src/content/docs/migrate/index.md @@ -34,6 +34,10 @@ For more compact storage, you can export the data files in Parquet format as fol EXPORT DATABASE '/path/to/export' (format="parquet"); ``` +:::note[Note] +The EXPORT DATABASE command also exports all indexes, regardless of whether their dependent extensions have been loaded or not. +::: + ## Import database The `IMPORT DATABASE` command imports the contents of the database from a specific directory to which @@ -48,8 +52,9 @@ IMPORT DATABASE '/path/to/export'; ``` :::note[Note] -The `IMPORT DATABASE` command can only be executed on an empty database. +1. The `IMPORT DATABASE` command can only be executed on an empty database. Currently, in case of a failure during the execution of the `IMPORT DATABASE` command, automatic rollback is not supported. Therefore, if the `IMPORT DATABASE` command fails, you will need to delete the database directory you are connected to and reload it again. +2. The `IMPORT DATABASE` command also imports all indexes, regardless of whether their dependent extensions were loaded during export. If an index's dependent extension was loaded at the time of export, it will be automatically loaded during import. However, if the dependent extension was not loaded during export, it will not be automatically loaded during import. In such cases, users must manually load the dependent extensions before querying the index. ::: From b347ae51cd564bbefd1625692a90f955a034848a Mon Sep 17 00:00:00 2001 From: Prashanth Rao <35005448+prrao87@users.noreply.github.com> Date: Fri, 24 Jan 2025 17:53:56 -0500 Subject: [PATCH 11/15] Add doc for file-format option (#342) (#343) * Add doc for file-format * Update index.mdx * Apply suggestions from code review --------- Co-authored-by: ziyi chen --- .../docs/cypher/query-clauses/load-from.md | 15 ++++++++++++++- src/content/docs/import/index.mdx | 9 +++++++++ 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/src/content/docs/cypher/query-clauses/load-from.md b/src/content/docs/cypher/query-clauses/load-from.md index 756ae208..44753bc0 100644 --- a/src/content/docs/cypher/query-clauses/load-from.md +++ b/src/content/docs/cypher/query-clauses/load-from.md @@ -144,7 +144,20 @@ You can also see the details of any warnings generated by the skipped lines usin See the [ignoring erroneous rows section of `COPY FROM`](import#ignore-erroneous-rows) for more details. ## Scan Data Formats -Load from can scan several raw or in-memory file formats, such as CSV, Parquet, Pandas, Polars, Arrow tables, and JSON. +`LOAD FROM` can scan several raw or in-memory file formats, such as CSV, Parquet, Pandas, Polars, Arrow tables, and JSON. + +### File format detection +`Load from` determines the file format based on the file extension if the `file_format` option is not given. For instance, files with a `.csv` extension are automatically recognized as CSV format. + +If the file format cannot be inferred from the extension, or if you need to override the default sniffing behaviour, the `file_format` option can be used. + +For example, to load a CSV file that has a `.tsv` extension (for tab-separated data), you must explicitly specify the file format using the `file_format` option, as shown below: +``` +LOAD FROM 'data.tsv' (file_format='csv') +RETURN * +``` + + Below we give examples of using `LOAD FROM` to scan data from each of these formats. We assume `WITH HEADERS` is not used in the examples below, so we discuss how Kùzu infers the variable names and data types of that bind to the scanned tuples. diff --git a/src/content/docs/import/index.mdx b/src/content/docs/import/index.mdx index ea7d3a0a..e1b60651 100644 --- a/src/content/docs/import/index.mdx +++ b/src/content/docs/import/index.mdx @@ -52,6 +52,15 @@ The following sections show how to bulk import data using `COPY FROM` on various +:::caution[Note] +Similar to the [LOAD FROM](/cypher/query-clauses/load-from.md) clause, the `COPY FROM` clause determines the file format based on the file extension if the `file_format` option is not provided. Alternatively, the `file_format` option can be used in the `COPY FROM` clause to explicitly specify the file format. + +Example: To copy from a file ending with an arbitrary extension like `.dsv`, use the `file_format = 'csv'` option to explicitly tell Kùzu to treat the file as a `CSV` file. +``` +COPY person FROM 'person.dsv' (file_format = 'csv') +``` +::: + ## `COPY FROM` a partial subset In certain cases, you may only want to partially fill your Kùzu table using the data from your input From 9db6bb9748f838aac0b53720b49b2870067fb398 Mon Sep 17 00:00:00 2001 From: prrao87 Date: Fri, 24 Jan 2025 17:02:24 -0600 Subject: [PATCH 12/15] Fix typos and improve formatting --- .../docs/cypher/query-clauses/load-from.md | 64 +++++++++---------- 1 file changed, 31 insertions(+), 33 deletions(-) diff --git a/src/content/docs/cypher/query-clauses/load-from.md b/src/content/docs/cypher/query-clauses/load-from.md index 44753bc0..c03ba39d 100644 --- a/src/content/docs/cypher/query-clauses/load-from.md +++ b/src/content/docs/cypher/query-clauses/load-from.md @@ -147,12 +147,12 @@ See the [ignoring erroneous rows section of `COPY FROM`](import#ignore-erroneous `LOAD FROM` can scan several raw or in-memory file formats, such as CSV, Parquet, Pandas, Polars, Arrow tables, and JSON. ### File format detection -`Load from` determines the file format based on the file extension if the `file_format` option is not given. For instance, files with a `.csv` extension are automatically recognized as CSV format. +`LOAD FROM` determines the file format based on the file extension if the `file_format` option is not given. For instance, files with a `.csv` extension are automatically recognized as CSV format. If the file format cannot be inferred from the extension, or if you need to override the default sniffing behaviour, the `file_format` option can be used. For example, to load a CSV file that has a `.tsv` extension (for tab-separated data), you must explicitly specify the file format using the `file_format` option, as shown below: -``` +```cypher LOAD FROM 'data.tsv' (file_format='csv') RETURN * ``` @@ -170,7 +170,7 @@ See the ](/import/csv#ignoring-erroneous-rows) documentation pages for the `COPY FROM` file. The configurations documented in those pages can also be specified after the `LOAD FROM` statement inside `()` when scanning CSV files. For example, you can indicate that the first line should -be interpreted as a header line by setting `(haders = true)` or that the CSV delimiter is '|' by setting `(DELIM="|")`. +be interpreted as a header line by setting `(headers = true)` or that the CSV delimiter is '|' by setting `(DELIM="|")`. Some of these configurations are also by default [automatically detected](/import/csv#auto-detecting-configurations) by Kùzu when scanning CSV files. These configurations determine the names and data types of the variables that bind to the fields scanned from CSV files. @@ -186,7 +186,7 @@ provide the names of the columns. The data types are always automatically inferr if `LOAD WITH HEADERS (...) FROM` is used, in which case the data types provided inside the `(...)` are used as described [above](#bound-variable-names-and-data-types)). -Suppose user.csv is a CSV file with the following contents: +Suppose `user.csv` is a CSV file with the following contents: ``` name,age Adam,30 @@ -198,15 +198,14 @@ Then if you run the following query, Kùzu will infer the column names `name` an ```cypher LOAD FROM "user.csv" (header = true) RETURN *; ------------------ -| name | age | ------------------ -| Adam | 30 | ------------------ -| Karissa | 40 | ------------------ -| Zhang | 50 | ------------------ +┌─────────┬───────┐ +│ name │ age │ +│ STRING │ INT64 │ +├─────────┼───────┤ +│ Adam │ 30 │ +│ Karissa │ 40 │ +│ Zhang │ 50 │ +└─────────┴───────┘ ``` @@ -220,15 +219,15 @@ Zhang,50 ```cypher LOAD FROM "user.csv" (header = false) RETURN *; ---------------------- -| column0 | column1 | ---------------------- -| Adam | 30 | ---------------------- -| Karissa | 40 | ---------------------- -| Zhang | 50 | ---------------------- +┌─────────┬─────────┐ +│ column0 │ column1 │ +│ STRING │ STRING │ +├─────────┼─────────┤ +│ name │ age │ +│ Adam │ 30 │ +│ Karissa │ 40 │ +│ Zhang │ 50 │ +└─────────┴─────────┘ ``` ### Parquet @@ -240,15 +239,14 @@ and the same content as in the `user.csv` file above. Then the query below will ```cypher LOAD FROM "user.parquet" RETURN *; ----------------- -| f0 | f1 | ----------------- -| Adam | 30 | ----------------- -| Karissa | 40 | ----------------- -| Zhang | 50 | ----------------- +┌─────────┬───────┐ +│ f0 │ f1 │ +│ STRING │ INT64 │ +├─────────┼───────┤ +│ Adam │ 30 │ +│ Karissa │ 40 │ +│ Zhang │ 50 │ +└─────────┴───────┘ ``` ### Pandas @@ -350,5 +348,5 @@ age: [[30,40,50]] ``` ### JSON -Kùzu can scan JSON files using `LOAD FROM`. -All JSON-related features are part of the JSON extension. See the documentation on the [JSON extension](/extensions/json#load-from) for details. +Kùzu can scan JSON files using `LOAD FROM`, but only upon installation of the JSON extension. +See the documentation on the [JSON extension](/extensions/json#load-from) for details. From 123c28d3e228c3b7604cf523e8daa905473e56cc Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Sat, 25 Jan 2025 14:49:58 -0500 Subject: [PATCH 13/15] Add doc for yield clause (#347) * Add doc for yield clause * Apply suggestions from code review --------- Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> --- src/content/docs/cypher/query-clauses/call.md | 68 +++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/src/content/docs/cypher/query-clauses/call.md b/src/content/docs/cypher/query-clauses/call.md index 806bca5e..f2bb8192 100644 --- a/src/content/docs/cypher/query-clauses/call.md +++ b/src/content/docs/cypher/query-clauses/call.md @@ -281,3 +281,71 @@ CALL SHOW_INDEXES() RETURN *; └────────────┴────────────┴────────────┴─────────────────────────┴──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘ ``` +### Using yield +The `YIELD` clause in Kuzu is used to rename the return columns of a CALL function to avoid naming conflicition and better readability. +Usage: +``` +CALL FUNC() +YIELD COLUMN0 [AS ALIAS0], COLUMN1 [AS ALIAS1] +RETURN ALIAS0, ALIAS1 +``` + +Example: +To rename the output column name of `current_setting('threads')` from `threads` to `threads_num`, you can use the following query: +``` +CALL current_setting('threads') +YIELD threads as threads_num +RETURN *; +``` + +Result: +``` +┌─────────────┐ +│ threads_num │ +│ STRING │ +├─────────────┤ +│ 10 │ +└─────────────┘ +``` + +Another useful scenario is to avoid naming conflicition when two call functions in the same query returns a column with the same name. +``` +CALL table_info('person') +YIELD `property id` as person_id, name as person_name, type as person_type, `default expression` as person_default, `primary key` as person_pk +CALL table_info('student') +YIELD `property id` as student_id, name as student_name, type as student_type, `default expression` as student_default, `primary key` as student_pk +RETURN *; +``` + +Result: +``` +┌───────────┬─────────────┬─────────────┬────────────────┬───────────┬────────────┬──────────────┬──────────────┬─────────────────┬────────────┐ +│ person_id │ person_name │ person_type │ person_default │ person_pk │ student_id │ student_name │ student_type │ student_default │ student_pk │ +│ INT32 │ STRING │ STRING │ STRING │ BOOL │ INT32 │ STRING │ STRING │ STRING │ BOOL │ +├───────────┼─────────────┼─────────────┼────────────────┼───────────┼────────────┼──────────────┼──────────────┼─────────────────┼────────────┤ +│ 0 │ id │ INT64 │ NULL │ True │ 0 │ id │ INT64 │ NULL │ True │ +└───────────┴─────────────┴─────────────┴────────────────┴───────────┴────────────┴──────────────┴──────────────┴─────────────────┴────────────┘ +``` + +:::caution[Note] +1. If the `YIELD` clause is used after a `CALL` function, **all** return columns of the function must appear in the `YIELD` clause. + +For example: +``` +CALL table_info('person') +YIELD `property id` as person_id +RETURN person_id +``` +The query throws an exception since not all returns columns of the `table_info` function appear in the yield clause. + +2. The column names to yield must match the original return column names of the call function. +For example: +``` +CALL current_setting('threads') +YIELD thread as threads_num +RETURN *; +``` +The query throws an exception since the column name to yield is `thread` which doesn't match the return column name(`threads`) of the call function. + +3. The syntax in Kùzu Cypher is different from other systems like Neo4j. In Kùzu, the `YIELD` clause must be followed by a return clause. `YIELD *` is not allowed in Kùzu. +::: From 6847d34c88a2e46052c711352b401c0176316c1b Mon Sep 17 00:00:00 2001 From: ziyi chen Date: Wed, 29 Jan 2025 15:36:34 -0500 Subject: [PATCH 14/15] skip/limit doc (#341) * skip/limit doc * Update limit.md * Update limit.md * Update skip.md --- .../docs/cypher/query-clauses/limit.md | 59 ++++++++++++++++--- src/content/docs/cypher/query-clauses/skip.md | 57 +++++++++++++++--- 2 files changed, 100 insertions(+), 16 deletions(-) diff --git a/src/content/docs/cypher/query-clauses/limit.md b/src/content/docs/cypher/query-clauses/limit.md index cbb0935e..acb9fe20 100644 --- a/src/content/docs/cypher/query-clauses/limit.md +++ b/src/content/docs/cypher/query-clauses/limit.md @@ -18,17 +18,58 @@ LIMIT 3; ``` Result: ``` ------------ -| u.name | ------------ -| Zhang | ------------ -| Karissa | ------------ -| Adam | ------------ +┌─────────┐ +│ u.name │ +│ STRING │ +├─────────┤ +│ Zhang │ +│ Karissa │ +│ Adam │ +└─────────┘ ``` If you omit the `ORDER BY`, you would get some k tuples in a `LIMIT k` query but you have no guarantee about which ones will be selected. + +The number of rows to limit can either be: +1. A parameter expression when used with prepared statement: + +Prepare: +```c++ +auto prepared = conn->prepare("MATCH (u:User) RETURN u.name limit $lt") +``` +Execution: +The number of rows to limit can be given at the time of execution. +```c++ +conn->execute(prepared.get(), std::make_pair(std::string{"lt"}, 1)) +``` + +Result: +``` +┌────────┐ +│ u.name │ +│ STRING │ +├────────┤ +│ Adam │ +└────────┘ +``` +2. A literal expression which can be evaluated at compile time. +```cypher +MATCH (u:User) +RETURN u.name +limit 1+2 +``` +Result: + +``` +┌─────────┐ +│ u.name │ +│ STRING │ +├─────────┤ +│ Adam │ +│ Karissa │ +│ Zhang │ +└─────────┘ +``` + diff --git a/src/content/docs/cypher/query-clauses/skip.md b/src/content/docs/cypher/query-clauses/skip.md index 393f2953..1b5d4491 100644 --- a/src/content/docs/cypher/query-clauses/skip.md +++ b/src/content/docs/cypher/query-clauses/skip.md @@ -20,14 +20,57 @@ SKIP 2; ``` Result: ``` ------------ -| u.name | ------------ -| Karissa | ------------ -| Zhang | ------------ +┌─────────┐ +│ u.name │ +│ STRING │ +├─────────┤ +│ Karissa │ +│ Zhang │ +└─────────┘ ``` If you omit the `ORDER BY`, you would skip some k tuples in a `SKIP` k query but you have no guarantee about which ones will be skipped. + + +The number of rows to skip can either be: +1. A parameter expression when used with prepared statement: + +Prepare: +```c++ +auto prepared = conn->prepare("MATCH (u:User) RETURN u.name skip $sp") +``` + +Execution: + +The number of rows to skip can be given at the time of execution. +```c++ +conn->execute(prepared.get(), std::make_pair(std::string{"sp"}, 2)) +``` + +Result: +``` +┌────────┐ +│ u.name │ +│ STRING │ +├────────┤ +│ Zhang │ +│ Noura │ +└────────┘ +``` +2. A literal expression which can be evaluated at compile time. +```cypher +MATCH (u:User) +RETURN u.name +skip 2+1 +``` +Result: + +``` +┌────────┐ +│ u.name │ +│ STRING │ +├────────┤ +│ Noura │ +└────────┘ +``` From bf980bc3eedd25a6a9083c9f072900edc5758464 Mon Sep 17 00:00:00 2001 From: Royi Luo Date: Fri, 31 Jan 2025 17:20:11 -0500 Subject: [PATCH 15/15] Add documentation on special behaviour for query result getNext() (#351) * Add docs on query result getNext() behaviour * Add manual frees in C API example * Apply suggestions from code review --------- Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> --- src/content/docs/client-apis/c.mdx | 60 +++++++++++++++++++++++- src/content/docs/client-apis/cpp.mdx | 67 +++++++++++++++++++++++++-- src/content/docs/client-apis/java.mdx | 60 ++++++++++++++++++++++++ 3 files changed, 183 insertions(+), 4 deletions(-) diff --git a/src/content/docs/client-apis/c.mdx b/src/content/docs/client-apis/c.mdx index fb4167c6..892dd0f8 100644 --- a/src/content/docs/client-apis/c.mdx +++ b/src/content/docs/client-apis/c.mdx @@ -73,4 +73,62 @@ And then link against `/libkuzu.so` (or `libkuzu.dylib`/`libkuzu.l The static library is more complicated (as noted above, it's recommended that you use CMake to handle the details) and is not installed by default, but all static libraries will be available in the build directory. -You need to define `KUZU_STATIC_DEFINE`, and link against the static kuzu library in `build/src`, as well as `antlr4_cypher`, `antlr4_runtime`, `brotlidec`, `brotlicommon`, `utf8proc`, `re2`, `serd`, `fastpfor`, `miniparquet`, `zstd`, `miniz`, `mbedtls`, `lz4` (all of which can be found in the third_party subdirectory of the CMake build directory. E.g. `build/third_party/zstd/libzstd.a`) and whichever standard library you're using. \ No newline at end of file +You need to define `KUZU_STATIC_DEFINE`, and link against the static Kùzu library in `build/src`, as well as `antlr4_cypher`, `antlr4_runtime`, `brotlidec`, `brotlicommon`, `utf8proc`, `re2`, `serd`, `fastpfor`, `miniparquet`, `zstd`, `miniz`, `mbedtls`, `lz4` (all of which can be found in the third_party subdirectory of the CMake build directory. E.g. `build/third_party/zstd/libzstd.a`) and whichever standard library you're using. + +## Handling Kùzu output using `kuzu_query_result_get_next()` + +For the examples in this section we will be using the following schema: +```cypher +CREATE NODE TABLE person(id INT64 PRIMARY KEY); +``` + +The `kuzu_query_result_get_next()` function returns a reference to the resulting flat tuple. Additionally, to reduce resource allocation all calls to `kuzu_query_result_get_next()` reuse the same +flat tuple object. This means that for a query result, each call to `kuzu_query_result_get_next()` actually overwrites the flat tuple previously returned by the previous call. + +Thus, we recommend processing each tuple immediately before making the next call to `getNext`: + +```c +kuzu_query_result result; +kuzu_connection_query(conn, "MATCH (p:person) RETURN p.*", result); +while (kuzu_query_result_has_next(result)) { + kuzu_flat_tuple tuple; + kuzu_query_result_get_next(result, tuple); + do_something(tuple); +} +``` + +If you wish to process the tuples later, you must explicitly make a copy of each tuple: +```cpp +static kuzu_value* copy_flat_tuple(kuzu_flat_tuple* tuple, uint32_t tupleLen) { + kuzu_value* ret = malloc(sizeof(kuzu_value) * tupleLen); + for (uint32_t i = 0; i < tupleLen; i++) { + kuzu_flat_tuple_get_value(tuple, i, &ret[i]); + } + return ret; +} + +void mainFunction() { + kuzu_query_result result; + kuzu_connection_query(conn, "MATCH (p:person) RETURN p.*", &result); + + uint64_t num_tuples = kuzu_query_result_get_num_tuples(&result); + kuzu_value** tuples = (kuzu_value**)malloc(sizeof(kuzu_value*) * num_tuples); + for (uint64_t i = 0; i < num_tuples; ++i) { + kuzu_flat_tuple tuple; + kuzu_query_result_get_next(&result, &tuple); + tuples[i] = copy_flat_tuple(&tuple, kuzu_query_result_get_num_columns(&result)); + kuzu_flat_tuple_destroy(&tuple); + } + + for (uint64_t i = 0; i < num_tuples; ++i) { + for (uint64_t j = 0; j < kuzu_query_result_get_num_columns(&result); ++j) { + doSomething(tuples[i][j]); + kuzu_value_destroy(&tuples[i][j]); + } + free(tuples[i]); + } + + free((void*)tuples); + kuzu_query_result_destroy(&result); +} +``` diff --git a/src/content/docs/client-apis/cpp.mdx b/src/content/docs/client-apis/cpp.mdx index 0a60502f..4f65ff47 100644 --- a/src/content/docs/client-apis/cpp.mdx +++ b/src/content/docs/client-apis/cpp.mdx @@ -11,10 +11,71 @@ See the following link for the full documentation of the C++ API. href="https://kuzudb.com/api-docs/cpp/annotated.html" /> +## Handling Kùzu output using `getNext()` + +For the examples in this section we will be using the following schema: +```cypher +CREATE NODE TABLE person(id INT64 PRIMARY KEY); +``` + +The `getNext()` function in a `QueryResult` returns a reference to the resulting `FlatTuple`. Additionally, to reduce resource allocation all calls to `getNext()` reuse the same +FlatTuple object. This means that for a `QueryResult`, each call to `getNext()` actually overwrites the `FlatTuple` previously returned by the previous call to `getNext()`. + +Thus, we don't recommend using `QueryResult` like this: + +```cpp +std::unique_ptr result = conn.query("MATCH (p:person) RETURN p.*"); +std::vector> tuples; +while (result->hasNext()) { + // Each call to getNext() actually returns a pointer to the same tuple object + tuples.emplace_back(result->getNext()); +} + +// This is wrong! +// The vector stores a bunch of pointers to the same underlying tuple object +for (const auto& resultTuple: tuples) { + doSomething(resultTuple); +} +``` + +Instead, we recommend processing each tuple immediately before making the next call to `getNext`: +```cpp +std::unique_ptr result = conn.query("MATCH (p:person) RETURN p.*"); +std::vector> tuples; +while (result->hasNext()) { + auto tuple = result->getNext(); + doSomething(tuple); +} +``` + +If wish to process the tuples later, you must explicitly make a copy of each tuple: +```cpp +static decltype(auto) copyFlatTuple(kuzu::processor::FlatTuple* tuple) { + std::vector> ret; + for (uint32_t i = 0; i < tuple->len(); i++) { + ret.emplace_back(tuple->getValue(i)->copy()); + } + return ret; +} + +void mainFunction() { + std::unique_ptr result = conn->query("MATCH (p:person) RETURN p.*"); + std::vector>> tuples; + while (result->hasNext()) { + auto tuple = result->getNext(); + tuples.emplace_back(copyFlatTuple(tuple.get())); + } + for (const auto& tuple : tuples) { + doSomething(tuple); + } +} +``` + +## UDF API + In addition to interfacing with the database, the C++ API offers users the ability to define custom functions via User Defined Functions (UDFs), described below. -## UDF API Kùzu provides two interfaces that enable you to define your own custom scalar and vectorized functions. ### Scalar functions @@ -211,7 +272,7 @@ conn->createVectorizedFunction("addFour", &addFour); conn->query("MATCH (p:person) return addFour(p.age)"); ``` -#### Option 2. Vectorized function with input and return type in Cypher +#### Option 2. Vectorized function with input and return type in Cypher Create a vectorized function with input and return type in Cypher. ```cpp @@ -263,4 +324,4 @@ conn->query("MATCH (p:person) return addDate(p.birthdate, p.age)"); ## Linking -See the [C API Documentation](/client-apis/c#linking) for details as linking to the C++ API is more or less identical. \ No newline at end of file +See the [C API Documentation](/client-apis/c#linking) for details as linking to the C++ API is more or less identical. diff --git a/src/content/docs/client-apis/java.mdx b/src/content/docs/client-apis/java.mdx index a8dd9011..ac4c120f 100644 --- a/src/content/docs/client-apis/java.mdx +++ b/src/content/docs/client-apis/java.mdx @@ -10,3 +10,63 @@ See the following link for the full documentation of the Java API. title="Java API documentation" href="https://kuzudb.com/api-docs/java" /> + +## Handling Kùzu output using `getNext()` + +For the examples in this section we will be using the following schema: +```cypher +CREATE NODE TABLE person(id INT64 PRIMARY KEY); +``` + +The `getNext()` function in a `QueryResult` returns a reference to the resulting `FlatTuple`. Additionally, to reduce resource allocation all calls to `getNext()` reuse the same +FlatTuple object. This means that for a `QueryResult`, each call to `getNext()` actually overwrites the `FlatTuple` previously returned by the previous call to `getNext()`. + +Thus, we don't recommend using `QueryResult` like this: + +```java +QueryResult result = conn.query("MATCH (p:person) RETURN p.*"); +List tuples = new ArrayList(); +while (result.hasNext()) { + // Each call to getNext() actually returns a reference to the same tuple object + tuples.add(result.getNext()); +} + +// This is wrong! +// The list stores a bunch of references to the same underlying tuple object +for (FlatTuple resultTuple: tuples) { + doSomething(resultTuple); +} +``` + +Instead, we recommend processing each tuple immediately before making the next call to `getNext`: +```java +QueryResult result = conn.query("MATCH (p:person) RETURN p.*"); +while (result.hasNext()) { + FlatTuple tuple = result.getNext(); + doSomething(tuple); +} +``` + +If wish to process the tuples later, you must explicitly make a copy of each tuple: +```java +List copyFlatTuple(FlatTuple tuple, long tupleLen) throws ObjectRefDestroyedException { + List ret = new ArrayList(); + for (int i = 0; i < tupleLen; i++) { + ret.add(tuple.getValue(i).clone()); + } + return ret; +} + +void mainFunction() throws ObjectRefDestroyedException { + QueryResult result = conn.query("MATCH (p:person) RETURN p.*"); + List> tuples = new ArrayList>(); + while (result.hasNext()) { + FlatTuple tuple = result.getNext(); + tuples.add(copyFlatTuple(tuple, result.getNumColumns())); + } + + for (List tuple: tuples) { + doSomething(tuple); + } +} +```