From 5680a0c9c30875c8cdfdbe7ca9e56c9dab0a4002 Mon Sep 17 00:00:00 2001 From: TanZiYen <104113819+TanZiYen@users.noreply.github.com> Date: Thu, 26 Oct 2023 14:01:45 +0800 Subject: [PATCH 1/6] Docs: Update-openmldbsql-folder --- docs/en/openmldb_sql/index.rst | 18 ++ docs/en/openmldb_sql/sql_difference.md | 257 ++++++++++++++++++++++ docs/en/openmldb_sql/udf_develop_guide.md | 216 ++++++++++++++++++ 3 files changed, 491 insertions(+) create mode 100644 docs/en/openmldb_sql/index.rst create mode 100644 docs/en/openmldb_sql/sql_difference.md create mode 100644 docs/en/openmldb_sql/udf_develop_guide.md diff --git a/docs/en/openmldb_sql/index.rst b/docs/en/openmldb_sql/index.rst new file mode 100644 index 00000000000..6380c333c9d --- /dev/null +++ b/docs/en/openmldb_sql/index.rst @@ -0,0 +1,18 @@ +============================= +OpenMLDB SQL +============================= + + +.. toctree:: + :maxdepth: 1 + + sql_difference + language_structure/index + data_types/index + functions_and_operators/index + dql/index + dml/index + ddl/index + deployment_manage/index + task_manage/index + udf_develop_guide diff --git a/docs/en/openmldb_sql/sql_difference.md b/docs/en/openmldb_sql/sql_difference.md new file mode 100644 index 00000000000..c0c4b26a5a9 --- /dev/null +++ b/docs/en/openmldb_sql/sql_difference.md @@ -0,0 +1,257 @@ +# Main Differences from Standard SQL + +This article provides a comparison between the main usage of OpenMLDB SQL (SELECT query statements) and standard SQL (using MySQL-supported syntax as an example). It aims to help developers with SQL experience quickly adapt to OpenMLDB SQL. + +Unless otherwise specified, the default version is OpenMLDB: >= v0.7.1 + +## Support Overview + +The table below summarizes the differences in overall performance between OpenMLDB SQL and standard SQL based on SELECT statement elements across three execution modes (for execution mode details, please refer to Using Process and Execution Mode). OpenMLDB SQL is currently partially compatible with standard SQL, with additional syntax introduced to accommodate specific business scenarios. New syntax is indicated in bold in the table. + +Note: ✓ indicates that the statement is supported, while ✕ indicates that it is not. + +| | **OpenMLDB SQL**
**Offline Mode** | **OpenMLDB SQL**
**Online Preview Mode** | **OpenMLDB SQL**
**Online Request Mode** | **Standard SQL** | **Remarks** | +| -------------- | ---------------------------- | -------------------------------- | -------------------------------- | ------------ | ------------------------------------------------------------ | +| WHERE Clause | ✓ | ✓ | ✕ | ✓ | Some functionalities can be achieved through built-in functions with the `_where` suffix. | +| HAVING Clause | ✓ | ✓ | X | ✓ | | +| JOIN Clause | ✓ | ✕ | ✓ | ✓ | OpenMLDB only supports the unique **LAST JOIN**. | +| GROUP BY Grouping | ✓ | ✕ | ✕ | ✓ | | +| ORDER BY Keyword | ✓ | ✓ | ✓ | ✓ | Support is limited to usage within the `WINDOW` and `LAST JOIN` clauses; it does not support reverse sorting in `DESC`. | +| LIMIT the Number of Rows | ✓ | ✓ | ✕ | ✓ | | +| WINDOW Clause | ✓ | ✓ | ✓ | ✓ | OpenMLDB introduces unique **WINDOW ... UNION** and **WINDOW ATTRIBUTES** syntax. | +| WITH Clause | ✕ | ✕ | ✕ | ✓ | OpenMLDB support begins from version v0.7.2. | +| Aggregate Function | ✓ | ✓ | ✓ | ✓ | OpenMLDB offers a variety of extension functions. | + + + +## Explanation of Differences + +### Difference Dimension + +Compared to standard SQL, the differences in OpenMLDB SQL can be explained from three main perspectives: + +1. **Execution Mode**: OpenMLDB SQL has varying support for different SQL statements in three distinct execution modes: offline mode, online preview mode, and online request mode. The choice of execution mode depends on specific requirements. In general, for real-time computations in SQL, business SQL must adhere to the constraints of the online request mode. +2. **Clause Combinations**: The combination of different clauses can introduce additional limitations. In these scenarios, one clause operates on the result set of another clause. For example, when LIMIT is applied to WHERE, the SQL would resemble `SELECT * FROM (SELECT * FROM t1 WHERE id >= 2) LIMIT 2`. The term 'table reference' used here refers to `FROM TableRef`, which does not represent a subquery or a complex FROM clause involving JOIN or UNION. +3. **Special Restrictions**: Unique restrictions that do not fit the previous categories are explained separately. These restrictions are usually due to incomplete functionality or known program issues. + +### Configuration of Scanning Limits + +To prevent user errors from affecting online performance, OpenMLDB has introduced relevant parameters that limit the number of full table scans in offline mode and online preview mode. If these limitations are enabled, certain operations involving scans of multiple records (such as SELECT *, aggregation operations, etc.) may result in truncated results and, consequently, incorrect outcomes. It's essential to note that these parameters do not affect the accuracy of results in online request mode. + +The configuration of these parameters is done within the tablet configuration file `conf/tablet.flags`, as detailed in the document on Configuration File. The parameters affecting scan limits include: + +- Maximum Number of Scans: `--max_traverse_cnt` +- Maximum Number of Scanned Keys: `--max_traverse_pk_cnt` +- Size Limit for Returned Results: `--scan_max_bytes_size` + +In versions from v0.7.3 onwards, it's expected that the default values for these parameters will be set to 0, implying there will be no related restrictions. Users of earlier versions should take note of the parameter settings. + +### WHERE Clause + +| **Apply To** | **Offline Mode** | **Online Preview Mode** | **Online Request Mode** | +| ------------------ | ------------ | ---------------- | ---------------- | +| Table References | ✓ | ✓ | ✕ | +| LAST JOIN | ✓ | ✓ | ✕ | +| Subquery/ With Clause | ✓ | ✓ | ✕ | + +In the online request mode, the `WHERE` clause isn't supported. However, some functionalities can be achieved through computation functions with the `_where` suffix, like `count_where` and `avg_where`, among others. For detailed information, please consult the [Built-In Computation Function Document](https://chat.openai.com/c/functions_and_operators/Files/udfs_8h.md). + +### LIMIT Clause + +LIMIT is followed by an INT literal, and it does not support other expressions. It indicates the maximum number of rows for returned data. However, LIMIT is not supported in the online mode. + +| **Apply to** | **Offline Mode** | **Online Preview Mode** | **Online Request Mode** | +| ----------------- | ---------------- | ----------------------- | ----------------------- | +| Table Reference | ✓ | ✓ | ✕ | +| WHERE | ✓ | ✓ | ✕ | +| WINDOW | ✓ | ✓ | ✕ | +| LAST JOIN | ✓ | ✓ | ✕ | +| GROUP BY & HAVING | ✕ | ✓ | X | + +### WINDOW Clause + +The WINDOW clause and the GROUP BY & HAVING clause cannot be used simultaneously. When transitioning to the online mode, the input table for the WINDOW clause must be either a physical table or a simple column filtering, along with LAST JOIN concatenation of the physical table. Simple column filtering entails a select list containing only column references or renaming columns, without additional expressions. You can refer to the table below for specific support scenarios. If a scenario is not listed, it means that it's not supported. + +| **Apply to** | **Offline Mode** | **Online Preview Mode** | **Online Request Mode** | +| ------------------------------------------------------------ | ---------------- | ----------------------- | ----------------------- | +| Table Reference | ✓ | ✓ | ✓ | +| GROUP BY & HAVING | ✕ | ✕ | ✕ | +| LAST JOIN | ✓ | ✓ | ✓ | +| Subqueries are only allowed under these conditions:
1. Simple column filtering from a single table
2. Multi-table LAST JOIN
3. Simple column filtering after a dual-table LAST JOIN
| ✓ | ✓ | ✓ | + +Special Restrictions: + +- In online request mode, the input for WINDOW can be a LAST JOIN or a LAST JOIN within a subquery. It's important to note that the columns for `PARTITION BY` and `ORDER BY` in the window definition must all originate from the leftmost table of the JOIN. + +### GROUP BY & HAVING Clause + +The GROUP BY statement is still considered an experimental feature and only supports a physical table as the input table. It's not supported in other scenarios. GROUP BY is also not available in the online mode. + +| **Apply to** | **Offline Mode** | **Online Preview Mode** | **Online Request Mode** | +| --------------- | ---------------- | ----------------------- | ----------------------- | +| Table Reference | ✓ | ✓ | ✕ | +| WHERE | ✕ | ✕ | ✕ | +| LAST JOIN | ✕ | ✕ | ✕ | +| Subquery | ✕ | ✕ | ✕ | + +### JOIN Clause(LAST JOIN) + +OpenMLDB exclusively supports the LAST JOIN syntax. For a detailed description, please refer to the section on LAST JOIN in the extended syntax. A JOIN consists of two inputs, the left and right. In the online request mode, it supports two inputs as physical tables or specific subqueries. You can refer to the table for specific details. If a scenario is not listed, it means it's not supported. + +| **Apply to** | **Offline Mode** | **Online Preview Mode** | **Online Request Mode** | +| ------------------------------------------------------------ | ---------------- | ----------------------- | ----------------------- | +| Two Table References | ✓✓ | ✓✕ | ✕✓ | +| As for WHERE subqueries, they are only allowed in the following cases:
- When both the left and right tables are simple column filters
- When both the left and right tables are the result of WINDOW or LAST JOIN operations | ✕✓ | ✕✓ | ✕✓ | + +Special Restrictions: + +- Launching LAST JOIN for specific subqueries involves additional requirements. For more information, please refer to [Launch Requirements](https://chat.openai.com/openmldb_sql/deployment_manage/ONLINE_REQUEST_REQUIREMENTS.md#usagespecifications-last-join-inonlinerequestmode). +- LAST JOIN is currently not supported in online preview mode. + +### WITH Clause + +OpenMLDB (>= v0.7.2) supports non-recursive WITH clauses. The WITH clause functions equivalently to how other clauses work when applied to subqueries. To understand how the WITH statement is supported, please refer to its corresponding subquery writing methods as explained in the table above. + +No special restrictions apply in this case. + +### ORDER BY Keyword + +The sorting keyword `ORDER BY` is only supported within the `WINDOW` and `LAST JOIN` clauses in the window definition, and the reverse sorting keyword `DESC` is not supported. Detailed guidance on these clauses can be found in the WINDOW and LAST JOIN sections. + +### Aggregate Function + +Aggregation functions can be applied to all tables or windows. Window aggregation queries are supported in all three modes. Full table aggregation queries are only supported in online preview mode and are not available in offline and online request modes. + +- Regarding full table aggregation, OpenMLDB v0.6.0 began supporting this feature in online preview mode. However, it's essential to pay attention to the described [Scanning Limit Configuration](https://openmldb.feishu.cn/wiki/wikcnhBl4NsKcAX6BO9NDtKAxDf#doxcnLWICKzccMuPiWwdpVjSaIe). + +- OpenMLDB offers various extensions for aggregation functions. To find the specific functions supported, please consult the product documentation in [OpenMLDB Built-In Function](https://chat.openai.com/openmldb_sql/functions_and_operators/Files/udfs_8h.md). + +## Extended Syntax + +OpenMLDB has focused on deep customization of the `WINDOW` and `LAST JOIN` statements, and this section will provide an in-depth explanation of these two statements. + +### WINDOW Clause + +A typical WINDOW statement in OpenMLDB generally includes the following elements: + +- Data Definition: Defines the data within the window using `PARTITION BY`. +- Data Sorting: Defines the data sorting within the window using `ORDER BY`. +- Scope Definition: Determines the direction of time extension through `PRECEDING`, `CURRENT ROW`, and `UNBOUNDED`. +- Range Unit: Utilizes `ROWS` and `ROWS_RANGE` to specify the unit of window sliding range. +- Window Attributes: Includes OpenMLDB-specific window attribute definitions, such as `MAXSIZE`, `EXCLUDE CURRENT_ROW`, `EXCLUDE CURRENT_TIME`, and `INSTANCE_NOT_IN_WINDOW`. +- Multi-table Definition: Uses the extended syntax `WINDOW ... UNION` to determine whether concatenation of cross-table data sources is required. + +For a detailed syntax of the WINDOW statement, please refer to the [WINDOW Documentation](link to the documentation)(../openmldb_sql/dql/WINDOW_CLAUSE.md) + +| **Statement Element** | **Support Syntax** | **Description** | Required? | +| ---------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | --------- | +| Data Definition | PARTITION BY | OpenMLDB supports multiple column data types: bool, int16, int32, int64, string, date, timestamp. | ✓ | +| Data Sorting | ORDER BY | - It only supports sorting on a single column.
- Supported data types for sorting include int16, int32, int64, and timestamp.
- Reverse order (`DESC`) is not supported. | ✓ | +| Scope Definition | Here's a summary of window definition syntax: | - You must specify both upper and lower boundaries.
- The boundary keyword `FOLLOWING` is not supported.
- In online request mode, the "current row" represents the present request line. From a table perspective, the current row is conceptually inserted into the appropriate position in the table based on the `ORDER BY` criteria. | ✓ | +| Scope Unit | - For basic upper and lower bounds definition, you can use ROWS/ROWS_RANGE BETWEEN ... AND...
- Scope definition is supported with keywords like PRECEDING, OPEN PRECEDING, CURRENT ROW, and UNBOUNDED.ROWS
ROWS_RANGE (Extended) | - ROW_RANGE is an extended syntax for defining window boundaries similar to standard SQL RANGE-type windows. It allows defining window boundaries with either numerical values or values with time units. This is an extended syntax.
- Window ranges defined in time units are equivalent to window definitions where time is converted into milliseconds. For example, `ROWS_RANGE 10s PRECEDING` and `ROWS_RANGE 10000 PRECEDING` are equivalent. | ✓ | +| Window Properties (Extended) | MAXSIZE
EXCLUDE CURRENT_ROW
EXCLUDE CURRENT_TIME
INSTANCE_NOT_IN_WINDOW | MAXSIZE is only valid to ROWS_RANGE | - | +| Multi Table Definition (Extension) | In practical use, the syntax form is relatively complex. Please refer to:
[Cross Table Feature Development Tutorial](../tutorial/tutorial_sql_2.md)
[WINDOW UNION Syntax Documentation](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) | - It permits the merging of multiple tables.
- It allows the union of simple subqueries.
- It is commonly used in combination with aggregation functions for cross-table aggregation operations. | - | +| Incognito Window | - | It's essential to note that a complete window definition must include `PARTITION BY`, `ORDER BY`, and window range definition. | - | + +#### Special Restrictions + +In online preview mode or offline mode, there are certain known issues when using LIMIT or WHERE clauses as inputs to the WINDOW clause, and it's generally not recommended. + +#### Example of Window Definition + +Define a `ROWS` type window with a range from the first 1000 rows to the current row. + +```SQL +SELECT + sum(col2) OVER w1 as w1_col2_sum +FROM + t1WINDOW w1 AS ( + PARTITION BY col1 + ORDER BY + col5 ROWS BETWEEN 1000 PRECEDING + AND CURRENT ROW + ); +``` + +Define a `ROWS_RANGE` type window with a range covering all rows in the first 10 seconds of the current row, including the current row. + +```SQL +SELECT + sum(col2) OVER w1 as w1_col2_sum +FROM + t1WINDOW w1 AS ( + PARTITION BY col1 + ORDER BY + col5 ROWS_RANGE BETWEEN 10s PRECEDING + AND CURRENT ROW + ); +``` + +Define a `ROWS` type window with a range from the first 1000 rows to the current row, containing only the current row and no other data at the current time. + +```SQL +SELECT + sum(col2) OVER w1 as w1_col2_sum +FROM + t1 WINDOW w1 AS ( + PARTITION BY col1 + ORDER BY + col5 ROWS BETWEEN 1000 PRECEDING + AND CURRENT ROW EXCLUDE CURRENT_TIME + ); +``` + +Define a `ROWS_RANGE` type window with a range from the current time to the past 10 seconds, excluding the current request line. + +```SQL +SELECT + sum(col2) OVER w1 as w1_col2_sum +FROM + t1 WINDOW w1 AS ( + PARTITION BY col1 + ORDER BY + col5 ROWS_RANGE BETWEEN 10s PRECEDING + AND CURRENT ROW EXCLUDE CURRENT_ROW + ); +``` + +Anonymous window: + +```SQL +SELECT + id, + pk1, + col1, + std_ts, + sum(col1) OVER ( + PARTITION BY pk1 + ORDER BY + std_ts ROWS BETWEEN 1 PRECEDING + AND CURRENT ROW + ) as w1_col1_sumfrom t1; +``` + +#### Example of WINDOW ... UNION + +In practical development, many applications store data in multiple tables. In such cases, the syntax `WINDOW ... UNION` is commonly used for cross-table aggregation operations. Please refer to the "Multi-Table Aggregation Features" section in the [Cross-Table Feature Development Tutorial](https://chat.openai.com/tutorial/tutorial_sql_2.md). + +### LAST JOIN Clause + +For detailed syntax specifications for LAST JOIN, please refer to the [LAST JOIN Documentation](https://chat.openai.com/openmldb_sql/dql/JOIN_CLAUSE.md). + +| **Statement Element** | **Support Syntax** | **Description** | Required? | +| --------------------- | ------------------ | ------------------------------------------------------------ | --------- | +| ON | ✓ | Supported column types include: BOOL, INT16, INT32, INT64, STRING, DATE, TIMESTAMP. | ✓ | +| USING | X | - | - | +| ORDER BY | ✓ | - Only the following column types can be used: INT16, INT32, INT64, TIMESTAMP.
- The reverse order keyword DESC is not supported. | - | + +#### Example of LAST JOIN + +```SQL +SELECT + * +FROM + t1 +LAST JOIN t2 ON t1.col1 = t2.col1; +``` + diff --git a/docs/en/openmldb_sql/udf_develop_guide.md b/docs/en/openmldb_sql/udf_develop_guide.md new file mode 100644 index 00000000000..7153af27be9 --- /dev/null +++ b/docs/en/openmldb_sql/udf_develop_guide.md @@ -0,0 +1,216 @@ +# UDF Function Development Guideline +## 1. Background +Although there are already hundreds of built-in functions, they can not satisfy the needs in some cases. In the past, this could only be done by developing new built-in functions. Built-in function development requires a relatively long cycle because it needs to recompile binary files and users have to wait for new version release. +In order to help users to quickly develop computing functions that are not provided by OpenMLDB, we develop the mechanism of user dynamic registration function. OpenMLDB will load the compiled library contains user defined function when executing `Create Function` statement. + +SQL functions can be categorised into scalar functions and aggregate functions. An introduction to scalar functions and aggregate functions can be seen [here](./built_in_function_develop_guide.md). +## 2. Development Procedures +### 2.1 Develop UDF functions +#### 2.1.1 Naming Specification of C++ Built-in Function +- The naming of C++ built-in function should follow the [snake_case](https://en.wikipedia.org/wiki/Snake_case) style. +- The name should clearly express the function's purpose. +- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../reference/sql/functions_and_operators/Files/udfs_8h.md). + +#### 2.1.2 +The types of the built-in C++ functions' parameters should be BOOL, NUMBER, TIMESTAMP, DATE, or STRING. +The SQL types corresponding to C++ types are shown as follows: + +| SQL Type | C/C++ Type | +|:----------|:------------| +| BOOL | `bool` | +| SMALLINT | `int16_t` | +| INT | `int32_t` | +| BIGINT | `int64_t` | +| FLOAT | `float` | +| DOUBLE | `double` | +| STRING | `StringRef` | +| TIMESTAMP | `Timestamp` | +| DATE | `Date` | + + +#### 2.1.3 Parameters and Return Values + +**Return Value**: + +* If the output type of the UDF is a basic type and not support null, it will be processed as a return value. +* If the output type of the UDF is a basic type and support null, it will be processed as function parameter. +* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will return through the last parameter of the function. + +**Parameters**: + +* If the parameter is a basic type, it will be passed by value. +* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will be passed by pointer. +* The first parameter must be `UDFContext* ctx`. The definition of [UDFContext](../../../include/udf/openmldb_udf.h) is: + +```c++ + struct UDFContext { + ByteMemoryPool* pool; // Used for memory allocation. + void* ptr; // Used for the storage of temporary variables for aggregrate functions. + }; +``` + +**Note**: +- if the input value is nullable, there are added `is_null` parameter to lable whether is null +- if the return value is nullable, it should be return by argument and add another `is_null` parameter + +For instance, declare a UDF function that input is nullable and return value is nullable. +```c++ +extern "C" +void sum(::openmldb::base::UDFContext* ctx, int64_t input1, bool is_null, int64_t input2, bool is_null, int64_t* output, bool* is_null); +``` + +**Function Declaration**: + +* The functions must be declared by extern "C". + +#### 2.1.4 Memory Management + +- It is not allowed to use `new` operator or `malloc` function to allocate memory for input and output argument in UDF functions. +- If you use `new` operator or `malloc` function to allocate memory for UDFContext::ptr in UDAF init functions, it need to be freed in output function mannually. +- If you need to request additional memory space dynamically, please use the memory management interface provided by OpenMLDB. OpenMLDB will automatically free the memory space after the function is executed. + +```c++ + char *buffer = ctx->pool->Alloc(size); +``` + +- The maximum size of the space allocated at a time cannot exceed 2M bytes. + + +#### 2.1.5 Implement the UDF Function +- The head file `udf/openmldb_udf.h` should be included. +- Develop the logic of the function. + +```c++ +#include "udf/openmldb_udf.h" // The headfile + +// Develop a UDF which slices the first 2 characters of a given string. +extern "C" +void cut2(::openmldb::base::UDFContext* ctx, ::openmldb::base::StringRef* input, ::openmldb::base::StringRef* output) { + if (input == nullptr || output == nullptr) { + return; + } + uint32_t size = input->size_ <= 2 ? input->size_ : 2; + //To apply memory space in UDF functions, please use ctx->pool. + char *buffer = ctx->pool->Alloc(size); + memcpy(buffer, input->data_, size); + output->size_ = size; + output->data_ = buffer; +} +``` + + +#### 2.1.5 Implement the UDAF Function +- The head file `udf/openmldb_udf.h` should be included. +- Develop the logic of the function. + +It need to develop three functions as below: +- init function. do some init works in this function such as alloc memory or init variables. The function name should be "xxx_init" +- update function. Update the aggretrate value. The function name should be "xxx_update" +- output function. Extract the aggregrate value and return. The function name should be "xxx_output" + +**Node**: It should return `UDFContext*` as return value in init and update function. + +```c++ +#include "udf/openmldb_udf.h" + +extern "C" +::openmldb::base::UDFContext* special_sum_init(::openmldb::base::UDFContext* ctx) { + // allocte memory by memory poll + ctx->ptr = ctx->pool->Alloc(sizeof(int64_t)); + // init the value + *(reinterpret_cast(ctx->ptr)) = 10; + // return the pointer of UDFContext + return ctx; +} + +extern "C" +::openmldb::base::UDFContext* special_sum_update(::openmldb::base::UDFContext* ctx, int64_t input) { + // get the value from ptr in UDFContext + int64_t cur = *(reinterpret_cast(ctx->ptr)); + cur += input; + *(reinterpret_cast(ctx->ptr)) = cur; + // return the pointer of UDFContext + return ctx; +} + +// get the result from ptr in UDFcontext and return +extern "C" +int64_t special_sum_output(::openmldb::base::UDFContext* ctx) { + return *(reinterpret_cast(ctx->ptr)) + 5; +} + +``` + + +For more UDF implementation, see [here](../../../src/examples/test_udf.cc). + + +### 2.2 Compile the Dynamic Library + +- Copy the `include` directory (`https://github.com/4paradigm/OpenMLDB/tree/main/include`) to a certain path (like `/work/OpenMLDB/`) for later compiling. +- Run the compiling command. `-I` specifies the path of `include` directory. `-o` specifies the name of the dynamic library. + +```shell +g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -std=c++11 -fPIC +``` + +### 2.3 Copy the Dynamic Library +The compiled dynamic libraries should be copied into the `udf` directories for both TaskManager and tablets. Please create a new `udf` directory if it does not exist. +- The `udf` directory of a tablet is `path_to_tablet/udf`. +- The `udf` directory of TaskManager is `path_to_taskmanager/taskmanager/bin/udf`. + +For example, if the deployment paths of a tablet and TaskManager are both `/work/openmldb`, the structure of the directory is shown below: + +``` + /work/openmldb/ + ├── bin + ├── conf + ├── taskmanager + │   ├── bin + │   │   ├── taskmanager.sh + │   │   └── udf + │   │   └── libtest_udf.so + │   ├── conf + │   └── lib + ├── tools + └── udf +    └── libtest_udf.so +``` + +```{note} +- Note that, for multiple tablets, the library needs to be copied to every one. +- Moreover, dynamic libraries should not be deleted before the execution of `DROP FUNCTION`. +``` + + +### 2.4 Register, Drop and Show the Functions +For registering, please use [CREATE FUNCTION](../reference/sql/ddl/CREATE_FUNCTION.md). +```sql +CREATE FUNCTION cut2(x STRING) RETURNS STRING OPTIONS (FILE='libtest_udf.so'); +``` + +Create an udaf function that input value and return value support null. +```sql +CREATE AGGREGATE FUNCTION third(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so', ARG_NULLABLE=true, RETURN_NULLABLE=true); +``` + +```{note} +- The types of parameters and return values must be consistent with the implementation of the code. +- `FILE` specifies the file name of the dynamic library. It is not necessary to include a path. +- A UDF function can only work on one type. Please create multiple functions for multiple types. +``` + +After successful registration, the function can be used. +```sql +SELECT cut2(c1) FROM t1; +``` + +You can view registered functions through `SHOW FUNCTIONS`. +```sql +SHOW FUNCTIONS; +``` + +Please use the `DROP FUNCTION` to delete a registered function. +```sql +DROP FUNCTION cut2; +``` From e648f88fa5947edafedebd2c2cb8ce57dd41bbaf Mon Sep 17 00:00:00 2001 From: Siqi Wang Date: Wed, 29 Nov 2023 16:40:41 +0800 Subject: [PATCH 2/6] Update sql_difference.md --- docs/en/openmldb_sql/sql_difference.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/en/openmldb_sql/sql_difference.md b/docs/en/openmldb_sql/sql_difference.md index c0c4b26a5a9..72f31fa75e9 100644 --- a/docs/en/openmldb_sql/sql_difference.md +++ b/docs/en/openmldb_sql/sql_difference.md @@ -14,13 +14,13 @@ Note: ✓ indicates that the statement is supported, while ✕ indicates that it | -------------- | ---------------------------- | -------------------------------- | -------------------------------- | ------------ | ------------------------------------------------------------ | | WHERE Clause | ✓ | ✓ | ✕ | ✓ | Some functionalities can be achieved through built-in functions with the `_where` suffix. | | HAVING Clause | ✓ | ✓ | X | ✓ | | -| JOIN Clause | ✓ | ✕ | ✓ | ✓ | OpenMLDB only supports the unique **LAST JOIN**. | -| GROUP BY Grouping | ✓ | ✕ | ✕ | ✓ | | -| ORDER BY Keyword | ✓ | ✓ | ✓ | ✓ | Support is limited to usage within the `WINDOW` and `LAST JOIN` clauses; it does not support reverse sorting in `DESC`. | -| LIMIT the Number of Rows | ✓ | ✓ | ✕ | ✓ | | -| WINDOW Clause | ✓ | ✓ | ✓ | ✓ | OpenMLDB introduces unique **WINDOW ... UNION** and **WINDOW ATTRIBUTES** syntax. | -| WITH Clause | ✕ | ✕ | ✕ | ✓ | OpenMLDB support begins from version v0.7.2. | -| Aggregate Function | ✓ | ✓ | ✓ | ✓ | OpenMLDB offers a variety of extension functions. | +| JOIN Clause | ✓ | ✕ | ✓ | ✓ | OpenMLDB only supports **LAST JOIN**. | +| GROUP BY | ✓ | ✕ | ✕ | ✓ | | +| ORDER BY | ✓ | ✓ | ✓ | ✓ | Support is limited to usage within the `WINDOW` and `LAST JOIN` clauses; it does not support reverse sorting in `DESC`. | +| LIMIT | ✓ | ✓ | ✕ | ✓ | | +| WINDOW Clause | ✓ | ✓ | ✓ | ✓ | OpenMLDB includes new syntac **WINDOW UNION** and **WINDOW ATTRIBUTES**. | +| WITH Clause | ✕ | ✕ | ✕ | ✓ | OpenMLDB supports begins from version v0.8.0. | +| Aggregate Function | ✓ | ✓ | ✓ | ✓ | OpenMLDB has more extension functions. | From 7a76d0ecb5d783b69365807174fbefb50d110670 Mon Sep 17 00:00:00 2001 From: Siqi Wang Date: Wed, 13 Dec 2023 17:32:34 +0800 Subject: [PATCH 3/6] Update sql_difference.md --- docs/en/openmldb_sql/sql_difference.md | 65 +++++++++++++++----------- 1 file changed, 37 insertions(+), 28 deletions(-) diff --git a/docs/en/openmldb_sql/sql_difference.md b/docs/en/openmldb_sql/sql_difference.md index 72f31fa75e9..feee7e6a9c4 100644 --- a/docs/en/openmldb_sql/sql_difference.md +++ b/docs/en/openmldb_sql/sql_difference.md @@ -6,7 +6,7 @@ Unless otherwise specified, the default version is OpenMLDB: >= v0.7.1 ## Support Overview -The table below summarizes the differences in overall performance between OpenMLDB SQL and standard SQL based on SELECT statement elements across three execution modes (for execution mode details, please refer to Using Process and Execution Mode). OpenMLDB SQL is currently partially compatible with standard SQL, with additional syntax introduced to accommodate specific business scenarios. New syntax is indicated in bold in the table. +The table below summarizes the differences in overall performance between OpenMLDB SQL and standard SQL based on SELECT statement elements across three execution modes (for execution mode details, please refer to [Workflow and Execution Modes](../quickstart/concepts/modes.md)). OpenMLDB SQL is currently partially compatible with standard SQL, with additional syntax introduced to accommodate specific business scenarios. New syntax is indicated in bold in the table. Note: ✓ indicates that the statement is supported, while ✕ indicates that it is not. @@ -14,11 +14,11 @@ Note: ✓ indicates that the statement is supported, while ✕ indicates that it | -------------- | ---------------------------- | -------------------------------- | -------------------------------- | ------------ | ------------------------------------------------------------ | | WHERE Clause | ✓ | ✓ | ✕ | ✓ | Some functionalities can be achieved through built-in functions with the `_where` suffix. | | HAVING Clause | ✓ | ✓ | X | ✓ | | -| JOIN Clause | ✓ | ✕ | ✓ | ✓ | OpenMLDB only supports **LAST JOIN**. | +| JOIN Clause | ✓ | ✕ | ✓ | ✓ | OpenMLDB only supports **LAST JOIN** and **LEFT JOIN**. | | GROUP BY | ✓ | ✕ | ✕ | ✓ | | | ORDER BY | ✓ | ✓ | ✓ | ✓ | Support is limited to usage within the `WINDOW` and `LAST JOIN` clauses; it does not support reverse sorting in `DESC`. | | LIMIT | ✓ | ✓ | ✕ | ✓ | | -| WINDOW Clause | ✓ | ✓ | ✓ | ✓ | OpenMLDB includes new syntac **WINDOW UNION** and **WINDOW ATTRIBUTES**. | +| WINDOW Clause | ✓ | ✓ | ✓ | ✓ | OpenMLDB includes new syntax **WINDOW UNION** and **WINDOW ATTRIBUTES**. | | WITH Clause | ✕ | ✕ | ✕ | ✓ | OpenMLDB supports begins from version v0.8.0. | | Aggregate Function | ✓ | ✓ | ✓ | ✓ | OpenMLDB has more extension functions. | @@ -38,7 +38,7 @@ Compared to standard SQL, the differences in OpenMLDB SQL can be explained from To prevent user errors from affecting online performance, OpenMLDB has introduced relevant parameters that limit the number of full table scans in offline mode and online preview mode. If these limitations are enabled, certain operations involving scans of multiple records (such as SELECT *, aggregation operations, etc.) may result in truncated results and, consequently, incorrect outcomes. It's essential to note that these parameters do not affect the accuracy of results in online request mode. -The configuration of these parameters is done within the tablet configuration file `conf/tablet.flags`, as detailed in the document on Configuration File. The parameters affecting scan limits include: +The configuration of these parameters is done within the tablet configuration file `conf/tablet.flags`, as detailed in the document on [Configuration File](../deploy/conf.md#the-configuration-file-for-tablet-conftabletflags). The parameters affecting scan limits include: - Maximum Number of Scans: `--max_traverse_cnt` - Maximum Number of Scanned Keys: `--max_traverse_pk_cnt` @@ -52,9 +52,9 @@ In versions from v0.7.3 onwards, it's expected that the default values for these | ------------------ | ------------ | ---------------- | ---------------- | | Table References | ✓ | ✓ | ✕ | | LAST JOIN | ✓ | ✓ | ✕ | -| Subquery/ With Clause | ✓ | ✓ | ✕ | +| Subquery/ WITH Clause | ✓ | ✓ | ✕ | -In the online request mode, the `WHERE` clause isn't supported. However, some functionalities can be achieved through computation functions with the `_where` suffix, like `count_where` and `avg_where`, among others. For detailed information, please consult the [Built-In Computation Function Document](https://chat.openai.com/c/functions_and_operators/Files/udfs_8h.md). +In the online request mode, the `WHERE` clause isn't supported. However, some functionalities can be achieved through computation functions with the `_where` suffix, like `count_where` and `avg_where`, among others. For detailed information, please refer to [Built-In Functions](./udfs_8h.md). ### LIMIT Clause @@ -66,7 +66,7 @@ LIMIT is followed by an INT literal, and it does not support other expressions. | WHERE | ✓ | ✓ | ✕ | | WINDOW | ✓ | ✓ | ✕ | | LAST JOIN | ✓ | ✓ | ✕ | -| GROUP BY & HAVING | ✕ | ✓ | X | +| GROUP BY & HAVING | ✕ | ✓ | ✕ | ### WINDOW Clause @@ -94,19 +94,22 @@ The GROUP BY statement is still considered an experimental feature and only supp | LAST JOIN | ✕ | ✕ | ✕ | | Subquery | ✕ | ✕ | ✕ | -### JOIN Clause(LAST JOIN) +### JOIN Clause -OpenMLDB exclusively supports the LAST JOIN syntax. For a detailed description, please refer to the section on LAST JOIN in the extended syntax. A JOIN consists of two inputs, the left and right. In the online request mode, it supports two inputs as physical tables or specific subqueries. You can refer to the table for specific details. If a scenario is not listed, it means it's not supported. +OpenMLDB exclusively supports the LAST JOIN and LEFT JOIN syntax. For a detailed description, please refer to the section on JOIN in the extended syntax. A JOIN consists of two inputs, the left and right. In the online request mode, it supports two inputs as physical tables or specific subqueries. You can refer to the table for specific details. If a scenario is not listed, it means it's not supported. -| **Apply to** | **Offline Mode** | **Online Preview Mode** | **Online Request Mode** | -| ------------------------------------------------------------ | ---------------- | ----------------------- | ----------------------- | -| Two Table References | ✓✓ | ✓✕ | ✕✓ | -| As for WHERE subqueries, they are only allowed in the following cases:
- When both the left and right tables are simple column filters
- When both the left and right tables are the result of WINDOW or LAST JOIN operations | ✕✓ | ✕✓ | ✕✓ | +| **Apply to** | **Offline Mode** | **Online Preview Mode** | **Online Request Mode** | +| ---------------------------------------------- | ------------ | ---------------- | ---------------- | +| LAST JOIN + two table reference | ✓ | ✕ | ✓ | +| LAST JOIN + simple column filtering for both tables| ✓ | ✕ | ✓ | +| LAST JOIN + left table is filtering with WHERE | ✓ | ✕ | ✓ | +| LAST JOIN one of the table is WINDOW or LAST JOIN | ✓ | ✕ | ✓ | +| LAST JOIN + right table is LEFT JOIN subquery | ✕ | ✕ | ✓ | +| LEFT JOIN | ✕ | ✕ | ✕ | Special Restrictions: - -- Launching LAST JOIN for specific subqueries involves additional requirements. For more information, please refer to [Launch Requirements](https://chat.openai.com/openmldb_sql/deployment_manage/ONLINE_REQUEST_REQUIREMENTS.md#usagespecifications-last-join-inonlinerequestmode). -- LAST JOIN is currently not supported in online preview mode. +- Launching LAST JOIN for specific subqueries involves additional requirements. For more information, please refer to [Online Requirements](../openmldb_sql/deployment_manage/ONLINE_REQUEST_REQUIREMENTS.md#specifications-of-last-join-under-online-request-mode). +- LAST JOIN and LEFT JOIN is currently not supported in online preview mode. ### WITH Clause @@ -124,11 +127,11 @@ Aggregation functions can be applied to all tables or windows. Window aggregatio - Regarding full table aggregation, OpenMLDB v0.6.0 began supporting this feature in online preview mode. However, it's essential to pay attention to the described [Scanning Limit Configuration](https://openmldb.feishu.cn/wiki/wikcnhBl4NsKcAX6BO9NDtKAxDf#doxcnLWICKzccMuPiWwdpVjSaIe). -- OpenMLDB offers various extensions for aggregation functions. To find the specific functions supported, please consult the product documentation in [OpenMLDB Built-In Function](https://chat.openai.com/openmldb_sql/functions_and_operators/Files/udfs_8h.md). +- OpenMLDB offers various extensions for aggregation functions. To find the specific functions supported, please consult the product documentation in [OpenMLDB Built-In Function](../openmldb_sql/udfs_8h.md). ## Extended Syntax -OpenMLDB has focused on deep customization of the `WINDOW` and `LAST JOIN` statements, and this section will provide an in-depth explanation of these two statements. +OpenMLDB has focused on deep customization of the `WINDOW` and `LAST JOIN` statements and this section will provide an in-depth explanation of these two statements. ### WINDOW Clause @@ -141,17 +144,17 @@ A typical WINDOW statement in OpenMLDB generally includes the following elements - Window Attributes: Includes OpenMLDB-specific window attribute definitions, such as `MAXSIZE`, `EXCLUDE CURRENT_ROW`, `EXCLUDE CURRENT_TIME`, and `INSTANCE_NOT_IN_WINDOW`. - Multi-table Definition: Uses the extended syntax `WINDOW ... UNION` to determine whether concatenation of cross-table data sources is required. -For a detailed syntax of the WINDOW statement, please refer to the [WINDOW Documentation](link to the documentation)(../openmldb_sql/dql/WINDOW_CLAUSE.md) +For a detailed syntax of the WINDOW statement, please refer to the [WINDOW Documentation](../openmldb_sql/dql/WINDOW_CLAUSE.md) | **Statement Element** | **Support Syntax** | **Description** | Required? | | ---------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | --------- | | Data Definition | PARTITION BY | OpenMLDB supports multiple column data types: bool, int16, int32, int64, string, date, timestamp. | ✓ | -| Data Sorting | ORDER BY | - It only supports sorting on a single column.
- Supported data types for sorting include int16, int32, int64, and timestamp.
- Reverse order (`DESC`) is not supported. | ✓ | -| Scope Definition | Here's a summary of window definition syntax: | - You must specify both upper and lower boundaries.
- The boundary keyword `FOLLOWING` is not supported.
- In online request mode, the "current row" represents the present request line. From a table perspective, the current row is conceptually inserted into the appropriate position in the table based on the `ORDER BY` criteria. | ✓ | -| Scope Unit | - For basic upper and lower bounds definition, you can use ROWS/ROWS_RANGE BETWEEN ... AND...
- Scope definition is supported with keywords like PRECEDING, OPEN PRECEDING, CURRENT ROW, and UNBOUNDED.ROWS
ROWS_RANGE (Extended) | - ROW_RANGE is an extended syntax for defining window boundaries similar to standard SQL RANGE-type windows. It allows defining window boundaries with either numerical values or values with time units. This is an extended syntax.
- Window ranges defined in time units are equivalent to window definitions where time is converted into milliseconds. For example, `ROWS_RANGE 10s PRECEDING` and `ROWS_RANGE 10000 PRECEDING` are equivalent. | ✓ | -| Window Properties (Extended) | MAXSIZE
EXCLUDE CURRENT_ROW
EXCLUDE CURRENT_TIME
INSTANCE_NOT_IN_WINDOW | MAXSIZE is only valid to ROWS_RANGE | - | -| Multi Table Definition (Extension) | In practical use, the syntax form is relatively complex. Please refer to:
[Cross Table Feature Development Tutorial](../tutorial/tutorial_sql_2.md)
[WINDOW UNION Syntax Documentation](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) | - It permits the merging of multiple tables.
- It allows the union of simple subqueries.
- It is commonly used in combination with aggregation functions for cross-table aggregation operations. | - | -| Incognito Window | - | It's essential to note that a complete window definition must include `PARTITION BY`, `ORDER BY`, and window range definition. | - | +| Data Sorting | ORDER BY | - It only supports sorting on a single column.
- Supported data types for sorting include int16, int32, int64, and timestamp.
- Reverse order (`DESC`) is not supported.
- Must specify for versions before v0.8.4 | - | +| Scope Definition | Basic upper and lower bounds definition: ROWS/ROWS_RANGE BETWEEN ... AND ... Scope definition is supported with keywords PRECEDING, OPEN PRECEDING, CURRENT ROW, UNBOUNDED | - Must specify both upper and lower boundaries.
- The boundary keyword `FOLLOWING` is not supported.
- In online request mode, `CURRENT ROW` represents the present request line. From a table perspective, the current row is virtually inserted into the appropriate position in the table based on the `ORDER BY` criteria. | ✓ | +| Scope Unit | ROWS
ROWS_RANGE (Extended) | - ROW_RANGE is an extended syntax for defining window boundaries similar to standard SQL RANGE-type windows. It allows defining window boundaries with either numerical values or values with time units. This is an extended syntax.
- Window ranges defined in time units are equivalent to window definitions where time is converted into milliseconds. For example, `ROWS_RANGE 10s PRECEDING ...` and `ROWS_RANGE 10000 PRECEDING...` are equivalent. | ✓ | +| Window Properties (Extended) | MAXSIZE
EXCLUDE CURRENT_ROW
EXCLUDE CURRENT_TIME
INSTANCE_NOT_IN_WINDOW | MAXSIZE is only valid to ROWS_RANGE Without ORDER BY and EXCLUDE CURRENT_TIME cannot be used together | - | +| Multi Table Definition (Extension) | In practical use, the syntax form is relatively complex. Please refer to:
[Cross Table Feature Development Tutorial](../tutorial/tutorial_sql_2.md)
[WINDOW UNION Syntax Documentation](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) | - Merging of multiple tables is allowed
- Union of simple subqueries is allowed
- It is commonly used in combination with aggregation functions for cross-table aggregation operations. | - | +| Incognito Window | - | Complete window definition must include `PARTITION BY`, `ORDER BY`, and window range definition. | - | #### Special Restrictions @@ -233,17 +236,17 @@ SELECT #### Example of WINDOW ... UNION -In practical development, many applications store data in multiple tables. In such cases, the syntax `WINDOW ... UNION` is commonly used for cross-table aggregation operations. Please refer to the "Multi-Table Aggregation Features" section in the [Cross-Table Feature Development Tutorial](https://chat.openai.com/tutorial/tutorial_sql_2.md). +In practical development, many applications store data in multiple tables. In such cases, the syntax `WINDOW ... UNION` is commonly used for cross-table aggregation operations. Please refer to the "Multi-Table Aggregation Features" section in the [Cross-Table Feature Development Tutorial](../tutorial/tutorial_sql_2.md). ### LAST JOIN Clause -For detailed syntax specifications for LAST JOIN, please refer to the [LAST JOIN Documentation](https://chat.openai.com/openmldb_sql/dql/JOIN_CLAUSE.md). +For detailed syntax specifications for LAST JOIN, please refer to the [LAST JOIN Documentation](../openmldb_sql/dql/JOIN_CLAUSE.md#join-clause). | **Statement Element** | **Support Syntax** | **Description** | Required? | | --------------------- | ------------------ | ------------------------------------------------------------ | --------- | | ON | ✓ | Supported column types include: BOOL, INT16, INT32, INT64, STRING, DATE, TIMESTAMP. | ✓ | | USING | X | - | - | -| ORDER BY | ✓ | - Only the following column types can be used: INT16, INT32, INT64, TIMESTAMP.
- The reverse order keyword DESC is not supported. | - | +| ORDER BY | ✓ | - LAST JOIN extended syntax, not supported by LEFT JOIN.
- Only the following column types can be used: INT16, INT32, INT64, TIMESTAMP.
- The reverse order keyword DESC is not supported. | - | #### Example of LAST JOIN @@ -253,5 +256,11 @@ SELECT FROM t1 LAST JOIN t2 ON t1.col1 = t2.col1; + +SELECT + * +FROM + t1 +LEFT JOIN t2 ON t1.col1 = t2.col1; ``` From 260a5565996acd3ef4b187ff22e2ba531a387a93 Mon Sep 17 00:00:00 2001 From: Siqi Wang Date: Wed, 13 Dec 2023 18:10:00 +0800 Subject: [PATCH 4/6] Update udf_develop_guide.md --- docs/en/openmldb_sql/udf_develop_guide.md | 138 ++++++++++++---------- 1 file changed, 76 insertions(+), 62 deletions(-) diff --git a/docs/en/openmldb_sql/udf_develop_guide.md b/docs/en/openmldb_sql/udf_develop_guide.md index 7153af27be9..25032220398 100644 --- a/docs/en/openmldb_sql/udf_develop_guide.md +++ b/docs/en/openmldb_sql/udf_develop_guide.md @@ -1,17 +1,19 @@ -# UDF Function Development Guideline -## 1. Background -Although there are already hundreds of built-in functions, they can not satisfy the needs in some cases. In the past, this could only be done by developing new built-in functions. Built-in function development requires a relatively long cycle because it needs to recompile binary files and users have to wait for new version release. -In order to help users to quickly develop computing functions that are not provided by OpenMLDB, we develop the mechanism of user dynamic registration function. OpenMLDB will load the compiled library contains user defined function when executing `Create Function` statement. - -SQL functions can be categorised into scalar functions and aggregate functions. An introduction to scalar functions and aggregate functions can be seen [here](./built_in_function_develop_guide.md). -## 2. Development Procedures -### 2.1 Develop UDF functions -#### 2.1.1 Naming Specification of C++ Built-in Function +# UDF Development Guideline +## Background +Although OpenMLDB provides over a hundred built-in functions for data scientists to perform data analysis and feature extraction, there are scenarios where these functions might not fully meet the requirements. To facilitate users in quickly and flexibly implementing specific feature computation needs, we have introduced support for user-defined functions (UDFs) based on C++ development. Additionally, we enable the loading of dynamically generated user-defined function libraries. + +```{seealso} +Users can also extend OpenMLDB's computation function library using the method of developing built-in functions. However, developing built-in functions requires modifying the source code and recompiling. If users wish to contribute extended functions to the OpenMLDB codebase, they can refer to [Built-in Function Develop Guide](./built_in_function_develop_guide.md). +``` + +## Development Procedures +### Develop UDF functions +#### Naming Specification of C++ Built-in Function - The naming of C++ built-in function should follow the [snake_case](https://en.wikipedia.org/wiki/Snake_case) style. - The name should clearly express the function's purpose. -- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../reference/sql/functions_and_operators/Files/udfs_8h.md). +- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../openmldb_sql/udfs_8h.md). -#### 2.1.2 +#### C++ Type and SQL Type Correlation The types of the built-in C++ functions' parameters should be BOOL, NUMBER, TIMESTAMP, DATE, or STRING. The SQL types corresponding to C++ types are shown as follows: @@ -28,69 +30,70 @@ The SQL types corresponding to C++ types are shown as follows: | DATE | `Date` | -#### 2.1.3 Parameters and Return Values +#### Parameters and Return Values **Return Value**: -* If the output type of the UDF is a basic type and not support null, it will be processed as a return value. -* If the output type of the UDF is a basic type and support null, it will be processed as function parameter. -* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will return through the last parameter of the function. +* If the output type of the UDF is a basic type and `return_nullable` set to false, it will be processed as a return value. +* If the output type of the UDF is a basic type and `return_nullable` set to true, it will be processed as a function parameter. +* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will return through the **last parameter** of the function. **Parameters**: * If the parameter is a basic type, it will be passed by value. -* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will be passed by pointer. +* If the output type of the UDF is STRING, TIMESTAMP or DATE, it will be passed by a pointer. * The first parameter must be `UDFContext* ctx`. The definition of [UDFContext](../../../include/udf/openmldb_udf.h) is: ```c++ struct UDFContext { ByteMemoryPool* pool; // Used for memory allocation. - void* ptr; // Used for the storage of temporary variables for aggregrate functions. + void* ptr; // Used for the storage of temporary variables for aggregate functions. }; ``` -**Note**: -- if the input value is nullable, there are added `is_null` parameter to lable whether is null -- if the return value is nullable, it should be return by argument and add another `is_null` parameter - -For instance, declare a UDF function that input is nullable and return value is nullable. -```c++ -extern "C" -void sum(::openmldb::base::UDFContext* ctx, int64_t input1, bool is_null, int64_t input2, bool is_null, int64_t* output, bool* is_null); -``` - **Function Declaration**: * The functions must be declared by extern "C". -#### 2.1.4 Memory Management +#### Memory Management + +- In scalar functions, the use of 'new' and 'malloc' to allocate space for input and output parameters is not allowed. However, temporary space allocation using 'new' and 'malloc' is permissible within the function, and the allocated space must be freed before the function returns. -- It is not allowed to use `new` operator or `malloc` function to allocate memory for input and output argument in UDF functions. -- If you use `new` operator or `malloc` function to allocate memory for UDFContext::ptr in UDAF init functions, it need to be freed in output function mannually. -- If you need to request additional memory space dynamically, please use the memory management interface provided by OpenMLDB. OpenMLDB will automatically free the memory space after the function is executed. +- In aggregate functions, space allocation using 'new' or 'malloc' can be performed in the 'init' function but must be released in the 'output' function. The final return value, if it is a string, needs to be stored in the space allocated by mempool. +- If dynamic memory allocation is required, OpenMLDB provides memory management interfaces. Upon function execution completion, OpenMLDB will automatically release the memory. ```c++ - char *buffer = ctx->pool->Alloc(size); +char *buffer = ctx->pool->Alloc(size); ``` +- The maximum size allocated at once cannot exceed 2M. -- The maximum size of the space allocated at a time cannot exceed 2M bytes. - +**Note**: +- If the parameters are declared as nullable, then all parameters are nullable, and each input parameter will have an additional `is_null` parameter. +- If the return value is declared as nullable, it will be returned through parameters, and an additional `is_null` parameter will indicate whether the return value is null. + +For instance, to declare a UDF scalar function, sum, which has two parameters, if the input and return value are nullable: +```c++ +extern "C" +void sum(::openmldb::base::UDFContext* ctx, int64_t input1, bool is_null, int64_t input2, bool is_null, int64_t* output, bool* is_null) { +``` +#### Scalar Function Implementation -#### 2.1.5 Implement the UDF Function +Scalar functions process individual data rows and return a single value, such as abs, sin, cos, date, year. +The process is as follows: - The head file `udf/openmldb_udf.h` should be included. - Develop the logic of the function. ```c++ -#include "udf/openmldb_udf.h" // The headfile +#include "udf/openmldb_udf.h" // must include this header file -// Develop a UDF which slices the first 2 characters of a given string. +// Develop a UDF that slices the first 2 characters of a given string. extern "C" void cut2(::openmldb::base::UDFContext* ctx, ::openmldb::base::StringRef* input, ::openmldb::base::StringRef* output) { if (input == nullptr || output == nullptr) { return; } uint32_t size = input->size_ <= 2 ? input->size_ : 2; - //To apply memory space in UDF functions, please use ctx->pool. + //use ctx->pool for memory allocation char *buffer = ctx->pool->Alloc(size); memcpy(buffer, input->data_, size); output->size_ = size; @@ -99,27 +102,33 @@ void cut2(::openmldb::base::UDFContext* ctx, ::openmldb::base::StringRef* input, ``` -#### 2.1.5 Implement the UDAF Function +#### Aggregation Function Implementation + +Aggregate functions process a dataset (such as a column of data) and perform computations, returning a single value, such as sum, avg, max, min, count. +The process is as follows: - The head file `udf/openmldb_udf.h` should be included. - Develop the logic of the function. -It need to develop three functions as below: -- init function. do some init works in this function such as alloc memory or init variables. The function name should be "xxx_init" -- update function. Update the aggretrate value. The function name should be "xxx_update" -- output function. Extract the aggregrate value and return. The function name should be "xxx_output" +To develop an aggregate function, you need to implement the following three C++ methods: -**Node**: It should return `UDFContext*` as return value in init and update function. +- init function: Perform initialization tasks such as allocating space for intermediate variables. Function naming format: 'aggregate_function_name_init'. -```c++ -#include "udf/openmldb_udf.h" +- update function: Implement the logic for processing each row of the respective field in the update function. Function naming format: 'aggregate_function_name_update'. +- output function: Process the final aggregated value and return the result. Function naming format: 'aggregate_function_name_output'." + +**Node**: Return `UDFContext*` as the return value in the init and update function. + +```c++ +#include "udf/openmldb_udf.h" //must include this header file +// implementation of aggregation function special_sum extern "C" ::openmldb::base::UDFContext* special_sum_init(::openmldb::base::UDFContext* ctx) { - // allocte memory by memory poll + // allocate space for intermediate variables and assign to 'ptr' in UDFContext. ctx->ptr = ctx->pool->Alloc(sizeof(int64_t)); // init the value *(reinterpret_cast(ctx->ptr)) = 10; - // return the pointer of UDFContext + // return pointer of UDFContext, cannot be omitted return ctx; } @@ -129,11 +138,11 @@ extern "C" int64_t cur = *(reinterpret_cast(ctx->ptr)); cur += input; *(reinterpret_cast(ctx->ptr)) = cur; - // return the pointer of UDFContext + // return the pointer of UDFContext, cannot be omitted return ctx; } -// get the result from ptr in UDFcontext and return +// get the aggregation result from ptr in UDFcontext and return extern "C" int64_t special_sum_output(::openmldb::base::UDFContext* ctx) { return *(reinterpret_cast(ctx->ptr)) + 5; @@ -145,16 +154,16 @@ int64_t special_sum_output(::openmldb::base::UDFContext* ctx) { For more UDF implementation, see [here](../../../src/examples/test_udf.cc). -### 2.2 Compile the Dynamic Library +### Compile the Dynamic Library - Copy the `include` directory (`https://github.com/4paradigm/OpenMLDB/tree/main/include`) to a certain path (like `/work/OpenMLDB/`) for later compiling. -- Run the compiling command. `-I` specifies the path of `include` directory. `-o` specifies the name of the dynamic library. +- Run the compiling command. `-I` specifies the path of the `include` directory. `-o` specifies the name of the dynamic library. ```shell g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -std=c++11 -fPIC ``` -### 2.3 Copy the Dynamic Library +### Copy the Dynamic Library The compiled dynamic libraries should be copied into the `udf` directories for both TaskManager and tablets. Please create a new `udf` directory if it does not exist. - The `udf` directory of a tablet is `path_to_tablet/udf`. - The `udf` directory of TaskManager is `path_to_taskmanager/taskmanager/bin/udf`. @@ -178,27 +187,32 @@ For example, if the deployment paths of a tablet and TaskManager are both `/work ``` ```{note} -- Note that, for multiple tablets, the library needs to be copied to every one. -- Moreover, dynamic libraries should not be deleted before the execution of `DROP FUNCTION`. +- For multiple tablets, the library needs to be copied to every tablet. +- Dynamic libraries should not be deleted before the execution of `DROP FUNCTION`. ``` -### 2.4 Register, Drop and Show the Functions -For registering, please use [CREATE FUNCTION](../reference/sql/ddl/CREATE_FUNCTION.md). +### Register, Drop and Show the Functions +For registering, please use [CREATE FUNCTION](../openmldb_sql/ddl/CREATE_FUNCTION.md). + +Register an scalar function: ```sql CREATE FUNCTION cut2(x STRING) RETURNS STRING OPTIONS (FILE='libtest_udf.so'); ``` - -Create an udaf function that input value and return value support null. +Register an aggregation function: +```sql +CREATE AGGREGATE FUNCTION special_sum(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so'); +``` +Register an aggregation function with input value and return value support null: ```sql CREATE AGGREGATE FUNCTION third(x BIGINT) RETURNS BIGINT OPTIONS (FILE='libtest_udf.so', ARG_NULLABLE=true, RETURN_NULLABLE=true); ``` -```{note} +**note**: - The types of parameters and return values must be consistent with the implementation of the code. - `FILE` specifies the file name of the dynamic library. It is not necessary to include a path. - A UDF function can only work on one type. Please create multiple functions for multiple types. -``` + After successful registration, the function can be used. ```sql @@ -210,7 +224,7 @@ You can view registered functions through `SHOW FUNCTIONS`. SHOW FUNCTIONS; ``` -Please use the `DROP FUNCTION` to delete a registered function. +Use the `DROP FUNCTION` to delete a registered function. ```sql DROP FUNCTION cut2; ``` From 93442b0f36283a2d9ed6dfa2e642c27e05d57914 Mon Sep 17 00:00:00 2001 From: Siqi Wang Date: Wed, 13 Dec 2023 18:11:31 +0800 Subject: [PATCH 5/6] Update udf_develop_guide.md --- docs/en/openmldb_sql/udf_develop_guide.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/en/openmldb_sql/udf_develop_guide.md b/docs/en/openmldb_sql/udf_develop_guide.md index 25032220398..1a2d73335a8 100644 --- a/docs/en/openmldb_sql/udf_develop_guide.md +++ b/docs/en/openmldb_sql/udf_develop_guide.md @@ -8,7 +8,7 @@ Users can also extend OpenMLDB's computation function library using the method o ## Development Procedures ### Develop UDF functions -#### Naming Specification of C++ Built-in Function +#### Naming Convention of C++ Built-in Function - The naming of C++ built-in function should follow the [snake_case](https://en.wikipedia.org/wiki/Snake_case) style. - The name should clearly express the function's purpose. - The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../openmldb_sql/udfs_8h.md). @@ -154,7 +154,7 @@ int64_t special_sum_output(::openmldb::base::UDFContext* ctx) { For more UDF implementation, see [here](../../../src/examples/test_udf.cc). -### Compile the Dynamic Library +### Compile Dynamic Library - Copy the `include` directory (`https://github.com/4paradigm/OpenMLDB/tree/main/include`) to a certain path (like `/work/OpenMLDB/`) for later compiling. - Run the compiling command. `-I` specifies the path of the `include` directory. `-o` specifies the name of the dynamic library. @@ -163,7 +163,7 @@ For more UDF implementation, see [here](../../../src/examples/test_udf.cc). g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -std=c++11 -fPIC ``` -### Copy the Dynamic Library +### Copy Dynamic Library The compiled dynamic libraries should be copied into the `udf` directories for both TaskManager and tablets. Please create a new `udf` directory if it does not exist. - The `udf` directory of a tablet is `path_to_tablet/udf`. - The `udf` directory of TaskManager is `path_to_taskmanager/taskmanager/bin/udf`. From d5f078ed150c4d9cf677b137c7da29eb95c7c8ab Mon Sep 17 00:00:00 2001 From: Siqi Wang Date: Wed, 13 Dec 2023 18:13:37 +0800 Subject: [PATCH 6/6] Update udf_develop_guide.md remove numbering --- docs/zh/openmldb_sql/udf_develop_guide.md | 26 +++++++++++------------ 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/zh/openmldb_sql/udf_develop_guide.md b/docs/zh/openmldb_sql/udf_develop_guide.md index 7fe4e81988d..e5ac1c94434 100644 --- a/docs/zh/openmldb_sql/udf_develop_guide.md +++ b/docs/zh/openmldb_sql/udf_develop_guide.md @@ -1,18 +1,18 @@ # 自定义函数(UDF)开发 -## 1. 背景 +## 背景 虽然OpenMLDB内置了上百个函数,以供数据科学家作数据分析和特征抽取。但是在某些场景下还是不能很好的满足要求,为了便于用户快速灵活实现特定的特征计算需求,我们支持了基于 C++ 的用户自定义函数(UDF)开发,以及动态用户自定义函数库的加载。 ```{seealso} 用户也可以使用内置函数开发的方式扩展 OpenMLDB 的计算函数库。但是内置函数开发需要修改源代码和重新编译。如果用户希望贡献扩展函数到 OpenMLDB 代码库,那么可以参考[内置函数的开发文档](../developer/built_in_function_develop_guide.md)。 ``` -## 2. 开发步骤 -### 2.1 开发自定义函数 -#### 2.1.1 C++函数名规范 +## 开发步骤 +### 开发自定义函数 +#### C++函数名规范 - C++内置函数名统一使用[snake_case](https://en.wikipedia.org/wiki/Snake_case)风格 - 要求函数名能清晰表达函数功能 - 函数不能重名。函数名不能和内置函数及其他自定义函数重名。所有内置函数的列表参考[这里](../openmldb_sql/functions_and_operators/Files/udfs_8h.md) -#### 2.1.2 C++类型与SQL类型对应关系 +#### C++类型与SQL类型对应关系 内置C++函数的参数类型限定为:BOOL类型,数值类型,时间戳日期类型和字符串类型。C++类型SQL类型对应关系如下: | SQL类型 | C/C++ 类型 | @@ -26,7 +26,7 @@ | STRING | `StringRef` | | TIMESTAMP | `Timestamp` | | DATE | `Date` | -#### 2.1.3 函数参数和返回值 +#### 函数参数和返回值 返回值: * 如果udf输出类型是基本类型,并且`return_nullable`设置为false, 则通过函数返回值返回 * 如果udf输出类型是基本类型,并且`return_nullable`设置为true, 则通过函数参数返回 @@ -46,7 +46,7 @@ 函数声明: * 函数必须用extern "C"来声明 -#### 2.1.4 内存管理 +#### 内存管理 - 在单行函数中,不允许使用`new`和`malloc`给输入和输出参数开辟空间。函数内部可以使用`new`和`malloc`申请临时空间, 申请的空间在函数返回前需要释放掉。 - 在聚合函数中,在init函数中可以使用`new`/`malloc`开辟空间,但是必须在output函数中释放。最后的返回值如果是string需要保存在mempool开辟的空间中 @@ -67,7 +67,7 @@ extern "C" void sum(::openmldb::base::UDFContext* ctx, int64_t input1, bool is_null, int64_t input2, bool is_null, int64_t* output, bool* is_null) { ``` -#### 2.1.5 单行函数开发 +#### 单行函数开发 单行函数(scalar function)对单行数据进行处理,返回单个值,比如 `abs`, `sin`, `cos`, `date`, `year` 等。 @@ -94,7 +94,7 @@ void cut2(::openmldb::base::UDFContext* ctx, ::openmldb::base::StringRef* input, } ``` -#### 2.1.6 聚合函数开发 +#### 聚合函数开发 聚合函数(aggregate function)对一个数据集(比如一列数据)执行计算,返回单个值,比如 `sum`, `avg`, `max`, `min`, `count` 等。 @@ -144,15 +144,15 @@ int64_t special_sum_output(::openmldb::base::UDFContext* ctx) { 更多udf/udaf实现参考[这里](../../../src/examples/test_udf.cc)。 -### 2.2 编译动态库 +### 编译动态库 - 拷贝include目录 `https://github.com/4paradigm/OpenMLDB/tree/main/include` 到某个路径下,下一步编译会用到。如/work/OpenMLDB/ - 执行编译命令,其中 -I 指定inlcude目录的路径 -o 指定产出动态库的名称 -- + ```shell g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -std=c++17 -fPIC ``` -### 2.3 拷贝动态库 +### 拷贝动态库 编译过的动态库需要被拷贝到 TaskManager 和 tablets中。如果 TaskManager 和 tablets中不存在`udf`目录,请先创建并重启这些进程(保证环境变量生效)。 - tablet的UDF目录是 `path_to_tablet/udf`。 - TaskManager的UDF目录是 `path_to_taskmanager/taskmanager/bin/udf`。 @@ -181,7 +181,7 @@ g++ -shared -o libtest_udf.so examples/test_udf.cc -I /work/OpenMLDB/include -st - 在执行' DROP FUNCTION '之前请勿删除动态库。 ``` -### 2.4 注册、删除和查看函数 +### 注册、删除和查看函数 注册函数使用[CREATE FUNCTION](../openmldb_sql/ddl/CREATE_FUNCTION.md) 注册单行函数