From 29e93aa92c7084bdf9df40a6b7546f808e6ccb8b Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Tue, 6 Aug 2024 06:55:07 +0800 Subject: [PATCH 01/16] docs: json datatype rfc --- docs/rfcs/2024-08-06-json-datatype.md | 141 ++++++++++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 docs/rfcs/2024-08-06-json-datatype.md diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md new file mode 100644 index 000000000000..b34a7d01a7ba --- /dev/null +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -0,0 +1,141 @@ +--- +Feature Name: Json Datatype +Tracking Issue: https://github.com/GreptimeTeam/greptimedb/issues/4230 +Date: 2024-8-6 +Author: "Yuhan Wang " +--- + +# Summary +This RFC proposes a method for storing and querying JSON data in the database. + +# Motivation +JSON is widely used across various scenarios. Direct support for writing and querying JSON can significantly enhance the database's flexibility. + +# Details + +## User Interface +The feature introduces a new data type for the database, similar to the common JSON type. Data is written as JSON strings and can be queried using functions. + +For example: +```SQL +CREATE TABLE IF NOT EXISTS test ( + ts TIMESTAMP TIME INDEX, + a INT, + b JSON +); + +INSERT INTO test VALUES( + 0, + 0, + '{ + "name": "jHl2oDDnPc1i2OzlP5Y", + "timestamp": "2024-07-25T04:33:11.369386Z", + "attributes": { "event_attributes": 48.28667 } + }' +); + +SELECT json_get(b, 'name') FROM test; ++---------------------+ +| b.name | ++---------------------+ +| jHl2oDDnPc1i2OzlP5Y | ++---------------------+ + +SELECT json_get_by_paths(b, 'attributes', 'event_attributes') + 1 FROM test; ++-------------------------------+ +| b.attributes.event_attributes | ++-------------------------------+ +| 49.28667 | ++-------------------------------+ + +``` + +## Storage + +### Schema Inference +Unlike other types, the schema of JSON data is inconsistent. For different JSON columns, we introduce a dynamic schema inference method for storing the data. + +For example: +```JSON +{ + "a": "jHl2oDDnPc1i2OzlP5Y", + "b": "2024-07-25T04:33:11.369386Z", + "c": { "d": 48.28648 } +} +``` +This will be parsed at runtime and stored as a corresponding `Struct` type in Arrow: +```Rust +Struct( + Field("a", Utf8), + Field("b", Utf8), + Field("c", Struct(Field("d", Float64))), +) +``` + +Dynamic schema inference helps achieve compression in some scenarios. See [benchmark](https://github.com/CookiePieWw/json-format-in-parquet-benchmark/) for more information. + +## Schema Change +The schema must remain consistent for a column within a table. When inserting data with different schemas, schema changes may occur. There are two types of schema changes: + +1. Field Addition + + Newly added fields can be incorporated into the schema, treating added fields in previously inserted data as null: + ```Rust + Struct( + Field("a", Utf8), + ) + + + Struct( + Field("a", Utf8), + Field("e", Int32) + ) + = + Struct( + Field("a", Utf8), + Field("e", Int32) + ) + ``` + +2. Field Modification + + Compatible fields can be altered to the widest type, similar to integral promotion in C: + ```Rust + Struct( + Field("a", Int16), + ) + + + Struct( + Field("a", Int32), + ) + = + Struct( + Field("a", Int32), + ) + ``` + + Non-compatible fields will fallback to a binary array to store the JSONB encoding: + ```Rust + Struct( + Field("a", Struct(Field("b", Float64))), + ) + + + Struct( + Field("a", Int32), + ) + = + Struct( + Field("a", BinaryArray), // JSONB + ) + ``` + +Like schema inference, schema changes are performed automatically without manual configuration. + +# Drawbacks + +1. This datatype is best suited for data with similar schemas. Varying schemas can lead to frequent schema changes and fallback to JSONB. +2. Schema inference and change bring additional writing overhead in favor of better compression rate. + +# Alternatives + +1. JSONB, a widely used binary representation format of json. +2. JSONC: A tape representation format for JSON with similar writing and query performance and better compression in some cases. See [discussion](https://github.com/apache/datafusion/issues/7845#issuecomment-2068061465) and [repo](https://github.com/CookiePieWw/jsonc) for more information. From ccf415bb06ddc7607d59b7d2a89eec744ec001ec Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Mon, 12 Aug 2024 14:13:30 +0800 Subject: [PATCH 02/16] docs: turn to a jsonb proposal --- docs/rfcs/2024-08-06-json-datatype.md | 86 ++------------------------- 1 file changed, 4 insertions(+), 82 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index b34a7d01a7ba..8cd6777805aa 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -50,92 +50,14 @@ SELECT json_get_by_paths(b, 'attributes', 'event_attributes') + 1 FROM test; ``` -## Storage +## Storage and Querying -### Schema Inference -Unlike other types, the schema of JSON data is inconsistent. For different JSON columns, we introduce a dynamic schema inference method for storing the data. - -For example: -```JSON -{ - "a": "jHl2oDDnPc1i2OzlP5Y", - "b": "2024-07-25T04:33:11.369386Z", - "c": { "d": 48.28648 } -} -``` -This will be parsed at runtime and stored as a corresponding `Struct` type in Arrow: -```Rust -Struct( - Field("a", Utf8), - Field("b", Utf8), - Field("c", Struct(Field("d", Float64))), -) -``` - -Dynamic schema inference helps achieve compression in some scenarios. See [benchmark](https://github.com/CookiePieWw/json-format-in-parquet-benchmark/) for more information. - -## Schema Change -The schema must remain consistent for a column within a table. When inserting data with different schemas, schema changes may occur. There are two types of schema changes: - -1. Field Addition - - Newly added fields can be incorporated into the schema, treating added fields in previously inserted data as null: - ```Rust - Struct( - Field("a", Utf8), - ) - + - Struct( - Field("a", Utf8), - Field("e", Int32) - ) - = - Struct( - Field("a", Utf8), - Field("e", Int32) - ) - ``` - -2. Field Modification - - Compatible fields can be altered to the widest type, similar to integral promotion in C: - ```Rust - Struct( - Field("a", Int16), - ) - + - Struct( - Field("a", Int32), - ) - = - Struct( - Field("a", Int32), - ) - ``` - - Non-compatible fields will fallback to a binary array to store the JSONB encoding: - ```Rust - Struct( - Field("a", Struct(Field("b", Float64))), - ) - + - Struct( - Field("a", Int32), - ) - = - Struct( - Field("a", BinaryArray), // JSONB - ) - ``` - -Like schema inference, schema changes are performed automatically without manual configuration. +Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings and can be casted to other types if needed. # Drawbacks -1. This datatype is best suited for data with similar schemas. Varying schemas can lead to frequent schema changes and fallback to JSONB. -2. Schema inference and change bring additional writing overhead in favor of better compression rate. +As a general purpose data type, JSONB may not be as efficient as specialized data types for specific scenarios. # Alternatives -1. JSONB, a widely used binary representation format of json. -2. JSONC: A tape representation format for JSON with similar writing and query performance and better compression in some cases. See [discussion](https://github.com/apache/datafusion/issues/7845#issuecomment-2068061465) and [repo](https://github.com/CookiePieWw/jsonc) for more information. +Extract and flatten JSON schema to store in a structured format throught pipeline. For nested data, we can provide nested types like `STRUCT` or `ARRAY`. From bacbe15ecb8f0c620a541a8beb50341abef3fe1b Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Mon, 12 Aug 2024 14:16:49 +0800 Subject: [PATCH 03/16] chore: fix typo --- docs/rfcs/2024-08-06-json-datatype.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index 8cd6777805aa..c9fc7c1b895a 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -60,4 +60,4 @@ As a general purpose data type, JSONB may not be as efficient as specialized dat # Alternatives -Extract and flatten JSON schema to store in a structured format throught pipeline. For nested data, we can provide nested types like `STRUCT` or `ARRAY`. +Extract and flatten JSON schema to store in a structured format through pipeline. For nested data, we can provide nested types like `STRUCT` or `ARRAY`. From c2580e420cc834cef05299df1436fabde7e4e392 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Mon, 12 Aug 2024 16:28:05 +0800 Subject: [PATCH 04/16] feat: add store and query process --- docs/rfcs/2024-08-06-json-datatype.md | 29 ++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index c9fc7c1b895a..42d9db21c6db 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -52,7 +52,34 @@ SELECT json_get_by_paths(b, 'attributes', 'event_attributes') + 1 FROM test; ## Storage and Querying -Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings and can be casted to other types if needed. +Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary and can be queried through pre-defined JSON functions. For clients, data is shown as strings and can be deserialized to other types if needed. + +Insertions of JSON data goes through following steps: + +1. Client gets JSON strings and sends it to the frontend. +2. Frontend serializes JSON strings as JSONB format and sends it to the datanode. +3. Datanode stores binary data in the database. + +Queries of JSON data goes through following steps: + +1. Client sends query to the frontend. +2. Frontend sends distributed query plans to the datanode. +3. Datanode executes distributed query plans and returns results of JSON format to the frontend. +4. Frontend executes non-distributed query plans and then deserializes results of JSONB format to strings, and returns to the client. + +``` +Insertion: + Serialize Store + JSON Strings ┌────────────┐ JSONB Data ┌────────────┐ + client ------------->│ Frontend │----------->│ Datanode │--> Storage + └────────────┘ └────────────┘ + +Queries: + Query + Deserialize Query + JSON Strings ┌────────────┐ JSONB Data ┌────────────┐ + client <-------------│ Frontend │<-----------│ Datanode │<-- Storage + └────────────┘ └────────────┘ +``` # Drawbacks From 787fdf52304d0ecba8ccf1d755b5e6301dd465d5 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Mon, 12 Aug 2024 16:29:29 +0800 Subject: [PATCH 05/16] fix: typo --- docs/rfcs/2024-08-06-json-datatype.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index 42d9db21c6db..be044f3082fe 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -52,7 +52,7 @@ SELECT json_get_by_paths(b, 'attributes', 'event_attributes') + 1 FROM test; ## Storage and Querying -Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary and can be queried through pre-defined JSON functions. For clients, data is shown as strings and can be deserialized to other types if needed. +Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings and can be deserialized to other types if needed. Insertions of JSON data goes through following steps: From dee656481b320d931c8d633365bd8234a5068912 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Mon, 12 Aug 2024 16:38:51 +0800 Subject: [PATCH 06/16] fix: use query nodes instead of query plans --- docs/rfcs/2024-08-06-json-datatype.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index be044f3082fe..ac38107a7332 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -63,9 +63,9 @@ Insertions of JSON data goes through following steps: Queries of JSON data goes through following steps: 1. Client sends query to the frontend. -2. Frontend sends distributed query plans to the datanode. -3. Datanode executes distributed query plans and returns results of JSON format to the frontend. -4. Frontend executes non-distributed query plans and then deserializes results of JSONB format to strings, and returns to the client. +2. Frontend sends distributed query nodes to the datanode. +3. Datanode executes distributed query nodes and returns results of JSON format to the frontend. +4. Frontend executes non-distributed query nodes and then deserializes results of JSONB format to strings, and returns to the client. ``` Insertion: From e14a9ae41c5b9359a7d2f36fb7abcc653d6d4f7b Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Mon, 12 Aug 2024 18:29:11 +0800 Subject: [PATCH 07/16] feat: a detailed overview of query --- docs/rfcs/2024-08-06-json-datatype.md | 50 +++++++++++++++++---------- 1 file changed, 32 insertions(+), 18 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index ac38107a7332..c7a27f3e1fa5 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -41,7 +41,7 @@ SELECT json_get(b, 'name') FROM test; | jHl2oDDnPc1i2OzlP5Y | +---------------------+ -SELECT json_get_by_paths(b, 'attributes', 'event_attributes') + 1 FROM test; +SELECT json_get_by_paths_int(b, 'attributes', 'event_attributes') + 1 FROM test; +-------------------------------+ | b.attributes.event_attributes | +-------------------------------+ @@ -50,35 +50,49 @@ SELECT json_get_by_paths(b, 'attributes', 'event_attributes') + 1 FROM test; ``` -## Storage and Querying +## Storage and Query -Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings and can be deserialized to other types if needed. +Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings. -Insertions of JSON data goes through following steps: +Insertions of JSON goes through following steps: 1. Client gets JSON strings and sends it to the frontend. -2. Frontend serializes JSON strings as JSONB format and sends it to the datanode. +2. Frontend encode JSON strings to JSONB format and sends it to the datanode. 3. Datanode stores binary data in the database. -Queries of JSON data goes through following steps: - -1. Client sends query to the frontend. -2. Frontend sends distributed query nodes to the datanode. -3. Datanode executes distributed query nodes and returns results of JSON format to the frontend. -4. Frontend executes non-distributed query nodes and then deserializes results of JSONB format to strings, and returns to the client. - ``` Insertion: - Serialize Store + Encode Store JSON Strings ┌────────────┐ JSONB Data ┌────────────┐ client ------------->│ Frontend │----------->│ Datanode │--> Storage └────────────┘ └────────────┘ +``` -Queries: - Query + Deserialize Query - JSON Strings ┌────────────┐ JSONB Data ┌────────────┐ - client <-------------│ Frontend │<-----------│ Datanode │<-- Storage - └────────────┘ └────────────┘ +The data of JSON type is represented by `Binary` data type in arrow. There are 2 types of JSON queries: get json elements through keys and compute over json elements. + +For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract json elements. + +For the latter, users need to manually specify the data type of the json elements. Before computing, and the query engine will decode the binary data in JSONB format into the specified data type. We provide functions like `json_get_int` and `json_get_by_paths_double` to extract json elements and convert them for further computation. + +Queries of JSON data goes through following steps: + +1. Client sends query to frontend, and frontend sends it to datafusion, which is the query engine of GreptimeDB. +2. Datafusion performs query over JSON data, and returns binary data to frontend. +3. If no computation is needed, frontend directly decodes it to JSON strings and return it to clients. +4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. Since the data type is specified, there's no need for further decoding in the frontend. + +``` +Queries without computation: + Decode Query + JSON Strings ┌────────────┐ JSONB Data ┌──────────────┐ + client <-------------│ Frontend │<-----------│ Datafusion │<-- Storage + └────────────┘ └──────────────┘ + +Queries with computation: + Query + Data of Specified Type ┌────────────┐ Data of Certain Type ┌──────────────┐ + client <-----------------------│ Frontend │<---------------------│ Datafusion │<-- Storage + └────────────┘ └──────────────┘ ``` # Drawbacks From 8dbf05011f1ce95f069468cfa235f4b02b461790 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Mon, 12 Aug 2024 19:05:16 +0800 Subject: [PATCH 08/16] fix: grammar --- docs/rfcs/2024-08-06-json-datatype.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index c7a27f3e1fa5..4604b5f06a9f 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -52,7 +52,7 @@ SELECT json_get_by_paths_int(b, 'attributes', 'event_attributes') + 1 FROM test; ## Storage and Query -Data of JSON type is stored as JSONB format in the database. For storage layer, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings. +Data of JSON type is stored as JSONB format in the database. For storage layer and query engine, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings. Insertions of JSON goes through following steps: @@ -70,16 +70,16 @@ Insertion: The data of JSON type is represented by `Binary` data type in arrow. There are 2 types of JSON queries: get json elements through keys and compute over json elements. -For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract json elements. +For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract json elements through keys. -For the latter, users need to manually specify the data type of the json elements. Before computing, and the query engine will decode the binary data in JSONB format into the specified data type. We provide functions like `json_get_int` and `json_get_by_paths_double` to extract json elements and convert them for further computation. +For the latter, users need to manually specify the data type of the json elements for computing. Before computing, the query engine will decode the binary data in JSONB format into the specified data type. We provide functions like `json_get_int` and `json_get_by_paths_double` to extract json elements and convert them for further computation. -Queries of JSON data goes through following steps: +Queries of JSON goes through following steps: 1. Client sends query to frontend, and frontend sends it to datafusion, which is the query engine of GreptimeDB. 2. Datafusion performs query over JSON data, and returns binary data to frontend. 3. If no computation is needed, frontend directly decodes it to JSON strings and return it to clients. -4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. Since the data type is specified, there's no need for further decoding in the frontend. +4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. There's no need for further decoding in the frontend. ``` Queries without computation: @@ -89,7 +89,7 @@ Queries without computation: └────────────┘ └──────────────┘ Queries with computation: - Query + Query Data of Specified Type ┌────────────┐ Data of Certain Type ┌──────────────┐ client <-----------------------│ Frontend │<---------------------│ Datafusion │<-- Storage └────────────┘ └──────────────┘ From 1a4538244cfb3f877075a710166c7c42d4326708 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Tue, 13 Aug 2024 01:31:36 +0800 Subject: [PATCH 09/16] fix: use independent cast function --- docs/rfcs/2024-08-06-json-datatype.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index 4604b5f06a9f..bb2f804996b2 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -72,7 +72,7 @@ The data of JSON type is represented by `Binary` data type in arrow. There are 2 For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract json elements through keys. -For the latter, users need to manually specify the data type of the json elements for computing. Before computing, the query engine will decode the binary data in JSONB format into the specified data type. We provide functions like `json_get_int` and `json_get_by_paths_double` to extract json elements and convert them for further computation. +For the latter, users need to manually specify the data type of the json elements for computing. We provide functions like `as_int` and `as_double` to decode the binary data into data with specified data type for further computation. Queries of JSON goes through following steps: @@ -82,13 +82,13 @@ Queries of JSON goes through following steps: 4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. There's no need for further decoding in the frontend. ``` -Queries without computation: +Queries without computation, decoding in frontend: Decode Query JSON Strings ┌────────────┐ JSONB Data ┌──────────────┐ client <-------------│ Frontend │<-----------│ Datafusion │<-- Storage └────────────┘ └──────────────┘ -Queries with computation: +Queries with computation, decoding in datafusion: Query Data of Specified Type ┌────────────┐ Data of Certain Type ┌──────────────┐ client <-----------------------│ Frontend │<---------------------│ Datafusion │<-- Storage From c5aa96352a4a8745f1bd462f45b91fe6f9d39ae5 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Tue, 13 Aug 2024 18:30:13 +0800 Subject: [PATCH 10/16] fix: unify cast function --- docs/rfcs/2024-08-06-json-datatype.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index bb2f804996b2..e172cb23ca6a 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -41,7 +41,7 @@ SELECT json_get(b, 'name') FROM test; | jHl2oDDnPc1i2OzlP5Y | +---------------------+ -SELECT json_get_by_paths_int(b, 'attributes', 'event_attributes') + 1 FROM test; +SELECT CAST(json_get_by_paths(b, 'attributes', 'event_attributes') AS DOUBLE) + 1 FROM test; +-------------------------------+ | b.attributes.event_attributes | +-------------------------------+ @@ -72,7 +72,7 @@ The data of JSON type is represented by `Binary` data type in arrow. There are 2 For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract json elements through keys. -For the latter, users need to manually specify the data type of the json elements for computing. We provide functions like `as_int` and `as_double` to decode the binary data into data with specified data type for further computation. +For the latter, users need to manually specify the data type of the json elements for computing. Users can use `CAST` to convert the binary data to the specified data type. Computation without explicit conversion will result in an error. Queries of JSON goes through following steps: From 803a2ab2f94a91a7aa769391f0dd77c59831a1be Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Tue, 13 Aug 2024 21:46:04 +0800 Subject: [PATCH 11/16] fix: refine, make statements clear --- docs/rfcs/2024-08-06-json-datatype.md | 44 +++++++++++++-------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index e172cb23ca6a..e3c596a78dda 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -14,7 +14,7 @@ JSON is widely used across various scenarios. Direct support for writing and que # Details ## User Interface -The feature introduces a new data type for the database, similar to the common JSON type. Data is written as JSON strings and can be queried using functions. +The feature introduces a new data type, `JSON`, for the database. Similar to the common JSON type, data is written as JSON strings and can be queried using functions. For example: ```SQL @@ -52,47 +52,47 @@ SELECT CAST(json_get_by_paths(b, 'attributes', 'event_attributes') AS DOUBLE) + ## Storage and Query -Data of JSON type is stored as JSONB format in the database. For storage layer and query engine, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings. +Data of `JSON` type is stored as JSONB format in the database. For storage layer and query engine, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings. -Insertions of JSON goes through following steps: +Insertions of `JSON` goes through following steps: 1. Client gets JSON strings and sends it to the frontend. -2. Frontend encode JSON strings to JSONB format and sends it to the datanode. +2. Frontend encode JSON strings to binary data of JSONB format and sends it to the datanode. 3. Datanode stores binary data in the database. ``` Insertion: - Encode Store - JSON Strings ┌────────────┐ JSONB Data ┌────────────┐ - client ------------->│ Frontend │----------->│ Datanode │--> Storage - └────────────┘ └────────────┘ + Encode Store + JSON Strings ┌────────────┐ JSONB ┌────────────┐ JSONB + client ------------->│ Frontend │------>│ Datanode │------> Storage + └────────────┘ └────────────┘ ``` -The data of JSON type is represented by `Binary` data type in arrow. There are 2 types of JSON queries: get json elements through keys and compute over json elements. +The data of `JSON` type is represented by `Binary` data type in arrow. There are 2 types of JSON queries: get JSON elements through keys and compute over JSON elements. -For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract json elements through keys. +For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract JSON elements through keys. -For the latter, users need to manually specify the data type of the json elements for computing. Users can use `CAST` to convert the binary data to the specified data type. Computation without explicit conversion will result in an error. +For the latter, users need to manually specify the data type of the JSON elements for computing. Users can use `CAST` to convert the JSON elements to the specified data type. Computation without explicit conversion will result in an error. -Queries of JSON goes through following steps: +Queries of `JSON` goes through following steps: 1. Client sends query to frontend, and frontend sends it to datafusion, which is the query engine of GreptimeDB. -2. Datafusion performs query over JSON data, and returns binary data to frontend. -3. If no computation is needed, frontend directly decodes it to JSON strings and return it to clients. +2. Datafusion performs query over binray data of JSONB format, and returns binary data to frontend. +3. If no computation is needed, frontend directly decodes the binary data to JSON strings and return it to clients. 4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. There's no need for further decoding in the frontend. ``` Queries without computation, decoding in frontend: - Decode Query - JSON Strings ┌────────────┐ JSONB Data ┌──────────────┐ - client <-------------│ Frontend │<-----------│ Datafusion │<-- Storage - └────────────┘ └──────────────┘ + Decode Query + JSON Strings ┌────────────┐ JSONB ┌──────────────┐ JSONB + client <-------------│ Frontend │<------│ Datafusion │<------ Storage + └────────────┘ └──────────────┘ Queries with computation, decoding in datafusion: - Query - Data of Specified Type ┌────────────┐ Data of Certain Type ┌──────────────┐ - client <-----------------------│ Frontend │<---------------------│ Datafusion │<-- Storage - └────────────┘ └──────────────┘ + Query + Data of Specified Type ┌────────────┐ Data of Specified Type ┌──────────────┐ JSONB + client <-----------------------│ Frontend │<-----------------------│ Datafusion │<------ Storage + └────────────┘ └──────────────┘ ``` # Drawbacks From e2d36054fe38c6191b5c8d5b845b2ccfdb9bbc83 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Tue, 17 Sep 2024 23:10:43 +0800 Subject: [PATCH 12/16] docs: update rfc according to impl --- docs/rfcs/2024-08-06-json-datatype.md | 116 +++++++++++++++----------- 1 file changed, 66 insertions(+), 50 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index e3c596a78dda..52e7f1310207 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -13,10 +13,21 @@ JSON is widely used across various scenarios. Direct support for writing and que # Details -## User Interface -The feature introduces a new data type, `JSON`, for the database. Similar to the common JSON type, data is written as JSON strings and can be queried using functions. +## Storage and Query -For example: +The type system of GreptimeDB is based on the types of arrow/datafusion, each type has a corresponding physical type from arrow/datafusion. Thus, the json type is built on top of the `Binary` type, utilizing current implementation of both `Value` and `Vector` of it. JSON type performs the same as Binary type inside the storage layer and query engine. + +This also brings 2 problems: insertion and query interface. + +## Insertion + +User commonly write JSON data as strings. Thus we need to make conversion between string and binary data. There are 2 ways to do this: + +1. MySQL and PostgreSQL servers provide auto-conversion between string and JSON data. When a string is inserted into a JSON column, the server will try to parse the string as JSON data and convert it to binary data of JSON type. The non-JSON string will be rejected. + +2. A function `parse_json` is provided to convert string to JSON data. The function will return a binary data of JSON type. If the string is not a valid JSON string, the function will return an error. + +For example, in MySQL client: ```SQL CREATE TABLE IF NOT EXISTS test ( ts TIMESTAMP TIME INDEX, @@ -34,70 +45,75 @@ INSERT INTO test VALUES( }' ); -SELECT json_get(b, 'name') FROM test; -+---------------------+ -| b.name | -+---------------------+ -| jHl2oDDnPc1i2OzlP5Y | -+---------------------+ - -SELECT CAST(json_get_by_paths(b, 'attributes', 'event_attributes') AS DOUBLE) + 1 FROM test; -+-------------------------------+ -| b.attributes.event_attributes | -+-------------------------------+ -| 49.28667 | -+-------------------------------+ - +INSERT INTO test VALUES( + 0, + 0, + parse_json('{ + "name": "jHl2oDDnPc1i2OzlP5Y", + "timestamp": "2024-07-25T04:33:11.369386Z", + "attributes": { "event_attributes": 48.28667 } + }') +); ``` +Are both valid. -## Storage and Query +For former the conversion is done by the server, while for the latter the conversion is done by the query engine. -Data of `JSON` type is stored as JSONB format in the database. For storage layer and query engine, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings. +## Query Interface -Insertions of `JSON` goes through following steps: +Correspondingly, users prefer to display JSON data as strings. Thus we need to make conversion between binary data and string data. There are alsol 2 ways to do this: auto-conversions on MySQL and PostgreSQL servers, and function `json_to_string`. -1. Client gets JSON strings and sends it to the frontend. -2. Frontend encode JSON strings to binary data of JSONB format and sends it to the datanode. -3. Datanode stores binary data in the database. +For example, in MySQL client: +```SQL +SELECT b FROM test; +SELECT json_to_string(b) FROM test; ``` -Insertion: - Encode Store - JSON Strings ┌────────────┐ JSONB ┌────────────┐ JSONB - client ------------->│ Frontend │------>│ Datanode │------> Storage - └────────────┘ └────────────┘ -``` +Will both return the JSON string. + +Specifically, we attach a message to the binary data of JSON type in the `metadata` of `Field` in arrow/datafusion schema. Frontend servers could identify the type of the binary data and convert it to string data if necessary. But for functions with a JSON return type, the metadata method is not applicable. Thus the functions of JSON type should specify the return type explicitly, such as `json_get_int` and `json_get_float` which return `INT` and `FLOAT` respectively. -The data of `JSON` type is represented by `Binary` data type in arrow. There are 2 types of JSON queries: get JSON elements through keys and compute over JSON elements. +## Functions +Similar to the common JSON type, data is written as JSON strings and can be queried with functions. -For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract JSON elements through keys. +For example: +```SQL +CREATE TABLE IF NOT EXISTS test ( + ts TIMESTAMP TIME INDEX, + a INT, + b JSON +); -For the latter, users need to manually specify the data type of the JSON elements for computing. Users can use `CAST` to convert the JSON elements to the specified data type. Computation without explicit conversion will result in an error. +INSERT INTO test VALUES( + 0, + 0, + '{ + "name": "jHl2oDDnPc1i2OzlP5Y", + "timestamp": "2024-07-25T04:33:11.369386Z", + "attributes": { "event_attributes": 48.28667 } + }' +); -Queries of `JSON` goes through following steps: +SELECT json_get_int(b, 'name') FROM test; ++---------------------+ +| b.name | ++---------------------+ +| jHl2oDDnPc1i2OzlP5Y | ++---------------------+ -1. Client sends query to frontend, and frontend sends it to datafusion, which is the query engine of GreptimeDB. -2. Datafusion performs query over binray data of JSONB format, and returns binary data to frontend. -3. If no computation is needed, frontend directly decodes the binary data to JSON strings and return it to clients. -4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. There's no need for further decoding in the frontend. +SELECT json_get_float(b, 'attributes.event_attributes') FROM test; ++--------------------------------+ +| b.attributes.event_attributes | ++--------------------------------+ +| 48.28667 | ++--------------------------------+ ``` -Queries without computation, decoding in frontend: - Decode Query - JSON Strings ┌────────────┐ JSONB ┌──────────────┐ JSONB - client <-------------│ Frontend │<------│ Datafusion │<------ Storage - └────────────┘ └──────────────┘ - -Queries with computation, decoding in datafusion: - Query - Data of Specified Type ┌────────────┐ Data of Specified Type ┌──────────────┐ JSONB - client <-----------------------│ Frontend │<-----------------------│ Datafusion │<------ Storage - └────────────┘ └──────────────┘ -``` +And more functions can be added in the future. # Drawbacks -As a general purpose data type, JSONB may not be as efficient as specialized data types for specific scenarios. +As a general purpose JSON data type, JSONB may not be as efficient as specialized data types for specific scenarios. # Alternatives From 79a5c505ea381d401fd41137b3789fe2b31d707d Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Wed, 18 Sep 2024 19:18:18 +0800 Subject: [PATCH 13/16] docs: refine --- docs/rfcs/2024-08-06-json-datatype.md | 86 +++++++++++++++++++++++---- 1 file changed, 75 insertions(+), 11 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index 52e7f1310207..c7ed5dbc162d 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -15,17 +15,17 @@ JSON is widely used across various scenarios. Direct support for writing and que ## Storage and Query -The type system of GreptimeDB is based on the types of arrow/datafusion, each type has a corresponding physical type from arrow/datafusion. Thus, the json type is built on top of the `Binary` type, utilizing current implementation of both `Value` and `Vector` of it. JSON type performs the same as Binary type inside the storage layer and query engine. +GreptimeDB's type system is built on Arrow/DataFusion, where each data type in GreptimeDB corresponds to a data type in Arrow/DataFusion. The proposed JSON type will be implemented on top of the existing `Binary` type, leveraging the current `datatype::value::Value` and `datatype::vectors::BinaryVector` implementations, utilizing the JSONB format as the encoding of JSON data. JSON data is stored and processed similarly to binary data within the storage layer and query engine. -This also brings 2 problems: insertion and query interface. +This approach brings problems when dealing with insertions and queries of JSON columns. ## Insertion -User commonly write JSON data as strings. Thus we need to make conversion between string and binary data. There are 2 ways to do this: +Users commonly write JSON data as strings. Thus we need to make conversions between string and JSONB. There are 2 ways to do this: -1. MySQL and PostgreSQL servers provide auto-conversion between string and JSON data. When a string is inserted into a JSON column, the server will try to parse the string as JSON data and convert it to binary data of JSON type. The non-JSON string will be rejected. +1. MySQL and PostgreSQL servers provide auto-conversions between strings and JSONB. When a string is inserted into a JSON column, the server will try to parse the string as JSON and convert it to JSONB. The non-JSON strings will be rejected. -2. A function `parse_json` is provided to convert string to JSON data. The function will return a binary data of JSON type. If the string is not a valid JSON string, the function will return an error. +2. A function `parse_json` is provided to convert string to JSONB. If the string is not a valid JSON string, the function will return an error. For example, in MySQL client: ```SQL @@ -57,11 +57,55 @@ INSERT INTO test VALUES( ``` Are both valid. -For former the conversion is done by the server, while for the latter the conversion is done by the query engine. +The dataflow of the insertion process is as follows: +``` +Insert JSON strings directly through client: + (Server identifies JSON type and performs auto-conversion) + Encode Insert + JSON Strings ┌──────────┐JSONB ┌──────────────┐JSONB + Client ------------>│ Server │----->│ Query Engine │-----> Storage + └──────────┘ └──────────────┘ +Insert JSON strings through parse_json function: + (Conversion is performed by function inside Query Engine) + Encode & Insert + JSON Strings ┌──────────┐JSON Strings ┌──────────────┐JSONB + Client ------------>│ Server │------------>│ Query Engine │-----> Storage + └──────────┘ └──────────────┘ +``` -## Query Interface +However, insertions through prepared statements in MySQL clients will not trigger the auto-conversion since the prepared plan of datafusion cannot identify JSON type from binary type. The server will directly insert the input string into the JSON column as string bytes instead of converting it to JSONB. This may cause problems when the string is not a valid JSON string. + +``` +For insertions through prepared statements in MySQL clients: + Prepare stmt ┌──────────┐ +Client ------------>│ Server │ -----> Cached Plan ───────┐ + └──────────┘ │ + (Cached plan erased type info of JSON │ + and treat it as binary) │ + ┌─────────────────────────────────┘ + ↓ + Execute stmt ┌──────────┐ +Client ------------>│ Server │ (Cannot perform auto-conversion here) + └──────────┘ +``` + +Thus, following codes may not work as expected: +```Rust +// sqlx first prepare a statement and then execute it. +sqlx::query(create table test (ts TIMESTAMP TIME INDEX, b JSON)) + .execute(&pool) + .await?; +sqlx::query("insert into demo values(?, ?)") + .bind(0) + .bind(r#"{"name": "jHl2oDDnPc1i2OzlP5Y", "timestamp": "2024-07-25T04:33:11.369386Z", "attributes": { "event_attributes": 48.28667 }}"#) + .execute(&pool) + .await?; +``` +The JSON will be inserted as string bytes instead of JSONB. Also happens when using `PREPARE` and `EXECUTE` in MySQL client. Among these scenarios, we need to use `parse_json` function explicitly to convert the string to JSONB. -Correspondingly, users prefer to display JSON data as strings. Thus we need to make conversion between binary data and string data. There are alsol 2 ways to do this: auto-conversions on MySQL and PostgreSQL servers, and function `json_to_string`. +## Query + +Correspondingly, users prefer to display JSON data as strings. Thus we need to make conversions between JSON data and strings before presenting JSON data. There are also 2 ways to do this: auto-conversions on MySQL and PostgreSQL servers, and function `json_to_string`. For example, in MySQL client: ```SQL @@ -69,12 +113,30 @@ SELECT b FROM test; SELECT json_to_string(b) FROM test; ``` -Will both return the JSON string. +Will both return the JSON as human-readable strings. + +Specifically, to perform auto-conversions, we attach a message to JSON data in the `metadata` of `Field` in Arrow/Datafusion schema when scanning a JSON column. Frontend servers could identify JSON data and convert it to strings. + +The dataflow of the query process is as follows: +``` +Query directly through client: + (Server identifies JSON type and performs auto-conversion based on column metadata) + Decode Scan + JSON Strings ┌──────────┐JSONB ┌──────────────┐JSONB + Client ------------>│ Server │----->│ Query Engine │<----- Storage + └──────────┘ └──────────────┘ +Query through json_to_string function: + (Conversion is performed by function inside Query Engine) + Scan & Decode + JSON Strings ┌──────────┐JSON Strings ┌──────────────┐JSONB + Client ------------>│ Server │------------>│ Query Engine │-----> Storage + └──────────┘ └──────────────┘ +``` -Specifically, we attach a message to the binary data of JSON type in the `metadata` of `Field` in arrow/datafusion schema. Frontend servers could identify the type of the binary data and convert it to string data if necessary. But for functions with a JSON return type, the metadata method is not applicable. Thus the functions of JSON type should specify the return type explicitly, such as `json_get_int` and `json_get_float` which return `INT` and `FLOAT` respectively. +However, if a function uses JSON type as its return type, the metadata method mentioned above is not applicable. Thus the functions of JSON type should specify the return type explicitly instead of returning a JSON type, such as `json_get_int` and `json_get_float` which return corresponding data of `INT` and `FLOAT` type respectively. ## Functions -Similar to the common JSON type, data is written as JSON strings and can be queried with functions. +Similar to the common JSON type, JSON data can be queried with functions. For example: ```SQL @@ -115,6 +177,8 @@ And more functions can be added in the future. As a general purpose JSON data type, JSONB may not be as efficient as specialized data types for specific scenarios. +The auto-conversion mechanism is not supported in all scenarios. We need to find workarounds for these scenarios. + # Alternatives Extract and flatten JSON schema to store in a structured format through pipeline. For nested data, we can provide nested types like `STRUCT` or `ARRAY`. From 104d5cf94e68d05f92cfb9ccee42a1dd6072f3bb Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Wed, 18 Sep 2024 19:20:53 +0800 Subject: [PATCH 14/16] docs: fix wrong arrows --- docs/rfcs/2024-08-06-json-datatype.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index c7ed5dbc162d..f23e99176736 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -123,13 +123,13 @@ Query directly through client: (Server identifies JSON type and performs auto-conversion based on column metadata) Decode Scan JSON Strings ┌──────────┐JSONB ┌──────────────┐JSONB - Client ------------>│ Server │----->│ Query Engine │<----- Storage + Client <------------│ Server │<-----│ Query Engine │<----- Storage └──────────┘ └──────────────┘ Query through json_to_string function: (Conversion is performed by function inside Query Engine) Scan & Decode JSON Strings ┌──────────┐JSON Strings ┌──────────────┐JSONB - Client ------------>│ Server │------------>│ Query Engine │-----> Storage + Client <------------│ Server │<------------│ Query Engine │<----- Storage └──────────┘ └──────────────┘ ``` From 93181a2513fe9d3409f21f66b5524842b0c3a688 Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Wed, 18 Sep 2024 21:46:28 +0800 Subject: [PATCH 15/16] docs: refine --- docs/rfcs/2024-08-06-json-datatype.md | 99 +++++++++++++++------------ 1 file changed, 56 insertions(+), 43 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index f23e99176736..a241d0db0674 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -60,48 +60,59 @@ Are both valid. The dataflow of the insertion process is as follows: ``` Insert JSON strings directly through client: + Parse Insert + String(Serialized JSON)┌──────────┐Arrow Binary(JSONB)┌──────┐Arrow Binary(JSONB) + Client ---------------------->│ Server │------------------>│ Mito │------------------> Storage + └──────────┘ └──────┘ (Server identifies JSON type and performs auto-conversion) - Encode Insert - JSON Strings ┌──────────┐JSONB ┌──────────────┐JSONB - Client ------------>│ Server │----->│ Query Engine │-----> Storage - └──────────┘ └──────────────┘ + Insert JSON strings through parse_json function: - (Conversion is performed by function inside Query Engine) - Encode & Insert - JSON Strings ┌──────────┐JSON Strings ┌──────────────┐JSONB - Client ------------>│ Server │------------>│ Query Engine │-----> Storage - └──────────┘ └──────────────┘ + Parse Insert + String(Serialized JSON)┌──────────┐String(Serialized JSON)┌─────┐Arrow Binary(JSONB)┌──────┐Arrow Binary(JSONB) + Client ---------------------->│ Server │---------------------->│ UDF │------------------>│ Mito │------------------> Storage + └──────────┘ └─────┘ └──────┘ + (Conversion is performed by UDF inside Query Engine) ``` -However, insertions through prepared statements in MySQL clients will not trigger the auto-conversion since the prepared plan of datafusion cannot identify JSON type from binary type. The server will directly insert the input string into the JSON column as string bytes instead of converting it to JSONB. This may cause problems when the string is not a valid JSON string. - -``` -For insertions through prepared statements in MySQL clients: - Prepare stmt ┌──────────┐ -Client ------------>│ Server │ -----> Cached Plan ───────┐ - └──────────┘ │ - (Cached plan erased type info of JSON │ - and treat it as binary) │ - ┌─────────────────────────────────┘ - ↓ - Execute stmt ┌──────────┐ -Client ------------>│ Server │ (Cannot perform auto-conversion here) - └──────────┘ -``` +Servers identify JSON column through column schema and perform auto-conversions. But when using prepared statements and binding parameters, the corresponding cached plans in datafusion generated by prepared statements cannot identify JSON columns. Under this circumstance, the servers identify JSON columns through the given parameters and perform auto-conversions. -Thus, following codes may not work as expected: +The following is an example of inserting JSON data through prepared statements: ```Rust -// sqlx first prepare a statement and then execute it. -sqlx::query(create table test (ts TIMESTAMP TIME INDEX, b JSON)) +sqlx::query( + "create table test(ts timestamp time index, j json)", +) +.execute(&pool) +.await +.unwrap(); + +let json = serde_json::json!({ + "code": 200, + "success": true, + "payload": { + "features": [ + "serde", + "json" + ], + "homepage": null + } +}); + +// Valid, can identify serde_json::Value as JSON type +sqlx::query("insert into test values($1, $2)") + .bind(i) + .bind(json) .execute(&pool) - .await?; -sqlx::query("insert into demo values(?, ?)") - .bind(0) - .bind(r#"{"name": "jHl2oDDnPc1i2OzlP5Y", "timestamp": "2024-07-25T04:33:11.369386Z", "attributes": { "event_attributes": 48.28667 }}"#) + .await + .unwrap(); + +// Invalid, cannot identify String as JSON type +sqlx::query("insert into test values($1, $2)") + .bind(i) + .bind(json.to_string()) .execute(&pool) - .await?; + .await + .unwrap(); ``` -The JSON will be inserted as string bytes instead of JSONB. Also happens when using `PREPARE` and `EXECUTE` in MySQL client. Among these scenarios, we need to use `parse_json` function explicitly to convert the string to JSONB. ## Query @@ -120,17 +131,19 @@ Specifically, to perform auto-conversions, we attach a message to JSON data in t The dataflow of the query process is as follows: ``` Query directly through client: - (Server identifies JSON type and performs auto-conversion based on column metadata) - Decode Scan - JSON Strings ┌──────────┐JSONB ┌──────────────┐JSONB - Client <------------│ Server │<-----│ Query Engine │<----- Storage - └──────────┘ └──────────────┘ + Decode Scan + String(Serialized JSON)┌──────────┐Arrow Binary(JSONB)┌──────────────┐Arrow Binary(JSONB) + Client <----------------------│ Server │<------------------│ Query Engine │<----------------- Storage + └──────────┘ └──────────────┘ +(Server identifies JSON type and performs auto-conversion based on column metadata) + Query through json_to_string function: - (Conversion is performed by function inside Query Engine) - Scan & Decode - JSON Strings ┌──────────┐JSON Strings ┌──────────────┐JSONB - Client <------------│ Server │<------------│ Query Engine │<----- Storage - └──────────┘ └──────────────┘ + Scan & Decode + String(Serialized JSON)┌──────────┐Arrow Binary(JSONB)┌──────────────┐Arrow Binary(JSONB) + Client <----------------------│ Server │<------------------│ Query Engine │<----- Storage + └──────────┘ └──────────────┘ + (Conversion is performed by UDF inside Query Engine) + ``` However, if a function uses JSON type as its return type, the metadata method mentioned above is not applicable. Thus the functions of JSON type should specify the return type explicitly instead of returning a JSON type, such as `json_get_int` and `json_get_float` which return corresponding data of `INT` and `FLOAT` type respectively. From 22c201211b0ffbca1fed65ec4c39a37d4d27df3e Mon Sep 17 00:00:00 2001 From: CookiePieWw Date: Thu, 19 Sep 2024 13:39:47 +0800 Subject: [PATCH 16/16] docs: fix some errors qaq --- docs/rfcs/2024-08-06-json-datatype.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index a241d0db0674..8a617a170909 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -138,11 +138,11 @@ Query directly through client: (Server identifies JSON type and performs auto-conversion based on column metadata) Query through json_to_string function: - Scan & Decode - String(Serialized JSON)┌──────────┐Arrow Binary(JSONB)┌──────────────┐Arrow Binary(JSONB) - Client <----------------------│ Server │<------------------│ Query Engine │<----- Storage - └──────────┘ └──────────────┘ - (Conversion is performed by UDF inside Query Engine) + Scan & Decode + String(Serialized JSON)┌──────────┐String(Serialized JSON)┌──────────────┐Arrow Binary(JSONB) + Client <----------------------│ Server │<----------------------│ Query Engine │<----------------- Storage + └──────────┘ └──────────────┘ + (Conversion is performed by UDF inside Query Engine) ``` @@ -169,7 +169,7 @@ INSERT INTO test VALUES( }' ); -SELECT json_get_int(b, 'name') FROM test; +SELECT json_get_string(b, 'name') FROM test; +---------------------+ | b.name | +---------------------+