From 8dd33b4039324d67cf17eeb743e9130fcbbe0bd6 Mon Sep 17 00:00:00 2001 From: Yiran Date: Wed, 11 Dec 2024 17:39:33 +0800 Subject: [PATCH 1/7] refactor: refine continuous aggregation documentation --- docs/getting-started/quick-start.md | 2 +- docs/reference/sql/create.md | 4 +- .../concepts/features-that-you-concern.md | 2 +- docs/user-guide/concepts/key-concepts.md | 2 +- .../continuous-aggregation/overview.md | 150 ------------------ .../continuous-aggregation.md} | 145 ++++++++++++++++- .../expressions.md} | 2 +- .../manage-flow.md | 15 +- docs/user-guide/flow-computation/overview.md | 124 +++++++++++++++ docs/user-guide/overview.md | 4 +- sidebars.ts | 12 +- 11 files changed, 284 insertions(+), 178 deletions(-) delete mode 100644 docs/user-guide/continuous-aggregation/overview.md rename docs/user-guide/{continuous-aggregation/usecase-example.md => flow-computation/continuous-aggregation.md} (58%) rename docs/user-guide/{continuous-aggregation/expression.md => flow-computation/expressions.md} (98%) rename docs/user-guide/{continuous-aggregation => flow-computation}/manage-flow.md (88%) create mode 100644 docs/user-guide/flow-computation/overview.md diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md index aa8f09faa..e5c0dd923 100644 --- a/docs/getting-started/quick-start.md +++ b/docs/getting-started/quick-start.md @@ -313,7 +313,7 @@ ORDER BY ### Continuous aggregation -For further analysis or reduce the scan cost when aggregating data frequently, you can save the aggregation results to another tables. This can be implemented by using the [continuous aggregation](/user-guide/continuous-aggregation/overview.md) feature of GreptimeDB. +For further analysis or reduce the scan cost when aggregating data frequently, you can save the aggregation results to another tables. This can be implemented by using the [continuous aggregation](/user-guide/flow-computation/overview.md) feature of GreptimeDB. For example, aggregate the API error number by 5-second and save the data to table `api_error_count`. diff --git a/docs/reference/sql/create.md b/docs/reference/sql/create.md index ac08bcabd..79c1e884e 100644 --- a/docs/reference/sql/create.md +++ b/docs/reference/sql/create.md @@ -138,7 +138,7 @@ The `ttl` value can be one of the following: - `months`, `month`, `M` – defined as 30.44 days - `years`, `year`, `y` – defined as 365.25 days - `forever`, `NULL`, an empty string `''` and `0s` (or any zero length duration, like `0d`), means the data will never be deleted. -- `instant`, note that database's TTL can't be set to `instant`. `instant` means the data will be deleted instantly when inserted, useful if you want to send input to a flow task without saving it, see more details in [flow management documents](/user-guide/continuous-aggregation/manage-flow.md#manage-flows). +- `instant`, note that database's TTL can't be set to `instant`. `instant` means the data will be deleted instantly when inserted, useful if you want to send input to a flow task without saving it, see more details in [flow management documents](/user-guide/flow-computation/manage-flow.md#manage-flows). - Unset, `ttl` can be unset by using `ALTER TABLE UNSET 'ttl'`, which means the table will inherit the database's ttl policy (if any). If a table has its own TTL policy, it will take precedence over the database TTL policy. @@ -420,7 +420,7 @@ AS ; ``` -For the statement to create or update a flow, please read the [flow management documents](/user-guide/continuous-aggregation/manage-flow.md#create-a-flow). +For the statement to create or update a flow, please read the [flow management documents](/user-guide/flow-computation/manage-flow.md#create-a-flow). ## CREATE VIEW diff --git a/docs/user-guide/concepts/features-that-you-concern.md b/docs/user-guide/concepts/features-that-you-concern.md index 9e8c99576..22a9d7529 100644 --- a/docs/user-guide/concepts/features-that-you-concern.md +++ b/docs/user-guide/concepts/features-that-you-concern.md @@ -41,7 +41,7 @@ GreptimeDB resolves this issue by: ## Does GreptimeDB support continuous aggregate or downsampling? -Since 0.8, GreptimeDB added a new function called `Flow`, which is used for continuous aggregation. Please read the [user guide](/user-guide/continuous-aggregation/overview.md). +Since 0.8, GreptimeDB added a new function called `Flow`, which is used for continuous aggregation. Please read the [user guide](/user-guide/flow-computation/overview.md). ## Can I store data in object storage in the cloud? diff --git a/docs/user-guide/concepts/key-concepts.md b/docs/user-guide/concepts/key-concepts.md index 829662b84..0e7ccf1b1 100644 --- a/docs/user-guide/concepts/key-concepts.md +++ b/docs/user-guide/concepts/key-concepts.md @@ -52,4 +52,4 @@ The data displayed in a view is retrieved dynamically from the underlying tables ## Flow -A `flow` in GreptimeDB refers to a [continuous aggregation](/user-guide/continuous-aggregation/overview.md) process that continuously updates and materializes aggregated data based on incoming data. +A `flow` in GreptimeDB refers to a [continuous aggregation](/user-guide/flow-computation/overview.md) process that continuously updates and materializes aggregated data based on incoming data. diff --git a/docs/user-guide/continuous-aggregation/overview.md b/docs/user-guide/continuous-aggregation/overview.md deleted file mode 100644 index 6b17e13e7..000000000 --- a/docs/user-guide/continuous-aggregation/overview.md +++ /dev/null @@ -1,150 +0,0 @@ ---- -description: Introduction to GreptimeDB's continuous aggregation feature, including real-time data aggregation, use cases, and a quick start example. ---- - -# Overview - -GreptimeDB provides a continuous aggregation feature that allows you to aggregate data in real-time. This feature is useful when you need to calculate and query the sum, average, or other aggregations on the fly. The continuous aggregation feature is provided by the Flow engine. It continuously updates the aggregated data based on the incoming data and materialize it. So you can think of it as a clever materialized views that know when to update result view table and how to update it with minimal effort. Some common use case include: - -- Downsampling the data point using i.e. average pooling to reduce amount of data for storage and analysis -- Real-time analytics that provide actionable information in near real-time - -When you insert data into the source table, the data is also sent to and stored in the Flow engine. -The Flow engine calculate the aggregation by time windows and store the result in the sink table. -The entire process is illustrated in the following image: - -![Continuous Aggregation](/flow-ani.svg) - -## Quick start with an example - -Here is a complete example of how a continuous aggregation query looks like. - -This use case is to calculate the total number of logs, the minimum size, the maximum size, the average size, and the number of packets with the size greater than 550 for each status code in a 1-minute fixed window for access logs. -First, create a source table `ngx_access_log` and a sink table `ngx_statistics` with following clauses: - -```sql -CREATE TABLE `ngx_access_log` ( - `client` STRING NULL, - `ua_platform` STRING NULL, - `referer` STRING NULL, - `method` STRING NULL, - `endpoint` STRING NULL, - `trace_id` STRING NULL FULLTEXT, - `protocol` STRING NULL, - `status` SMALLINT UNSIGNED NULL, - `size` DOUBLE NULL, - `agent` STRING NULL, - `access_time` TIMESTAMP(3) NOT NULL, - TIME INDEX (`access_time`) -) -WITH( - append_mode = 'true' -); -``` - -```sql -CREATE TABLE `ngx_statistics` ( - `status` SMALLINT UNSIGNED NULL, - `total_logs` BIGINT NULL, - `min_size` DOUBLE NULL, - `max_size` DOUBLE NULL, - `avg_size` DOUBLE NULL, - `high_size_count` BIGINT NULL, - `time_window` TIMESTAMP time index, - `update_at` TIMESTAMP NULL, - PRIMARY KEY (`status`) -); -``` - -Then create the flow `ngx_aggregation` to aggregate a series of aggregate functions, including `count`, `min`, `max`, `avg` of the `size` column, and the sum of all packets of size great than 550. The aggregation is calculated in 1-minute fixed windows of `access_time` column and also grouped by the `status` column. So you can be made aware in real time the information about packet size and action upon it, i.e. if the `high_size_count` became too high at a certain point, you can further examine if anything goes wrong, or if the `max_size` column suddenly spike in a 1 minute time window, you can then trying to locate that packet and further inspect it. - -```sql -CREATE FLOW ngx_aggregation -SINK TO ngx_statistics -AS -SELECT - status, - count(client) AS total_logs, - min(size) as min_size, - max(size) as max_size, - avg(size) as avg_size, - sum(case when `size` > 550 then 1 else 0 end) as high_size_count, - date_bin(INTERVAL '1 minutes', access_time) as time_window, -FROM ngx_access_log -GROUP BY - status, - time_window; -``` - -To observe the outcome of the continuous aggregation in the `ngx_statistics` table, insert some data into the source table `ngx_access_log`. - -```sql -INSERT INTO ngx_access_log -VALUES - ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 1000, "agent", "2021-07-01 00:00:01.000"), - ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:00:30.500"), - ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 600, "agent", "2021-07-01 00:01:01.000"), - ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 700, "agent", "2021-07-01 00:01:01.500"); -``` - -Then the sink table `ngx_statistics` will be incremental updated and contain the following data: - -```sql -SELECT * FROM ngx_statistics; -``` - -```sql - status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at ---------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- - 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 - 200 | 1 | 600 | 600 | 600 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 - 404 | 1 | 700 | 700 | 700 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 -(3 rows) -``` - -Try to insert more data into the `ngx_access_log` table: - -```sql -INSERT INTO ngx_access_log -VALUES - ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:01:01.000"), - ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 800, "agent", "2021-07-01 00:01:01.500"); -``` - -The sink table `ngx_statistics` now have corresponding rows updated, notes how `max_size`, `avg_size` and `high_size_count` are updated: - -```sql -SELECT * FROM ngx_statistics; -``` - -```sql - status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at ---------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- - 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 - 200 | 2 | 500 | 600 | 550 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 - 404 | 2 | 700 | 800 | 750 | 2 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 -(3 rows) -``` - -Here is the explanation of the columns in the `ngx_statistics` table: - -- `status`: The status code of the HTTP response. -- `total_logs`: The total number of logs with the same status code. -- `min_size`: The minimum size of the packets with the same status code. -- `max_size`: The maximum size of the packets with the same status code. -- `avg_size`: The average size of the packets with the same status code. -- `high_size_count`: The number of packets with the size greater than 550. -- `time_window`: The time window of the aggregation. -- `update_at`: The time when the aggregation is updated. - - - - -## Next Steps - -Congratulations you already have a preliminary understanding of the continuous aggregation feature. -Please refer to the following sections to learn more: - -- [Usecase Examples](./usecase-example.md) provides more examples of how to use continuous aggregation in real-time analytics, monitoring, and dashboard. -- [Manage Flows](./manage-flow.md) describes how to create and delete a flow. Each of your continuous aggregation query is a flow. -- [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. diff --git a/docs/user-guide/continuous-aggregation/usecase-example.md b/docs/user-guide/flow-computation/continuous-aggregation.md similarity index 58% rename from docs/user-guide/continuous-aggregation/usecase-example.md rename to docs/user-guide/flow-computation/continuous-aggregation.md index 88b5d447b..38df73c4d 100644 --- a/docs/user-guide/continuous-aggregation/usecase-example.md +++ b/docs/user-guide/flow-computation/continuous-aggregation.md @@ -1,8 +1,14 @@ --- -description: Provides major use case examples for continuous aggregation in GreptimeDB, including real-time analytics, monitoring, and dashboards. It includes SQL queries and examples to demonstrate how to set up and use continuous aggregation for various scenarios. +description: Explore continuous aggregation in GreptimeDB for real-time insights. This guide explains how to perform continuous aggregations using the Flow engine, including calculating sums, averages, and other metrics within specified time windows. Learn through examples of real-time analytics, monitoring, and dashboard use cases. Understand how to create source and sink tables, define flows, and write SQL queries for continuous aggregation. Discover how to calculate log statistics, retrieve distinct countries by time window, and monitor sensor data in real-time. This comprehensive guide is essential for developers looking to leverage GreptimeDB's continuous aggregation capabilities for efficient data processing and real-time analytics. --- -# Usecase Examples +# Continuous Aggregation + +Continuous aggregation is a crucial aspect of processing time-series data to deliver real-time insights. +The Flow engine empowers developers to perform continuous aggregations, +such as calculating sums, averages, and other metrics, seamlessly. +It efficiently updates the aggregated data within specified time windows, making it an invaluable tool for analytics. + Following are three major usecase examples for continuous aggregation: 1. **Real-time Analytics**: A real-time analytics platform that continuously aggregates data from a stream of events, delivering immediate insights while optionally downsampling the data to a lower resolution. For instance, this system can compile data from a high-frequency stream of log events (e.g., occurring every millisecond) to provide up-to-the-minute insights such as the number of requests per minute, average response times, and error rates per minute. @@ -13,9 +19,130 @@ Following are three major usecase examples for continuous aggregation: In all these usecases, the continuous aggregation system continuously aggregates data from a stream of events and provides real-time insights and alerts based on the aggregated data. The system can also downsample the data to a lower resolution to reduce the amount of data stored and processed. This allows the system to provide real-time insights and alerts while keeping the data storage and processing costs low. -## Real-time analytics example -See [Overview](/user-guide/continuous-aggregation/overview.md#quick-start-with-an-example) for an example of real-time analytics. Which is to calculate the total number of logs, the minimum size, the maximum size, the average size, and the number of packets with the size greater than 550 for each status code in a 1-minute fixed window for access logs. +## Real-time Analytics Example + +### Calculate the Log Statistics + +This use case is to calculate the total number of logs, the minimum size, the maximum size, the average size, and the number of packets with the size greater than 550 for each status code in a 1-minute fixed window for access logs. +First, create a source table `ngx_access_log` and a sink table `ngx_statistics` with following clauses: + +```sql +CREATE TABLE `ngx_access_log` ( + `client` STRING NULL, + `ua_platform` STRING NULL, + `referer` STRING NULL, + `method` STRING NULL, + `endpoint` STRING NULL, + `trace_id` STRING NULL FULLTEXT, + `protocol` STRING NULL, + `status` SMALLINT UNSIGNED NULL, + `size` DOUBLE NULL, + `agent` STRING NULL, + `access_time` TIMESTAMP(3) NOT NULL, + TIME INDEX (`access_time`) +) +WITH( + append_mode = 'true' +); +``` + +```sql +CREATE TABLE `ngx_statistics` ( + `status` SMALLINT UNSIGNED NULL, + `total_logs` BIGINT NULL, + `min_size` DOUBLE NULL, + `max_size` DOUBLE NULL, + `avg_size` DOUBLE NULL, + `high_size_count` BIGINT NULL, + `time_window` TIMESTAMP time index, + `update_at` TIMESTAMP NULL, + PRIMARY KEY (`status`) +); +``` + +Then create the flow `ngx_aggregation` to aggregate a series of aggregate functions, including `count`, `min`, `max`, `avg` of the `size` column, and the sum of all packets of size great than 550. The aggregation is calculated in 1-minute fixed windows of `access_time` column and also grouped by the `status` column. So you can be made aware in real time the information about packet size and action upon it, i.e. if the `high_size_count` became too high at a certain point, you can further examine if anything goes wrong, or if the `max_size` column suddenly spike in a 1 minute time window, you can then trying to locate that packet and further inspect it. + +```sql +CREATE FLOW ngx_aggregation +SINK TO ngx_statistics +AS +SELECT + status, + count(client) AS total_logs, + min(size) as min_size, + max(size) as max_size, + avg(size) as avg_size, + sum(case when `size` > 550 then 1 else 0 end) as high_size_count, + date_bin(INTERVAL '1 minutes', access_time) as time_window, +FROM ngx_access_log +GROUP BY + status, + time_window; +``` + +To observe the outcome of the continuous aggregation in the `ngx_statistics` table, insert some data into the source table `ngx_access_log`. + +```sql +INSERT INTO ngx_access_log +VALUES + ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 1000, "agent", "2021-07-01 00:00:01.000"), + ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:00:30.500"), + ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 600, "agent", "2021-07-01 00:01:01.000"), + ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 700, "agent", "2021-07-01 00:01:01.500"); +``` + +Then the sink table `ngx_statistics` will be incremental updated and contain the following data: + +```sql +SELECT * FROM ngx_statistics; +``` + +```sql + status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at +--------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- + 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 + 200 | 1 | 600 | 600 | 600 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 + 404 | 1 | 700 | 700 | 700 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 +(3 rows) +``` + +Try to insert more data into the `ngx_access_log` table: + +```sql +INSERT INTO ngx_access_log +VALUES + ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:01:01.000"), + ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 800, "agent", "2021-07-01 00:01:01.500"); +``` + +The sink table `ngx_statistics` now have corresponding rows updated, notes how `max_size`, `avg_size` and `high_size_count` are updated: + +```sql +SELECT * FROM ngx_statistics; +``` + +```sql + status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at +--------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- + 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 + 200 | 2 | 500 | 600 | 550 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 + 404 | 2 | 700 | 800 | 750 | 2 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 +(3 rows) +``` + +Here is the explanation of the columns in the `ngx_statistics` table: + +- `status`: The status code of the HTTP response. +- `total_logs`: The total number of logs with the same status code. +- `min_size`: The minimum size of the packets with the same status code. +- `max_size`: The maximum size of the packets with the same status code. +- `avg_size`: The average size of the packets with the same status code. +- `high_size_count`: The number of packets with the size greater than 550. +- `time_window`: The time window of the aggregation. +- `update_at`: The time when the aggregation is updated. + +### Retrieve Distinct Countries by Time Window Another example of real-time analytics is to retrieve all distinct countries from the `ngx_access_log` table. You can use the following query to group countries by time window: @@ -91,7 +218,7 @@ select * from ngx_country; +---------+---------------------+----------------------------+ ``` -## Real-time monitoring example +## Real-Time Monitoring Example Consider a usecase where you have a stream of sensor events from a network of temperature sensors that you want to monitor in real-time. The sensor events contain information such as the sensor ID, the temperature reading, the timestamp of the reading, and the location of the sensor. You want to continuously aggregate this data to provide real-time alerts when the temperature exceeds a certain threshold. Then the query for continuous aggregation would be: @@ -178,7 +305,7 @@ SELECT * FROM temp_alerts; +-----------+-------+----------+---------------------+----------------------------+ ``` -## Real-time dashboard +## Real-Time Dashboard Consider a usecase in which you need a bar graph that show the distribution of packet sizes for each status code to monitor the health of the system. The query for continuous aggregation would be: @@ -253,6 +380,8 @@ SELECT * FROM ngx_distribution; +------+-------------+------------+---------------------+----------------------------+ ``` -## Conclusion +## Next Steps + +- [Manage Flow](manage-flow.md): Gain insights into the mechanisms of the Flow engine and the SQL syntax for defining a Flow. +- [Expressions](expressions.md): Learn about the expressions supported by the Flow engine for data transformation. -Continuous aggregation is a powerful tool for real-time analytics, monitoring, and dashboarding. It allows you to continuously aggregate data from a stream of events and provide real-time insights and alerts based on the aggregated data. By downsampling the data to a lower resolution, you can reduce the amount of data stored and processed, making it easier to provide real-time insights and alerts while keeping the data storage and processing costs low. Continuous aggregation is a key component of any real-time data processing system and can be used in a wide range of usecases to provide real-time insights and alerts based on streaming data. \ No newline at end of file diff --git a/docs/user-guide/continuous-aggregation/expression.md b/docs/user-guide/flow-computation/expressions.md similarity index 98% rename from docs/user-guide/continuous-aggregation/expression.md rename to docs/user-guide/flow-computation/expressions.md index 2b274c907..b28d8b19b 100644 --- a/docs/user-guide/continuous-aggregation/expression.md +++ b/docs/user-guide/flow-computation/expressions.md @@ -2,7 +2,7 @@ description: Lists supported aggregate and scalar functions in GreptimeDB's flow, including count, sum, avg, min, max, and various scalar functions. It provides links to detailed documentation for each function. --- -# Expression +# Expressions ## Aggregate functions diff --git a/docs/user-guide/continuous-aggregation/manage-flow.md b/docs/user-guide/flow-computation/manage-flow.md similarity index 88% rename from docs/user-guide/continuous-aggregation/manage-flow.md rename to docs/user-guide/flow-computation/manage-flow.md index 030043f28..946f17135 100644 --- a/docs/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/user-guide/flow-computation/manage-flow.md @@ -1,5 +1,5 @@ --- -description: Describes how to manage flows in GreptimeDB, including creating, updating, and deleting flows. It explains the syntax for creating flows, the importance of sink tables, and how to use the EXPIRE AFTER clause. Examples of SQL queries for managing flows are provided. +description: Learn how to manage flows in GreptimeDB, including creating, updating, and deleting flows. This guide covers the syntax for creating flows, the importance of sink tables, and how to use the EXPIRE AFTER clause. It provides examples of SQL queries for managing flows, creating source and sink tables, and defining continuous aggregation queries. Understand the significance of column order, time index, and tags in sink tables. Discover how to manually trigger flow processing with the ADMIN FLUSH_FLOW command and how to delete flows using the DROP FLOW clause. This comprehensive guide is essential for users looking to leverage GreptimeDB's flow computation capabilities for real-time analytics, monitoring, and dashboards. --- # Manage Flows @@ -8,9 +8,10 @@ Each `flow` is a continuous aggregation query in GreptimeDB. It continuously updates the aggregated data based on the incoming data. This document describes how to create, and delete a flow. -## Create input table +## Create a Source Table Before creating a flow, you need to create a source table to store the raw data. Like this: + ```sql CREATE TABLE temp_sensor_data ( sensor_id INT, @@ -21,6 +22,7 @@ CREATE TABLE temp_sensor_data ( ); ``` However, if you don't want to store the raw data, you can use a temporary table as the source table by creating table using `WITH ('ttl' = 'instant')` table option: + ```sql CREATE TABLE temp_sensor_data ( sensor_id INT, @@ -30,9 +32,10 @@ CREATE TABLE temp_sensor_data ( PRIMARY KEY(sensor_id, loc) ) WITH ('ttl' = 'instant'); ``` + Setting `'ttl'` to `'instant'` will make the table a temporary table, which means it will automatically discard all inserted data and the table will always be empty, only sending them to flow task for computation. -## Create a sink table +## Create a Sink Table Before creating a flow, you need a sink table to store the aggregated data generated by the flow. While it is the same to a regular time series table, there are a few important considerations: @@ -152,11 +155,11 @@ Only data timestamped from 09:00:00 onwards will be used in the aggregation. The `SQL` part of the flow is similar to a standard `SELECT` clause with a few differences. The syntax of the query is as follows: ```sql -SELECT AGGR_FUNCTION(column1, column2,..), TIME_WINDOW_FUNCTION() as time_window FROM GROUP BY time_window; +SELECT AGGR_FUNCTION(column1, column2,..) [, TIME_WINDOW_FUNCTION() as time_window] FROM GROUP BY {time_window | column1, column2,.. }; ``` Only the following types of expressions are allowed after the `SELECT` keyword: -- Aggregate functions: Refer to the [Expression](./expression.md) documentation for details. +- Aggregate functions: Refer to the [Expressions](expressions.md) documentation for details. - Time window functions: Refer to the [define time window](#define-time-window) section for details. - Scalar functions: Such as `col`, `to_lowercase(col)`, `col + 1`, etc. This part is the same as in a standard `SELECT` clause in GreptimeDB. @@ -176,7 +179,7 @@ The following points should be noted about the rest of the query syntax: Other expressions in `GROUP BY` can include literals, columns, or scalar expressions. - `ORDER BY`, `LIMIT`, and `OFFSET` are not supported. -Refer to [Usecase Examples](./usecase-example.md) for more examples of how to use continuous aggregation in real-time analytics, monitoring, and dashboards. +Refer to [Continuous Aggregation](continuous-aggregation.md) for more examples of how to use continuous aggregation in real-time analytics, monitoring, and dashboards. ### Define time window diff --git a/docs/user-guide/flow-computation/overview.md b/docs/user-guide/flow-computation/overview.md new file mode 100644 index 000000000..cc2a7bc42 --- /dev/null +++ b/docs/user-guide/flow-computation/overview.md @@ -0,0 +1,124 @@ +--- +description: Discover how GreptimeDB's Flow engine enables real-time computation of data streams for ETL processes and on-the-fly calculations. Learn about its programming model, use cases, and a quick start example for calculating user agent statistics from nginx logs. +--- + +# Overview + +GreptimeDB's Flow engine enables real-time computation of data streams. +It is particularly beneficial for Extract-Transform-Load (ETL) processes or for performing on-the-fly calculations and queries such as sum, average, and other aggregations. +The Flow engine ensures that data is processed incrementally and continuously, +updating the final results as new streaming data arrives. +It functions similarly to materialized views, +determining when and how to update the result view table with minimal effort. + +Use cases include: + +- Real-time analytics that deliver actionable insights almost instantaneously. +- Downsampling data points, such as using average pooling, to reduce the volume of data for storage and analysis. + +## Programming Model + +Upon data insertion into the source table, +the data is concurrently ingested to the Flow engine. +At each trigger interval (one second), +the Flow engine executes the specified computations and updates the sink table with the results. +Both the source and sink tables are time-series tables within GreptimeDB. +Before creating a Flow, +it is crucial to define the schemas for these tables and design the Flow to specify the computation logic. +This process is visually represented in the following image: + +![Continuous Aggregation](/flow-ani.svg) + +## Quick Start Example + +To illustrate the capabilities of GreptimeDB's Flow engine, +consider the task of calculating user agent statistics from nginx logs. +The source table is `nginx_access_log`, +and the sink table is `user_agent_statistics`. + +First, create the source table `nginx_access_log`. +To optimize performance for counting the `user_agent` field, +specify it as a `TAG` column type using the `PRIMARY KEY` keyword. + +```sql +CREATE TABLE ngx_access_log ( + ip_address STRING, + http_method STRING, + request STRING, + status_code INT16, + body_bytes_sent INT32, + user_agent STRING, + response_size INT32, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (ip_address, http_method, user_agent, status_code) +) WITH ('append_mode'='true'); +``` + +Next, create the sink table `user_agent_statistics`. +Note that all tables in GreptimeDB are time-series tables, +hence the inclusion of the `__ts_placeholder` column as a timestamp placeholder. + +```sql +CREATE TABLE user_agent_statistics ( + user_agent STRING, + total_count INT32, + __ts_placeholder TIMESTAMP TIME INDEX, + update_at TIMESTAMP, + PRIMARY KEY (user_agent) +); +``` + +Finally, create the Flow `user_agent_flow` to count the occurrences of each user agent in the `nginx_access_log` table. + +```sql +CREATE FLOW user_agent_flow +SINK TO user_agent_statistics +AS +SELECT + user_agent, + COUNT(user_agent) AS total_count +FROM + ngx_access_log +GROUP BY + user_agent; +``` + +Once the Flow is created, +the Flow engine will continuously process data from the `ngx_access_log` table and update the `user_agent_statistics` table with the computed results. + +To observe the continuous aggregation results, +insert sample data into the `ngx_access_log` table. + +```sql +INSERT INTO ngx_access_log +VALUES + ('192.168.1.1', 'GET', '/index.html', 200, 512, 'Mozilla/5.0', 1024, '2023-10-01T10:00:00Z'), + ('192.168.1.2', 'POST', '/submit', 201, 256, 'curl/7.68.0', 512, '2023-10-01T10:01:00Z'), + ('192.168.1.1', 'GET', '/about.html', 200, 128, 'Mozilla/5.0', 256, '2023-10-01T10:02:00Z'), + ('192.168.1.3', 'GET', '/contact', 404, 64, 'curl/7.68.0', 128, '2023-10-01T10:03:00Z'); +``` + +After inserting the data, +query the `user_agent_statistics` table to view the results. + +```sql +SELECT * FROM user_agent_statistics; +``` + +The query results will display the total count of each user agent in the `user_agent_statistics` table. + +```sql ++-----------------+-------------+ +| user_agent | total_count | ++-----------------+-------------+ +| Mozilla/5.0 | 2 | +| curl/7.68.0 | 2 | ++-----------------+-------------+ +``` + +## Next Steps + +- [Continuous Aggregation](./continuous-aggregation.md): Explore the primary scenario in time-series data processing, with three common use cases for continuous aggregation. +- [Manage Flow](manage-flow.md): Gain insights into the mechanisms of the Flow engine and the SQL syntax for defining a Flow. +- [Expressions](expressions.md): Learn about the expressions supported by the Flow engine for data transformation. + diff --git a/docs/user-guide/overview.md b/docs/user-guide/overview.md index 12c62e12c..62d45540a 100644 --- a/docs/user-guide/overview.md +++ b/docs/user-guide/overview.md @@ -54,7 +54,7 @@ Next, let's analyze the key features of GreptimeDB demonstrated by this query ex - **Unified Storage:** GreptimeDB is the time series database to store and analyze both metrics and [logs](/user-guide/logs/overview.md). The simplified architecture and data consistency enhances the ability to analyze and troubleshoot issues, and can lead to cost savings and improved system performance. - **Unique Data Model:** The unique [data model](/user-guide/concepts/data-model.md) with time index and full-text index greatly improves query performance and has stood the test of large data sets. It not only supports metric [insertion](/user-guide/ingest-data/overview.md) and [query](/user-guide/query-data/overview.md), but also provides a very friendly way to [write](/user-guide/logs/write-logs.md) and [query](/user-guide/logs/query-logs.md) logs, as well as handle [vector type data](/user-guide/vectors/vector-type.md). -- **Range Queries:** GreptimeDB supports [range queries](/user-guide/query-data/sql.md#aggregate-data-by-time-window) to evaluate [expressions](/reference/sql/functions/overview.md) over time, providing insights into metric trends. You can also [continuously aggregate](/user-guide/continuous-aggregation/overview.md) data for further analysis. +- **Range Queries:** GreptimeDB supports [range queries](/user-guide/query-data/sql.md#aggregate-data-by-time-window) to evaluate [expressions](/reference/sql/functions/overview.md) over time, providing insights into metric trends. You can also [continuously aggregate](/user-guide/flow-computation/overview.md) data for further analysis. - **SQL and Multiple Protocols:** GreptimeDB uses SQL as the main query language and supports [multiple protocols](/user-guide/protocols/overview.md), which greatly reduces the learning curve and development cost. You can easily migrate from Prometheus or [Influxdb to GreptimeDB](/user-guide/migrate-to-greptimedb/migrate-from-influxdb.md), or just start with GreptimeDB. - **JOIN Operations:** The data model of GreptimeDB's time series tables enables it to support [JOIN operations](/reference/sql/join.md) on metrics and logs. @@ -69,6 +69,6 @@ Having understood these features, you can now go directly to exploring the featu * [Manage Data](./manage-data/overview.md) * [Integrations](./integrations/overview.md) * [Protocols](./protocols/overview.md) -* [Continuous Aggregation](./continuous-aggregation/overview.md) +* [Continuous Aggregation](./flow-computation/overview.md) * [Operations](./administration/overview.md) diff --git a/sidebars.ts b/sidebars.ts index 5f806a65b..d2e5f70c5 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -136,13 +136,13 @@ const sidebars: SidebarsConfig = { }, { type: 'category', - label: 'Continuous Aggregation', + label: 'Flow Computation', items: [ - 'user-guide/continuous-aggregation/overview', - 'user-guide/continuous-aggregation/usecase-example', - 'user-guide/continuous-aggregation/manage-flow', - 'user-guide/continuous-aggregation/expression', - ], + 'user-guide/flow-computation/overview', + 'user-guide/flow-computation/continuous-aggregation', + 'user-guide/flow-computation/manage-flow', + 'user-guide/flow-computation/expressions', + ] }, { type: 'category', From 0ce99574af7b3d60c4662d2b24cf2c769156b276 Mon Sep 17 00:00:00 2001 From: Yiran Date: Thu, 12 Dec 2024 15:06:52 +0800 Subject: [PATCH 2/7] test SQLs --- .../flow-computation/manage-flow.md | 2 +- docs/user-guide/flow-computation/overview.md | 21 ++++++++++--------- 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/docs/user-guide/flow-computation/manage-flow.md b/docs/user-guide/flow-computation/manage-flow.md index 946f17135..d92c21de0 100644 --- a/docs/user-guide/flow-computation/manage-flow.md +++ b/docs/user-guide/flow-computation/manage-flow.md @@ -42,7 +42,7 @@ While it is the same to a regular time series table, there are a few important c - **Column order and type**: Ensure the order and type of the columns in the sink table match the query result of the flow. - **Time index**: Specify the `TIME INDEX` for the sink table, typically using the time window column generated by the time window function. -- **Specify `update_at` as the last column of the schema**: The flow automatically writes the update time of the data to the `update_at` column. Ensure this column is the last one in the sink table schema. +- **Update time**: The Flow engine automatically appends the update time to the end of each computation result row. This update time is stored in the `updated_at` column. Ensure that this column is included in the sink table schema. - **Tags**: Use `PRIMARY KEY` to specify Tags, which together with the time index serves as a unique identifier for row data and optimizes query performance. For example: diff --git a/docs/user-guide/flow-computation/overview.md b/docs/user-guide/flow-computation/overview.md index cc2a7bc42..e9907e48e 100644 --- a/docs/user-guide/flow-computation/overview.md +++ b/docs/user-guide/flow-computation/overview.md @@ -55,15 +55,16 @@ CREATE TABLE ngx_access_log ( ``` Next, create the sink table `user_agent_statistics`. -Note that all tables in GreptimeDB are time-series tables, -hence the inclusion of the `__ts_placeholder` column as a timestamp placeholder. +The `update_at` column tracks the last update time of the record, which is automatically updated by the Flow engine. +Although all tables in GreptimeDB are time-series tables, this computation does not require time windows. +Therefore, the `__ts_placeholder` column is included as a time index placeholder. ```sql CREATE TABLE user_agent_statistics ( user_agent STRING, - total_count INT32, - __ts_placeholder TIMESTAMP TIME INDEX, + total_count INT64, update_at TIMESTAMP, + __ts_placeholder TIMESTAMP TIME INDEX, PRIMARY KEY (user_agent) ); ``` @@ -108,12 +109,12 @@ SELECT * FROM user_agent_statistics; The query results will display the total count of each user agent in the `user_agent_statistics` table. ```sql -+-----------------+-------------+ -| user_agent | total_count | -+-----------------+-------------+ -| Mozilla/5.0 | 2 | -| curl/7.68.0 | 2 | -+-----------------+-------------+ ++-------------+-------------+----------------------------+---------------------+ +| user_agent | total_count | update_at | __ts_placeholder | ++-------------+-------------+----------------------------+---------------------+ +| Mozilla/5.0 | 2 | 2024-12-12 06:45:33.228000 | 1970-01-01 00:00:00 | +| curl/7.68.0 | 2 | 2024-12-12 06:45:33.228000 | 1970-01-01 00:00:00 | ++-------------+-------------+----------------------------+---------------------+ ``` ## Next Steps From ec141bcfa0ed410efff10c3cb1bcc5b2a5a362a5 Mon Sep 17 00:00:00 2001 From: Yiran Date: Thu, 12 Dec 2024 15:50:32 +0800 Subject: [PATCH 3/7] the zh docs --- docs/user-guide/flow-computation/overview.md | 5 +- .../current.json | 6 +- .../continuous-aggregation/overview.md | 151 ------------------ .../continuous-aggregation.md} | 145 +++++++++++++++-- .../expressions.md} | 0 .../manage-flow.md | 4 +- .../user-guide/flow-computation/overview.md | 124 ++++++++++++++ 7 files changed, 261 insertions(+), 174 deletions(-) delete mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/overview.md rename i18n/zh/docusaurus-plugin-content-docs/current/user-guide/{continuous-aggregation/usecase-example.md => flow-computation/continuous-aggregation.md} (61%) rename i18n/zh/docusaurus-plugin-content-docs/current/user-guide/{continuous-aggregation/expression.md => flow-computation/expressions.md} (100%) rename i18n/zh/docusaurus-plugin-content-docs/current/user-guide/{continuous-aggregation => flow-computation}/manage-flow.md (97%) create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md diff --git a/docs/user-guide/flow-computation/overview.md b/docs/user-guide/flow-computation/overview.md index e9907e48e..153af9ccc 100644 --- a/docs/user-guide/flow-computation/overview.md +++ b/docs/user-guide/flow-computation/overview.md @@ -8,8 +8,7 @@ GreptimeDB's Flow engine enables real-time computation of data streams. It is particularly beneficial for Extract-Transform-Load (ETL) processes or for performing on-the-fly calculations and queries such as sum, average, and other aggregations. The Flow engine ensures that data is processed incrementally and continuously, updating the final results as new streaming data arrives. -It functions similarly to materialized views, -determining when and how to update the result view table with minimal effort. +You can think of it as a clever materialized views that know when to update result view table and how to update it with minimal effort. Use cases include: @@ -87,7 +86,7 @@ GROUP BY Once the Flow is created, the Flow engine will continuously process data from the `ngx_access_log` table and update the `user_agent_statistics` table with the computed results. -To observe the continuous aggregation results, +To observe the results, insert sample data into the `ngx_access_log` table. ```sql diff --git a/i18n/zh/docusaurus-plugin-content-docs/current.json b/i18n/zh/docusaurus-plugin-content-docs/current.json index f7b0de804..f77803b56 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current.json +++ b/i18n/zh/docusaurus-plugin-content-docs/current.json @@ -31,9 +31,9 @@ "message": "读取数据", "description": "The label for category Query Data in sidebar docs" }, - "sidebar.docs.category.Continuous Aggregation": { - "message": "持续聚合", - "description": "The label for category Continuous Aggregation in sidebar docs" + "sidebar.docs.category.Flow Computation": { + "message": "流计算", + "description": "The label for category Flow Computation in sidebar docs" }, "sidebar.docs.category.Logs": { "message": "日志", diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/overview.md deleted file mode 100644 index 7d3b552d9..000000000 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/overview.md +++ /dev/null @@ -1,151 +0,0 @@ ---- -description: GreptimeDB 提供持续聚合功能,允许实时聚合数据,适用于实时计算和查询总和、平均值等。本文介绍了持续聚合的基本概念、快速开始示例以及如何检查和验证聚合结果。 ---- - -# 概述 - -GreptimeDB 提供了一个持续聚合功能,允许你实时聚合数据。这个功能在当你需要实时计算和查询总和、平均值或其他聚合时非常有用。持续聚合功能由 Flow 引擎提供。它不断地基于传入的数据更新聚合数据并将其实现。因此,你可以将其视为一个聪明的物化视图,它知道何时更新结果视图表以及如何以最小的努力更新它。一些常见的用例包括: - -- 下采样数据点,使用如平均池化等方法减少存储和分析的数据量 -- 提供近实时分析,提供可操作的信息 - - -当你将数据插入 source 表时,数据也会被发送到 Flow 引擎并存储在其中。 -Flow 引擎通过时间窗口计算聚合并将结果存储在目标表中。 -整个过程如下图所示: - -![Continuous Aggregation](/flow-ani.svg) - -## 快速开始示例 - -以下是持续聚合查询的一个完整示例。 - - -这个例子是根据输入表中的数据计算一系列统计数据,包括一分钟时间窗口内的总日志数、最小大小、最大大小、平均大小以及大小大于 550 的数据包数。 - -首先,创建一个 source 表 `ngx_access_log` 和一个 sink 表 `ngx_statistics`,如下所示: - -```sql -CREATE TABLE `ngx_access_log` ( - `client` STRING NULL, - `ua_platform` STRING NULL, - `referer` STRING NULL, - `method` STRING NULL, - `endpoint` STRING NULL, - `trace_id` STRING NULL FULLTEXT, - `protocol` STRING NULL, - `status` SMALLINT UNSIGNED NULL, - `size` DOUBLE NULL, - `agent` STRING NULL, - `access_time` TIMESTAMP(3) NOT NULL, - TIME INDEX (`access_time`) -) -WITH( - append_mode = 'true' -); -``` - -```sql -CREATE TABLE `ngx_statistics` ( - `status` SMALLINT UNSIGNED NULL, - `total_logs` BIGINT NULL, - `min_size` DOUBLE NULL, - `max_size` DOUBLE NULL, - `avg_size` DOUBLE NULL, - `high_size_count` BIGINT NULL, - `time_window` TIMESTAMP time index, - `update_at` TIMESTAMP NULL, - PRIMARY KEY (`status`) -); -``` - -然后创建名为 `ngx_aggregation` 的 flow 任务,包括 `count`、`min`、`max`、`avg` `size` 列的聚合函数,以及大于 550 的所有数据包的大小总和。聚合是在 `access_time` 列的 1 分钟固定窗口中计算的,并且还按 `status` 列分组。因此,你可以实时了解有关数据包大小和对其的操作的信息,例如,如果 `high_size_count` 在某个时间点变得太高,你可以进一步检查是否有任何问题,或者如果 `max_size` 列在 1 分钟时间窗口内突然激增,你可以尝试定位该数据包并进一步检查。 - -```sql -CREATE FLOW ngx_aggregation -SINK TO ngx_statistics -AS -SELECT - status, - count(client) AS total_logs, - min(size) as min_size, - max(size) as max_size, - avg(size) as avg_size, - sum(case when `size` > 550 then 1 else 0 end) as high_size_count, - date_bin(INTERVAL '1 minutes', access_time) as time_window, -FROM ngx_access_log -GROUP BY - status, - time_window; -``` - -要检查持续聚合是否正常工作,首先插入一些数据到源表 `ngx_access_log` 中。 - -```sql -INSERT INTO ngx_access_log -VALUES - ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 1000, "agent", "2021-07-01 00:00:01.000"), - ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:00:30.500"), - ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 600, "agent", "2021-07-01 00:01:01.000"), - ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 700, "agent", "2021-07-01 00:01:01.500"); -``` - -则 `ngx_access_log` 表将被增量更新以包含以下数据: - -```sql -SELECT * FROM ngx_statistics; -``` - -```sql - status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at ---------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- - 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 - 200 | 1 | 600 | 600 | 600 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 - 404 | 1 | 700 | 700 | 700 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 -(3 rows) -``` - -尝试向 `ngx_access_log` 表中插入更多数据: - -```sql -INSERT INTO ngx_access_log -VALUES - ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:01:01.000"), - ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 800, "agent", "2021-07-01 00:01:01.500"); -``` - -结果表 `ngx_statistics` 将被增量更新,注意 `max_size`、`avg_size` 和 `high_size_count` 是如何更新的: - -```sql -SELECT * FROM ngx_statistics; -``` - -```sql - status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at ---------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- - 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 - 200 | 2 | 500 | 600 | 550 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 - 404 | 2 | 700 | 800 | 750 | 2 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 -(3 rows) -``` - -`ngx_statistics` 表中的列解释如下: - -- `status`: HTTP 响应的状态码。 -- `total_logs`: 相同状态码的日志总数。 -- `min_size`: 相同状态码的数据包的最小大小。 -- `max_size`: 相同状态码的数据包的最大大小。 -- `avg_size`: 相同状态码的数据包的平均大小。 -- `high_size_count`: 包大小大于 550 的数据包数。 -- `time_window`: 聚合的时间窗口。 -- `update_at`: 聚合结果更新的时间。 - -## 下一步 - -恭喜你已经初步了解了持续聚合功能。 -请参考以下章节了解更多: - - -- [用例](./usecase-example.md) 提供了更多关于如何在实时分析、监控和仪表板中使用持续聚合的示例。 -- [管理 Flow](./manage-flow.md) 描述了如何创建和删除 flow。你的每个持续聚合查询都是一个 flow。 -- [表达式](./expression.md) 是持续聚合查询中可用表达式。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/usecase-example.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/continuous-aggregation.md similarity index 61% rename from i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/usecase-example.md rename to i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/continuous-aggregation.md index 6b110fa7f..a800b4aa0 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/usecase-example.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/continuous-aggregation.md @@ -1,22 +1,141 @@ ---- -description: 持续聚合的用例示例,包括实时分析、实时监控和实时仪表盘的详细示例。 ---- +# 持续聚合 + +持续聚合是处理时间序列数据以提供实时洞察的关键方面。 +Flow 引擎使开发人员能够无缝地执行持续聚合,例如计算总和、平均值和其他指标。 +它在指定的时间窗口内高效地更新聚合数据,使其成为分析的宝贵工具。 -# 用例 持续聚合的三个主要用例示例如下: 1. **实时分析**:一个实时分析平台,不断聚合来自事件流的数据,提供即时洞察,同时可选择将数据降采样到较低分辨率。例如,此系统可以编译来自高频日志事件流(例如,每毫秒发生一次)的数据,以提供每分钟的请求数、平均响应时间和每分钟的错误率等最新洞察。 - 2. **实时监控**:一个实时监控系统,不断聚合来自事件流的数据,根据聚合数据提供实时警报。例如,此系统可以处理来自传感器事件流的数据,以提供当温度超过某个阈值时的实时警报。 - 3. **实时仪表盘**:一个实时仪表盘,显示每分钟的请求数、平均响应时间和每分钟的错误数。此仪表板可用于监控系统的健康状况,并检测系统中的任何异常。 在所有这些用例中,持续聚合系统不断聚合来自事件流的数据,并根据聚合数据提供实时洞察和警报。系统还可以将数据降采样到较低分辨率,以减少存储和处理的数据量。这使得系统能够提供实时洞察和警报,同时保持较低的数据存储和处理成本。 ## 实时分析示例 -请参考[概述](/user-guide/continuous-aggregation/overview.md#quick-start-with-an-example)中的实时分析示例。 -该示例用于计算日志的总数、数据包大小的最小、最大和平均值,以及大小大于 550 的数据包数量按照每个状态码在 1 分钟固定窗口中的实时分析。 +### 日志统计 + +这个例子是根据输入表中的数据计算一系列统计数据,包括一分钟时间窗口内的总日志数、最小大小、最大大小、平均大小以及大小大于 550 的数据包数。 + +首先,创建一个 source 表 `ngx_access_log` 和一个 sink 表 `ngx_statistics`,如下所示: + +```sql +CREATE TABLE `ngx_access_log` ( + `client` STRING NULL, + `ua_platform` STRING NULL, + `referer` STRING NULL, + `method` STRING NULL, + `endpoint` STRING NULL, + `trace_id` STRING NULL FULLTEXT, + `protocol` STRING NULL, + `status` SMALLINT UNSIGNED NULL, + `size` DOUBLE NULL, + `agent` STRING NULL, + `access_time` TIMESTAMP(3) NOT NULL, + TIME INDEX (`access_time`) +) +WITH( + append_mode = 'true' +); +``` + +```sql +CREATE TABLE `ngx_statistics` ( + `status` SMALLINT UNSIGNED NULL, + `total_logs` BIGINT NULL, + `min_size` DOUBLE NULL, + `max_size` DOUBLE NULL, + `avg_size` DOUBLE NULL, + `high_size_count` BIGINT NULL, + `time_window` TIMESTAMP time index, + `update_at` TIMESTAMP NULL, + PRIMARY KEY (`status`) +); +``` + +然后创建名为 `ngx_aggregation` 的 flow 任务,包括 `count`、`min`、`max`、`avg` `size` 列的聚合函数,以及大于 550 的所有数据包的大小总和。聚合是在 `access_time` 列的 1 分钟固定窗口中计算的,并且还按 `status` 列分组。因此,你可以实时了解有关数据包大小和对其的操作的信息,例如,如果 `high_size_count` 在某个时间点变得太高,你可以进一步检查是否有任何问题,或者如果 `max_size` 列在 1 分钟时间窗口内突然激增,你可以尝试定位该数据包并进一步检查。 + +```sql +CREATE FLOW ngx_aggregation +SINK TO ngx_statistics +AS +SELECT + status, + count(client) AS total_logs, + min(size) as min_size, + max(size) as max_size, + avg(size) as avg_size, + sum(case when `size` > 550 then 1 else 0 end) as high_size_count, + date_bin(INTERVAL '1 minutes', access_time) as time_window, +FROM ngx_access_log +GROUP BY + status, + time_window; +``` + +要检查持续聚合是否正常工作,首先插入一些数据到源表 `ngx_access_log` 中。 + +```sql +INSERT INTO ngx_access_log +VALUES + ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 1000, "agent", "2021-07-01 00:00:01.000"), + ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:00:30.500"), + ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 600, "agent", "2021-07-01 00:01:01.000"), + ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 700, "agent", "2021-07-01 00:01:01.500"); +``` + +则 `ngx_access_log` 表将被增量更新以包含以下数据: + +```sql +SELECT * FROM ngx_statistics; +``` + +```sql + status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at +--------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- + 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 + 200 | 1 | 600 | 600 | 600 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 + 404 | 1 | 700 | 700 | 700 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:17.439000 +(3 rows) +``` + +尝试向 `ngx_access_log` 表中插入更多数据: + +```sql +INSERT INTO ngx_access_log +VALUES + ("android", "Android", "referer", "GET", "/api/v1", "trace_id", "HTTP", 200, 500, "agent", "2021-07-01 00:01:01.000"), + ("ios", "iOS", "referer", "GET", "/api/v1", "trace_id", "HTTP", 404, 800, "agent", "2021-07-01 00:01:01.500"); +``` + +结果表 `ngx_statistics` 将被增量更新,注意 `max_size`、`avg_size` 和 `high_size_count` 是如何更新的: + +```sql +SELECT * FROM ngx_statistics; +``` + +```sql + status | total_logs | min_size | max_size | avg_size | high_size_count | time_window | update_at +--------+------------+----------+----------+----------+-----------------+----------------------------+---------------------------- + 200 | 2 | 500 | 1000 | 750 | 1 | 2021-07-01 00:00:00.000000 | 2024-07-24 08:36:17.439000 + 200 | 2 | 500 | 600 | 550 | 1 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 + 404 | 2 | 700 | 800 | 750 | 2 | 2021-07-01 00:01:00.000000 | 2024-07-24 08:36:46.495000 +(3 rows) +``` + +`ngx_statistics` 表中的列解释如下: + +- `status`: HTTP 响应的状态码。 +- `total_logs`: 相同状态码的日志总数。 +- `min_size`: 相同状态码的数据包的最小大小。 +- `max_size`: 相同状态码的数据包的最大大小。 +- `avg_size`: 相同状态码的数据包的平均大小。 +- `high_size_count`: 包大小大于 550 的数据包数。 +- `time_window`: 聚合的时间窗口。 +- `update_at`: 聚合结果更新的时间。 + +### 按时间窗口查询国家 另一个实时分析的示例是从 `ngx_access_log` 表中查询所有不同的国家。 你可以使用以下查询按时间窗口对国家进行分组: @@ -254,12 +373,8 @@ SELECT * FROM ngx_distribution; +------+-------------+------------+---------------------+----------------------------+ ``` -## 总结 +## 下一步 -持续聚合是实时分析、监控和仪表盘的强大工具。 -它允许你不断聚合来自事件流的数据,并根据聚合数据提供实时洞察和警报。 -通过将数据降采样到较低分辨率,你可以减少存储和处理的数据量, -从而更容易提供实时洞察和警报,同时保持较低的数据存储和处理成本。 -持续聚合是任何实时数据处理系统的关键组件,可以在各种用例中使用, -以提供基于流数据的实时洞察和警报。 +- [管理 Flow](manage-flow.md):深入了解 Flow 引擎的机制和定义 Flow 的 SQL 语法。 +- [表达式](expressions.md):了解 Flow 引擎支持的数据转换表达式。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/expression.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/expressions.md similarity index 100% rename from i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/expression.md rename to i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/expressions.md diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/manage-flow.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md similarity index 97% rename from i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/manage-flow.md rename to i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md index 40a134438..0db681d03 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/continuous-aggregation/manage-flow.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md @@ -2,7 +2,7 @@ description: 介绍如何在 GreptimeDB 中创建和删除 flow,包括创建 sink 表、flow 的 SQL 语法和示例。 --- -# 管理 Flows +# 管理 Flow 每一个 `flow` 是 GreptimeDB 中的一个持续聚合查询。 它根据传入的数据持续更新并聚合数据。 @@ -40,7 +40,7 @@ CREATE TABLE temp_sensor_data ( - **列的顺序和类型**:确保 sink 表中列的顺序和类型与 flow 查询结果匹配。 - **时间索引**:为 sink 表指定 `TIME INDEX`,通常使用时间窗口函数生成的时间列。 -- **将 `update_at` 指定为 schema 的最后一列**:flow 会自动将数据的更新时间写入 `update_at` 列。请确保此列是 sink 表模式中的最后一列。 +- **更新时间**:Flow 引擎会自动将更新时间附加到每个计算结果行的末尾。此更新时间存储在 `updated_at` 列中。请确保在 sink 表的 schema 中包含此列。 - **Tag**:使用 `PRIMARY KEY` 指定 Tag,与 time index 一起作为行数据的唯一标识,并优化查询性能。 例如: diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md new file mode 100644 index 000000000..5b610e7d1 --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md @@ -0,0 +1,124 @@ +--- +description: 了解 GreptimeDB 的 Flow 引擎如何实现数据流的实时计算,如何用于 ETL 过程和即时计算。了解其程序模型、使用案例以及从 nginx 日志计算 user_agent 统计信息的快速入门示例。 +--- + +# 概述 + +GreptimeDB 的 Flow 引擎实现了数据流的实时计算。 +它特别适用于提取-转换-加载 (ETL) 过程或执行即时的计算和查询,例如求和、平均值和其他聚合。 +Flow 引擎确保数据被增量和连续地处理, +根据到达的新的流数据更新最终结果。 +你可以将其视为一个聪明的物化视图, +它知道何时更新结果视图表以及如何以最小的努力更新它。 + +使用案例包括: + +- 下采样数据点,使用如平均池化等方法减少存储和分析的数据量 +- 提供近实时分析、可操作的信息 + +## 程序模型 + +在将数据插入 source 表后, +数据会同时被写入到 Flow 引擎中。 +在每个触发间隔(一秒)时, +Flow 引擎执行指定的计算并将结果更新到 sink 表中。 +source 表和 sink 表都是 GreptimeDB 中的时间序列表。 +在创建 Flow 之前, +定义这些表的 schema 并设计 Flow 以指定计算逻辑是至关重要的。 +此过程在下图中直观地表示: + +![连续聚合](/flow-ani.svg) + +## 快速入门示例 + +为了说明 GreptimeDB 的 Flow 引擎的功能, +考虑从 nginx 日志计算 user_agent 统计信息的任务。 +source 表是 `nginx_access_log`, +sink 表是 `user_agent_statistics`。 + +首先,创建 source 表 `nginx_access_log`。 +为了优化计算 `user_agent` 字段的性能, +使用 `PRIMARY KEY` 关键字将其指定为 `TAG` 列类型。 + +```sql +CREATE TABLE ngx_access_log ( + ip_address STRING, + http_method STRING, + request STRING, + status_code INT16, + body_bytes_sent INT32, + user_agent STRING, + response_size INT32, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (ip_address, http_method, user_agent, status_code) +) WITH ('append_mode'='true'); +``` + +接下来,创建 sink 表 `user_agent_statistics`。 +`update_at` 列跟踪数据的最后更新时间,由 Flow 引擎自动更新。 +尽管 GreptimeDB 中的所有表都是时间序列表,但此计算不需要时间窗口。 +因此增加了 `__ts_placeholder` 列作为时间索引占位列。 + +```sql +CREATE TABLE user_agent_statistics ( + user_agent STRING, + total_count INT64, + update_at TIMESTAMP, + __ts_placeholder TIMESTAMP TIME INDEX, + PRIMARY KEY (user_agent) +); +``` + +最后,创建 Flow `user_agent_flow` 以计算 `nginx_access_log` 表中每个 user_agent 的出现次数。 + +```sql +CREATE FLOW user_agent_flow +SINK TO user_agent_statistics +AS +SELECT + user_agent, + COUNT(user_agent) AS total_count +FROM + ngx_access_log +GROUP BY + user_agent; +``` + +一旦创建了 Flow, +Flow 引擎将持续处理 `nginx_access_log` 表中的数据,并使用计算结果更新 `user_agent_statistics` 表。 + +要观察 Flow 的结果, +将示例数据插入 `nginx_access_log` 表。 + +```sql +INSERT INTO ngx_access_log +VALUES + ('192.168.1.1', 'GET', '/index.html', 200, 512, 'Mozilla/5.0', 1024, '2023-10-01T10:00:00Z'), + ('192.168.1.2', 'POST', '/submit', 201, 256, 'curl/7.68.0', 512, '2023-10-01T10:01:00Z'), + ('192.168.1.1', 'GET', '/about.html', 200, 128, 'Mozilla/5.0', 256, '2023-10-01T10:02:00Z'), + ('192.168.1.3', 'GET', '/contact', 404, 64, 'curl/7.68.0', 128, '2023-10-01T10:03:00Z'); +``` + +插入数据后, +查询 `user_agent_statistics` 表以查看结果。 + +```sql +SELECT * FROM user_agent_statistics; +``` + +查询结果将显示 `user_agent_statistics` 表中每个 user_agent 的总数。 + +```sql ++-------------+-------------+----------------------------+---------------------+ +| user_agent | total_count | update_at | __ts_placeholder | ++-------------+-------------+----------------------------+---------------------+ +| Mozilla/5.0 | 2 | 2024-12-12 06:45:33.228000 | 1970-01-01 00:00:00 | +| curl/7.68.0 | 2 | 2024-12-12 06:45:33.228000 | 1970-01-01 00:00:00 | ++-------------+-------------+----------------------------+---------------------+ +``` + +## 下一步 + +- [持续聚合](./continuous-aggregation.md):探索时间序列数据处理中的主要场景,了解持续聚合的三种常见使用案例。 +- [管理 Flow](manage-flow.md):深入了解 Flow 引擎的机制和定义 Flow 的 SQL 语法。 +- [表达式](expressions.md):了解 Flow 引擎支持的数据转换表达式。 From b633d20f942401182f80560821ac1107b598cab3 Mon Sep 17 00:00:00 2001 From: Yiran Date: Thu, 12 Dec 2024 16:12:17 +0800 Subject: [PATCH 4/7] fix dead links --- .../current/getting-started/quick-start.md | 2 +- .../current/reference/sql/create.md | 4 ++-- .../current/user-guide/concepts/features-that-you-concern.md | 2 +- .../current/user-guide/concepts/key-concepts.md | 2 +- .../current/user-guide/flow-computation/manage-flow.md | 4 ++-- .../current/user-guide/overview.md | 4 ++-- 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md b/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md index e833a8e29..f1d2ce71b 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md @@ -306,7 +306,7 @@ ORDER BY ### 持续聚合 -为了进一步分析或在频繁聚合数据时减少扫描成本,你可以将聚合结果保存到另一个表中。这可以通过使用 GreptimeDB 的[持续聚合](/user-guide/continuous-aggregation/overview.md)功能来实现。 +为了进一步分析或在频繁聚合数据时减少扫描成本,你可以将聚合结果保存到另一个表中。这可以通过使用 GreptimeDB 的[持续聚合](/user-guide/flow-computation/overview.md)功能来实现。 例如,按照 5 秒钟的时间窗口聚合 API 错误数量,并将数据保存到 `api_error_count` 表中。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/create.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/create.md index 9558af4f0..7ae2f2fb3 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/create.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/create.md @@ -139,7 +139,7 @@ CREATE TABLE IF NOT EXISTS temperatures( - `months`, `month`, `M` – 月,定义为 30.44 天 - `years`, `year`, `y` – 年,定义为 365.25 天 - `forever`, `NULL`, `0s` (或任何长度为 0 的时间范围,如 `0d`)或空字符串 `''`,表示数据永远不会被删除。 -- `instant`, 注意数据库的 TTL 不能设置为 `instant`。`instant` 表示数据在插入时立即删除,如果你想将输入发送到流任务而不保存它,可以使用 `instant`,请参阅[流管理文档](/user-guide/continuous-aggregation/manage-flow.md#manage-flows)了解更多细节。 +- `instant`, 注意数据库的 TTL 不能设置为 `instant`。`instant` 表示数据在插入时立即删除,如果你想将输入发送到流任务而不保存它,可以使用 `instant`,请参阅[流管理文档](/user-guide/flow-computation/manage-flow.md#manage-flows)了解更多细节。 - 未设置,可以使用 `ALTER TABLE UNSET 'ttl'` 来取消表的 `ttl` 设置,这样表将继承数据库的 `ttl` 策略(如果有的话)。 如果一张表有自己的 TTL 策略,那么它将使用该 TTL 策略。否则,数据库的 TTL 策略将被应用到表上。 @@ -425,7 +425,7 @@ AS ; ``` -用于创建或更新 Flow 任务,请阅读[Flow 管理文档](/user-guide/continuous-aggregation/manage-flow.md#创建-flow)。 +用于创建或更新 Flow 任务,请阅读[Flow 管理文档](/user-guide/flow-computation/manage-flow.md#创建-flow)。 ## 创建 View diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/features-that-you-concern.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/features-that-you-concern.md index 90963bd52..985bf91a2 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/features-that-you-concern.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/features-that-you-concern.md @@ -40,7 +40,7 @@ GreptimeDB 通过以下方式解决这个问题: ## GreptimeDB 支持持续聚合或降采样吗? -从 0.8 版本开始,GreptimeDB 添加了一个名为 `Flow` 的新功能,用于持续聚合和降采样等场景。请阅读[用户指南](/user-guide/continuous-aggregation/overview.md)获取更多信息。 +从 0.8 版本开始,GreptimeDB 添加了一个名为 `Flow` 的新功能,用于持续聚合和降采样等场景。请阅读[用户指南](/user-guide/flow-computation/overview.md)获取更多信息。 ## 我可以在云的对象存储中存储数据吗? diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/key-concepts.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/key-concepts.md index 35096fc3c..23fdb4cf4 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/key-concepts.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/concepts/key-concepts.md @@ -47,4 +47,4 @@ GreptimeDB 使用[倒排索引](/contributor-guide/datanode/data-persistence-ind ## Flow -GreptimeDB 中的 Flow 是指[持续聚合](/user-guide/continuous-aggregation/overview.md)过程,该过程根据传入数据持续更新和聚合数据。 +GreptimeDB 中的 Flow 是指[持续聚合](/user-guide/flow-computation/overview.md)过程,该过程根据传入数据持续更新和聚合数据。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md index 0db681d03..508aa5867 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md @@ -146,7 +146,7 @@ SELECT AGGR_FUNCTION(column1, column2,..), TIME_WINDOW_FUNCTION() as time_window ``` 在 `SELECT` 关键字之后只允许以下类型的表达式: -- 聚合函数:有关详细信息,请参阅[表达式](./expression.md)文档。 +- 聚合函数:有关详细信息,请参阅[表达式](./expressions.md)文档。 - 时间窗口函数:有关详细信息,请参阅[定义时间窗口](#define-time-window)部分。 - 标量函数:例如 `col`、`to_lowercase(col)`、`col + 1` 等。这部分与 GreptimeDB 中的标准 `SELECT` 子句相同。 @@ -158,7 +158,7 @@ SELECT AGGR_FUNCTION(column1, column2,..), TIME_WINDOW_FUNCTION() as time_window `GROUP BY` 中的其他表达式可以是 literal、列名或 scalar 表达式。 - 不支持`ORDER BY`、`LIMIT` 和 `OFFSET`。 -有关如何在实时分析、监控和仪表板中使用持续聚合的更多示例,请参阅[用例示例](./usecase-example.md)。 +有关如何在实时分析、监控和仪表板中使用持续聚合的更多示例,请参阅[持续聚合](./continuous-aggregation.md)。 ### 定义时间窗口 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/overview.md index f56b41fa3..156fd6143 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/overview.md @@ -54,7 +54,7 @@ ALIGN '5s' BY (host) FILL PREV - **统一存储:** GreptimeDB 是支持同时存储和分析指标及[日志](/user-guide/logs/overview.md)的时序数据库。简化的架构和数据一致性增强了分析和解决问题的能力,并可节省成本且提高系统性能。 - **独特的数据模型:** 独特的[数据模型](/user-guide/concepts/data-model.md)搭配时间索引和全文索引,大大提升了查询性能,并在超大数据集上也经受住了考验。它不仅支持[数据指标的插入](/user-guide/ingest-data/overview.md)和[查询](/user-guide/query-data/overview.md),也提供了非常友好的方式便于日志的[写入](/user-guide/logs/write-logs.md)和[查询](/user-guide/logs/query-logs.md),以及[向量类型数据](/user-guide/vectors/vector-type.md)的处理。 -- **范围查询:** GreptimeDB 支持[范围查询](/user-guide/query-data/sql.md#aggregate-data-by-time-window)来计算一段时间内的[表达式](/reference/sql/functions/overview.md),从而了解指标趋势。你还可以[持续聚合](/user-guide/continuous-aggregation/overview.md)数据以进行进一步分析。 +- **范围查询:** GreptimeDB 支持[范围查询](/user-guide/query-data/sql.md#aggregate-data-by-time-window)来计算一段时间内的[表达式](/reference/sql/functions/overview.md),从而了解指标趋势。你还可以[持续聚合](/user-guide/flow-computation/overview.md)数据以进行进一步分析。 - **SQL 和多种协议:** GreptimeDB 使用 SQL 作为主要查询语言,并支持[多种协议](/user-guide/protocols/overview.md),大大降低了学习曲线和接入成本。你可以轻松从 Prometheus 或 [Influxdb 迁移](/user-guide/migrate-to-greptimedb/migrate-from-influxdb.md)至 GreptimeDB,或者从 0 接入 GreptimeDB。 - **JOIN 操作:** GreptimeDB 的时间序列表的数据模型,使其具备了支持[JOIN](/reference/sql/join.md)数据指标和日志的能力。 @@ -69,5 +69,5 @@ ALIGN '5s' BY (host) FILL PREV * [数据管理](./manage-data/overview.md) * [集成](./integrations/overview.md) * [协议](./protocols/overview.md) -* [持续聚合](./continuous-aggregation/overview.md) +* [持续聚合](./flow-computation/overview.md) * [运维操作](./administration/overview.md) From 9b3e3e64c54963c9650572c761ff9af091f45ae8 Mon Sep 17 00:00:00 2001 From: Yiran Date: Thu, 12 Dec 2024 16:34:11 +0800 Subject: [PATCH 5/7] the select syntax --- .../current/user-guide/flow-computation/manage-flow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md index 508aa5867..175d27cb3 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/manage-flow.md @@ -142,7 +142,7 @@ GROUP BY time_window; flow 的 `SQL` 部分类似于标准的 `SELECT` 子句,但有一些不同之处。查询的语法如下: ```sql -SELECT AGGR_FUNCTION(column1, column2,..), TIME_WINDOW_FUNCTION() as time_window FROM GROUP BY time_window; +SELECT AGGR_FUNCTION(column1, column2,..) [, TIME_WINDOW_FUNCTION() as time_window] FROM GROUP BY {time_window | column1, column2,.. }; ``` 在 `SELECT` 关键字之后只允许以下类型的表达式: From 87287951ab130ec10e46a840b4a63a737565a4a4 Mon Sep 17 00:00:00 2001 From: Yiran Date: Tue, 17 Dec 2024 11:38:28 +0800 Subject: [PATCH 6/7] Apply suggestions from code review --- .../current/user-guide/flow-computation/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md index 5b610e7d1..2de2b7890 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md @@ -13,7 +13,7 @@ Flow 引擎确保数据被增量和连续地处理, 使用案例包括: -- 下采样数据点,使用如平均池化等方法减少存储和分析的数据量 +- 降采样数据点,使用如平均池化等方法减少存储和分析的数据量 - 提供近实时分析、可操作的信息 ## 程序模型 From 23d19b988d1169372d16ed429dd5eb7ae661f263 Mon Sep 17 00:00:00 2001 From: Yiran Date: Tue, 17 Dec 2024 17:16:09 +0800 Subject: [PATCH 7/7] apply review suggestions --- docs/user-guide/flow-computation/continuous-aggregation.md | 2 +- docs/user-guide/flow-computation/manage-flow.md | 2 +- docs/user-guide/flow-computation/overview.md | 2 +- .../user-guide/flow-computation/continuous-aggregation.md | 4 ++++ .../current/user-guide/flow-computation/overview.md | 2 +- 5 files changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/user-guide/flow-computation/continuous-aggregation.md b/docs/user-guide/flow-computation/continuous-aggregation.md index 38df73c4d..88327ceac 100644 --- a/docs/user-guide/flow-computation/continuous-aggregation.md +++ b/docs/user-guide/flow-computation/continuous-aggregation.md @@ -1,5 +1,5 @@ --- -description: Explore continuous aggregation in GreptimeDB for real-time insights. This guide explains how to perform continuous aggregations using the Flow engine, including calculating sums, averages, and other metrics within specified time windows. Learn through examples of real-time analytics, monitoring, and dashboard use cases. Understand how to create source and sink tables, define flows, and write SQL queries for continuous aggregation. Discover how to calculate log statistics, retrieve distinct countries by time window, and monitor sensor data in real-time. This comprehensive guide is essential for developers looking to leverage GreptimeDB's continuous aggregation capabilities for efficient data processing and real-time analytics. +description: Learn how to use GreptimeDB's continuous aggregation for real-time analytics. Master Flow engine basics, time-window calculations, and SQL queries through practical examples of log analysis and sensor monitoring. --- # Continuous Aggregation diff --git a/docs/user-guide/flow-computation/manage-flow.md b/docs/user-guide/flow-computation/manage-flow.md index d92c21de0..d169bc796 100644 --- a/docs/user-guide/flow-computation/manage-flow.md +++ b/docs/user-guide/flow-computation/manage-flow.md @@ -1,5 +1,5 @@ --- -description: Learn how to manage flows in GreptimeDB, including creating, updating, and deleting flows. This guide covers the syntax for creating flows, the importance of sink tables, and how to use the EXPIRE AFTER clause. It provides examples of SQL queries for managing flows, creating source and sink tables, and defining continuous aggregation queries. Understand the significance of column order, time index, and tags in sink tables. Discover how to manually trigger flow processing with the ADMIN FLUSH_FLOW command and how to delete flows using the DROP FLOW clause. This comprehensive guide is essential for users looking to leverage GreptimeDB's flow computation capabilities for real-time analytics, monitoring, and dashboards. +description: Describes how to manage flows in GreptimeDB, including creating, updating, and deleting flows. It explains the syntax for creating flows, the importance of sink tables, and how to use the EXPIRE AFTER clause. Examples of SQL queries for managing flows are provided. --- # Manage Flows diff --git a/docs/user-guide/flow-computation/overview.md b/docs/user-guide/flow-computation/overview.md index 153af9ccc..7d89c6333 100644 --- a/docs/user-guide/flow-computation/overview.md +++ b/docs/user-guide/flow-computation/overview.md @@ -5,7 +5,7 @@ description: Discover how GreptimeDB's Flow engine enables real-time computation # Overview GreptimeDB's Flow engine enables real-time computation of data streams. -It is particularly beneficial for Extract-Transform-Load (ETL) processes or for performing on-the-fly calculations and queries such as sum, average, and other aggregations. +It is particularly beneficial for Extract-Transform-Load (ETL) processes or for performing on-the-fly filtering, calculations and queries such as sum, average, and other aggregations. The Flow engine ensures that data is processed incrementally and continuously, updating the final results as new streaming data arrives. You can think of it as a clever materialized views that know when to update result view table and how to update it with minimal effort. diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/continuous-aggregation.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/continuous-aggregation.md index a800b4aa0..3c577bda8 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/continuous-aggregation.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/continuous-aggregation.md @@ -1,3 +1,7 @@ +--- +description: 持续聚合是处理时间序列数据以提供实时洞察的关键方面。本文介绍了持续聚合的三个主要用例:实时分析、实时监控和实时仪表盘,并提供了详细的 SQL 示例。 +--- + # 持续聚合 持续聚合是处理时间序列数据以提供实时洞察的关键方面。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md index 2de2b7890..287321da9 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/flow-computation/overview.md @@ -5,7 +5,7 @@ description: 了解 GreptimeDB 的 Flow 引擎如何实现数据流的实时计 # 概述 GreptimeDB 的 Flow 引擎实现了数据流的实时计算。 -它特别适用于提取-转换-加载 (ETL) 过程或执行即时的计算和查询,例如求和、平均值和其他聚合。 +它特别适用于提取-转换-加载 (ETL) 过程或执行即时的过滤、计算和查询,例如求和、平均值和其他聚合。 Flow 引擎确保数据被增量和连续地处理, 根据到达的新的流数据更新最终结果。 你可以将其视为一个聪明的物化视图,