GreptimeTeam · nicecui · Jul 18, 2024 · Jul 18, 2024 · Jul 18, 2024 · Jul 18, 2024
@@ -1,15 +1,15 @@
 # Pipeline Configuration
 
-Pipeline is a mechanism in GreptimeDB for transforming log data. It consists of a unique name and a set of configuration rules that define how log data is formatted, split, and transformed. Currently, we support JSON (`application/json`) and plain text (`text/plain`) formats as input for log data.
+Pipeline is a mechanism in GreptimeDB for parsing and transforming log data. It consists of a unique name and a set of configuration rules that define how log data is formatted, split, and transformed. Currently, we support JSON (`application/json`) and plain text (`text/plain`) formats as input for log data.
 
 These configurations are provided in YAML format, allowing the Pipeline to process data during the log writing process according to the defined rules and store the processed data in the database for subsequent structured queries.
 
-## The overall structure
+## Overall structure
 
 Pipeline consists of two parts: Processors and Transform, both of which are in array format. A Pipeline configuration can contain multiple Processors and multiple Transforms. The data type described by Transform determines the table structure when storing log data in the database.
 
 - Processors are used for preprocessing log data, such as parsing time fields and replacing fields.
-- Transform is used for converting log data formats, such as converting string types to numeric types.
+- Transform is used for converting data formats, such as converting string types to numeric types.
 
 Here is an example of a simple configuration that includes Processors and Transform:
 
@@ -40,15 +40,15 @@ The Processor is used for preprocessing log data, and its configuration is locat
 
 We currently provide the following built-in Processors:
 
-- `date`: Used to parse formatted time string fields, such as `2024-07-12T16:18:53.048`.
-- `epoch`: Used to parse numeric timestamp fields, such as `1720772378893`.
-- `dissect`: Used to split log data fields.
-- `gsub`: Used to replace log data fields.
-- `join`: Used to merge array-type fields in logs.
-- `letter`: Used to convert log data fields to letters.
-- `regex`: Used to perform regular expression matching on log data fields.
-- `urlencoding`: Used to perform URL encoding/decoding on log data fields.
-- `csv`: Used to parse CSV data fields in logs.
+- `date`: parses formatted time string fields, such as `2024-07-12T16:18:53.048`.
+- `epoch`: parses numeric timestamp fields, such as `1720772378893`.
+- `dissect`: splits log data fields.
+- `gsub`: replaces log data fields.
+- `join`: merges array-type fields in logs.
+- `letter`: converts log data fields to letters.
+- `regex`: performs regular expression matching on log data fields.
+- `urlencoding`: performs URL encoding/decoding on log data fields.
+- `csv`: parses CSV data fields in logs.
 
 ### `date`
 
@@ -68,7 +68,7 @@ processors:
 In the above example, the configuration of the `date` processor includes the following fields:
 
 - `fields`: A list of time field names to be parsed.
-- `formats`: Time format strings, supporting multiple format strings. Parsing is attempted in the order provided until successful.
+- `formats`: Time format strings, supporting multiple format strings. Parsing is attempted in the order provided until successful. You can find reference [here](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) for formatting syntax. 
 - `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
 - `timezone`: Time zone. Use the time zone identifiers from the [tz_database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) to specify the time zone. Defaults to `UTC`.
 

@@ -101,7 +101,7 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/nginx_pipeline" -F "fi
 
 After successfully executing this command, a pipeline named `nginx_pipeline` will be created, and the result will be returned as:
 
-```shell
+```json
 {"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}.
 ```
 
@@ -126,7 +126,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=public&table=pipeline_lo
 
 You will see the following output if the command is successful:
 
-```shell
+```json
 {"output":[{"affectedrows":4}],"execution_time_ms":79}
 ```
 
@@ -182,7 +182,7 @@ Of course, if you need keyword searching within large text blocks, you must use
 
 ## Query logs
 
-The `pipeline_logs` as the example to query logs.
+We use the `pipeline_logs` table as an example to query logs.
 
 ### Query logs by tags
 

@@ -14,7 +14,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=<db-name>&table=<table-n
      -d "$<log-items>"
 ```
 
-## Query parameters
+## Request parameters
 
 This interface accepts the following parameters:
 
@@ -23,9 +23,82 @@ This interface accepts the following parameters:
 - `pipeline_name`: The name of the [pipeline](./pipeline-config.md).
 - `version`: The version of the pipeline. Optional, default use the latest one.
 
-## Body data format
+## `Content-Type` and body format
 
-The request body supports NDJSON and JSON Array formats, where each JSON object represents a log entry.
+GreptimeDB uses `Content-Type` header to decide how to decode the payload body. Currently the following two format is supported:
+- `application/json`: this includes normal JSON format and NDJSON format.
+- `text/plain`: multiple log lines separated by line breaks.
+
+### `application/json` format
+
+Here is an example of JSON format body payload
+
+```JSON
+[
+     {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \\"GET /index.html HTTP/1.1\\" 200 612 \\"-\\" \\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\\""},
+     {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \\"POST /api/login HTTP/1.1\\" 200 1784 \\"-\\" \\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\\""},
+     {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \\"GET /images/logo.png HTTP/1.1\\" 304 0 \\"-\\" \\"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\\""},
+     {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \\"GET /contact HTTP/1.1\\" 404 162 \\"-\\" \\"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\\""}
+]
+```
+
+Note the whole JSON is an array of four objects (log lines). Each JSON object represents one line to be processed by Pipeline engine. 
+
+The name of the key in JSON objects, which is `message` here, is used as field name in Pipeline processors. For example: 
+
+```yaml
+processors:
+  - dissect:
+      fields:
+        - message
+      patterns:
+        - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
+      ignore_missing: true
+
+# rest of the file is ignored
+```
+
+We can also rewrite the payload into NDJSON format like following:
+
+```JSON
+{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \\"GET /index.html HTTP/1.1\\" 200 612 \\"-\\" \\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\\""}
+{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \\"POST /api/login HTTP/1.1\\" 200 1784 \\"-\\" \\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\\""}
+{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \\"GET /images/logo.png HTTP/1.1\\" 304 0 \\"-\\" \\"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\\""}
+{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \\"GET /contact HTTP/1.1\\" 404 162 \\"-\\" \\"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\\""}
+```
+
+Note the outer array is eliminated, and lines are separated by line breaks instead of `,`.
+
+### `text/plain` format
+
+Log in plain text format is widely used throughout the ecosystem. GreptimeDB also supports `text/plain` format as log data input, enabling ingesting logs first hand from log producers.
+
+The equivalent body payload of previous example is like following:
+
+```plain
+127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
+10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
+172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"
+```
+
+Sending log ingestion request to GreptimeDB requires only modifying the `Content-Type` header to be `text/plain`, and you are good to go!
+
+Unlike JSON format, where the input data already have key names as field names to be used in Pipeline processors, `text/plain` format just gives the whole line as input to the Pipeline engine. In this case we use `line` as the field name to refer to the input line, for example:
+
+```yaml
+processors:
+  - dissect:
+      fields:
+        - line
+      patterns:
+        - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
+      ignore_missing: true
+
+# rest of the file is ignored
+```
+
+It is recommended to use `dissect` or `regex` processor to split the input line into fields first and then process the fields accordingly.
 
 ## Example