Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update logs data format #1066

Merged
merged 5 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions docs/nightly/en/user-guide/logs/pipeline-config.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Pipeline Configuration

Pipeline is a mechanism in GreptimeDB for transforming log data. It consists of a unique name and a set of configuration rules that define how log data is formatted, split, and transformed. Currently, we support JSON (`application/json`) and plain text (`text/plain`) formats as input for log data.
Pipeline is a mechanism in GreptimeDB for parsing and transforming log data. It consists of a unique name and a set of configuration rules that define how log data is formatted, split, and transformed. Currently, we support JSON (`application/json`) and plain text (`text/plain`) formats as input for log data.

These configurations are provided in YAML format, allowing the Pipeline to process data during the log writing process according to the defined rules and store the processed data in the database for subsequent structured queries.

## The overall structure
## Overall structure

Pipeline consists of two parts: Processors and Transform, both of which are in array format. A Pipeline configuration can contain multiple Processors and multiple Transforms. The data type described by Transform determines the table structure when storing log data in the database.

- Processors are used for preprocessing log data, such as parsing time fields and replacing fields.
- Transform is used for converting log data formats, such as converting string types to numeric types.
- Transform is used for converting data formats, such as converting string types to numeric types.

Here is an example of a simple configuration that includes Processors and Transform:

Expand Down Expand Up @@ -40,15 +40,15 @@ The Processor is used for preprocessing log data, and its configuration is locat

We currently provide the following built-in Processors:

- `date`: Used to parse formatted time string fields, such as `2024-07-12T16:18:53.048`.
- `epoch`: Used to parse numeric timestamp fields, such as `1720772378893`.
- `dissect`: Used to split log data fields.
- `gsub`: Used to replace log data fields.
- `join`: Used to merge array-type fields in logs.
- `letter`: Used to convert log data fields to letters.
- `regex`: Used to perform regular expression matching on log data fields.
- `urlencoding`: Used to perform URL encoding/decoding on log data fields.
- `csv`: Used to parse CSV data fields in logs.
- `date`: parses formatted time string fields, such as `2024-07-12T16:18:53.048`.
- `epoch`: parses numeric timestamp fields, such as `1720772378893`.
- `dissect`: splits log data fields.
- `gsub`: replaces log data fields.
- `join`: merges array-type fields in logs.
- `letter`: converts log data fields to letters.
- `regex`: performs regular expression matching on log data fields.
- `urlencoding`: performs URL encoding/decoding on log data fields.
- `csv`: parses CSV data fields in logs.

### `date`

Expand All @@ -68,7 +68,7 @@ processors:
In the above example, the configuration of the `date` processor includes the following fields:

- `fields`: A list of time field names to be parsed.
- `formats`: Time format strings, supporting multiple format strings. Parsing is attempted in the order provided until successful.
- `formats`: Time format strings, supporting multiple format strings. Parsing is attempted in the order provided until successful. You can find reference [here](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) for formatting syntax.
- `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
- `timezone`: Time zone. Use the time zone identifiers from the [tz_database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) to specify the time zone. Defaults to `UTC`.

Expand Down
6 changes: 3 additions & 3 deletions docs/nightly/en/user-guide/logs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/nginx_pipeline" -F "fi

After successfully executing this command, a pipeline named `nginx_pipeline` will be created, and the result will be returned as:

```shell
```json
{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}.
```

Expand All @@ -126,7 +126,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=public&table=pipeline_lo

You will see the following output if the command is successful:

```shell
```json
{"output":[{"affectedrows":4}],"execution_time_ms":79}
```

Expand Down Expand Up @@ -182,7 +182,7 @@ Of course, if you need keyword searching within large text blocks, you must use

## Query logs

The `pipeline_logs` as the example to query logs.
We use the `pipeline_logs` table as an example to query logs.

### Query logs by tags

Expand Down
79 changes: 76 additions & 3 deletions docs/nightly/en/user-guide/logs/write-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=<db-name>&table=<table-n
-d "$<log-items>"
```

## Query parameters
## Request parameters

This interface accepts the following parameters:

Expand All @@ -23,9 +23,82 @@ This interface accepts the following parameters:
- `pipeline_name`: The name of the [pipeline](./pipeline-config.md).
- `version`: The version of the pipeline. Optional, default use the latest one.

## Body data format
## `Content-Type` and body format

The request body supports NDJSON and JSON Array formats, where each JSON object represents a log entry.
GreptimeDB uses `Content-Type` header to decide how to decode the payload body. Currently the following two format is supported:
- `application/json`: this includes normal JSON format and NDJSON format.
- `text/plain`: multiple log lines separated by line breaks.

### `application/json` format

Here is an example of JSON format body payload

```JSON
[
{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \\"GET /index.html HTTP/1.1\\" 200 612 \\"-\\" \\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\\""},
{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \\"POST /api/login HTTP/1.1\\" 200 1784 \\"-\\" \\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\\""},
{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \\"GET /images/logo.png HTTP/1.1\\" 304 0 \\"-\\" \\"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\\""},
{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \\"GET /contact HTTP/1.1\\" 404 162 \\"-\\" \\"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\\""}
]
```

Note the whole JSON is an array of four objects (log lines). Each JSON object represents one line to be processed by Pipeline engine.
shuiyisong marked this conversation as resolved.
Show resolved Hide resolved

The name of the key in JSON objects, which is `message` here, is used as field name in Pipeline processors. For example:

```yaml
processors:
- dissect:
fields:
- message
shuiyisong marked this conversation as resolved.
Show resolved Hide resolved
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true

# rest of the file is ignored
```

We can also rewrite the payload into NDJSON format like following:

```JSON
{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \\"GET /index.html HTTP/1.1\\" 200 612 \\"-\\" \\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\\""}
{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \\"POST /api/login HTTP/1.1\\" 200 1784 \\"-\\" \\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\\""}
{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \\"GET /images/logo.png HTTP/1.1\\" 304 0 \\"-\\" \\"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\\""}
{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \\"GET /contact HTTP/1.1\\" 404 162 \\"-\\" \\"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\\""}
```

Note the outer array is eliminated, and lines are separated by line breaks instead of `,`.

### `text/plain` format

Log in plain text format is widely used throughout the ecosystem. GreptimeDB also supports `text/plain` format as log data input, enabling ingesting logs first hand from log producers.

The equivalent body payload of previous example is like following:

```plain
127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"
```

Sending log ingestion request to GreptimeDB requires only modifying the `Content-Type` header to be `text/plain`, and you are good to go!

Unlike JSON format, where the input data already have key names as field names to be used in Pipeline processors, `text/plain` format just gives the whole line as input to the Pipeline engine. In this case we use `line` as the field name to refer to the input line, for example:

```yaml
processors:
- dissect:
fields:
- line
shuiyisong marked this conversation as resolved.
Show resolved Hide resolved
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true

# rest of the file is ignored
```

It is recommended to use `dissect` or `regex` processor to split the input line into fields first and then process the fields accordingly.

## Example

Expand Down