-
Notifications
You must be signed in to change notification settings - Fork 66
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
165 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,38 +1,159 @@ | ||
### Function | ||
|
||
  **grok** — parse a string using a grok pattern | ||
  **grok** — parse a string using a Grok pattern | ||
|
||
### Synopsis | ||
|
||
``` | ||
grok(p: string, s: string) -> any | ||
grok(p: string, s: string, definitions: string) -> any | ||
grok(p: string, s: string) -> record | ||
grok(p: string, s: string, definitions: string) -> record | ||
``` | ||
|
||
### Description | ||
|
||
The _grok_ function parses a string `s` using grok pattern `p` and returns | ||
The _grok_ function parses a string `s` using Grok pattern `p` and returns | ||
a record containing the parsed fields. The syntax for pattern `p` | ||
is `%{pattern:field_name}` where _pattern_ is the name of the pattern | ||
to match in `s` and _field_name_ is the resultant field name of the capture | ||
value. | ||
|
||
When provided with three arguments, `definitions` is a string | ||
of named patterns in the format `PATTERN_NAME PATTERN` each separated by newlines. | ||
The named patterns can then be referenced in argument `p`. | ||
of named patterns in the format `PATTERN_NAME PATTERN` each separated by | ||
newlines (`\n`). The named patterns can then be referenced in argument `p`. | ||
|
||
#### Included Patterns | ||
### Included Patterns | ||
|
||
The _grok_ function by default includes a set of builtin named patterns | ||
The `grok` function by default includes a set of built-in named patterns | ||
that can be referenced in any pattern. The included named patterns can be seen | ||
[here](https://raw.githubusercontent.com/brimdata/zed/main/pkg/grok/base.go). | ||
|
||
### Comparison to Other Implementations | ||
|
||
Although Grok functionality appears in many open source tools, it lacks a | ||
formal specification. As a result, example parsing configurations found via | ||
web searches may not all plug seamlessly into Zed's `grok` function without | ||
modification. | ||
|
||
[Logstash](https://www.elastic.co/logstash) was the first tool to widely | ||
promote the approach via its | ||
[Grok filter plugin](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html), | ||
so it serves as the de facto reference implementation. Many articles have | ||
been published by Elastic and others that provide helpful guidance on becoming | ||
proficient in Grok. To help you adapt what you learn from these resources to | ||
the use of Zed's `grok` function, review the tips below. | ||
|
||
:::tip Note | ||
As these represent areas of possible future Zed enhancement, links to open | ||
issues are provided. If you find a functional gap significantly impacts your | ||
ability to use Zed's `grok` function, please add a comment to the relevant | ||
issue describing your use case. | ||
::: | ||
|
||
1. Logstash's Grok offers an optional data type conversion syntax, | ||
e.g., | ||
``` | ||
%{NUMBER:num:int} | ||
``` | ||
to store `num` as an integer type instead of as a | ||
string. Zed currently accepts this trailing `:type` syntax but effectively | ||
ignores it and stores all parsed values as strings. Downstream use of Zed's | ||
[`cast` function](cast.md) can be used instead for data type conversion. | ||
([zed/4928](https://github.com/brimdata/zed/issues/4928)) | ||
|
||
2. Some Logstash Grok examples use an optional square bracket syntax for | ||
storing a parsed value in a nested field, e.g., | ||
``` | ||
%{GREEDYDATA:[nested][field]} | ||
``` | ||
to store a value into `{"nested": {"field": ... }}`. In Zed the more common | ||
dot-separated field naming convention `nested.field` can be combined | ||
with the downstream use of the [`nest_dotted` function](nest_dotted.md) to | ||
store values in nested fields. | ||
([zed/4929](https://github.com/brimdata/zed/issues/4929)) | ||
|
||
3. Zed's regular expressions syntax does not currently support the | ||
"named capture" syntax shown in the | ||
[Logstash docs](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html#_custom_patterns). | ||
([zed/4899](https://github.com/brimdata/zed/issues/4899)) | ||
|
||
Instead use the the approach shown later in that section of the Logstash | ||
docs by including a custom pattern in the `definitions` argument, e.g., | ||
|
||
```mdtest-command | ||
echo '"Jan 1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<[email protected]>"' | | ||
zq -Z 'yield grok("%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}", | ||
this, | ||
"POSTFIX_QUEUEID [0-9A-F]{10,11}")' - | ||
``` | ||
|
||
produces | ||
|
||
```mdtest-output | ||
{ | ||
timestamp: "Jan 1 06:25:43", | ||
logsource: "mailserver14", | ||
program: "postfix/cleanup", | ||
pid: "21403", | ||
queue_id: "BEF25A72965", | ||
syslog_message: "message-id=<[email protected]>" | ||
} | ||
``` | ||
|
||
4. The Grok implementation for Logstash uses the | ||
[Oniguruma](https://github.com/kkos/oniguruma) regular expressions library | ||
while Zed's `grok` uses Go's [regexp](https://pkg.go.dev/regexp) and | ||
[RE2 syntax](https://github.com/google/re2/wiki/Syntax). These | ||
implementations share the same basic syntax which should suffice for most | ||
parsing needs. But per a detailed | ||
[comparison](https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines), | ||
Oniguruma does provide some advanced syntax not available in RE2, | ||
such as recursion, look-ahead, look-behind, and backreferences. To | ||
avoid compatibility issues, we recommend building configurations starting | ||
from the RE2-based [included patterns](#included-patterns). | ||
|
||
:::tip Note | ||
If you absolutely require features of Logstash's Grok that are not currently | ||
present in Zed's implementation, you can create a Logstash-based preprocessing | ||
pipeline that uses its | ||
[Grok filter plugin](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html) | ||
and send its output as JSON to Zed tools. Issue | ||
[zed/3151](https://github.com/brimdata/zed/issues/3151) provides some tips for | ||
getting started. If you pursue this approach, please add a comment to the | ||
issue describing your use case or come talk to us on | ||
[community Slack](https://www.brimdata.io/join-slack/). | ||
::: | ||
|
||
### Debugging | ||
|
||
Much like creating complex regular expressions, building sophisticated Grok | ||
configurations can be frustrating because single-character mistakes can make | ||
the difference between perfect parsing and total failure. | ||
|
||
A recommended workflow is to start by successfully parsing a small/simple | ||
portion of your target data and | ||
[incrementally](https://www.elastic.co/blog/slow-and-steady-how-to-build-custom-grok-patterns-incrementally) | ||
adding more parsing logic and re-testing at each step. | ||
|
||
To aid in this workflow, you may find an | ||
[interactive Grok debugger](https://grokdebugger.com/) helpful. However, note | ||
that these have their own | ||
[differences and limitations](https://github.com/cjslack/grok-debugger). | ||
If you devise a working Grok config in such a tool be sure to incrementally | ||
test it with Zed's `grok`. Be mindful of necessary adjustments such as those | ||
described [above](#comparison-to-other-implementations) and in the [examples](#examples). | ||
|
||
### Need Help? | ||
|
||
If you have difficulty with your Grok configurations, please come talk to us | ||
on the [community Slack](https://www.brimdata.io/join-slack/). | ||
|
||
### Examples | ||
|
||
Parsing a simple log line using the builtin named patterns: | ||
Parsing a simple log line using the built-in named patterns: | ||
```mdtest-command | ||
echo '"2020-09-16T04:20:42.45+01:00 DEBUG This is a sample debug log message"' | | ||
zq -Z 'yield grok("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}", this)' - | ||
zq -Z 'yield grok("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}", | ||
this)' - | ||
``` | ||
=> | ||
```mdtest-output | ||
|
@@ -42,3 +163,35 @@ echo '"2020-09-16T04:20:42.45+01:00 DEBUG This is a sample debug log message"' | | |
message: "This is a sample debug log message" | ||
} | ||
``` | ||
|
||
Per Zed's handling of [string literals](../expressions.md#literals), the | ||
leading backslash in escape sequences in string arguments must be doubled, | ||
such as changing the `\d` to `\\d` if we repurpose the | ||
[included pattern](#included-patterns) for `NUMTZ` as a `definitions` argument: | ||
|
||
```mdtest-command | ||
echo '"+7000"' | | ||
zq -z 'yield grok("%{MY_NUMTZ:tz}", | ||
this, | ||
"MY_NUMTZ [+-]\\d{4}")' - | ||
``` | ||
=> | ||
```mdtest-output | ||
{tz:"+7000"} | ||
``` | ||
|
||
In addition to using `\n` newline escapes to separate multiple named patterns | ||
in the `definitions` argument, string concatenation via `+` may further enhance | ||
readability. | ||
|
||
```mdtest-command | ||
echo '"(555)-1212"' | | ||
zq -z 'yield grok("\\(%{PH_PREFIX:prefix}\\)-%{PH_LINE_NUM:line_number}", | ||
this, | ||
"PH_PREFIX \\d{3}\n" + | ||
"PH_LINE_NUM \\d{4}")' - | ||
``` | ||
=> | ||
```mdtest-output | ||
{prefix:"555",line_number:"1212"} | ||
``` |