Improve Grok function docs (#5243)

brimdata · Sep 11, 2024 · 94c2630 · 94c2630
1 parent ab07f4d
commit 94c2630
Show file tree

Hide file tree

Showing 2 changed files with 165 additions and 11 deletions.
diff --git a/docs/language/conventions.md b/docs/language/conventions.md
@@ -5,14 +5,15 @@ sidebar_label: Conventions
 
 # Type Conventions
 
-Function arguments and operator input values are all dynamically typed,
+[Function](functions/README.md) arguments and [operator](operators/README.md) input values are all dynamically typed,
 yet certain functions expect certain specific [data types](data-types.md)
 or classes of data types. To this end, the function and operator prototypes
 in the Zed documentation include several type classes as follows:
 * _any_ - any Zed data type
 * _float_ - any floating point Zed type
 * _int_ - any signed or unsigned Zed integer type
 * _number_ - either float or int
+* _record_ - any [record](../formats/zson.md#251-record-type) type
 
 Note that there is no "any" type in Zed as all super-structured data is
 comprehensively typed; "any" here simply refers to a value that is allowed

diff --git a/docs/language/functions/grok.md b/docs/language/functions/grok.md
@@ -1,38 +1,159 @@
 ### Function
 
-&emsp; **grok** &mdash; parse a string using a grok pattern
+&emsp; **grok** &mdash; parse a string using a Grok pattern
 
 ### Synopsis
 
 ```
-grok(p: string, s: string) -> any
-grok(p: string, s: string, definitions: string) -> any
+grok(p: string, s: string) -> record
+grok(p: string, s: string, definitions: string) -> record
 ```
 
 ### Description
 
-The _grok_ function parses a string `s` using grok pattern `p` and returns
+The _grok_ function parses a string `s` using Grok pattern `p` and returns
 a record containing the parsed fields. The syntax for pattern `p`
 is `%{pattern:field_name}` where _pattern_ is the name of the pattern
 to match in `s` and _field_name_ is the resultant field name of the capture
 value.
 
 When provided with three arguments, `definitions` is a string
-of named patterns in the format `PATTERN_NAME PATTERN` each separated by newlines.
-The named patterns can then be referenced in argument `p`.
+of named patterns in the format `PATTERN_NAME PATTERN` each separated by
+newlines (`\n`). The named patterns can then be referenced in argument `p`.
 
-#### Included Patterns
+### Included Patterns
 
-The _grok_ function by default includes a set of builtin named patterns
+The `grok` function by default includes a set of built-in named patterns
 that can be referenced in any pattern. The included named patterns can be seen
 [here](https://raw.githubusercontent.com/brimdata/zed/main/pkg/grok/base.go).
 
+### Comparison to Other Implementations
+
+Although Grok functionality appears in many open source tools, it lacks a
+formal specification. As a result, example parsing configurations found via
+web searches may not all plug seamlessly into Zed's `grok` function without
+modification.
+
+[Logstash](https://www.elastic.co/logstash) was the first tool to widely
+promote the approach via its
+[Grok filter plugin](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html),
+so it serves as the de facto reference implementation. Many articles have
+been published by Elastic and others that provide helpful guidance on becoming
+proficient in Grok. To help you adapt what you learn from these resources to
+the use of Zed's `grok` function, review the tips below.
+
+:::tip Note
+As these represent areas of possible future Zed enhancement, links to open
+issues are provided. If you find a functional gap significantly impacts your
+ability to use Zed's `grok` function, please add a comment to the relevant
+issue describing your use case.
+:::
+
+1. Logstash's Grok offers an optional data type conversion syntax,
+   e.g.,
+   ```
+   %{NUMBER:num:int}
+   ```
+  to store `num` as an integer type instead of as a
+  string. Zed currently accepts this trailing `:type` syntax but effectively
+  ignores it and stores all parsed values as strings. Downstream use of Zed's
+  [`cast` function](cast.md) can be used instead for data type conversion.
+  ([zed/4928](https://github.com/brimdata/zed/issues/4928))
+
+2. Some Logstash Grok examples use an optional square bracket syntax for
+   storing a parsed value in a nested field, e.g.,
+   ```
+   %{GREEDYDATA:[nested][field]}
+   ```
+   to store a value into `{"nested": {"field": ... }}`. In Zed the more common
+   dot-separated field naming convention `nested.field` can be combined
+   with the downstream use of the [`nest_dotted` function](nest_dotted.md) to
+   store values in nested fields.
+   ([zed/4929](https://github.com/brimdata/zed/issues/4929))
+
+3. Zed's regular expressions syntax does not currently support the
+   "named capture" syntax shown in the
+   [Logstash docs](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html#_custom_patterns).
+   ([zed/4899](https://github.com/brimdata/zed/issues/4899))
+
+   Instead use the the approach shown later in that section of the Logstash
+   docs by including a custom pattern in the `definitions` argument, e.g.,
+
+   ```mdtest-command
+   echo '"Jan  1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<[email protected]>"' |
+     zq -Z 'yield grok("%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}",
+                       this,
+                       "POSTFIX_QUEUEID [0-9A-F]{10,11}")' -
+   ```
+
+   produces
+
+   ```mdtest-output
+   {
+       timestamp: "Jan  1 06:25:43",
+       logsource: "mailserver14",
+       program: "postfix/cleanup",
+       pid: "21403",
+       queue_id: "BEF25A72965",
+       syslog_message: "message-id=<[email protected]>"
+   }
+   ```
+
+4. The Grok implementation for Logstash uses the
+   [Oniguruma](https://github.com/kkos/oniguruma) regular expressions library
+   while Zed's `grok` uses Go's [regexp](https://pkg.go.dev/regexp) and
+   [RE2 syntax](https://github.com/google/re2/wiki/Syntax). These
+   implementations share the same basic syntax which should suffice for most
+   parsing needs. But per a detailed
+   [comparison](https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines),
+   Oniguruma does provide some advanced syntax not available in RE2,
+   such as recursion, look-ahead, look-behind, and backreferences. To
+   avoid compatibility issues, we recommend building configurations starting
+   from the RE2-based [included patterns](#included-patterns).
+
+:::tip Note
+If you absolutely require features of Logstash's Grok that are not currently
+present in Zed's implementation, you can create a Logstash-based preprocessing
+pipeline that uses its
+[Grok filter plugin](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html)
+and send its output as JSON to Zed tools. Issue
+[zed/3151](https://github.com/brimdata/zed/issues/3151) provides some tips for
+getting started. If you pursue this approach, please add a comment to the
+issue describing your use case or come talk to us on
+[community Slack](https://www.brimdata.io/join-slack/).
+:::
+
+### Debugging
+
+Much like creating complex regular expressions, building sophisticated Grok
+configurations can be frustrating because single-character mistakes can make
+the difference between perfect parsing and total failure.
+
+A recommended workflow is to start by successfully parsing a small/simple
+portion of your target data and
+[incrementally](https://www.elastic.co/blog/slow-and-steady-how-to-build-custom-grok-patterns-incrementally)
+adding more parsing logic and re-testing at each step.
+
+To aid in this workflow, you may find an
+[interactive Grok debugger](https://grokdebugger.com/) helpful. However, note
+that these have their own
+[differences and limitations](https://github.com/cjslack/grok-debugger).
+If you devise a working Grok config in such a tool be sure to incrementally
+test it with Zed's `grok`. Be mindful of necessary adjustments such as those
+described [above](#comparison-to-other-implementations) and in the [examples](#examples).
+
+### Need Help?
+
+If you have difficulty with your Grok configurations, please come talk to us
+on the [community Slack](https://www.brimdata.io/join-slack/).
+
 ### Examples
 
-Parsing a simple log line using the builtin named patterns:
+Parsing a simple log line using the built-in named patterns:
 ```mdtest-command
 echo '"2020-09-16T04:20:42.45+01:00 DEBUG This is a sample debug log message"' |
-  zq -Z 'yield grok("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}", this)' -
+  zq -Z 'yield grok("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}",
+                    this)' -
 ```
 =>
 ```mdtest-output
@@ -42,3 +163,35 @@ echo '"2020-09-16T04:20:42.45+01:00 DEBUG This is a sample debug log message"' |
     message: "This is a sample debug log message"
 }
 ```
+
+Per Zed's handling of [string literals](../expressions.md#literals), the
+leading backslash in escape sequences in string arguments must be doubled,
+such as changing the `\d` to `\\d` if we repurpose the
+[included pattern](#included-patterns) for `NUMTZ` as a `definitions` argument:
+
+```mdtest-command
+echo '"+7000"' |
+  zq -z 'yield grok("%{MY_NUMTZ:tz}",
+                    this,
+                    "MY_NUMTZ [+-]\\d{4}")' -
+```
+=>
+```mdtest-output
+{tz:"+7000"}
+```
+
+In addition to using `\n` newline escapes to separate multiple named patterns
+in the `definitions` argument, string concatenation via `+` may further enhance
+readability.
+
+```mdtest-command
+echo '"(555)-1212"' |
+  zq -z 'yield grok("\\(%{PH_PREFIX:prefix}\\)-%{PH_LINE_NUM:line_number}",
+                    this, 
+                    "PH_PREFIX \\d{3}\n" +
+                    "PH_LINE_NUM \\d{4}")' -
+```
+=>
+```mdtest-output
+{prefix:"555",line_number:"1212"}
+```