Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference JSON in places we previously said NDJSON #5360

Merged
merged 5 commits into from
Oct 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
* Improve the error message shown to a user when a `zed` command is run but there's no pool/branch set for use (#5198)
* Improve the performance of the [`load` operator](docs/language/operators/load.md) by removing an unnecessary/inefficient merge (#5200)
* Improve the [`sort` operator](docs/language/operators/sort.md) to allow different ordering for each key (#5203, #5262)
* Update the [Zeek reference shaper docs](docs/integrations/zeek/shaping-zeek-ndjson.md#reference-shaper-contents) to incorporate changes for [Zeek v7.0.0](https://github.com/zeek/zeek/releases/tag/v7.0.0) logs (#5212)
* Update the [Zeek reference shaper docs](docs/integrations/zeek/shaping-zeek-json.md#reference-shaper-contents) to incorporate changes for [Zeek v7.0.0](https://github.com/zeek/zeek/releases/tag/v7.0.0) logs (#5212)
* Update the [`summarize` operator docs](docs/language/operators/summarize.md) to show the use of `by` without an [aggregate function](docs/language/aggregates/README.md) (#5216)
* Update the [`grok` function docs](docs/language/functions/grok.md) with additional examples and guidance (#5243)
* Update the [Lateral Subquery docs](docs/language/lateral-subqueries.md) with an emphasis on when primitive values or arrays are returned by [Lateral Expressions](docs/language/lateral-subqueries.md#lateral-expressions) (#5264)
Expand All @@ -28,7 +28,7 @@
## v1.16.0
* Improve ZNG scanning performance (#5101, #5103)
* Improve the error message shown when `zq` is invoked with a single argument that's not a valid query and doesn't contain a source (#5119)
* Update [Zeek integration docs](docs/integrations/zeek/README.md), including [reference shaper](docs/integrations/zeek/shaping-zeek-ndjson.md) changes for [Zeek v6.2.0](https://github.com/zeek/zeek/releases/tag/v6.2.0) data (#5106)
* Update [Zeek integration docs](docs/integrations/zeek/README.md), including [reference shaper](docs/integrations/zeek/shaping-zeek-json.md) changes for [Zeek v6.2.0](https://github.com/zeek/zeek/releases/tag/v6.2.0) data (#5106)
* [String literals](docs/language/expressions.md#formatted-string-literals) now use the "f-string" format `f"{ <expr> }"` instead of the previous `${ <expr> }` (#5123)
* Prototype SQL support has been dropped from the Zed language (full SQL support is expected at a later date) (#5109)
* Empty objects and arrays in JSON output are now consistently printed on a single line (#5127)
Expand Down Expand Up @@ -376,7 +376,7 @@ questions.
* Add an `unflatten()` function that turns fields with dot-separated names into fields of nested records (#2277)
* Fix an issue where querying an index in a Zed lake did not return all matched records (#2273)
* Accept type definition names and aliases in shaper functions (#2289)
* Add a reference [shaper for Zeek data](docs/integrations/zeek/shaping-zeek-ndjson.md) (#2300, #2368, #2448, #2489, #2601)
* Add a reference [shaper for Zeek data](docs/integrations/zeek/shaping-zeek-json.md) (#2300, #2368, #2448, #2489, #2601)
* Fix an issue where accessing a `null` array element in a `by` grouping caused a panic (#2310)
* Add support for parsing timestamps with offset format `±[hh][mm]` (#2297)
* Remove cropping from `shape()` (#2309)
Expand Down Expand Up @@ -493,7 +493,7 @@ questions.
* Fix an issue where temporary spill-to-disk directories were not being deleted upon exit (#3009, #3010)
* Fix a ZSON issue with `union` types with alias decorators (#3015, #3016)
* The ZSON format has been changed such that integer type IDs are no longer output (#3017)
* Update the reference Zed shaper for Zeek ([docs](docs/integrations/zeek/shaping-zeek-ndjson.md)) to reflect changes in Zeek release v4.1.0 (#3021)
* Update the reference Zed shaper for Zeek ([docs](docs/integrations/zeek/shaping-zeek-json.md)) to reflect changes in Zeek release v4.1.0 (#3021)
* Fix an issue where backslash escapes in Zed regular expressions were not accepted (#3040)
* The ZST format has been updated to work for typedef'd outer records (#3047)
* Fix an issue where an empty string could not be output as a JSON field name (#3054)
Expand Down
4 changes: 2 additions & 2 deletions docs/integrations/fluentd.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,14 +195,14 @@ richer data typing options, including some types well-suited to Zeek data such
as `ip`, `time`, and `duration`. In Zed, the task of cleaning up data to
improve its typing is known as [shaping](../language/shaping.md).

For Zeek data specifically, a [reference shaper](zeek/shaping-zeek-ndjson.md#reference-shaper-contents)
For Zeek data specifically, a [reference shaper](zeek/shaping-zeek-json.md#reference-shaper-contents)
is available that reflects the field and type information in the logs
generated by a recent Zeek release. To improve the quality of our data, we
next created an expanded configuration that applies the shaper before loading
the data into our pool.

First we saved the contents of the shaper from
[here](zeek/shaping-zeek-ndjson.md#reference-shaper-contents) to a file
[here](zeek/shaping-zeek-json.md#reference-shaper-contents) to a file
`shaper.zed`. Then in the same directory we created the following
`fluentd-shaped.conf`:

Expand Down
2 changes: 1 addition & 1 deletion docs/integrations/zeek/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ docs may be of interest to you.

* [Reading Zeek Log Formats](reading-zeek-log-formats.md)
* [Zed/Zeek Data Type Compatibility](data-type-compatibility.md)
* [Shaping Zeek NDJSON](shaping-zeek-ndjson.md)
* [Shaping Zeek JSON](shaping-zeek-json.md)
4 changes: 2 additions & 2 deletions docs/integrations/zeek/data-type-compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ As the [Zed data model](../../formats/zed.md) was in many ways inspired by the
the rich Zed storage formats ([ZSON](../../formats/zson.md),
[ZNG](../../formats/zng.md), etc.) maintain comprehensive interoperability
with Zeek. When Zeek is configured to output its logs in
NDJSON format, much of the rich type information is lost in translation, but
this can be restored by following the guidance for [shaping Zeek NDJSON](shaping-zeek-ndjson.md).
JSON format, much of the rich type information is lost in translation, but
this can be restored by following the guidance for [shaping Zeek JSON](shaping-zeek-json.md).
On the other hand, Zeek TSV can be converted to Zed storage formats and back to
Zeek TSV without any loss of information.

Expand Down
26 changes: 13 additions & 13 deletions docs/integrations/zeek/reading-zeek-log-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,32 +76,32 @@ once they've been read in as is. The
provides further detail on how the rich data types in Zeek TSV map to the
equivalent [rich types in Zed](../../formats/zed.md#1-primitive-types).

## Zeek NDJSON
## Zeek JSON

As an alternative to the default TSV format, there are two common ways that
Zeek may instead generate logs in [NDJSON](https://en.wikipedia.org/wiki/JSON_streaming#NDJSON) format.
Zeek may instead generate logs in JSON format.

1. Using the [JSON Streaming Logs](https://github.com/corelight/json-streaming-logs)
package (recommended for use with Zed)
2. Using the built-in [ASCII logger](https://docs.zeek.org/en/current/scripts/base/frameworks/logging/writers/ascii.zeek.html)
configured with `redef LogAscii::use_json = T;`

In both cases, Zed tools such as `zq` can read these NDJSON logs automatically
In both cases, Zed tools such as `zq` can read these logs automatically
as is, but with caveats.

Let's revisit the same `conn` record we just examined from the Zeek TSV
log, but now as NDJSON generated using the JSON Streaming Logs package.
log, but now as generated using the JSON Streaming Logs package.

#### conn.ndjson
#### conn.json

```mdtest-input conn.ndjson
```mdtest-input conn.json
{"_path":"conn","_write_ts":"2018-03-24T17:15:21.400275Z","ts":"2018-03-24T17:15:21.255387Z","uid":"C8Tful1TvM3Zf5x8fl","id.orig_h":"10.164.94.120","id.orig_p":39681,"id.resp_h":"10.47.3.155","id.resp_p":3389,"proto":"tcp","duration":0.004266023635864258,"orig_bytes":97,"resp_bytes":19,"conn_state":"RSTR","missed_bytes":0,"history":"ShADTdtr","orig_pkts":10,"orig_ip_bytes":730,"resp_pkts":6,"resp_ip_bytes":342}
```

#### Example

```mdtest-command
super -Z -c 'head 1' conn.ndjson
super -Z -c 'head 1' conn.json
```

#### Output
Expand Down Expand Up @@ -139,33 +139,33 @@ all follow from the records having been previously output as JSON.
3. The connection `duration` is printed as a floating point number rather than
the Zed `duration` type.
4. The keys for the null-valued fields in the record read from
TSV are not present in the record read from NDJSON.
TSV are not present in the record read from JSON.

If you're familiar with the limitations of the JSON data types, it makes sense
that Zeek chose to output these values in NDJSON as it did. Furthermore, if
that Zeek chose to output these values as it did. Furthermore, if
you were just seeking to do quick searches on the string values or simple math
on the numbers, these limitations may be acceptable. However, if you intended
to perform operations like
[aggregations with time-based grouping](../../language/functions/bucket.md)
or [CIDR matches](../../language/functions/network_of.md)
on IP addresses, you would likely want to restore the rich Zed data types as
the records are being read. The document on [shaping Zeek NDJSON](shaping-zeek-ndjson.md)
the records are being read. The document on [shaping Zeek JSON](shaping-zeek-json.md)
provides details on how this can be done.

## The Role of `_path`

Zeek's `_path` field plays an important role in differentiating between its
different [log types](https://docs.zeek.org/en/master/script-reference/log-files.html)
(`conn`, `dns`, etc.) For instance,
[shaping Zeek NDJSON](shaping-zeek-ndjson.md) relies on the value of
the `_path` field to know which Zed type to apply to an input NDJSON
[shaping Zeek JSON](shaping-zeek-json.md) relies on the value of
the `_path` field to know which Zed type to apply to an input JSON
record.

If reading Zeek TSV logs or logs generated by the JSON Streaming Logs
package, this `_path` value is provided within the Zeek logs. However, if the
log was generated by Zeek's built-in ASCII logger when using the
`redef LogAscii::use_json = T;` configuration, the value that would be used for
`_path` is present in the log _file name_ but is not in the NDJSON log
`_path` is present in the log _file name_ but is not in the JSON log
records. In this case you could adjust your Zeek configuration by following the
[Log Extension Fields example](https://docs.zeek.org/en/master/frameworks/logging.html#log-extension-fields)
from the Zeek docs. If you enter `path` in the locations where the example
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
sidebar_position: 3
sidebar_label: Shaping Zeek NDJSON
sidebar_label: Shaping Zeek JSON
---

# Shaping Zeek NDJSON
# Shaping Zeek JSON

When [reading Zeek NDJSON format logs](reading-zeek-log-formats.md#zeek-ndjson),
When [reading Zeek JSON format logs](reading-zeek-log-formats.md#zeek-json),
much of the rich data typing that was originally present inside Zeek is at risk
of being lost. This detail can be restored using a Zed
[shaper](../../language/shaping.md), such as the
Expand All @@ -14,7 +14,7 @@ of being lost. This detail can be restored using a Zed
## Zeek Version/Configuration

The fields and data types in the reference shaper reflect the default
NDJSON-format logs output by Zeek releases up to the version number referenced
JSON-format logs output by Zeek releases up to the version number referenced
in the comments at the top. They have been revisited periodically
as new Zeek versions have been released.

Expand All @@ -40,8 +40,8 @@ The following reference `shaper.zed` may seem large, but ultimately it follows a
fairly simple pattern that repeats across the many [Zeek log types](https://docs.zeek.org/en/master/script-reference/log-files.html).

```mdtest-input shaper.zed
// This reference Zed shaper for Zeek NDJSON logs was most recently tested with
// Zeek v7.0.0. The fields and data types reflect the default NDJSON
// This reference Zed shaper for Zeek JSON logs was most recently tested with
// Zeek v7.0.0. The fields and data types reflect the default JSON
// logs output by that Zeek version when using the JSON Streaming Logs package.
// (https://github.com/corelight/json-streaming-logs).

Expand Down Expand Up @@ -146,10 +146,10 @@ yield nest_dotted(this)
### Configurable Options

The shaper begins with some configurable boolean constants that control how
the shaper will behave when the NDJSON data does not precisely match the Zeek
the shaper will behave when the JSON data does not precisely match the Zeek
type definitions.

* `_crop_records` (default: `true`) - Fields in the NDJSON records whose names
* `_crop_records` (default: `true`) - Fields in the JSON records whose names
are not referenced in the type definitions will be removed. If set to `false`,
such a field would be maintained and assigned an inferred type.

Expand All @@ -159,7 +159,7 @@ original input record will be
along with the shaped and cropped variations.

At these default settings, the shaper is well-suited for an iterative workflow
with a goal of establishing full coverage of the NDJSON data with rich Zed
with a goal of establishing full coverage of the JSON data with rich Zed
types. For instance, the [`has_error` function](../../language/functions/has_error.md)
can be applied on the shaped output and any error values surfaced will point
to fields that can be added to the type definitions in the shaper.
Expand All @@ -182,7 +182,7 @@ record.
### Type Definitions Per Zeek Log `_path`

The bulk of this Zed shaper consists of detailed per-field data type
definitions for each record in the default set of NDJSON logs output by Zeek.
definitions for each record in the default set of JSON logs output by Zeek.
These type definitions reference the types we defined above, such as `port`
and `conn_id`. The syntax for defining primitive and complex types follows the
relevant sections of the [ZSON Format](../../formats/zson.md#2-the-zson-format)
Expand All @@ -198,7 +198,7 @@ specification.
:::tip note
See [the role of `_path`](reading-zeek-log-formats.md#the-role-of-_path)
for important details if you're using Zeek's built-in [ASCII logger](https://docs.zeek.org/en/current/scripts/base/frameworks/logging/writers/ascii.zeek.html)
to generate NDJSON rather than the [JSON Streaming Logs](https://github.com/corelight/json-streaming-logs) package.
rather than the [JSON Streaming Logs](https://github.com/corelight/json-streaming-logs) package.
:::

### Zed Pipeline
Expand Down Expand Up @@ -237,7 +237,7 @@ Picking this apart, it transforms each record as it's being read in several
steps.

1. The [`nest_dotted` function](../../language/functions/nest_dotted.md)
reverses the Zeek NDJSON logger's "flattening" of nested records, e.g., how
reverses the Zeek JSON logger's "flattening" of nested records, e.g., how
it populates a field named `id.orig_h` rather than creating a field `id` with
sub-field `orig_h` inside it. Restoring the original nesting now gives us
the option to reference the embedded record named `id` in the Zed language
Expand All @@ -247,17 +247,17 @@ steps.
2. The [`switch` operator](../../language/operators/switch.md) is used to flag
any problems encountered when applying the shaper logic, e.g.,

* An incoming Zeek NDJSON record has a `_path` value for which the shaper
* An incoming Zeek JSON record has a `_path` value for which the shaper
lacks a type definition.
* A field in an incoming Zeek NDJSON record is located in our type
* A field in an incoming Zeek JSON record is located in our type
definitions but cannot be successfully [cast](../../language/functions/cast.md)
to the target type defined in the shaper.
* An incoming Zeek NDJSON record has additional field(s) beyond those in
* An incoming Zeek JSON record has additional field(s) beyond those in
the target type definition and the [configurable options](#configurable-options)
are set such that this should be treated as an error.

3. Each [`shape` function](../../language/functions/shape.md) call applies an
appropriate type definition based on the nature of the incoming Zeek NDJSON
appropriate type definition based on the nature of the incoming Zeek JSON
record. The logic of `shape` includes:

* For any fields referenced in the type definition that aren't present in
Expand All @@ -271,9 +271,9 @@ steps.

A shaper is typically invoked via the `-I` option of `zq`.

For example, if we assume this input file `weird.ndjson`
For example, if we assume this input file `weird.json`

```mdtest-input weird.ndjson
```mdtest-input weird.json
{
"_path": "weird",
"_write_ts": "2018-03-24T17:15:20.600843Z",
Expand All @@ -292,7 +292,7 @@ For example, if we assume this input file `weird.ndjson`
applying the reference shaper via

```mdtest-command
super -Z -I shaper.zed weird.ndjson
super -Z -I shaper.zed weird.json
```

produces
Expand All @@ -317,7 +317,7 @@ produces
} (=weird)
```

If working in a directory containing many NDJSON logs, the
If working in a directory containing many JSON logs, the
reference shaper can be applied to all the records they contain and
output them all in a single binary [ZNG](../../formats/zng.md) file as
follows:
Expand All @@ -340,8 +340,8 @@ super -Z -I shaper.zed -c '| has_error(this)' *.log

## Importing Shaped Data Into Zui

If you wish to shape your Zeek NDJSON data in [Zui](https://zui.brimdata.io/),
drag the NDJSON files into the app and then paste the contents of the
If you wish to shape your Zeek JSON data in [Zui](https://zui.brimdata.io/),
drag the files into the app and then paste the contents of the
[`shaper.zed` shown above](#reference-shaper-contents) into the
**Shaper Editor** of the [**Preview & Load**](https://zui.brimdata.io/docs/features/Preview-Load)
screen.
Expand Down
Loading