Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for lateral subqueries and over operator #5264

Merged
merged 3 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 58 additions & 15 deletions docs/language/lateral-subqueries.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,19 @@ sidebar_label: Lateral Subqueries

Lateral subqueries provide a powerful means to apply a Zed query
to each subsequence of values generated from an outer sequence of values.
The inner query may be _any Zed query_ and may refer to values from
The inner query may be _any_ dataflow operator sequence (excluding
[`from` operators](operators/from.md)) and may refer to values from
the outer sequence.

:::tip Note
This pattern rhymes with the SQL pattern of a "lateral
join", which runs a subquery for each row of the outer query's results.
:::

Lateral subqueries are created using the scoped form of the
[`over` operator](operators/over.md) and may be nested to arbitrary depth.
[`over` operator](operators/over.md). They may be nested to arbitrary depth
and accesses to variables in parent lateral query bodies follows lexical
scoping.

For example,
```mdtest-command
Expand All @@ -24,7 +32,7 @@ produces
{name:"foo",elem:2}
{name:"bar",elem:3}
```
Here the lateral scope, described below, creates a subquery
Here the [lateral scope](#lateral-scope), described below, creates a subquery
```
yield {name,elem:this}
```
Expand All @@ -41,7 +49,7 @@ The first subquery thus operates on the input values `1, 2` with the variable
{name:"foo",elem:1}
{name:"foo",elem:2}
```
and the second subquery operators on the input value `3` with the variable
and the second subquery operates on the input value `3` with the variable
`name` set to "bar", emitting
```
{name:"bar",elem:3}
Expand Down Expand Up @@ -81,17 +89,23 @@ between each `<expr>` evaluated in the outer scope and each `<var>`, which
represents a new symbol in the inner scope of the `<query>`.
In the field reference form, a single identifier `<field>` refers to a field
in the parent scope and makes that field's value available in the lateral scope
with the same name.
via the same name.

Note that any such variable definitions override [implied field references](dataflow-model.md#implied-field-references) of
`this`. If a both a field named `x` and a variable named `x` need be
referenced in the lateral scope, the field reference should be qualified as
`this.x` while the variable is referenced simply as `x`.

The `<query>`, which may be any Zed query, is evaluated once per outer value
The `<query>` is evaluated once per outer value
on the sequence generated by the `over` expression. In the lateral scope,
the value `this` refers to the inner sequence generated from the `over` expressions.
This query runs to completion for each inner sequence and emits
each subquery result as each inner sequence traversal completes.

This structure is powerful because _any_ Zed query can appear in the body of
the lateral scope. In contrast to the `yield` example, a sort could be
applied to each subsequence in the subquery, where sort
This structure is powerful because _any_ dataflow operator sequence (excluding
[`from` operators](operators/from.md)) can appear in the body of
the lateral scope. In contrast to the [`yield`](operators/yield.md) example above, a [`sort`](operators/sort.md) could be
applied to each subsequence in the subquery, where `sort`
reads all values of the subsequence, sorts them, emits them, then
repeats the process for the next subsequence. For example,
```mdtest-command
Expand All @@ -112,13 +126,12 @@ parenthesized form:
```
( over <expr> [, <expr>...] [with <var>=<expr> [, ... <var>[=<expr>]] | <lateral> )
```
> Note that the parentheses disambiguate a lateral expression from a lateral
> dataflow operator.

This form must always include a lateral scope as indicated by `<lateral>`,
which can be any dataflow operator sequence excluding [`from` operators](operators/from.md).
As with the `over` operator, values from the outer scope can be brought into
the lateral scope using the `with` clause.
:::tip
The parentheses disambiguate a lateral expression from a [lateral dataflow operator](operators/over.md).
:::

This form must always include a [lateral scope](#lateral-scope) as indicated by `<lateral>`.

The lateral expression is evaluated by evaluating each `<expr>` and feeding
the results as inputs to the `<lateral>` dataflow operators. Each time the
Expand Down Expand Up @@ -148,3 +161,33 @@ produces
{sorted:[1,4,7],sum:12}
{sorted:[1,2,3],sum:6}
```
Because Zed expressions evaluate to a single result, if multiple values remain
at the conclusion of the lateral dataflow, they are automatically wrapped in
an array, e.g.,
```mdtest-command
echo '{x:1} {x:[2]} {x:[3,4]}' |
zq -z 'yield {s:(over x | yield this+1)}' -
```
produces
```mdtest-output
{s:2}
{s:3}
{s:[4,5]}
```
To handle such dynamic input data, you can ensure your downstream dataflow
always receives consistently packaged values by explicitly wrapping the result
of the lateral scope, e.g.,
```mdtest-command
echo '{x:1} {x:[2]} {x:[3,4]}' |
zq -z 'yield {s:(over x | yield this+1 | collect(this))}' -
```
produces
```mdtest-output
{s:[2]}
{s:[3]}
{s:[4,5]}
```
Similarly, a primitive value may be consistently produced by concluding the
lateral scope with an operator such as [`head`](operators/head.md) or
[`tail`](operators/tail.md), or by applying certain [aggregate functions](aggregates/README.md)
such as done with [`sum`](aggregates/sum.md) above.
34 changes: 5 additions & 29 deletions docs/language/operators/over.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,45 +12,21 @@ The `over` operator traverses complex values to create a new sequence
of derived values (e.g., the elements of an array) and either
(in the first form) sends the new values directly to its output or
(in the second form) sends the values to a scoped computation as indicated
by `<lateral>`, which may represent any Zed subquery operating on the
derived sequence of values as `this`.
by `<lateral>`, which may represent any Zed [subquery](../lateral-subqueries.md) operating on the
derived sequence of values as [`this`](../dataflow-model.md#the-special-value-this).

Each expression `<expr>` is evaluated in left-to-right order and derived sequences are
generated from each such result depending on its types:
* an array value generates each of its element,
* an array value generates each of its elements,
* a map value generates a sequence of records of the form `{key:<key>,value:<value>}` for each
entry in the map, and
* all other values generate a single value equal to itself.

Records can be converted to maps with the [_flatten_ function](../functions/flatten.md)
Records can be converted to maps with the [`flatten` function](../functions/flatten.md)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Records can be converted to maps...

Strictly speaking, it doesn't seem like flatten converts them to maps. Rather, it generates a sequence of values similar to what over does with maps, but even then, there's still the difference that flatten creates an array for the path of the record keys whereas over on a map outputs the keys as is. Does this bother anyone? It doesn't quite bother me enough to insist on a major expansion of this text. However, user confusion about maps has come up a few times (e.g., the subtle difference between true Zed map types and what I've been calling "JSON-compatible pseudo-maps" (#4565)) which is why I perked up when I saw a possible oversimplified reference to "maps" here.

Anyone have reactions?

resulting in a map that can be traversed,
e.g., if `this` is a record, it can be traversed with `over flatten(this)`.

The nested subquery depicted as `<lateral>` is called a "lateral query" as the
outer query operates on the top-level sequence of values while the lateral
query operates on subsequences of values derived from each input value.
This pattern rhymes with the SQL pattern of a "lateral join", which runs a
SQL subquery for each row of the outer query's table.

In a Zed lateral query, each input value induces a derived subsequence and
for each such input, the lateral query runs to completion and yields its results.
In this way, operators like `sort` and `summarize`, which operate on their
entire input, run to completion for each subsequence and yield to the output the
lateral result set for each outer input as a sequence of values.

Within the lateral query, `this` refers to the values of the subsequence thereby
preventing lateral expressions from accessing the outer `this`.
To accommodate such references, the _over_ operator includes a _with_ clause
that binds arbitrary expressions evaluated in the outer scope
to variables that may be referenced by name in the lateral scope.

> Note that any such variable definitions override implied field references
> of `this`. If a both a field named "x" and a variable named "x" need be
> referenced in the lateral scope, the field reference should be qualified as `this.x`
> while the variable is referenced simply as `x`.

Lateral queries may be nested to arbitrary depth and accesses to variables
in parent lateral query bodies follows lexical scoping.
The nested subquery depicted as `<lateral>` is called a [lateral subquery](../lateral-subqueries.md).

### Examples

Expand Down