Add more "where" coverage in the summarize doc #5316

philrz · 2024-10-03T22:48:25Z

What's Changing

More examples are being proposed in the user-facing doc for the summarize operator to show some subtleties related to including where filtering with an aggreagtion.

Why

As part of benchmarking work, I was recently converting some SQL queries to their Zed equivalents and came across the effects shown in these examples. I'm not certain if SQL users learning Zed might be tripped up by the same, but I figure it can't hurt to call it out in the docs just in case.

Details

Here's a separate example I showed to the team at a group sync using the attached sample.csv data.

In essence, I can see that it's possible in both SQL and Zed to create an aggregation result that includes what I'll call "empty buckets":

D select _path,count(*) filter (where len(_path) < 4) from 'sample.csv' group by _path;
	┌──────────────┬──────────────────────────────────────────────┐
	│    _path     │ count_star() FILTER (WHERE (len(_path) < 4)) │
	│   varchar    │                    int64                     │
	├──────────────┼──────────────────────────────────────────────┤
	│ conn         │                                            0 │
	│ files        │                                            0 │
	│ capture_loss │                                            0 │
	│ dns          │                                            2 │
	│ weird        │                                            0 │
	│ stats        │                                            0 │
	│ x509         │                                            0 │
	│ ssl          │                                            1 │
	└──────────────┴──────────────────────────────────────────────┘
	
	$ zq -i csv 'count() where len(_path) < 4 by _path' sample.csv
	{_path:"dns",count:2(uint64)}
	{_path:"weird",count:0(uint64)}
	{_path:"capture_loss",count:0(uint64)}
	{_path:"stats",count:0(uint64)}
	{_path:"conn",count:0(uint64)}
	{_path:"files",count:0(uint64)}
	{_path:"x509",count:0(uint64)}
	{_path:"ssl",count:1(uint64)}

Likewise, I can also create results in both SQL and Zed without the empty buckets:

	D select _path,count(*) from 'sample.csv' where len(_path) < 4 group by _path;
	┌─────────┬──────────────┐
	│  _path  │ count_star() │
	│ varchar │    int64     │
	├─────────┼──────────────┤
	│ ssl     │            1 │
	│ dns     │            2 │
	└─────────┴──────────────┘
	
	$ zq -i csv 'len(_path) < 4 | count() by _path' sample.csv
	{_path:"ssl",count:1(uint64)}
	{_path:"dns",count:2(uint64)}

Here's my concern, though. I expect SQL users are accustomed to seeing the pattern SELECT... [aggregate function(s)]... GROUP BY as "an aggregation", and so when such a user comes to learn Zed, they may look for a similar pattern and see summarize... [aggregate function(s)]... BY as an equivalent way to express "an aggregation" . And since in the SQL the where filtering happens in the middle of "an aggregation", I suspect they may try putting the where in the middle of the summarize in Zed. But that would give them the "empty buckets" behavior, which they might not expect. Since getting the "without empty buckets" behavior in Zed requires moving the filter to a separate pipeline element before the summarize, this seems like something they'll want to know early in their learning of Zed.

nwt · 2024-10-07T20:11:49Z

docs/language/operators/summarize.md

+Results are included for `by` groupings that generate null results when `where`
+filters are used inside `summarize`.


"`where` clause" is used everywhere else. And the other example descriptions end with a colon.

Suggested change

Results are included for `by` groupings that generate null results when `where`

filters are used inside `summarize`.

Results are included for `by` groupings that generate null results when `where`

clause are used inside `summarize`:

nwt · 2024-10-07T20:14:21Z

docs/language/operators/summarize.md

+{key:"foo",sum:null}
+```
+
+To avoid null results for `by` groupings like just shown, filter before `summarize`.


Suggested change

To avoid null results for `by` groupings like just shown, filter before `summarize`.

To avoid null results for `by` groupings as just shown, filter before `summarize`:

Add more "where" coverage in the summarize doc

7dc614a

philrz self-assigned this Oct 3, 2024

nwt approved these changes Oct 7, 2024

View reviewed changes

PR feedback: filter->clause, use colons for examples

30ee306

philrz added skip-autoperf skip-notify-downstream labels Oct 7, 2024

philrz merged commit b0853e6 into main Oct 7, 2024
4 checks passed

philrz deleted the summarize-where branch October 7, 2024 20:29

philrz mentioned this pull request Oct 21, 2024

Add count() docs examples about absent/zero output #5359

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more "where" coverage in the summarize doc #5316

Add more "where" coverage in the summarize doc #5316

philrz commented Oct 3, 2024

nwt Oct 7, 2024

nwt Oct 7, 2024

		Results are included for `by` groupings that generate null results when `where`
		filters are used inside `summarize`.

	To avoid null results for `by` groupings like just shown, filter before `summarize`.
	To avoid null results for `by` groupings as just shown, filter before `summarize`:

Add more "where" coverage in the summarize doc #5316

Add more "where" coverage in the summarize doc #5316

Conversation

philrz commented Oct 3, 2024

What's Changing

Why

Details

nwt Oct 7, 2024

Choose a reason for hiding this comment

nwt Oct 7, 2024

Choose a reason for hiding this comment