Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more "where" coverage in the summarize doc #5316

Merged
merged 2 commits into from
Oct 7, 2024
Merged

Conversation

philrz
Copy link
Contributor

@philrz philrz commented Oct 3, 2024

What's Changing

More examples are being proposed in the user-facing doc for the summarize operator to show some subtleties related to including where filtering with an aggreagtion.

Why

As part of benchmarking work, I was recently converting some SQL queries to their Zed equivalents and came across the effects shown in these examples. I'm not certain if SQL users learning Zed might be tripped up by the same, but I figure it can't hurt to call it out in the docs just in case.

Details

Here's a separate example I showed to the team at a group sync using the attached sample.csv data.

In essence, I can see that it's possible in both SQL and Zed to create an aggregation result that includes what I'll call "empty buckets":

D select _path,count(*) filter (where len(_path) < 4) from 'sample.csv' group by _path;
	┌──────────────┬──────────────────────────────────────────────┐
	│    _path     │ count_star() FILTER (WHERE (len(_path) < 4)) │
	│   varchar    │                    int64                     │
	├──────────────┼──────────────────────────────────────────────┤
	│ conn         │                                            0 │
	│ files        │                                            0 │
	│ capture_loss │                                            0 │
	│ dns          │                                            2 │
	│ weird        │                                            0 │
	│ stats        │                                            0 │
	│ x509         │                                            0 │
	│ ssl          │                                            1 │
	└──────────────┴──────────────────────────────────────────────┘
	
	$ zq -i csv 'count() where len(_path) < 4 by _path' sample.csv
	{_path:"dns",count:2(uint64)}
	{_path:"weird",count:0(uint64)}
	{_path:"capture_loss",count:0(uint64)}
	{_path:"stats",count:0(uint64)}
	{_path:"conn",count:0(uint64)}
	{_path:"files",count:0(uint64)}
	{_path:"x509",count:0(uint64)}
	{_path:"ssl",count:1(uint64)}

Likewise, I can also create results in both SQL and Zed without the empty buckets:

	D select _path,count(*) from 'sample.csv' where len(_path) < 4 group by _path;
	┌─────────┬──────────────┐
	│  _path  │ count_star() │
	│ varchar │    int64     │
	├─────────┼──────────────┤
	│ ssl     │            1 │
	│ dns     │            2 │
	└─────────┴──────────────┘
	
	$ zq -i csv 'len(_path) < 4 | count() by _path' sample.csv
	{_path:"ssl",count:1(uint64)}
	{_path:"dns",count:2(uint64)}

Here's my concern, though. I expect SQL users are accustomed to seeing the pattern SELECT... [aggregate function(s)]... GROUP BY as "an aggregation", and so when such a user comes to learn Zed, they may look for a similar pattern and see summarize... [aggregate function(s)]... BY as an equivalent way to express "an aggregation" . And since in the SQL the where filtering happens in the middle of "an aggregation", I suspect they may try putting the where in the middle of the summarize in Zed. But that would give them the "empty buckets" behavior, which they might not expect. Since getting the "without empty buckets" behavior in Zed requires moving the filter to a separate pipeline element before the summarize, this seems like something they'll want to know early in their learning of Zed.

@philrz philrz self-assigned this Oct 3, 2024
Comment on lines 131 to 132
Results are included for `by` groupings that generate null results when `where`
filters are used inside `summarize`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"`where` clause" is used everywhere else. And the other example descriptions end with a colon.

Suggested change
Results are included for `by` groupings that generate null results when `where`
filters are used inside `summarize`.
Results are included for `by` groupings that generate null results when `where`
clause are used inside `summarize`:

{key:"foo",sum:null}
```

To avoid null results for `by` groupings like just shown, filter before `summarize`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To avoid null results for `by` groupings like just shown, filter before `summarize`.
To avoid null results for `by` groupings as just shown, filter before `summarize`:

@philrz philrz merged commit b0853e6 into main Oct 7, 2024
4 checks passed
@philrz philrz deleted the summarize-where branch October 7, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants