Merge pull request #3578 from vespa-engine/kkraune-patch-1

link to sample app + typos / grammar
vespa-engine · Jan 10, 2025 · 87a84a7 · 87a84a7
2 parents 40879c1 + fee94eb
commit 87a84a7
Showing 1 changed file with 42 additions and 38 deletions.
diff --git a/en/grouping.html b/en/grouping.html
@@ -96,11 +96,11 @@
 <br/>
 
 <p>
-  The Vespa grouping language is a list processing language
-  which describes how the query hits should be grouped, aggregated and presented in result sets.
+  The Vespa grouping language is a list-processing language
+  which describes how the query hits should be grouped, aggregated, and presented in result sets.
   A grouping statement takes the list of all matches to a query as input and groups/aggregates
   it, possibly in multiple nested and parallel ways to produce the output.
-  This is a logical specification, and does not indicate how it is executed,
+  This is a logical specification and does not indicate how it is executed,
   as instantiating the list of all matches to the query somewhere would be too expensive,
   and execution is distributed instead.
 </p>
@@ -111,6 +111,10 @@
   Fields used in grouping must be defined as <a href="attributes.html">attribute</a> in the document schema.
   Grouping supports continuation objects for <a href="#pagination">pagination</a>.
 </p>
+<p>
+  The <a href="https://github.com/vespa-engine/sample-apps/tree/master/examples/part-purchases-demo">Grouping Results</a>
+  sample application is a practical example.
+</p>
 
 
 
@@ -122,17 +126,17 @@ <h2 id="the-grouping-language-structure">The grouping language structure</h2>
   <li><code>all(statement)</code>: Execute the nested statement once on the input list as a whole.</li>
   <li><code>each(statement)</code>: Execute the nested statement on each element of the input list.</li>
   <li><code>group(specification)</code>:
-    Turn the input list into a list of list according to the grouping specification.</li>
+    Turn the input list into a list of lists according to the grouping specification.</li>
   <li><code>output</code>: Output some value(s) at the current location in the structure.</li>
 </ul>
 <p>
   The parallel and nested collection of these operations defines both the structure of the computation
   and of the result it produces.
   For example, <code>all(group(customer) each(output(count())))</code>
-  will take all matches, group them by customer id, and for each group output the count of hits in the group.
+  will take all matches, group them by customer id, and for each group, output the count of hits in the group.
 </p>
 <p>
-  Vespa distributes and executes the grouping program on content nodes, and merges results on container nodes -
+  Vespa distributes and executes the grouping program on content nodes and merges results on container nodes -
   in multiple phases, as needed.
   As realizing such programs over a distributed data set requires more network round-trips than a regular search query,
   these queries may be more expensive than regular queries -
@@ -423,7 +427,7 @@ <h2 id="basic-grouping">Basic Grouping</h2>
 <h2 id="ordering-and-limiting-groups">Ordering and Limiting Groups</h2>
 <p>
 In many scenarios, a large collection of groups is produced, possibly too large to display or process.
-This is handled by ordering groups, then limiting, the number of groups to return.
+This is handled by ordering groups, then limiting the number of groups to return.
 </p><p>
 The <code>order</code> clause accepts a list of one or more expressions.
 Each of the arguments to <code>order</code> is prefixed by either a plus/minus for ascending/descending order.
@@ -435,7 +439,7 @@ <h2 id="ordering-and-limiting-groups">Ordering and Limiting Groups</h2>
 <p>
   An implicit limit can be specified through the
   <a href="reference/query-api-reference.html#grouping.defaultmaxgroups">grouping.defaultMaxGroups</a> query parameter.
-  This value will always be overridden if <code>max</code> explicitly specified in the query.
+  This value will always be overridden if <code>max</code> is explicitly specified in the query.
   Use <code>max(inf)</code> to retrieve all groups when the query parameter is set.
 </p>
 <p>
@@ -449,7 +453,7 @@ <h2 id="ordering-and-limiting-groups">Ordering and Limiting Groups</h2>
   many samples are needed to fetch from each node in order to get the right groups.
   This is the <code>precision</code>.
   An initial factor of 3 has proven to be quite good in most use cases.
-  If however the data for customer 'Jones' was spread on 3 different content nodes,
+  If however, the data for customer 'Jones' was spread on 3 different content nodes,
   'Jones' might be among the 2 best on only one node.
   But based on the distribution of the data,
   we have concluded by earlier tests that if we fetch 5.67 as many groups as we need to,
@@ -459,7 +463,7 @@ <h2 id="ordering-and-limiting-groups">Ordering and Limiting Groups</h2>
 <p>
 However, there is one exception.
 Without an <code>order</code> constraint, <code>precision</code> is not required.
-Then local ordering will be the same as global ordering.
+Then, local ordering will be the same as global ordering.
 Ordering will not change after a merge operation.
 </p>
 <h3 id="ordering-and-limiting-groups-example">Example</h3>
@@ -499,7 +503,7 @@ <h2 id="hits-per-group">Hits per Group</h2>
 <p>
   An implicit limit can be specified through the
   <a href="reference/query-api-reference.html#grouping.defaultmaxhits">grouping.defaultMaxHits</a> query parameter.
-  This value will always be overridden if <code>max</code> explicitly specified in the query.
+  This value will always be overridden if <code>max</code> is explicitly specified in the query.
   Use <code>max(inf)</code> to retrieve all hits when the query parameter is set.
 </p>
 
@@ -714,7 +718,7 @@ <h3 id="global-limit-combining">Combining with default limits for groups/hits</h
   The <code>grouping.globalMaxGroups</code> restriction will utilize the
   <a href="reference/query-api-reference.html#grouping.defaultmaxgroups">grouping.defaultMaxGroups</a>/
   <a href="reference/query-api-reference.html#grouping.defaultmaxhits">grouping.defaultMaxHits</a>
-  values for grouping statements without a <code>max</code>. The two queries below are identical assuming
+  values for grouping statements without a <code>max</code>. The two queries below are identical, assuming
   <code>defaultMaxGroups=5</code> and <code>defaultMaxHits=7</code>, and both will be rejected when
   <code>globalMaxGroups &lt; 5+5*7</code>.
 </p>
@@ -732,7 +736,7 @@ <h3 id="global-limit-combining">Combining with default limits for groups/hits</h
 <p>
   A grouping without <code>max</code> combined with <code>defaultMaxGroups=-1</code>/<code>defaultMaxHits=-1</code>
   will be rejected unless <code>globalMaxGroups=-1</code>. This is because the query produces an unbounded result,
-  infinite number of groups if <code>defaultMaxGroups=-1</code> or infinite number of summaries if
+  an infinite number of groups if <code>defaultMaxGroups=-1</code> or an infinite number of summaries if
   <code>defaultMaxHits=-1</code>.
   An unintentional DoS (Denial of Service) could be the utter consequence if a query returns thousands of groups and summaries.
   This is why setting <code>globalMaxGroups=-1</code> is risky.
@@ -757,18 +761,18 @@ <h3 id="global-limit-recommendation">Recommended settings</h3>
 
 <h2 id="performance-and-correctness">Performance and Correctness</h2>
 <p>
-  Grouping is by default tuned to favour performance over correctness.
+  Grouping is, by default, tuned to favor performance over correctness.
   Perfect correctness may not be achievable; result of queries using <a href="#ordering-and-limiting-groups">non-default ordering</a>
-  can be approximate and correctness can only be partially achieved by a larger <code>precision</code> value that sacrifices performance.
+  can be approximate, and correctness can only be partially achieved by a larger <code>precision</code> value that sacrifices performance.
 </p>
 <p>
   The <a href="reference/grouping-syntax.html#grouping-session-cache">grouping session cache</a> is enabled by default.
   Disabling it will improve correctness, especially for queries using <code>order</code> and <code>max</code>.
-  The cost of multi-level grouping expressions will though increase.
+  The cost of multi-level grouping expressions will increase, though.
 </p>
 <p>
   Consider increasing the <a href="#ordering-and-limiting-groups">precision</a> value when using <code>max</code> in combination with <code>order</code>.
-  The default precision may not achieve the required correctness for your use-case.
+  The default precision may not achieve the required correctness for your use case.
 </p>
 
 <h2 id="nested-groups">Nested Groups</h2>
@@ -1442,12 +1446,12 @@ <h2 id="nested-groups">Nested Groups</h2>
 
 <h2 id="structured-grouping">Structured grouping</h2>
 <p>
-Structured grouping is nested grouping over array of struct or maps.
+Structured grouping is nested grouping over an array of structs or maps.
 </p>
 
 <h2 id="range-grouping">Range grouping</h2>
 <p>
-In examples above, results are grouped on distinct values, like customer or date.
+In the examples above, results are grouped on distinct values, like customer or date.
 To group on price:
 </p>
 <pre>
@@ -1557,7 +1561,7 @@ <h2 id="pagination">Pagination</h2>
 </pre>
 <p>
   The <code>continuations</code> annotation is an ordered list of continuation strings.
-  These are combined by replacement,
+  These are combined by replacement
   so that a continuation given later will replace any shared state with a continuation given before.
   Also, when using the <code>continuations</code> annotation,
   always pass the <em>this</em>-continuation as its first element.
@@ -1567,7 +1571,7 @@ <h2 id="pagination">Pagination</h2>
 <a href='reference/grouping-syntax.html#order'>ordering</a>.
 Adding a tie-breaker might be needed - like <a href='reference/rank-features.html#random'>random.match</a>
 or a random double value stored in each document -
-to keep the ordering stable in case of multiple documents that would otherwise get the same rank score,
+to keep the ordering stable in case of multiple documents that would otherwise get the same rank score
 or the same value used for ordering."%}
 
 
@@ -1587,7 +1591,7 @@ <h2 id="expressions">Expressions</h2>
     <li>Concatenation of the results of sub-expressions</li>
 </ul>
 <p>
-  Sum the prices of purchases on per-hour-of-day basis:
+  Sum the prices of purchases on a per-hour-of-day basis:
 </p>
 <pre>
 select (&hellip;) | all(group(mod(div(date,mul(60,60)),24)) each(output(sum(price))))
@@ -1654,7 +1658,7 @@ <h2 id="expressions">Expressions</h2>
 </table>
 <p>
 Note that the validity of an expression depends on the current nesting level.
-E.g. while <code>sum(price)</code> would be a valid expression for a group of hits, <code>price</code> would not.
+For, while <code>sum(price)</code> would be a valid expression for a group of hits, <code>price</code> would not.
 As a general rule, each operator within an expression either applies to a single hit or aggregates values across a group.
 </p>
 
@@ -1706,7 +1710,7 @@ <h2 id="more-examples">More examples</h2>
 
 <h3 id="topn-full-corpus">TopN / Full corpus</h3>
 <p>
-Simple grouping, count the number of documents in each group:
+Simple grouping: count the number of documents in each group:
 </p>
 <pre>all( group(a) each(output(count())) )</pre>
 <p>Two parallel groupings:</p>
@@ -1747,13 +1751,13 @@ <h3 id="ordering-groups">Ordering groups</h3>
 
 <h3 id="collecting-aggregates">Collecting aggregates</h3>
 <p>
-Simple grouping to count number of documents in each group and return the best hit in each group:
+Simple grouping to count the number of documents in each group and return the best hit in each group:
 </p>
 <pre>all( group(a) each(max(1) each(output(summary()))) )</pre>
 <p>Also return the sum of attribute "b":</p>
 <pre>all( group(a) each(max(1) output(count(), sum(b)) each(output(summary()))) )</pre>
 <p>
-  Also return an XOR of the 64 most significant bits of an MD5
+  Also, return an XOR of the 64 most significant bits of an MD5
   over the concatenation of attributes "a", "b" and "c":
 </p>
 <pre>all(group(a) each(max(1) output(count(), sum(b), xor(md5(cat(a, b, c), 64)))
@@ -1762,14 +1766,14 @@ <h3 id="collecting-aggregates">Collecting aggregates</h3>
 
 <h3 id="grouping">Grouping</h3>
 <p>
-  Single level grouping on "a" attribute, returning at most 5 groups with full hit count as well as the 69 best hits.
+  Single-level grouping on "a" attribute, returning at most 5 groups with full hit count as well as the 69 best hits.
 </p>
 <pre>all( group(a) max(5) each(max(69) output(count()) each(output(summary()))) )</pre>
 <p>Two level grouping on "a" and "b" attribute:</p>
 <pre>all( group(a) max(5) each(output(count())
      all(group(b) max(5) each(max(69) output(count())
          each(output(summary()))))) )</pre>
-<p>Three level grouping on "a", "b" and "c" attribute:</p>
+<p> Three-level grouping on "a", "b" and "c" attribute:</p>
 <pre>all( group(a) max(5) each(output(count())
      all(group(b) max(5) each(output(count())
          all(group(c) max(5) each(max(69) output(count())
@@ -1833,10 +1837,10 @@ <h3 id="time-and-date">Time and date</h3>
 
 <h3 id="counting-unique-groups">Counting unique groups</h3>
 <p>
-  The <code>count</code> aggregator can be applied on list of groups to determine the number of unique groups
+  The <code>count</code> aggregator can be applied on a list of groups to determine the number of unique groups
   without having to explicitly retrieve all groups.
   Note that this count is an estimate using HyperLogLog++ which is an algorithm for the count-distinct problem.
-  To get an accurate count one needs to explicitly retrieve all groups
+  To get an accurate count, one needs to explicitly retrieve all groups
   and count them in a custom component or in the middle tier calling out to Vespa.
   This is network intensive and might not be feasible in cases with many unique groups.
 </p>
@@ -1860,11 +1864,11 @@ <h3 id="counting-unique-groups">Counting unique groups</h3>
   The query outputs the sum for the 3 best groups.
   The <code>count</code> clause outputs the estimated number of groups (potentially &gt;3).
   The <code>count</code> becomes an estimate here as the number of groups is limited by max,
-  while in the above example it's not limited by max:
+  while in the above example, it's not limited by max:
 </p>
 <pre>all( group(a) max(3) output(count()) each(output(sum(b))) )</pre>
 <p>
-  Output the number of top level groups, and for the 10 best groups,
+  Output the number of top-level groups, and for the 10 best groups,
   output the number of unique values for attribute "b":
 </p>
 <pre>all( group(a) max(10) output(count()) each(group(b) output(count())) )</pre>
@@ -1898,11 +1902,11 @@ <h3 id="impression-forecasting">Impression forecasting</h3>
 can be used to figure out how many times a given advertisement would have been shown to this particular user.
 </p><p>
 So if the rank score is 0.420 for a specific user/ad/bid combination,
-then <code>interpolatedlookup(impressions,relevance())</code> would return 5.0.
+then <code>interpolatedlookup(impressions, relevance())</code> would return 5.0.
 If the bid is increased so the rank score gets to 0.490,
 it would get 5.5 as the return value instead.
 </p><p>
-In this context a count of 5.5 isn't meaningful for the past of a single user,
+In this context, a count of 5.5 isn't meaningful for the past of a single user,
 but it gives more information that may be used as a forecast.
 Summing this across more, different users may then be used to forecast
 the total of future impressions for the advertisement.
@@ -1911,7 +1915,7 @@ <h3 id="impression-forecasting">Impression forecasting</h3>
 
 <h3 id="aggregating-over-all-documents">Aggregating over all documents</h3>
 <p>
-  Grouping is useful to analyse data.
+  Grouping is useful for analyzing data.
   To aggregate over the full document set, create <em>one</em> group (which will have <em>all</em> documents)
   by using a constant (here 1) - example:
 </p>
@@ -1920,7 +1924,7 @@ <h3 id="aggregating-over-all-documents">Aggregating over all documents</h3>
   hits=0 \
   ranking=unranked
 </pre>
-<p>Make sure all documents have a value for the given field, if not, NaN is used and the final result is also NaN:</p>
+<p>Make sure all documents have a value for the given field, if not, NaN is used, and the final result is also NaN:</p>
 <pre>{% highlight json %}
 {
     "id": "group:long:1",
@@ -1937,7 +1941,7 @@ <h3 id="count-fields-with-nan">Count fields with NaN</h3>
 <p>
   Count number of documents missing a value for an <a href="/en/attributes.html">attribute</a> field
   (actually, in this example, unset or less than 0, see the bucket expression below).
-  Set a higher query timeout just in case.
+  Set a higher query timeout, just in case.
   Example, analyzing a field called <em>rating</em>:
 </p>
 <pre>
@@ -2009,7 +2013,7 @@ <h3 id="list-fields-with-nan">List fields with NaN</h3>
   </li>
   <li>
     Results can grow large when listing all fields.
-    To limit to the information required, define a summary class like the below::
+    To limit to the information required, define a summary class like the below:
 <pre>
 document-summary name_only {
     summary name {}