From dca54943d8f4606756c441ebf3872c687b6d2ddf Mon Sep 17 00:00:00 2001
From: Ani <bloodraven166@gmail.com>
Date: Sat, 6 Jan 2024 05:37:12 -0800
Subject: [PATCH] Improvements to the introductory vignette (#5836)

* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes
---
 vignettes/datatable-intro.Rmd | 80 +++++++++++++++++------------------
 1 file changed, 39 insertions(+), 41 deletions(-)

diff --git a/vignettes/datatable-intro.Rmd b/vignettes/datatable-intro.Rmd
index 3624a7c5b..81d4ca131 100644
--- a/vignettes/datatable-intro.Rmd
+++ b/vignettes/datatable-intro.Rmd
@@ -21,19 +21,19 @@ knitr::opts_chunk$set(
 .old.th = setDTthreads(1)
 ```
 
-This vignette introduces the `data.table` syntax, its general form, how to *subset* rows, *select and compute* on columns, and perform aggregations *by group*. Familiarity with `data.frame` data structure from base R is useful, but not essential to follow this vignette.
+This vignette introduces the `data.table` syntax, its general form, how to *subset* rows, *select and compute* on columns, and perform aggregations *by group*. Familiarity with the `data.frame` data structure from base R is useful, but not essential to follow this vignette.
 
 ***
 
 ## Data analysis using `data.table`
 
-Data manipulation operations such as *subset*, *group*, *update*, *join* etc., are all inherently related. Keeping these *related operations together* allows for:
+Data manipulation operations such as *subset*, *group*, *update*, *join*, etc. are all inherently related. Keeping these *related operations together* allows for:
 
 * *concise* and *consistent* syntax irrespective of the set of operations you would like to perform to achieve your end goal.
 
 * performing analysis *fluidly* without the cognitive burden of having to map each operation to a particular function from a potentially huge set of functions available before performing the analysis.
 
-* *automatically* optimising operations internally, and very effectively, by knowing precisely the data required for each operation, leading to very fast and memory efficient code.
+* *automatically* optimising operations internally and very effectively by knowing precisely the data required for each operation, leading to very fast and memory-efficient code.
 
 Briefly, if you are interested in reducing *programming* and *compute* time tremendously, then this package is for you. The philosophy that `data.table` adheres to makes this possible. Our goal is to illustrate it through this series of vignettes.
 
@@ -58,13 +58,13 @@ flights
 dim(flights)
 ```
 
-Aside: `fread` accepts `http` and `https` URLs directly as well as operating system commands such as `sed` and `awk` output. See `?fread` for examples.
+Aside: `fread` accepts `http` and `https` URLs directly, as well as operating system commands such as `sed` and `awk` output. See `?fread` for examples.
 
 ## Introduction
 
 In this vignette, we will
 
-1. Start with basics - what is a `data.table`, its general form, how to *subset* rows, how to *select and compute* on columns;
+1. Start with the basics - what is a `data.table`, its general form, how to *subset* rows, how to *select and compute* on columns;
 
 2. Then we will look at performing data aggregations by group
 
@@ -72,7 +72,7 @@ In this vignette, we will
 
 ### a) What is `data.table`? {#what-is-datatable-1a}
 
-`data.table` is an R package that provides **an enhanced version** of `data.frame`s, which are the standard data structure for storing data in `base` R. In the [Data](#data) section above, we already created a `data.table` using `fread()`. We can also create one using the `data.table()` function. Here is an example:
+`data.table` is an R package that provides **an enhanced version** of a `data.frame`, the standard data structure for storing data in `base` R. In the [Data](#data) section above, we saw how to create a `data.table` using `fread()`, but alternatively we can also create one using the `data.table()` function. Here is an example:
 
 ```{r}
 DT = data.table(
@@ -85,13 +85,13 @@ DT
 class(DT$ID)
 ```
 
-You can also convert existing objects to a `data.table` using `setDT()` (for `data.frame`s and `list`s) and `as.data.table()` (for other structures); the difference is beyond the scope of this vignette, see `?setDT` and `?as.data.table` for more details.
+You can also convert existing objects to a `data.table` using `setDT()` (for `data.frame` and `list` structures) or `as.data.table()` (for other structures). For more details pertaining to the difference (goes beyond the scope of this vignette), please see `?setDT` and `?as.data.table`.
 
 #### Note that:
 
 * Row numbers are printed with a `:` in order to visually separate the row number from the first column.
 
-* When the number of rows to print exceeds the global option `datatable.print.nrows` (default = `r getOption("datatable.print.nrows")`), it automatically prints only the top 5 and bottom 5 rows (as can be seen in the [Data](#data) section). If you've had a lot of experience with `data.frame`s, you may have found yourself waiting around while larger tables print-and-page, sometimes seemingly endlessly. You can query the default number like so:
+* When the number of rows to print exceeds the global option `datatable.print.nrows` (default = `r getOption("datatable.print.nrows")`), it automatically prints only the top 5 and bottom 5 rows (as can be seen in the [Data](#data) section). For a large `data.frame`, you may have found yourself waiting around while larger tables print-and-page, sometimes seemingly endlessly. This restriction helps with that, and you can query the default number like so:
 
     ```{.r}
     getOption("datatable.print.nrows")
@@ -101,7 +101,7 @@ You can also convert existing objects to a `data.table` using `setDT()` (for `da
 
 ### b) General form - in what way is a `data.table` *enhanced*? {#enhanced-1b}
 
-In contrast to a `data.frame`, you can do *a lot more* than just subsetting rows and selecting columns within the frame of a `data.table`, i.e., within `[ ... ]` (NB: we might also refer to writing things inside `DT[...]` as "querying `DT`", in analogy to SQL). To understand it we will have to first look at the *general form* of `data.table` syntax, as shown below:
+In contrast to a `data.frame`, you can do *a lot more* than just subsetting rows and selecting columns within the frame of a `data.table`, i.e., within `[ ... ]` (NB: we might also refer to writing things inside `DT[...]` as "querying `DT`", as an analogy or in relevance to SQL). To understand it we will have to first look at the *general form* of the `data.table` syntax, as shown below:
 
 ```{r eval = FALSE}
 DT[i, j, by]
@@ -110,7 +110,7 @@ DT[i, j, by]
 ## SQL:  where | order by   select | update  group by
 ```
 
-Users who have an SQL background might perhaps immediately relate to this syntax.
+Users with an SQL background might perhaps immediately relate to this syntax.
 
 #### The way to read it (out loud) is:
 
@@ -131,7 +131,7 @@ head(ans)
 
 * The *row indices* that satisfy the condition `origin == "JFK" & month == 6L` are computed, and since there is nothing else left to do, all columns from `flights` at rows corresponding to those *row indices* are simply returned as a `data.table`.
 
-* A comma after the condition in `i` is not required. But `flights[origin == "JFK" & month == 6L, ]` would work just fine. In `data.frame`s, however, the comma is necessary.
+* A comma after the condition in `i` is not required. But `flights[origin == "JFK" & month == 6L, ]` would work just fine. In a `data.frame`, however, the comma is necessary.
 
 #### -- Get the first two rows from `flights`. {#subset-rows-integer}
 
@@ -153,9 +153,9 @@ head(ans)
 
 #### `order()` is internally optimised
 
-* We can use "-" on a `character` columns within the frame of a `data.table` to sort in decreasing order.
+* We can use "-" on `character` columns within the frame of a `data.table` to sort in decreasing order.
 
-* In addition, `order(...)` within the frame of a `data.table` uses `data.table`'s internal fast radix order `forder()`. This sort provided such a compelling improvement over R's `base::order` that the R project adopted the `data.table` algorithm as its default sort in 2016 for R 3.3.0, see `?sort` and the [R Release NEWS](https://cran.r-project.org/doc/manuals/r-release/NEWS.pdf).
+* In addition, `order(...)` within the frame of a `data.table` uses `data.table`'s internal fast radix order `forder()`. This sort provided such a compelling improvement over R's `base::order` that the R project adopted the `data.table` algorithm as its default sort in 2016 for R 3.3.0 (for reference, check `?sort` and the [R Release NEWS](https://cran.r-project.org/doc/manuals/r-release/NEWS.pdf)).
 
 We will discuss `data.table`'s fast order in more detail in the *`data.table` internals* vignette.
 
@@ -168,7 +168,7 @@ ans <- flights[, arr_delay]
 head(ans)
 ```
 
-* Since columns can be referred to as if they are variables within the frame of `data.table`s, we directly refer to the *variable* we want to subset. Since we want *all the rows*, we simply skip `i`.
+* Since columns can be referred to as if they are variables within the frame of a `data.table`, we directly refer to the *variable* we want to subset. Since we want *all the rows*, we simply skip `i`.
 
 * It returns *all* the rows for the column `arr_delay`.
 
@@ -183,7 +183,7 @@ head(ans)
 
 * `data.table` also allows wrapping columns with `.()` instead of `list()`. It is an *alias* to `list()`; they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer `.()` for conciseness, so we will continue to use `.()` hereafter.
 
-`data.table`s (and `data.frame`s) are internally `list`s as well, with the stipulation that each element has the same length and the `list` has a `class` attribute. Allowing `j` to return a `list` enables converting and returning `data.table` very efficiently.
+A `data.table` (and a `data.frame` too) is internally a `list` as well, with the stipulation that each element has the same length and the `list` has a `class` attribute. Allowing `j` to return a `list` enables converting and returning `data.table` very efficiently.
 
 #### Tip: {#tip-1}
 
@@ -210,8 +210,6 @@ ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]
 head(ans)
 ```
 
-That's it.
-
 ### e) Compute or *do* in `j`
 
 #### -- How many trips have had total delay < 0?
@@ -239,7 +237,7 @@ ans
 
 * Now, we look at `j` and find that it uses only *two columns*. And what we have to do is to compute their `mean()`. Therefore we subset just those columns corresponding to the matching rows, and compute their `mean()`.
 
-Because the three main components of the query (`i`, `j` and `by`) are *together* inside `[...]`, `data.table` can see all three and optimise the query altogether *before evaluation*, not each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns _besides_ `arr_delay` and `dep_delay`), for both speed and memory efficiency.
+Because the three main components of the query (`i`, `j` and `by`) are *together* inside `[...]`, `data.table` can see all three and optimise the query altogether *before evaluation*, rather than optimizing each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns _besides_ `arr_delay` and `dep_delay`), for both speed and memory efficiency.
 
 #### -- How many trips have been made in 2014 from "JFK" airport in the month of June?
 
@@ -248,7 +246,7 @@ ans <- flights[origin == "JFK" & month == 6L, length(dest)]
 ans
 ```
 
-The function `length()` requires an input argument. We just needed to compute the number of rows in the subset. We could have used any other column as input argument to `length()` really. This approach is reminiscent of `SELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6` in SQL.
+The function `length()` requires an input argument. We just need to compute the number of rows in the subset. We could have used any other column as the input argument to `length()`. This approach is reminiscent of `SELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6` in SQL.
 
 This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where `data.table` provides a *special symbol* `.N` for it.
 
@@ -256,7 +254,7 @@ This type of operation occurs quite frequently, especially while grouping (as we
 
 `.N` is a special built-in variable that holds the number of observations _in the current group_. It is particularly useful when combined with `by` as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.
 
-So we can now accomplish the same task by using `.N` as follows:
+Now that we now, we can now accomplish the same task by using `.N` as follows:
 
 ```{r}
 ans <- flights[origin == "JFK" & month == 6L, .N]
@@ -273,7 +271,7 @@ We could have accomplished the same operation by doing `nrow(flights[origin == "
 
 ### g) Great! But how can I refer to columns by names in `j` (like in a `data.frame`)? {#refer_j}
 
-If you're writing out the column names explicitly, there's no difference vis-a-vis `data.frame` (since v1.9.8).
+If you're writing out the column names explicitly, there's no difference compared to a `data.frame` (since v1.9.8).
 
 #### -- Select both `arr_delay` and `dep_delay` columns the `data.frame` way.
 
@@ -291,7 +289,7 @@ select_cols = c("arr_delay", "dep_delay")
 flights[ , ..select_cols]
 ```
 
-For those familiar with the Unix terminal, the `..` prefix should be reminiscent of the "up-one-level" command, which is analogous to what's happening here -- the `..` signals to `data.table` to look for the `select_cols` variable "up-one-level", i.e., in the global environment in this case.
+For those familiar with the Unix terminal, the `..` prefix should be reminiscent of the "up-one-level" command, which is analogous to what's happening here -- the `..` signals to `data.table` to look for the `select_cols` variable "up-one-level", i.e., within the global environment in this case.
 
 #### -- Select columns named in a variable using `with = FALSE`
 
@@ -362,11 +360,11 @@ ans
 
 * We know `.N` [is a special variable](#special-N) that holds the number of rows in the current group. Grouping by `origin` obtains the number of rows, `.N`, for each group.
 
-* By doing `head(flights)` you can see that the origin airports occur in the order *"JFK"*, *"LGA"* and *"EWR"*. The original order of grouping variables is preserved in the result. _This is important to keep in mind!_
+* By doing `head(flights)` you can see that the origin airports occur in the order *"JFK"*, *"LGA"*, and *"EWR"*. The original order of grouping variables is preserved in the result. _This is important to keep in mind!_
 
-* Since we did not provide a name for the column returned in `j`, it was named `N`  automatically by recognising the special symbol `.N`.
+* Since we did not provide a name for the column returned in `j`, it was named `N` automatically by recognising the special symbol `.N`.
 
-* `by` also accepts a character vector of column names. This is particularly useful for  coding programmatically, e.g., designing a function with the grouping columns as a (`character` vector) function argument.
+* `by` also accepts a character vector of column names. This is particularly useful for coding programmatically. For e.g., designing a function with the grouping columns (in the form of a `character` vector) as a function argument.
 
 * When there's only one column or expression to refer to in `j` and `by`, we can drop the `.()` notation. This is purely for convenience. We could instead do:
 
@@ -430,7 +428,7 @@ ans <- flights[carrier == "AA",
 ans
 ```
 
-* All we did was to change `by` to `keyby`. This automatically orders the result by the grouping variables in increasing order. In fact, due to the internal implementation of `by` first requiring a sort before recovering the original table's order, `keyby` is typically faster than `by` because it doesn't require this second step.
+* All we did was change `by` to `keyby`. This automatically orders the result by the grouping variables in increasing order. In fact, due to the internal implementation of `by` first requiring a sort before recovering the original table's order, `keyby` is typically faster than `by` because it doesn't require this second step.
 
 **Keys:** Actually `keyby` does a little more than *just ordering*. It also *sets a key* after ordering by setting an `attribute` called `sorted`. 
 
@@ -453,7 +451,7 @@ ans <- ans[order(origin, -dest)]
 head(ans)
 ```
 
-* Recall that we can use `-` on a `character` column in `order()` within the frame of a `data.table`. This is possible to due `data.table`'s internal query optimisation.
+* Recall that we can use `-` on a `character` column in `order()` within the frame of a `data.table`. This is possible due to `data.table`'s internal query optimisation.
 
 * Also recall that `order(...)` with the frame of a `data.table` is *automatically optimised* to use `data.table`'s internal fast radix order `forder()` for speed. 
 
@@ -488,7 +486,7 @@ ans
 
 * The last row corresponds to `dep_delay > 0 = TRUE` and `arr_delay > 0 = FALSE`. We can see that `r flights[!is.na(arr_delay) & !is.na(dep_delay), .N, .(dep_delay>0, arr_delay>0)][, N[4L]]` flights started late but arrived early (or on time).
 
-* Note that we did not provide any names to `by-expression`. Therefore, names have been automatically assigned in the result. As with `j`, you can name these expressions as you would elements of any `list`, e.g. `DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)]`.
+* Note that we did not provide any names to `by-expression`. Therefore, names have been automatically assigned in the result. As with `j`, you can name these expressions as you would for elements of any `list`, like for e.g. `DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)]`.
 
 * You can provide other columns along with expressions, for example: `DT[, .N, by = .(a, b>0)]`.
 
@@ -498,11 +496,11 @@ ans
 
 It is of course not practical to have to type `mean(myCol)` for every column one by one. What if you had 100 columns to average `mean()`?
 
-How can we do this efficiently, concisely? To get there, refresh on [this tip](#tip-1) - *"As long as the `j`-expression returns a `list`, each element of the `list` will be converted to a column in the resulting `data.table`"*. Suppose we can refer to the *data subset for each group* as a variable *while grouping*, then we can loop through all the columns of that variable using the already- or soon-to-be-familiar base function `lapply()`. No new names to learn specific to `data.table`.
+How can we do this efficiently and concisely? To get there, refresh on [this tip](#tip-1) - *"As long as the `j`-expression returns a `list`, each element of the `list` will be converted to a column in the resulting `data.table`"*. If we can refer to the *data subset for each group* as a variable *while grouping*, we can then loop through all the columns of that variable using the already- or soon-to-be-familiar base function `lapply()`. No new names to learn specific to `data.table`.
 
 #### Special symbol `.SD`: {#special-SD}
 
-`data.table` provides a *special* symbol, called `.SD`. It stands for **S**ubset of **D**ata. It by itself is a `data.table` that holds the data for *the current group* defined using `by`.
+`data.table` provides a *special* symbol called `.SD`. It stands for **S**ubset of **D**ata. It by itself is a `data.table` that holds the data for *the current group* defined using `by`.
 
 Recall that a `data.table` is internally a `list` as well with all its columns of equal length.
 
@@ -530,7 +528,7 @@ DT[, lapply(.SD, mean), by = ID]
 
 * Since `lapply()` returns a `list`, so there is no need to wrap it with an additional `.()` (if necessary, refer to [this tip](#tip-1)).
 
-We are almost there. There is one little thing left to address. In our `flights` `data.table`, we only wanted to calculate the `mean()` of two columns `arr_delay` and `dep_delay`. But `.SD` would contain all the columns other than the grouping variables by default.
+We are almost there. There is one little thing left to address. In our `flights` `data.table`, we only wanted to calculate the `mean()` of the two columns `arr_delay` and `dep_delay`. But `.SD` would contain all the columns other than the grouping variables by default.
 
 #### -- How can we specify just the columns we would like to compute the `mean()` on?
 
@@ -538,7 +536,7 @@ We are almost there. There is one little thing left to address. In our `flights`
 
 Using the argument `.SDcols`. It accepts either column names or column indices. For example, `.SDcols = c("arr_delay", "dep_delay")` ensures that `.SD` contains only these two columns for each group.
 
-Similar to [part g)](#refer_j), you can also provide the columns to remove instead of columns to keep using `-` or `!` sign as well as select consecutive columns as `colA:colB` and deselect consecutive columns as `!(colA:colB)` or `-(colA:colB)`.
+Similar to [part g)](#refer_j), you can also specify the columns to remove instead of columns to keep using `-` or `!`. Additionally, you can select consecutive columns as `colA:colB` and deselect them as `!(colA:colB)` or `-(colA:colB)`.
 
 Now let us try to use `.SD` along with `.SDcols` to get the `mean()` of `arr_delay` and `dep_delay` columns grouped by `origin`, `dest` and `month`.
 
@@ -564,7 +562,7 @@ head(ans)
 
 ### g) Why keep `j` so flexible?
 
-So that we have a consistent syntax and keep using already existing (and familiar) base functions instead of learning new functions. To illustrate, let us use the `data.table` `DT` that we created at the very beginning under [What is a data.table?](#what-is-datatable-1a) section.
+So that we have a consistent syntax and keep using already existing (and familiar) base functions instead of learning new functions. To illustrate, let us use the `data.table` `DT` that we created at the very beginning under the section [What is a data.table?](#what-is-datatable-1a).
 
 #### -- How can we concatenate columns `a` and `b` for each group in `ID`?
 
@@ -582,18 +580,18 @@ DT[, .(val = list(c(a,b))), by = ID]
 
 * Here, we first concatenate the values with `c(a,b)` for each group, and wrap that with `list()`. So for each group, we return a list of all concatenated values.
 
-* Note those commas are for display only. A list column can contain any object in each cell, and in this example, each cell is itself a vector and some cells contain longer vectors than others.
+* Note that those commas are for display only. A list column can contain any object in each cell, and in this example, each cell is itself a vector and some cells contain longer vectors than others.
 
 Once you start internalising usage in `j`, you will realise how powerful the syntax can be. A very useful way to understand it is by playing around, with the help of `print()`.
 
 For example:
 
 ```{r}
-## (1) look at the difference between
-DT[, print(c(a,b)), by = ID]
+## look at the difference between
+DT[, print(c(a,b)), by = ID] # (1)
 
-## (2) and
-DT[, print(list(c(a,b))), by = ID]
+## and
+DT[, print(list(c(a,b))), by = ID] # (2)
 ```
 
 In (1), for each group, a vector is returned, with length = 6,4,2 here. However (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore (1) results in a length of ` 6+4+2 = `r 6+4+2``, whereas (2) returns `1+1+1=`r 1+1+1``.
@@ -612,9 +610,9 @@ We have seen so far that,
 
 * We can subset rows similar to a `data.frame`- except you don't have to use `DT$` repetitively since columns within the frame of a `data.table` are seen as if they are *variables*.
 
-* We can also sort a `data.table` using `order()`, which internally uses `data.table`'s fast order for performance.
+* We can also sort a `data.table` using `order()`, which internally uses data.table's fast order for better performance.
 
-We can do much more in `i` by keying a `data.table`, which allows blazing fast subsets and joins. We will see this in the *"Keys and fast binary search based subsets"* and *"Joins and rolling joins"* vignette.
+We can do much more in `i` by keying a `data.table`, which allows for blazing fast subsets and joins. We will see this in the *"Keys and fast binary search based subsets"* and *"Joins and rolling joins"* vignette.
 
 #### Using `j`:
 
@@ -630,7 +628,7 @@ We can do much more in `i` by keying a `data.table`, which allows blazing fast s
 
 #### Using `by`:
 
-* Using `by`, we can group by columns by specifying a *list of columns* or a *character vector of column names* or even *expressions*. The flexibility of `j`, combined with `by` and `i` makes for a very powerful syntax.
+* Using `by`, we can group by columns by specifying a *list of columns* or a *character vector of column names* or even *expressions*. The flexibility of `j`, combined with `by` and `i`, makes for a very powerful syntax.
 
 * `by` can handle multiple columns and also *expressions*.