diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index da580764b..66788d46d 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -33,7 +33,7 @@ sudo lshw -class disk sudo hdparm -t /dev/sda ``` -When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently as well timing isolated tasks (such as `fread` alone), it's a good idea to benchmark a pipeline of tasks such as reading data, computing operators and producing final output and report the total time of the pipeline. +When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently, in addition to timing isolated tasks (such as `fread` alone), it's a good idea to benchmark the total time of an end-to-end pipeline of tasks such as reading data, manipulating it, and producing final output. # subset: threshold for index optimization on compound queries @@ -68,7 +68,7 @@ options(datatable.auto.index=TRUE) options(datatable.use.index=TRUE) ``` -- `use.index=FALSE` will force query not to use indices even if they exists, but existing keys are still used for optimization. +- `use.index=FALSE` will force the query not to use indices even if they exist, but existing keys are still used for optimization. - `auto.index=FALSE` disables building index automatically when doing subset on non-indexed data, but if indices were created before this option was set, or explicitly by calling `setindex` they still will be used for optimization. Two other options control optimization globally, including use of indices: @@ -77,27 +77,27 @@ options(datatable.optimize=2L) options(datatable.optimize=3L) ``` `options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on. -Those options affects much more optimizations thus should not be used when only control of index is needed. Read more in `?datatable.optimize`. +Those options affect many more optimizations and thus should not be used when only control of indices is needed. Read more in `?datatable.optimize`. # _by reference_ operations -When benchmarking `set*` functions it make sense to measure only first run. Those functions updates data.table by reference thus in subsequent runs they get already processed `data.table` on input. +When benchmarking `set*` functions it only makes sense to measure the first run. These functions update their input by reference, so subsequent runs will use the already-processed `data.table`, biasing the results. Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking. # try to benchmark atomic processes If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results. -Of course if your benchmark is meant to present _full workflow_ then it perfectly make sense to present total timing, still splitting timings might give good insight into bottlenecks in such workflow. -There are another cases when it might not be desired, for example when benchmarking _reading csv_, followed by _grouping_. R requires to populate _R's global string cache_ which adds extra overhead when importing character data to R session. On the other hand _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing. +Of course if your benchmark is meant to present to present an _end-to-end workflow_, then it makes perfect sense to present the overall timing. Nevertheless, separating out timing of individual steps is useful for understanding which steps are the main bottlenecks of a workflow. +There are other cases when atomic benchmarking might not be desirable, for example when _reading a csv_, followed by _grouping_. R requires populating _R's global string cache_ which adds extra overhead when importing character data to an R session. On the other hand, the _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing. # avoid class coercion -Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class. +Unless this is what you truly want to measure you should prepare input objects of the expected class for every tool you are benchmarking. # avoid `microbenchmark(..., times=100)` -Repeating benchmarking many times usually does not fit well for data processing tools. Of course it perfectly make sense for more atomic calculations. It does not well represent use case for common data processing tasks, which rather consists of batches sequentially provided transformations, each run once. +Repeating a benchmark many times usually does not give the clearest picture for data processing tools. Of course, it makes perfect sense for more atomic calculations, but this is not a good representation of the most common way these tools will actually be used, namely for data processing tasks, which consist of batches of sequentially provided transformations, each run once. Matt once said: > I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale. @@ -106,8 +106,8 @@ This is very valid. The smaller time measurement is the relatively bigger noise # multithreaded processing -One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized. -You can control how much threads you want to use with `setDTthreads`. +One of the main factors that is likely to impact timings is the number of threads available to your R session. In recent versions of `data.table`, some functions are parallelized. +You can control the number of threads you want to use with `setDTthreads`. ```r setDTthreads(0) # use all available cores (default) diff --git a/vignettes/datatable-importing.Rmd b/vignettes/datatable-importing.Rmd index c37cd6f75..f1d1e9257 100644 --- a/vignettes/datatable-importing.Rmd +++ b/vignettes/datatable-importing.Rmd @@ -21,11 +21,11 @@ Importing `data.table` is no different from importing other R packages. This vig ## Why to import `data.table` -One of the biggest features of `data.table` is its concise syntax which makes exploratory analysis faster and easier to write and perceive; this convenience can drive packages authors to use `data.table` in their own packages. Another maybe even more important reason is high performance. When outsourcing heavy computing tasks from your package to `data.table`, you usually get top performance without needing to re-invent any high of these numerical optimization tricks on your own. +One of the biggest features of `data.table` is its concise syntax which makes exploratory analysis faster and easier to write and perceive; this convenience can drive package authors to use `data.table`. Another, perhaps more important reason is high performance. When outsourcing heavy computing tasks from your package to `data.table`, you usually get top performance without needing to re-invent any of these numerical optimization tricks on your own. ## Importing `data.table` is easy -It is very easy to use `data.table` as a dependency due to the fact that `data.table` does not have any of its own dependencies. This statement is valid for both operating system dependencies and R dependencies. It means that if you have R installed on your machine, it already has everything needed to install `data.table`. This also means that adding `data.table` as a dependency of your package will not result in a chain of other recursive dependencies to install, making it very convenient for offline installation. +It is very easy to use `data.table` as a dependency due to the fact that `data.table` does not have any of its own dependencies. This applies both to operating system and to R dependencies. It means that if you have R installed on your machine, it already has everything needed to install `data.table`. It also means that adding `data.table` as a dependency of your package will not result in a chain of other recursive dependencies to install, making it very convenient for offline installation. ## `DESCRIPTION` file {DESCRIPTION} diff --git a/vignettes/datatable-intro.Rmd b/vignettes/datatable-intro.Rmd index 81d4ca131..c783c3fa7 100644 --- a/vignettes/datatable-intro.Rmd +++ b/vignettes/datatable-intro.Rmd @@ -39,7 +39,7 @@ Briefly, if you are interested in reducing *programming* and *compute* time trem ## Data {#data} -In this vignette, we will use [NYC-flights14](https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv) data obtained by [flights](https://github.com/arunsrinivasan/flights) package (available on GitHub only). It contains On-Time flights data from the Bureau of Transporation Statistics for all the flights that departed from New York City airports in 2014 (inspired by [nycflights13](https://github.com/tidyverse/nycflights13)). The data is available only for Jan-Oct'14. +In this vignette, we will use [NYC-flights14](https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv) data obtained from the [flights](https://github.com/arunsrinivasan/flights) package (available on GitHub only). It contains On-Time flights data from the Bureau of Transportation Statistics for all the flights that departed from New York City airports in 2014 (inspired by [nycflights13](https://github.com/tidyverse/nycflights13)). The data is available only for Jan-Oct'14. We can use `data.table`'s fast-and-friendly file reader `fread` to load `flights` directly as follows: @@ -179,7 +179,7 @@ ans <- flights[, list(arr_delay)] head(ans) ``` -* We wrap the *variables* (column names) within `list()`, which ensures that a `data.table` is returned. In case of a single column name, not wrapping with `list()` returns a vector instead, as seen in the [previous example](#select-j-1d). +* We wrap the *variables* (column names) within `list()`, which ensures that a `data.table` is returned. In the case of a single column name, not wrapping with `list()` returns a vector instead, as seen in the [previous example](#select-j-1d). * `data.table` also allows wrapping columns with `.()` instead of `list()`. It is an *alias* to `list()`; they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer `.()` for conciseness, so we will continue to use `.()` hereafter. @@ -235,7 +235,7 @@ ans * We first subset in `i` to find matching *row indices* where `origin` airport equals `"JFK"`, and `month` equals `6L`. We *do not* subset the _entire_ `data.table` corresponding to those rows _yet_. -* Now, we look at `j` and find that it uses only *two columns*. And what we have to do is to compute their `mean()`. Therefore we subset just those columns corresponding to the matching rows, and compute their `mean()`. +* Now, we look at `j` and find that it uses only *two columns*. And what we have to do is to compute their `mean()`. Therefore, we subset just those columns corresponding to the matching rows, and compute their `mean()`. Because the three main components of the query (`i`, `j` and `by`) are *together* inside `[...]`, `data.table` can see all three and optimise the query altogether *before evaluation*, rather than optimizing each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns _besides_ `arr_delay` and `dep_delay`), for both speed and memory efficiency. @@ -263,9 +263,9 @@ ans * Once again, we subset in `i` to get the *row indices* where `origin` airport equals *"JFK"*, and `month` equals *6*. -* We see that `j` uses only `.N` and no other columns. Therefore the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices). +* We see that `j` uses only `.N` and no other columns. Therefore, the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices). -* Note that we did not wrap `.N` with `list()` or `.()`. Therefore a vector is returned. +* Note that we did not wrap `.N` with `list()` or `.()`. Therefore, a vector is returned. We could have accomplished the same operation by doing `nrow(flights[origin == "JFK" & month == 6L])`. However, it would have to subset the entire `data.table` first corresponding to the *row indices* in `i` *and then* return the rows using `nrow()`, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the *`data.table` design* vignette. @@ -311,7 +311,7 @@ DF[with(DF, x > 1), ] * Using `with()` in (2) allows using `DF`'s column `x` as if it were a variable. - Hence the argument name `with` in `data.table`. Setting `with = FALSE` disables the ability to refer to columns as if they are variables, thereby restoring the "`data.frame` mode". + Hence, the argument name `with` in `data.table`. Setting `with = FALSE` disables the ability to refer to columns as if they are variables, thereby restoring the "`data.frame` mode". * We can also *deselect* columns using `-` or `!`. For example: @@ -364,7 +364,7 @@ ans * Since we did not provide a name for the column returned in `j`, it was named `N` automatically by recognising the special symbol `.N`. -* `by` also accepts a character vector of column names. This is particularly useful for coding programmatically. For e.g., designing a function with the grouping columns (in the form of a `character` vector) as a function argument. +* `by` also accepts a character vector of column names. This is particularly useful for coding programmatically, e.g., designing a function with the grouping columns (in the form of a `character` vector) as a function argument. * When there's only one column or expression to refer to in `j` and `by`, we can drop the `.()` notation. This is purely for convenience. We could instead do: @@ -594,7 +594,7 @@ DT[, print(c(a,b)), by = ID] # (1) DT[, print(list(c(a,b))), by = ID] # (2) ``` -In (1), for each group, a vector is returned, with length = 6,4,2 here. However (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore (1) results in a length of ` 6+4+2 = `r 6+4+2``, whereas (2) returns `1+1+1=`r 1+1+1``. +In (1), for each group, a vector is returned, with length = 6,4,2 here. However, (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore, (1) results in a length of ` 6+4+2 = `r 6+4+2``, whereas (2) returns `1+1+1=`r 1+1+1``. ## Summary diff --git a/vignettes/datatable-keys-fast-subset.Rmd b/vignettes/datatable-keys-fast-subset.Rmd index e73b71b92..3ec50640c 100644 --- a/vignettes/datatable-keys-fast-subset.Rmd +++ b/vignettes/datatable-keys-fast-subset.Rmd @@ -20,7 +20,7 @@ knitr::opts_chunk$set( .old.th = setDTthreads(1) ``` -This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the *"Introduction to data.table"* and *"Reference semantics"* vignettes first. +This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the *"Introduction to data.table"* and *"Reference semantics"* vignettes first. *** @@ -147,7 +147,7 @@ head(flights) #### set* and `:=`: -In *data.table*, the `:=` operator and all the `set*` (e.g., `setkey`, `setorder`, `setnames` etc..) functions are the only ones which modify the input object *by reference*. +In *data.table*, the `:=` operator and all the `set*` (e.g., `setkey`, `setorder`, `setnames` etc.) functions are the only ones which modify the input object *by reference*. Once you *key* a *data.table* by certain columns, you can subset by querying those key columns using the `.()` notation in `i`. Recall that `.()` is an *alias to* `list()`. @@ -239,7 +239,7 @@ flights[.(unique(origin), "MIA")] #### What's happening here? -* Read [this](#multiple-key-point) again. The value provided for the second key column *"MIA"* has to find the matching values in `dest` key column *on the matching rows provided by the first key column `origin`*. We can not skip the values of key columns *before*. Therefore we provide *all* unique values from key column `origin`. +* Read [this](#multiple-key-point) again. The value provided for the second key column *"MIA"* has to find the matching values in `dest` key column *on the matching rows provided by the first key column `origin`*. We can not skip the values of key columns *before*. Therefore, we provide *all* unique values from key column `origin`. * *"MIA"* is automatically recycled to fit the length of `unique(origin)` which is *3*. @@ -308,7 +308,7 @@ key(flights) * And on those row indices, we replace the `key` column with the value `0`. -* Since we have replaced values on the *key* column, the *data.table* `flights` isn't sorted by `hour` any more. Therefore, the key has been automatically removed by setting to NULL. +* Since we have replaced values on the *key* column, the *data.table* `flights` isn't sorted by `hour` anymore. Therefore, the key has been automatically removed by setting to NULL. Now, there shouldn't be any *24* in the `hour` column. diff --git a/vignettes/datatable-programming.Rmd b/vignettes/datatable-programming.Rmd index 4e8a99879..0536f11d6 100644 --- a/vignettes/datatable-programming.Rmd +++ b/vignettes/datatable-programming.Rmd @@ -88,7 +88,7 @@ my_subset = function(data, col, val) { my_subset(iris, Species, "setosa") ``` -We have to use `deparse(substitute(...))` to catch the actual names of objects passed to function so we can construct the `subset` function call using those original names. Although ths provides unlimited flexibility with relatively low complexity, **use of `eval(parse(...))` should be avoided**. The main reasons are: +We have to use `deparse(substitute(...))` to catch the actual names of objects passed to function, so we can construct the `subset` function call using those original names. Although this provides unlimited flexibility with relatively low complexity, **use of `eval(parse(...))` should be avoided**. The main reasons are: - lack of syntax validation - [vulnerability to code injection](https://github.com/Rdatatable/data.table/issues/2655#issuecomment-376781159) @@ -151,7 +151,7 @@ substitute2( ) ``` -We can see in the output that both the functions names, as well as the names of the variables passed to those functions, have been replaced.. We used `substitute2` for convenience. In this simple case, base R's `substitute` could have been used as well, though it would've required usage of `lapply(env, as.name)`. +We can see in the output that both the functions names, as well as the names of the variables passed to those functions, have been replaced. We used `substitute2` for convenience. In this simple case, base R's `substitute` could have been used as well, though it would've required usage of `lapply(env, as.name)`. Now, to use substitution inside `[.data.table`, we don't need to call the `substitute2` function. As it is now being used internally, all we have to do is to provide `env` argument, the same way as we've provided it to the `substitute2` function in the example above. Substitution can be applied to the `i`, `j` and `by` (or `keyby`) arguments of the `[.data.table` method. Note that setting the `verbose` argument to `TRUE` can be used to print expressions after substitution is applied. This is very useful for debugging. diff --git a/vignettes/datatable-reshape.Rmd b/vignettes/datatable-reshape.Rmd index d282bc7de..d05cdb10c 100644 --- a/vignettes/datatable-reshape.Rmd +++ b/vignettes/datatable-reshape.Rmd @@ -133,7 +133,7 @@ Check `?dcast` for other useful arguments and additional examples. ## 2. Limitations in current `melt/dcast` approaches -So far we've seen features of `melt` and `dcast` that are implemented efficiently for `data.table`s, using internal `data.table` machinery (*fast radix ordering*, *binary search* etc..). +So far we've seen features of `melt` and `dcast` that are implemented efficiently for `data.table`s, using internal `data.table` machinery (*fast radix ordering*, *binary search* etc.). However, there are situations we might run into where the desired operation is not expressed in a straightforward manner. For example, consider the `data.table` shown below: @@ -162,7 +162,7 @@ str(DT.c1) ## gender column is character type now! #### Issues -1. What we wanted to do was to combine all the `dob` and `gender` type columns together respectively. Instead we are combining *everything* together, and then splitting them again. I think it's easy to see that it's quite roundabout (and inefficient). +1. What we wanted to do was to combine all the `dob` and `gender` type columns together respectively. Instead, we are combining *everything* together, and then splitting them again. I think it's easy to see that it's quite roundabout (and inefficient). As an analogy, imagine you've a closet with four shelves of clothes and you'd like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3! diff --git a/vignettes/datatable-sd-usage.Rmd b/vignettes/datatable-sd-usage.Rmd index ae0b5a84a..5f0348e4f 100644 --- a/vignettes/datatable-sd-usage.Rmd +++ b/vignettes/datatable-sd-usage.Rmd @@ -197,7 +197,7 @@ Pitching[rank_in_team == 1, team_performance := Teams[.SD, Rank, on = c('teamID', 'yearID')]] ``` -Note that the `x[y]` syntax returns `nrow(y)` values (i.e., it's a right join), which is why `.SD` is on the right in `Teams[.SD]` (since the RHS of `:=` in this case requires `nrow(Pitching[rank_in_team == 1])` values. +Note that the `x[y]` syntax returns `nrow(y)` values (i.e., it's a right join), which is why `.SD` is on the right in `Teams[.SD]` (since the RHS of `:=` in this case requires `nrow(Pitching[rank_in_team == 1])` values). # Grouped `.SD` operations diff --git a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd index ff50ba97e..7d631fe93 100644 --- a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd +++ b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd @@ -114,7 +114,7 @@ b) reordering the entire data.table, by reference, based on the order vector com # -Computing the order isn't the time consuming part, since data.table uses true radix sorting on integer, character and numeric vectors. However reordering the data.table could be time consuming (depending on the number of rows and columns). +Computing the order isn't the time consuming part, since data.table uses true radix sorting on integer, character and numeric vectors. However, reordering the data.table could be time consuming (depending on the number of rows and columns). Unless our task involves repeated subsetting on the same column, fast key based subsetting could effectively be nullified by the time to reorder, depending on our data.table dimensions. @@ -148,7 +148,7 @@ As we will see in the next section, the `on` argument provides several advantage * allows for a cleaner syntax by having the columns on which the subset is performed as part of the syntax. This makes the code easier to follow when looking at it at a later point. - Note that `on` argument can also be used on keyed subsets as well. In fact, we encourage to provide the `on` argument even when subsetting using keys for better readability. + Note that `on` argument can also be used on keyed subsets as well. In fact, we encourage providing the `on` argument even when subsetting using keys for better readability. # @@ -277,7 +277,7 @@ flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest" ## 3. Auto indexing -First we looked at how to fast subset using binary search using *keys*. Then we figured out that we could improve performance even further and have more cleaner syntax by using secondary indices. +First we looked at how to fast subset using binary search using *keys*. Then we figured out that we could improve performance even further and have cleaner syntax by using secondary indices. That is what *auto indexing* does. At the moment, it is only implemented for binary operators `==` and `%in%`. An index is automatically created *and* saved as an attribute. That is, unlike the `on` argument which computes the index on the fly each time (unless one already exists), a secondary index is created here.