Skip to content

Commit

Permalink
review fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
davidbudzynski committed Jun 5, 2022
1 parent 4975bfe commit fff9f7c
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 16 deletions.
20 changes: 10 additions & 10 deletions vignettes/datatable-benchmarking.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ sudo lshw -class disk
sudo hdparm -t /dev/sda
```

When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently, as well timing isolated tasks (such as `fread` alone), it's a good idea to benchmark a pipeline of tasks such as reading data, computing operators and producing final output and report the total time of the pipeline.
When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently, as well as timing isolated tasks (such as `fread` alone), it's a good idea to benchmark a pipeline of tasks such as reading data, computing operators and producing final output and report the total time of the pipeline.

# subset: threshold for index optimization on compound queries

Expand Down Expand Up @@ -59,7 +59,7 @@ options(datatable.auto.index=TRUE)
options(datatable.use.index=TRUE)
```

- `use.index=FALSE` will force query not to use indices even if they exist, but existing keys are still used for optimization.
- `use.index=FALSE` will force the query not to use indices even if they exist, but existing keys are still used for optimization.
- `auto.index=FALSE` disables building index automatically when doing subset on non-indexed data, but if indices were created before this option was set, or explicitly by calling `setindex` they still will be used for optimization.

Two other options control optimization globally, including use of indices:
Expand All @@ -68,27 +68,27 @@ options(datatable.optimize=2L)
options(datatable.optimize=3L)
```
`options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on.
Those options affect many more optimizations thus should not be used when only control of index is needed. Read more in `?datatable.optimize`.
Those options affect many more optimizations and thus should not be used when only control of index is needed. Read more in `?datatable.optimize`.

# _by reference_ operations

When benchmarking `set*` functions it makes sense to measure only first run. Those functions update data.table by reference thus in subsequent runs they get already processed `data.table` on input.
When benchmarking `set*` functions it makes sense to measure the first run only. These functions update their input by reference, so subsequent runs will use already-processed `data.table`, biasing the results.

Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking.

# try to benchmark atomic processes

If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results.
Of course if your benchmark is meant to present _full workflow_ then it perfectly makes sense to present total timing, still splitting timings might give good insight into bottlenecks in such workflow.
There are another cases when it might not be desired, for example when benchmarking _reading csv_, followed by _grouping_. R requires populating _R's global string cache_ which adds extra overhead when importing character data to R session. On the other hand _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.
Of course if your benchmark is meant to present to present an _end-to-end workflow_, then it makes perfect sense to present the overall timing. Nevertheless, separating out timing of individual steps is useful for understanding which steps are the main bottlenecks of a workflow.
There are another cases when it might not be desirable, for example when _reading a csv_, followed by _grouping_. R requires populating _R's global string cache_ which adds extra overhead when importing character data to an R session. On the other hand, the _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.

# avoid class coercion

Unless this is what you truly want to measure you should prepare input objects for every tool you are benchmarking in expected class.
Unless this is what you truly want to measure you should prepare input objects of the expected class for every tool you are benchmarking.

# avoid `microbenchmark(..., times=100)`

Repeating benchmarking many times usually does not fit well for data processing tools. Of course, it perfectly makes sense for more atomic calculations. It does not well represent use case for common data processing tasks, which rather consists of batches sequentially provided transformations, each run once.
Repeating a benchmark many times usually does not fit well for data processing tools. Of course, it makes perfect sense for more atomic calculations, but thus is not a good representation of the common use case of data processing tasks, which consist of batches of sequentially provided transformations, each run once.
Matt once said:

> I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.
Expand All @@ -97,8 +97,8 @@ This is very valid. The smaller time measurement is the relatively bigger noise

# multithreaded processing

One of the main factor that is likely to impact timings is a number of threads in your machine. In recent versions of `data.table` some functions have been parallelized.
You can control how many threads you want to use with `setDTthreads`.
One of the main factors that is likely to impact timings is a number of threads in your machine. In recent versions of `data.table` some functions have been parallelized.
You can control the number of threads you want to use with `setDTthreads`.

```r
setDTthreads(0) # use all available cores (default)
Expand Down
2 changes: 1 addition & 1 deletion vignettes/datatable-importing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Importing `data.table` is no different from importing other R packages. This vig

## Why to import `data.table`

One of the biggest features of `data.table` is its concise syntax which makes exploratory analysis faster and easier to write and perceive; this convenience can drive packages authors to use `data.table` in their own packages. Another maybe even more important reason is high performance. When outsourcing heavy computing tasks from your package to `data.table`, you usually get top performance without needing to re-invent any high of these numerical optimization tricks on your own.
One of the biggest features of `data.table` is its concise syntax which makes exploratory analysis faster and easier to write and perceive; this convenience can drive packages authors to use `data.table` in their own packages. Another perhaps more important reason is high performance. When outsourcing heavy computing tasks from your package to `data.table`, you usually get top performance without needing to re-invent any high of these numerical optimization tricks on your own.

## Importing `data.table` is easy

Expand Down
6 changes: 3 additions & 3 deletions vignettes/datatable-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Briefly, if you are interested in reducing *programming* and *compute* time trem

## Data {#data}

In this vignette, we will use [NYC-flights14](https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv) data obtained by [flights](https://github.com/arunsrinivasan/flights) package (available on GitHub only). It contains On-Time flights data from the Bureau of Transportation Statistics for all the flights that departed from New York City airports in 2014 (inspired by [nycflights13](https://github.com/tidyverse/nycflights13)). The data is available only for Jan-Oct'14.
In this vignette, we will use [NYC-flights14](https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv) data obtained from the [flights](https://github.com/arunsrinivasan/flights) package (available on GitHub only). It contains On-Time flights data from the Bureau of Transportation Statistics for all the flights that departed from New York City airports in 2014 (inspired by [nycflights13](https://github.com/tidyverse/nycflights13)). The data is available only for Jan-Oct'14.

We can use `data.table`'s fast-and-friendly file reader `fread` to load `flights` directly as follows:

Expand Down Expand Up @@ -185,7 +185,7 @@ head(ans)

#### {.bs-callout .bs-callout-info}

* We wrap the *variables* (column names) within `list()`, which ensures that a `data.table` is returned. In case of a single column name, not wrapping with `list()` returns a vector instead, as seen in the [previous example](#select-j-1d).
* We wrap the *variables* (column names) within `list()`, which ensures that a `data.table` is returned. In the case of a single column name, not wrapping with `list()` returns a vector instead, as seen in the [previous example](#select-j-1d).

* `data.table` also allows wrapping columns with `.()` instead of `list()`. It is an *alias* to `list()`; they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer `.()` for conciseness, so we will continue to use `.()` hereafter.

Expand Down Expand Up @@ -477,7 +477,7 @@ head(ans)

#### {.bs-callout .bs-callout-info}

* Recall that we can use `-` on a `character` column in `order()` within the frame of a `data.table`. This is possible to due to `data.table`'s internal query optimisation.
* Recall that we can use `-` on a `character` column in `order()` within the frame of a `data.table`. This is possible due to `data.table`'s internal query optimisation.

* Also recall that `order(...)` with the frame of a `data.table` is *automatically optimised* to use `data.table`'s internal fast radix order `forder()` for speed.

Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-keys-fast-subset.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ Instead, in *data.tables* we set and use `keys`. Think of a `key` as **superchar

#### Keys and their properties {#key-properties}

1. We can set keys on *multiple columns* and the column can be of *different types* -- *integer*, *numeric*, *character*, *factor*, *integer64* etc. *list* and *complex* types are not supported yet.
1. [ ] We can set keys on *multiple columns* and the column can be of *different types* -- *integer*, *numeric*, *character*, *factor*, *integer64* etc. *list* and *complex* types are not supported yet.

2. Uniqueness is not enforced, i.e., duplicate key values are allowed. Since rows are sorted by key, any duplicates in the key columns will appear consecutively.

Expand Down Expand Up @@ -146,7 +146,7 @@ head(flights)

#### set* and `:=`:

In *data.table*, the `:=` operator and all the `set*` (e.g., `setkey`, `setorder`, `setnames` etc...) functions are the only ones which modify the input object *by reference*.
In *data.table*, the `:=` operator and all the `set*` (e.g., `setkey`, `setorder`, `setnames` etc.) functions are the only ones which modify the input object *by reference*.

Once you *key* a *data.table* by certain columns, you can subset by querying those key columns using the `.()` notation in `i`. Recall that `.()` is an *alias to* `list()`.

Expand Down

0 comments on commit fff9f7c

Please sign in to comment.