Skip to content

Commit

Permalink
Apply suggestions from code review benchmarking
Browse files Browse the repository at this point in the history
Co-authored-by: Michael Chirico <[email protected]>
  • Loading branch information
davidbudzynski and MichaelChirico authored Nov 3, 2023
1 parent 5b8b1dc commit 3088c0b
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 3 deletions.
4 changes: 2 additions & 2 deletions vignettes/datatable-benchmarking.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ Protecting your `data.table` from being updated by reference operations can be a

If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results.
Of course if your benchmark is meant to present to present an _end-to-end workflow_, then it makes perfect sense to present the overall timing. Nevertheless, separating out timing of individual steps is useful for understanding which steps are the main bottlenecks of a workflow.
There are another cases when it might not be desirable, for example when _reading a csv_, followed by _grouping_. R requires populating _R's global string cache_ which adds extra overhead when importing character data to an R session. On the other hand, the _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.
There are other cases when it might not be desirable, for example when _reading a csv_, followed by _grouping_. R requires populating _R's global string cache_ which adds extra overhead when importing character data to an R session. On the other hand, the _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.

# avoid class coercion

Expand All @@ -97,7 +97,7 @@ This is very valid. The smaller time measurement is the relatively bigger noise

# multithreaded processing

One of the main factors that is likely to impact timings is a number of threads in your machine. In recent versions of `data.table` some functions have been parallelized.
One of the main factors that is likely to impact timings is the number of threads in your machine. In recent versions of `data.table`, some functions have been parallelized.
You can control the number of threads you want to use with `setDTthreads`.

```r
Expand Down
2 changes: 1 addition & 1 deletion vignettes/datatable-importing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Importing `data.table` is no different from importing other R packages. This vig

## Why to import `data.table`

One of the biggest features of `data.table` is its concise syntax which makes exploratory analysis faster and easier to write and perceive; this convenience can drive packages authors to use `data.table` in their own packages. Another perhaps more important reason is high performance. When outsourcing heavy computing tasks from your package to `data.table`, you usually get top performance without needing to re-invent any high of these numerical optimization tricks on your own.
One of the biggest features of `data.table` is its concise syntax which makes exploratory analysis faster and easier to write and perceive; this convenience can drive packages authors to use `data.table` in their own packages. Another, perhaps more important reason is high performance. When outsourcing heavy computing tasks from your package to `data.table`, you usually get top performance without needing to re-invent any of these numerical optimization tricks on your own.

## Importing `data.table` is easy

Expand Down

0 comments on commit 3088c0b

Please sign in to comment.