diff --git a/DESCRIPTION b/DESCRIPTION index 405b7a009..6756db8ae 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -3,7 +3,7 @@ Version: 1.14.9 Title: Extension of `data.frame` Depends: R (>= 3.1.0) Imports: methods -Suggests: bit64 (>= 4.0.0), bit (>= 4.0.4), curl, R.utils, xts, zoo (>= 1.8-1), yaml, knitr, rmarkdown, markdown +Suggests: bit64 (>= 4.0.0), bit (>= 4.0.4), curl, R.utils, xts, zoo (>= 1.8-1), yaml, knitr, markdown Description: Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development. License: MPL-2.0 | file LICENSE URL: https://r-datatable.com, https://Rdatatable.gitlab.io/data.table, https://github.com/Rdatatable/data.table diff --git a/vignettes/css/toc.css b/vignettes/css/toc.css new file mode 100644 index 000000000..86adaba5b --- /dev/null +++ b/vignettes/css/toc.css @@ -0,0 +1,6 @@ +#TOC { + border: 1px solid #ccc; + border-radius: 5px; + padding-left: 1em; + background: #f6f6f6; +} diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 7614a27d5..da580764b 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -2,15 +2,24 @@ title: "Benchmarking data.table" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette: - toc: true - number_sections: true + markdown::html_format: + options: + toc: true + number_sections: true + meta: + css: [default, css/toc.css] vignette: > %\VignetteIndexEntry{Benchmarking data.table} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- + + This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid. # fread: clear caches diff --git a/vignettes/datatable-faq.Rmd b/vignettes/datatable-faq.Rmd index 4b0645e6b..f1deaba78 100644 --- a/vignettes/datatable-faq.Rmd +++ b/vignettes/datatable-faq.Rmd @@ -2,12 +2,15 @@ title: "Frequently Asked Questions about data.table" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette: - toc: true - number_sections: true + markdown::html_format: + options: + toc: true + number_sections: true + meta: + css: [default, css/toc.css] vignette: > %\VignetteIndexEntry{Frequently Asked Questions about data.table} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- @@ -94,13 +97,13 @@ As [highlighted above](#j-num), `j` in `[.data.table` is fundamentally different Furthermore, data.table _inherits_ from `data.frame`. It _is_ a `data.frame`, too. A data.table can be passed to any package that only accepts `data.frame` and that package can use `[.data.frame` syntax on the data.table. See [this answer](https://stackoverflow.com/a/10529888/403310) for how that is achieved. -We _have_ proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 : +We _have_ proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0: > `unique()` and `match()` are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matt Dowle for suggesting improvements to the way the hash code is generated in unique.c. A second proposal was to use `memcpy` in duplicate.c, which is much faster than a for loop in C. This would improve the _way_ that R copies data internally (on some measures by 13 times). The thread on r-devel is [here](https://stat.ethz.ch/pipermail/r-devel/2010-April/057249.html). -A third more significant proposal that was accepted is that R now uses data.table's radix sort code as from R 3.3.0 : +A third more significant proposal that was accepted is that R now uses data.table's radix sort code as from R 3.3.0: > The radix sort algorithm and implementation from data.table (forder) replaces the previous radix (counting) sort and adds a new method for order(). Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ?sort). @@ -236,7 +239,7 @@ Then you are using a version prior to 1.5.3. Prior to 1.5.3 `[.data.table` detec ## What are the scoping rules for `j` expressions? -Think of the subset as an environment where all the column names are variables. When a variable `foo` is used in the `j` of a query such as `X[Y, sum(foo)]`, `foo` is looked for in the following order : +Think of the subset as an environment where all the column names are variables. When a variable `foo` is used in the `j` of a query such as `X[Y, sum(foo)]`, `foo` is looked for in the following order: 1. The scope of `X`'s subset; _i.e._, `X`'s column names. 2. The scope of each row of `Y`; _i.e._, `Y`'s column names (_join inherited scope_) @@ -295,18 +298,18 @@ The `Z[Y]` part is not a single name so that is evaluated within the frame of `X ## Can you explain further why data.table is inspired by `A[B]` syntax in `base`? -Consider `A[B]` syntax using an example matrix `A` : +Consider `A[B]` syntax using an example matrix `A`: ```{r} A = matrix(1:12, nrow = 4) A ``` -To obtain cells `(1, 2) = 5` and `(3, 3) = 11` many users (we believe) may try this first : +To obtain cells `(1, 2) = 5` and `(3, 3) = 11` many users (we believe) may try this first: ```{r} A[c(1, 3), c(2, 3)] ``` -However, this returns the union of those rows and columns. To reference the cells, a 2-column matrix is required. `?Extract` says : +However, this returns the union of those rows and columns. To reference the cells, a 2-column matrix is required. `?Extract` says: > When indexing arrays by `[` a single argument `i` can be a matrix with as many columns as there are dimensions of `x`; the result is then a vector with elements corresponding to the sets of indices in each row of `i`. @@ -354,7 +357,7 @@ Furthermore, matrices, especially sparse matrices, are often stored in a 3-colum data.table _inherits_ from `data.frame`. It _is_ a `data.frame`, too. A data.table _can_ be passed to any package that _only_ accepts `data.frame`. When that package uses `[.data.frame` syntax on the data.table, it works. It works because `[.data.table` looks to see where it was called from. If it was called from such a package, `[.data.table` diverts to `[.data.frame`. ## I've heard that data.table syntax is analogous to SQL. -Yes : +Yes: - `i` $\Leftrightarrow$ where - `j` $\Leftrightarrow$ select @@ -367,7 +370,7 @@ Yes : - `mult = "first"|"last"` $\Leftrightarrow$ N/A because SQL is inherently unordered - `roll = TRUE` $\Leftrightarrow$ N/A because SQL is inherently unordered -The general form is : +The general form is: ```{r, eval = FALSE} DT[where, select|update, group by][order by][...] ... [...] @@ -447,7 +450,7 @@ Many thanks to the R core team for fixing the issue in Sep 2019. data.table v1.1 This comes up quite a lot but it's really earth-shatteringly simple. A function such as `merge` is _generic_ if it consists of a call to `UseMethod`. When you see people talking about whether or not functions are _generic_ functions they are merely typing the function without `()` afterwards, looking at the program code inside it and if they see a call to `UseMethod` then it is _generic_. What does `UseMethod` do? It literally slaps the function name together with the class of the first argument, separated by period (`.`) and then calls that function, passing along the same arguments. It's that simple. For example, `merge(X, Y)` contains a `UseMethod` call which means it then _dispatches_ (i.e. calls) `paste("merge", class(X), sep = ".")`. Functions with dots in their name may or may not be methods. The dot is irrelevant really, other than dot being the separator that `UseMethod` uses. Knowing this background should now highlight why, for example, it is obvious to R folk that `as.data.table.data.frame` is the `data.frame` method for the `as.data.table` generic function. Further, it may help to elucidate that, yes, you are correct, it is not obvious from its name alone that `ls.fit` is not the fit method of the `ls` generic function. You only know that by typing `ls` (not `ls()`) and observing it isn't a single call to `UseMethod`. -You might now ask: where is this documented in R? Answer: it's quite clear, but, you need to first know to look in `?UseMethod` and _that_ help file contains : +You might now ask: where is this documented in R? Answer: it's quite clear, but, you need to first know to look in `?UseMethod` and _that_ help file contains: > When a function calling `UseMethod('fun')` is applied to an object with class attribute `c('first', 'second')`, the system searches for a function called `fun.first` and, if it finds it, applies it to the object. If no such function is found a function called `fun.second` is tried. If no class name produces a suitable function, the function `fun.default` is used, if it exists, or an error results. @@ -481,7 +484,7 @@ copied in bulk (`memcpy` in C) rather than looping in C. ## What are primary and secondary indexes in data.table? Manual: [`?setkey`](https://www.rdocumentation.org/packages/data.table/functions/setkey) -S.O. : [What is the purpose of setting a key in data.table?](https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table/20057411#20057411) +S.O.: [What is the purpose of setting a key in data.table?](https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table/20057411#20057411) `setkey(DT, col1, col2)` orders the rows by column `col1` then within each group of `col1` it orders by `col2`. This is a _primary index_. The row order is changed _by reference_ in RAM. Subsequent joins and groups on those key columns then take advantage of the sort order for efficiency. (Imagine how difficult looking for a phone number in a printed telephone directory would be if it wasn't sorted by surname then forename. That's literally all `setkey` does. It sorts the rows by the columns you specify.) The index doesn't use any RAM. It simply changes the row order in RAM and marks the key columns. Analogous to a _clustered index_ in SQL. @@ -521,7 +524,7 @@ DT[ , { mySD = copy(.SD) Please upgrade to v1.8.1 or later. From this version, if `.N` is returned by `j` it is renamed to `N` to avoid any ambiguity in any subsequent grouping between the `.N` special variable and a column called `".N"`. -The old behaviour can be reproduced by forcing `.N` to be called `.N`, like this : +The old behaviour can be reproduced by forcing `.N` to be called `.N`, like this: ```{r} DT = data.table(a = c(1,1,2,2,2), b = c(1,2,2,2,1)) DT @@ -533,7 +536,7 @@ cat(try( If you are already running v1.8.1 or later then the error message is now more helpful than the "cannot change value of locked binding" error, as you can see above, since this vignette was produced using v1.8.1 or later. -The more natural syntax now works : +The more natural syntax now works: ```{r} if (packageVersion("data.table") >= "1.8.1") { DT[ , .N, by = list(a, b)][ , unique(N), by = a] @@ -555,7 +558,7 @@ Hopefully, this is self explanatory. The full message is: Coerced numeric RHS to integer to match the column's type; may have truncated precision. Either change the column to numeric first by creating a new numeric vector length 5 (nrows of entire table) yourself and assigning that (i.e. 'replace' column), or coerce RHS to integer yourself (e.g. 1L or as.integer) to make your intent clear (and for speed). Or, set the column type correctly up front when you create the table and stick to it, please. -To generate it, try : +To generate it, try: ```{r} DT = data.table(a = 1:5, b = 1:5) diff --git a/vignettes/datatable-importing.Rmd b/vignettes/datatable-importing.Rmd index 41a3d629a..c37cd6f75 100644 --- a/vignettes/datatable-importing.Rmd +++ b/vignettes/datatable-importing.Rmd @@ -2,10 +2,10 @@ title: "Importing data.table" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette + markdown::html_format vignette: > %\VignetteIndexEntry{Importing data.table} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- diff --git a/vignettes/datatable-intro.Rmd b/vignettes/datatable-intro.Rmd index 3a5eda34c..5bd36437a 100644 --- a/vignettes/datatable-intro.Rmd +++ b/vignettes/datatable-intro.Rmd @@ -2,10 +2,10 @@ title: "Introduction to data.table" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette + markdown::html_format vignette: > %\VignetteIndexEntry{Introduction to data.table} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- diff --git a/vignettes/datatable-keys-fast-subset.Rmd b/vignettes/datatable-keys-fast-subset.Rmd index 465052d94..3e9a4f23c 100644 --- a/vignettes/datatable-keys-fast-subset.Rmd +++ b/vignettes/datatable-keys-fast-subset.Rmd @@ -2,10 +2,10 @@ title: "Keys and fast binary search based subset" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette + markdown::html_format vignette: > %\VignetteIndexEntry{Keys and fast binary search based subset} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- diff --git a/vignettes/datatable-programming.Rmd b/vignettes/datatable-programming.Rmd index bf481f06f..d63b1bccc 100644 --- a/vignettes/datatable-programming.Rmd +++ b/vignettes/datatable-programming.Rmd @@ -2,10 +2,10 @@ title: "Programming on data.table" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette + markdown::html_format vignette: > %\VignetteIndexEntry{Programming on data.table} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- diff --git a/vignettes/datatable-reference-semantics.Rmd b/vignettes/datatable-reference-semantics.Rmd index 33da89bb9..220a2a19a 100644 --- a/vignettes/datatable-reference-semantics.Rmd +++ b/vignettes/datatable-reference-semantics.Rmd @@ -2,10 +2,10 @@ title: "Reference semantics" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette + markdown::html_format vignette: > %\VignetteIndexEntry{Reference semantics} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- diff --git a/vignettes/datatable-reshape.Rmd b/vignettes/datatable-reshape.Rmd index 3f94392fc..c26d5510d 100644 --- a/vignettes/datatable-reshape.Rmd +++ b/vignettes/datatable-reshape.Rmd @@ -2,10 +2,10 @@ title: "Efficient reshaping using data.tables" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette + markdown::html_format vignette: > %\VignetteIndexEntry{Efficient reshaping using data.tables} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- diff --git a/vignettes/datatable-sd-usage.Rmd b/vignettes/datatable-sd-usage.Rmd index f84fd6ea6..48ccf2193 100644 --- a/vignettes/datatable-sd-usage.Rmd +++ b/vignettes/datatable-sd-usage.Rmd @@ -2,12 +2,15 @@ title: "Using .SD for Data Analysis" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette: - toc: true - number_sections: true + markdown::html_format: + options: + toc: true + number_sections: true + meta: + css: [default, css/toc.css] vignette: > %\VignetteIndexEntry{Using .SD for Data Analysis} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- diff --git a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd index ef506605c..374ccd66b 100644 --- a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd +++ b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd @@ -2,10 +2,10 @@ title: "Secondary indices and auto indexing" date: "`r Sys.Date()`" output: - rmarkdown::html_vignette + markdown::html_format vignette: > %\VignetteIndexEntry{Secondary indices and auto indexing} - %\VignetteEngine{knitr::rmarkdown} + %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} ---