diff --git a/vignettes/css/bootstrap.css b/vignettes/css/bootstrap.css deleted file mode 100644 index 1453f27bf..000000000 --- a/vignettes/css/bootstrap.css +++ /dev/null @@ -1,118 +0,0 @@ -code, -kbd, -pre, -samp { - font-family: Source Code Pro, Inconsolata, Monaco, Consolas, Menlo, Courier New, monospace; -} - -code { - padding: 0px 2px; - font-size: 90%; - color: #c7254e; - white-space: nowrap; - background-color: #f9f2f4; - border-radius: 3px; - border: 0px; -} - -pre { - display: block; - padding: 9.5px; - margin: 0 0 10px; - font-size: 14px; - line-height: 1.428571429; - color: #c7254e; - background-color: #f9f2f4 - word-break: break-all; - word-wrap: break-word; - border: 0px ; - border-radius: 3px; - /*background-color: #FDF6E3;*/ - /*background-color: #f5f5f5; */ - /*border: 1px solid #FDF6E3;*/ -} - -pre code { - padding: 0; - font-size: inherit; - color: inherit; - white-space: pre-wrap; - background-color: transparent; - border-radius: 0; -} - -.bs-callout { - margin:20px 0; - padding:20px; - border-left:3px solid #eee -} - -.bs-callout h4 { - margin-top:0; - margin-bottom:5px -} - -.bs-callout p:last-child { - margin-bottom:0 -} - -.bs-callout code { - background-color:#fff; - border-radius:3px -} - -.bs-callout pre code { - background-color:transparent; - border-radius:3px -} - -.bs-callout-danger { - background-color:#fdf7f7; - border-color:#d9534f -} - -.bs-callout-danger h4 { - color:#d9534f -} - -.bs-callout-warning { - background-color:#fcf8f2; - border-color:#f0ad4e -} - -.bs-callout-warning h4 { - color:#f0ad4e -} - -.bs-callout-info { - background-color:#f4f8fa; - border-color:#5bc0de -} - -.bs-callout-info h4 { - color:#5bc0de -} - -// KeyWordTok -.sourceCode .kw { color: #268BD2; } -// DataTypeTok -.sourceCode .dt { color: #268BD2; } - -// DecValTok (decimal value), BaseNTok, FloatTok -.sourceCode .dv, .sourceCode .bn, .sourceCode .fl { color: #D33682; } -// CharTok -.sourceCode .ch { color: #DC322F; } -// StringTok -.sourceCode .st { color: #2AA198; } -// CommentTok -.sourceCode .co { color: #93A1A1; } -// OtherTok -.sourceCode .ot { color: #A57800; } -// AlertTok -.sourceCode .al { color: #CB4B16; font-weight: bold; } -// FunctionTok -.sourceCode .fu { color: #268BD2; } -// RegionMarkerTok -.sourceCode .re { } -// ErrorTok -.sourceCode .er { color: #D30102; font-weight: bold; } diff --git a/vignettes/datatable-intro.Rmd b/vignettes/datatable-intro.Rmd index 5bd36437a..a810edbb6 100644 --- a/vignettes/datatable-intro.Rmd +++ b/vignettes/datatable-intro.Rmd @@ -86,7 +86,7 @@ class(DT$ID) You can also convert existing objects to a `data.table` using `setDT()` (for `data.frame`s and `list`s) and `as.data.table()` (for other structures); the difference is beyond the scope of this vignette, see `?setDT` and `?as.data.table` for more details. -#### Note that: {.bs-callout .bs-callout-info} +#### Note that: * Row numbers are printed with a `:` in order to visually separate the row number from the first column. @@ -111,7 +111,7 @@ DT[i, j, by] Users who have an SQL background might perhaps immediately relate to this syntax. -#### The way to read it (out loud) is: {.bs-callout .bs-callout-info} +#### The way to read it (out loud) is: Take `DT`, subset/reorder rows using `i`, then calculate `j`, grouped by `by`. @@ -126,8 +126,6 @@ ans <- flights[origin == "JFK" & month == 6L] head(ans) ``` -#### {.bs-callout .bs-callout-info} - * Within the frame of a `data.table`, columns can be referred to *as if they are variables*, much like in SQL or Stata. Therefore, we simply refer to `origin` and `month` as if they are variables. We do not need to add the prefix `flights$` each time. Nevertheless, using `flights$origin` and `flights$month` would work just fine. * The *row indices* that satisfy the condition `origin == "JFK" & month == 6L` are computed, and since there is nothing else left to do, all columns from `flights` at rows corresponding to those *row indices* are simply returned as a `data.table`. @@ -140,7 +138,6 @@ head(ans) ans <- flights[1:2] ans ``` -#### {.bs-callout .bs-callout-info} * In this case, there is no condition. The row indices are already provided in `i`. We therefore return a `data.table` with all columns from `flights` at rows for those *row indices*. @@ -153,7 +150,7 @@ ans <- flights[order(origin, -dest)] head(ans) ``` -#### `order()` is internally optimised {.bs-callout .bs-callout-info} +#### `order()` is internally optimised * We can use "-" on a `character` columns within the frame of a `data.table` to sort in decreasing order. @@ -170,8 +167,6 @@ ans <- flights[, arr_delay] head(ans) ``` -#### {.bs-callout .bs-callout-info} - * Since columns can be referred to as if they are variables within the frame of `data.table`s, we directly refer to the *variable* we want to subset. Since we want *all the rows*, we simply skip `i`. * It returns *all* the rows for the column `arr_delay`. @@ -183,15 +178,13 @@ ans <- flights[, list(arr_delay)] head(ans) ``` -#### {.bs-callout .bs-callout-info} - * We wrap the *variables* (column names) within `list()`, which ensures that a `data.table` is returned. In case of a single column name, not wrapping with `list()` returns a vector instead, as seen in the [previous example](#select-j-1d). * `data.table` also allows wrapping columns with `.()` instead of `list()`. It is an *alias* to `list()`; they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer `.()` for conciseness, so we will continue to use `.()` hereafter. `data.table`s (and `data.frame`s) are internally `list`s as well, with the stipulation that each element has the same length and the `list` has a `class` attribute. Allowing `j` to return a `list` enables converting and returning `data.table` very efficiently. -#### Tip: {.bs-callout .bs-callout-warning #tip-1} +#### Tip: {#tip-1}} As long as `j-expression` returns a `list`, each element of the list will be converted to a column in the resulting `data.table`. This makes `j` quite powerful, as we will see shortly. It is also very important to understand this for when you'd like to make more complicated queries!! @@ -205,8 +198,6 @@ head(ans) # ans <- flights[, list(arr_delay, dep_delay)] ``` -#### {.bs-callout .bs-callout-info} - * Wrap both columns within `.()`, or `list()`. That's it. #### -- Select both `arr_delay` and `dep_delay` columns *and* rename them to `delay_arr` and `delay_dep`. @@ -229,7 +220,7 @@ ans <- flights[, sum( (arr_delay + dep_delay) < 0 )] ans ``` -#### What's happening here? {.bs-callout .bs-callout-info} +#### What's happening here? * `data.table`'s `j` can handle more than just *selecting columns* - it can handle *expressions*, i.e., *computing on columns*. This shouldn't be surprising, as columns can be referred to as if they are variables. Then we should be able to *compute* by calling functions on those variables. And that's what precisely happens here. @@ -243,8 +234,6 @@ ans <- flights[origin == "JFK" & month == 6L, ans ``` -#### {.bs-callout .bs-callout-info} - * We first subset in `i` to find matching *row indices* where `origin` airport equals `"JFK"`, and `month` equals `6L`. We *do not* subset the _entire_ `data.table` corresponding to those rows _yet_. * Now, we look at `j` and find that it uses only *two columns*. And what we have to do is to compute their `mean()`. Therefore we subset just those columns corresponding to the matching rows, and compute their `mean()`. @@ -262,7 +251,7 @@ The function `length()` requires an input argument. We just needed to compute th This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where `data.table` provides a *special symbol* `.N` for it. -#### Special symbol `.N`: {.bs-callout .bs-callout-info #special-N} +#### Special symbol `.N`: {#special-N} `.N` is a special built-in variable that holds the number of observations _in the current group_. It is particularly useful when combined with `by` as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset. @@ -273,8 +262,6 @@ ans <- flights[origin == "JFK" & month == 6L, .N] ans ``` -#### {.bs-callout .bs-callout-info} - * Once again, we subset in `i` to get the *row indices* where `origin` airport equals *"JFK"*, and `month` equals *6*. * We see that `j` uses only `.N` and no other columns. Therefore the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices). @@ -372,8 +359,6 @@ ans # ans <- flights[, .(.N), by = "origin"] ``` -#### {.bs-callout .bs-callout-info} - * We know `.N` [is a special variable](#special-N) that holds the number of rows in the current group. Grouping by `origin` obtains the number of rows, `.N`, for each group. * By doing `head(flights)` you can see that the origin airports occur in the order *"JFK"*, *"LGA"* and *"EWR"*. The original order of grouping variables is preserved in the result. _This is important to keep in mind!_ @@ -400,8 +385,6 @@ ans <- flights[carrier == "AA", .N, by = origin] ans ``` -#### {.bs-callout .bs-callout-info} - * We first obtain the row indices for the expression `carrier == "AA"` from `i`. * Using those *row indices*, we obtain the number of rows while grouped by `origin`. Once again no columns are actually materialised here, because the `j-expression` does not require any columns to be actually subsetted and is therefore fast and memory efficient. @@ -416,8 +399,6 @@ head(ans) # ans <- flights[carrier == "AA", .N, by = c("origin", "dest")] ``` -#### {.bs-callout .bs-callout-info} - * `by` accepts multiple columns. We just provide all the columns by which to group by. Note the use of `.()` again in `by` -- again, this is just shorthand for `list()`, and `list()` can be used here as well. Again, we'll stick with `.()` in this vignette. #### -- How can we get the average arrival and departure delay for each `orig,dest` pair for each month for carrier code `"AA"`? {#origin-dest-month} @@ -429,8 +410,6 @@ ans <- flights[carrier == "AA", ans ``` -#### {.bs-callout .bs-callout-info} - * Since we did not provide column names for the expressions in `j`, they were automatically generated as `V1` and `V2`. * Once again, note that the input order of grouping columns is preserved in the result. @@ -450,8 +429,6 @@ ans <- flights[carrier == "AA", ans ``` -#### {.bs-callout .bs-callout-info} - * All we did was to change `by` to `keyby`. This automatically orders the result by the grouping variables in increasing order. In fact, due to the internal implementation of `by` first requiring a sort before recovering the original table's order, `keyby` is typically faster than `by` because it doesn't require this second step. **Keys:** Actually `keyby` does a little more than *just ordering*. It also *sets a key* after ordering by setting an `attribute` called `sorted`. @@ -475,8 +452,6 @@ ans <- ans[order(origin, -dest)] head(ans) ``` -#### {.bs-callout .bs-callout-info} - * Recall that we can use `-` on a `character` column in `order()` within the frame of a `data.table`. This is possible to due `data.table`'s internal query optimisation. * Also recall that `order(...)` with the frame of a `data.table` is *automatically optimised* to use `data.table`'s internal fast radix order `forder()` for speed. @@ -488,8 +463,6 @@ ans <- flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)] head(ans, 10) ``` -#### {.bs-callout .bs-callout-info} - * We can tack expressions one after another, *forming a chain* of operations, i.e., `DT[ ... ][ ... ][ ... ]`. * Or you can also chain them vertically: @@ -512,8 +485,6 @@ ans <- flights[, .N, .(dep_delay>0, arr_delay>0)] ans ``` -#### {.bs-callout .bs-callout-info} - * The last row corresponds to `dep_delay > 0 = TRUE` and `arr_delay > 0 = FALSE`. We can see that `r flights[!is.na(arr_delay) & !is.na(dep_delay), .N, .(dep_delay>0, arr_delay>0)][, N[4L]]` flights started late but arrived early (or on time). * Note that we did not provide any names to `by-expression`. Therefore, names have been automatically assigned in the result. As with `j`, you can name these expressions as you would elements of any `list`, e.g. `DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)]`. @@ -528,7 +499,7 @@ It is of course not practical to have to type `mean(myCol)` for every column one How can we do this efficiently, concisely? To get there, refresh on [this tip](#tip-1) - *"As long as the `j`-expression returns a `list`, each element of the `list` will be converted to a column in the resulting `data.table`"*. Suppose we can refer to the *data subset for each group* as a variable *while grouping*, then we can loop through all the columns of that variable using the already- or soon-to-be-familiar base function `lapply()`. No new names to learn specific to `data.table`. -#### Special symbol `.SD`: {.bs-callout .bs-callout-info #special-SD} +#### Special symbol `.SD`: {#special-SD} `data.table` provides a *special* symbol, called `.SD`. It stands for **S**ubset of **D**ata. It by itself is a `data.table` that holds the data for *the current group* defined using `by`. @@ -542,8 +513,6 @@ DT DT[, print(.SD), by = ID] ``` -#### {.bs-callout .bs-callout-info} - * `.SD` contains all the columns *except the grouping columns* by default. * It is also generated by preserving the original order - data corresponding to `ID = "b"`, then `ID = "a"`, and then `ID = "c"`. @@ -554,8 +523,6 @@ To compute on (multiple) columns, we can then simply use the base R function `la DT[, lapply(.SD, mean), by = ID] ``` -#### {.bs-callout .bs-callout-info} - * `.SD` holds the rows corresponding to columns `a`, `b` and `c` for that group. We compute the `mean()` on each of these columns using the already-familiar base function `lapply()`. * Each group returns a list of three elements containing the mean value which will become the columns of the resulting `data.table`. @@ -566,7 +533,7 @@ We are almost there. There is one little thing left to address. In our `flights` #### -- How can we specify just the columns we would like to compute the `mean()` on? -#### .SDcols {.bs-callout .bs-callout-info} +#### .SDcols Using the argument `.SDcols`. It accepts either column names or column indices. For example, `.SDcols = c("arr_delay", "dep_delay")` ensures that `.SD` contains only these two columns for each group. @@ -590,8 +557,6 @@ ans <- flights[, head(.SD, 2), by = month] head(ans) ``` -#### {.bs-callout .bs-callout-info} - * `.SD` is a `data.table` that holds all the rows for *that group*. We simply subset the first two rows as we have seen [here](#subset-rows-integer) already. * For each group, `head(.SD, 2)` returns the first two rows as a `data.table`, which is also a `list`, so we do not have to wrap it with `.()`. @@ -606,8 +571,6 @@ So that we have a consistent syntax and keep using already existing (and familia DT[, .(val = c(a,b)), by = ID] ``` -#### {.bs-callout .bs-callout-info} - * That's it. There is no special syntax required. All we need to know is the base function `c()` which concatenates vectors and [the tip from before](#tip-1). #### -- What if we would like to have all the values of column `a` and `b` concatenated, but returned as a list column? @@ -616,8 +579,6 @@ DT[, .(val = c(a,b)), by = ID] DT[, .(val = list(c(a,b))), by = ID] ``` -#### {.bs-callout .bs-callout-info} - * Here, we first concatenate the values with `c(a,b)` for each group, and wrap that with `list()`. So for each group, we return a list of all concatenated values. * Note those commas are for display only. A list column can contain any object in each cell, and in this example, each cell is itself a vector and some cells contain longer vectors than others. @@ -646,7 +607,7 @@ DT[i, j, by] We have seen so far that, -#### Using `i`: {.bs-callout .bs-callout-info} +#### Using `i`: * We can subset rows similar to a `data.frame`- except you don't have to use `DT$` repetitively since columns within the frame of a `data.table` are seen as if they are *variables*. @@ -654,7 +615,7 @@ We have seen so far that, We can do much more in `i` by keying a `data.table`, which allows blazing fast subsets and joins. We will see this in the *"Keys and fast binary search based subsets"* and *"Joins and rolling joins"* vignette. -#### Using `j`: {.bs-callout .bs-callout-info} +#### Using `j`: 1. Select columns the `data.table` way: `DT[, .(colA, colB)]`. @@ -666,7 +627,7 @@ We can do much more in `i` by keying a `data.table`, which allows blazing fast s 5. Combine with `i`: `DT[colA > value, sum(colB)]`. -#### Using `by`: {.bs-callout .bs-callout-info} +#### Using `by`: * Using `by`, we can group by columns by specifying a *list of columns* or a *character vector of column names* or even *expressions*. The flexibility of `j`, combined with `by` and `i` makes for a very powerful syntax. @@ -682,7 +643,7 @@ We can do much more in `i` by keying a `data.table`, which allows blazing fast s 3. `DT[col > val, head(.SD, 1), by = ...]` - combine `i` along with `j` and `by`. -#### And remember the tip: {.bs-callout .bs-callout-warning} +#### And remember the tip: As long as `j` returns a `list`, each element of the list will become a column in the resulting `data.table`. diff --git a/vignettes/datatable-reference-semantics.Rmd b/vignettes/datatable-reference-semantics.Rmd index 220a2a19a..c96ed090f 100644 --- a/vignettes/datatable-reference-semantics.Rmd +++ b/vignettes/datatable-reference-semantics.Rmd @@ -71,7 +71,7 @@ both (1) and (2) resulted in deep copy of the entire data.frame in versions of ` Great performance improvements were made in `R v3.1` as a result of which only a *shallow* copy is made for (1) and not *deep* copy. However, for (2) still, the entire column is *deep* copied even in `R v3.1+`. This means the more columns one subassigns to in the *same query*, the more *deep* copies R does. -#### *shallow* vs *deep* copy {.bs-callout .bs-callout-info} +#### *shallow* vs *deep* copy A *shallow* copy is just a copy of the vector of column pointers (corresponding to the columns in a *data.frame* or *data.table*). The actual data is not physically copied in memory. @@ -86,31 +86,27 @@ It can be used in `j` in two ways: (a) The `LHS := RHS` form - ```{r eval = FALSE} - DT[, c("colA", "colB", ...) := list(valA, valB, ...)] +```{r eval = FALSE} +DT[, c("colA", "colB", ...) := list(valA, valB, ...)] - # when you have only one column to assign to you - # can drop the quotes and list(), for convenience - DT[, colA := valA] - ``` +# when you have only one column to assign to you +# can drop the quotes and list(), for convenience +DT[, colA := valA] +``` (b) The functional form - ```{r eval = FALSE} - DT[, `:=`(colA = valA, # valA is assigned to colA - colB = valB, # valB is assigned to colB - ... - )] - ``` - -#### {.bs-callout .bs-callout-warning} +```{r eval = FALSE} +DT[, `:=`(colA = valA, # valA is assigned to colA + colB = valB, # valB is assigned to colB + ... +)] +``` Note that the code above explains how `:=` can be used. They are not working examples. We will start using them on `flights` *data.table* from the next section. # -#### {.bs-callout .bs-callout-info} - * In (a), `LHS` takes a character vector of column names and `RHS` a *list of values*. `RHS` just needs to be a `list`, irrespective of how its generated (e.g., using `lapply()`, `list()`, `mget()`, `mapply()` etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance. * On the other hand, (b) is handy if you would like to jot some comments down for later. @@ -140,7 +136,7 @@ head(flights) # flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)] ``` -#### Note that {.bs-callout .bs-callout-info} +#### Note that * We did not have to assign the result back to `flights`. @@ -166,8 +162,6 @@ We see that there are totally `25` unique values in the data. Both *0* and *24* flights[hour == 24L, hour := 0L] ``` -#### {.bs-callout .bs-callout-info} - * We can use `i` along with `:=` in `j` the very same way as we have already seen in the *"Introduction to data.table"* vignette. * Column `hour` is replaced with `0` only on those *row indices* where the condition `hour == 24L` specified in `i` evaluates to `TRUE`. @@ -186,7 +180,7 @@ Let's look at all the `hours` to verify. flights[, sort(unique(hour))] ``` -#### Exercise: {.bs-callout .bs-callout-warning #update-by-reference-question} +#### Exercise: {#update-by-reference-question} What is the difference between `flights[hour == 24L, hour := 0L]` and `flights[hour == 24L][, hour := 0L]`? Hint: The latter needs an assignment (`<-`) if you would want to use the result later. @@ -204,7 +198,7 @@ head(flights) # flights[, `:=`(delay = NULL)] ``` -#### {.bs-callout .bs-callout-info #delete-convenience} +#### {#delete-convenience} * Assigning `NULL` to a column *deletes* that column. And it happens *instantly*. @@ -229,8 +223,6 @@ flights[, max_speed := max(speed), by = .(origin, dest)] head(flights) ``` -#### {.bs-callout .bs-callout-info} - * We add a new column `max_speed` using the `:=` operator by reference. * We provide the columns to group by the same way as shown in the *Introduction to data.table* vignette. For each group, `max(speed)` is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. `flights` *data.table* is modified *in-place*. @@ -249,7 +241,6 @@ out_cols = c("max_dep_delay", "max_arr_delay") flights[, c(out_cols) := lapply(.SD, max), by = month, .SDcols = in_cols] head(flights) ``` -#### {.bs-callout .bs-callout-info} * We use the `LHS := RHS` form. We store the input column names and the new columns to add in separate variables and provide them to `.SDcols` and for `LHS` (for better readability). @@ -283,7 +274,6 @@ ans = foo(flights) head(flights) head(ans) ``` -#### {.bs-callout .bs-callout-info} * Note that the new column `speed` has been added to `flights` *data.table*. This is because `:=` performs operations by reference. Since `DT` (the function argument) and `flights` refer to the same object in memory, modifying `DT` also reflects on `flights`. @@ -293,8 +283,6 @@ head(ans) In the previous section, we used `:=` for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a *data.table* object to a function, and might want to use the `:=` operator, but *wouldn't* want to update the original object. We can accomplish this using the function `copy()`. -#### {.bs-callout .bs-callout-info} - The `copy()` function *deep* copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object. # @@ -321,8 +309,6 @@ There are two particular places where `copy()` function is essential: head(ans) ``` -#### {.bs-callout .bs-callout-info} - * Using `copy()` function did not update `flights` *data.table* by reference. It doesn't contain the column `speed`. * And `ans` contains the maximum speed corresponding to each month. @@ -354,7 +340,7 @@ However we could improve this functionality further by *shallow* copying instead ## Summary -#### The `:=` operator {.bs-callout .bs-callout-info} +#### The `:=` operator * It is used to *add/update/delete* columns by reference. diff --git a/vignettes/datatable-reshape.Rmd b/vignettes/datatable-reshape.Rmd index c26d5510d..0b5d7a57d 100644 --- a/vignettes/datatable-reshape.Rmd +++ b/vignettes/datatable-reshape.Rmd @@ -77,8 +77,6 @@ DT.m1 str(DT.m1) ``` -#### {.bs-callout .bs-callout-info} - * `measure.vars` specify the set of columns we would like to collapse (or combine) together. * We can also specify column *indices* instead of *names*. @@ -98,8 +96,6 @@ DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"), DT.m1 ``` -#### {.bs-callout .bs-callout-info} - * By default, when one of `id.vars` or `measure.vars` is missing, the rest of the columns are *automatically assigned* to the missing argument. * When neither `id.vars` nor `measure.vars` are specified, as mentioned under `?melt`, all *non*-`numeric`, `integer`, `logical` columns will be assigned to `id.vars`. @@ -118,8 +114,6 @@ That is, we'd like to collect all *child* observations corresponding to each `fa dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob") ``` -#### {.bs-callout .bs-callout-info} - * `dcast` uses *formula* interface. The variables on the *LHS* of formula represents the *id* vars and *RHS* the *measure* vars. * `value.var` denotes the column to be filled in with while casting to wide format. @@ -165,7 +159,7 @@ DT.c1 str(DT.c1) ## gender column is character type now! ``` -#### Issues {.bs-callout .bs-callout-info} +#### Issues 1. What we wanted to do was to combine all the `dob` and `gender` type columns together respectively. Instead we are combining *everything* together, and then splitting them again. I think it's easy to see that it's quite roundabout (and inefficient). @@ -198,8 +192,6 @@ DT.m2 str(DT.m2) ## col type is preserved ``` -#### {.bs-callout .bs-callout-info} - * We can remove the `variable` column if necessary. * The functionality is implemented entirely in C, and is therefore both *fast* and *memory efficient* in addition to being *straightforward*. @@ -210,7 +202,7 @@ Usually in these problems, the columns we'd like to melt can be distinguished by ```{r} DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender")) -print(DT.m2, class=TRUE) +DT.m2 ``` #### - Using `measure()` to specify `measure.vars` via separator or pattern @@ -260,7 +252,7 @@ is used to convert the `child` string values to integers: ```{r} DT.m3 = melt(DT, measure = measure(value.name, child=as.integer, sep="_child")) -print(DT.m3, class=TRUE) +DT.m3 ``` In the code above we used `sep="_child"` which results in melting only @@ -288,12 +280,12 @@ groups, two numeric output columns, and an anonymous type conversion function, ```{r} -print(melt(who, measure.vars = measure( +melt(who, measure.vars = measure( diagnosis, gender, ages, ymin=as.numeric, ymax=function(y)ifelse(y=="", Inf, as.numeric(y)), pattern="new_?(.*)_(.)(([0-9]{2})([0-9]{0,2}))" -)), class=TRUE) +)) ``` ### b) Enhanced `dcast` @@ -312,15 +304,13 @@ DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "ge DT.c2 ``` -#### {.bs-callout .bs-callout-info} - * Attributes are preserved in result wherever possible. * Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient. # -#### Multiple functions to `fun.aggregate`: {.bs-callout .bs-callout-info} +#### Multiple functions to `fun.aggregate`: You can also provide *multiple functions* to `fun.aggregate` to `dcast` for *data.tables*. Check the examples in `?dcast` which illustrates this functionality. diff --git a/vignettes/datatable-sd-usage.Rmd b/vignettes/datatable-sd-usage.Rmd index 8e7919f34..8a89b99fe 100644 --- a/vignettes/datatable-sd-usage.Rmd +++ b/vignettes/datatable-sd-usage.Rmd @@ -202,7 +202,7 @@ Note that the `x[y]` syntax returns `nrow(y)` values (i.e., it's a right join), Often, we'd like to perform some operation on our data _at the group level_. When we specify `by =` (or `keyby = `), the mental model for what happens when `data.table` processes `j` is to think of your `data.table` as being split into many component sub-`data.table`s, each of which corresponds to a single value of your `by` variable(s): -![Grouping, Illustrated](plots/grouping_illustration.png 'A visual depiction of how grouping works. On the left is a grid. The first column is titled "ID COLUMN" with values the capital letters A through G, and the rest of the data is unlabelled, but is in a darker color and simply has "Data" written to indicate that's arbitrary. A right arrow shows how this data is split into groups. Each capital letter A through G has a grid on the right-hand side; the grid on the left has been subdivided to create that on the right.') +![Grouping, Illustrated](plots/grouping_illustration.png) In the case of grouping, `.SD` is multiple in nature -- it refers to _each_ of these sub-`data.table`s, _one-at-a-time_ (slightly more accurately, the scope of `.SD` is a single sub-`data.table`). This allows us to concisely express an operation that we'd like to perform on _each sub-`data.table`_ before the re-assembled result is returned to us. diff --git a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd index 374ccd66b..6f2474c11 100644 --- a/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd +++ b/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd @@ -105,7 +105,7 @@ setkey(flights, origin) flights["JFK"] # or flights[.("JFK")] ``` -#### `setkey()` requires: {.bs-callout .bs-callout-info} +#### `setkey()` requires: a) computing the order vector for the column(s) provided, here, `origin`, and @@ -139,7 +139,7 @@ Since there can be multiple secondary indices, and creating an index is as simpl As we will see in the next section, the `on` argument provides several advantages: -#### `on` argument {.bs-callout .bs-callout-info} +#### `on` argument * enables subsetting by computing secondary indices on the fly. This eliminates having to do `setindex()` every time.