Skip to content

Commit

Permalink
Merge branch 'master' into hindi_a
Browse files Browse the repository at this point in the history
  • Loading branch information
amansingh1011 authored Nov 21, 2024
2 parents 7581e65 + 546259d commit 4ad7706
Show file tree
Hide file tree
Showing 7 changed files with 30 additions and 30 deletions.
8 changes: 4 additions & 4 deletions vignettes/datatable-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ You can also convert existing objects to a `data.table` using `setDT()` (for `da
getOption("datatable.print.nrows")
```
* `data.table` doesn't set or use *row names*, ever. We will see why in the *"Keys and fast binary search based subset"* vignette.
* `data.table` doesn't set or use *row names*, ever. We will see why in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette.
### b) General form - in what way is a `data.table` *enhanced*? {#enhanced-1b}
Expand Down Expand Up @@ -479,7 +479,7 @@ ans

**Keys:** Actually `keyby` does a little more than *just ordering*. It also *sets a key* after ordering by setting an `attribute` called `sorted`.

We'll learn more about `keys` in the `vignette("datatable-keys-fast-subset", package="data.table")`; for now, all you have to know is that you can use `keyby` to automatically order the result by the columns specified in `by`.
We'll learn more about `keys` in the [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) vignette; for now, all you have to know is that you can use `keyby` to automatically order the result by the columns specified in `by`.

### c) Chaining

Expand Down Expand Up @@ -659,7 +659,7 @@ We have seen so far that,

* We can also sort a `data.table` using `order()`, which internally uses data.table's fast order for better performance.

We can do much more in `i` by keying a `data.table`, which allows for blazing fast subsets and joins. We will see this in the `vignette("datatable-keys-fast-subset", package="data.table")` and the `vignette("datatable-joins", package="data.table")`.
We can do much more in `i` by keying a `data.table`, which allows for blazing fast subsets and joins. We will see this in the vignettes [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html) and [`vignette("datatable-joins", package="data.table")`](datatable-joins.html).

#### Using `j`:

Expand Down Expand Up @@ -693,7 +693,7 @@ We can do much more in `i` by keying a `data.table`, which allows for blazing fa

As long as `j` returns a `list`, each element of the list will become a column in the resulting `data.table`.

We will see how to *add/update/delete* columns *by reference* and how to combine them with `i` and `by` in the next vignette (`vignette("datatable-reference-semantics", package="data.table")`).
We will see how to *add/update/delete* columns *by reference* and how to combine them with `i` and `by` in the [next vignette (`vignette("datatable-reference-semantics", package="data.table")`)](datatable-reference-semantics.html).

***

Expand Down
6 changes: 3 additions & 3 deletions vignettes/datatable-joins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ In this vignette you will learn how to perform any join operation using resource

It assumes familiarity with the `data.table` syntax. If that is not the case, please read the following vignettes:

- `vignette("datatable-intro", package="data.table")`
- `vignette("datatable-reference-semantics", package="data.table")`
- `vignette("datatable-keys-fast-subset", package="data.table")`
- [`vignette("datatable-intro", package="data.table")`](datatable-intro.html)
- [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html)
- [`vignette("datatable-keys-fast-subset", package="data.table")`](datatable-keys-fast-subset.html)

***

Expand Down
14 changes: 7 additions & 7 deletions vignettes/datatable-keys-fast-subset.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ knitr::opts_chunk$set(
.old.th = setDTthreads(1)
```

This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the `vignette("datatable-intro", package="data.table")` and the `vignette("datatable-reference-semantics", package="data.table")` first.
This vignette is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, add/modify/delete columns *by reference* in `j` and group by using `by`. If you're not familiar with these concepts, please read the vignettes [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) and [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) first.

***

## Data {#data}

We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
We will use the same `flights` data as in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.

```{r echo = FALSE}
options(width = 100L)
Expand Down Expand Up @@ -58,7 +58,7 @@ In this vignette, we will

### a) What is a *key*?

In the `vignette("datatable-intro", package="data.table")`, we saw how to subset rows in `i` using logical expressions, row numbers and using `order()`. In this section, we will look at another way of subsetting incredibly fast - using *keys*.
In the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette, we saw how to subset rows in `i` using logical expressions, row numbers and using `order()`. In this section, we will look at another way of subsetting incredibly fast - using *keys*.

But first, let's start by looking at *data.frames*. All *data.frames* have a row names attribute. Consider the *data.frame* `DF` below.

Expand Down Expand Up @@ -143,7 +143,7 @@ head(flights)

* Alternatively you can pass a character vector of column names to the function `setkeyv()`. This is particularly useful while designing functions to pass columns to set key on as function arguments.

* Note that we did not have to assign the result back to a variable. This is because like the `:=` function we saw in the `vignette("datatable-reference-semantics", package="data.table")`, `setkey()` and `setkeyv()` modify the input *data.table* *by reference*. They return the result invisibly.
* Note that we did not have to assign the result back to a variable. This is because like the `:=` function we saw in the [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) vignette, `setkey()` and `setkeyv()` modify the input *data.table* *by reference*. They return the result invisibly.

* The *data.table* is now reordered (or sorted) by the column we provided - `origin`. Since we reorder by reference, we only require additional memory of one column of length equal to the number of rows in the *data.table*, and is therefore very memory efficient.

Expand Down Expand Up @@ -262,7 +262,7 @@ flights[.("LGA", "TPA"), .(arr_delay)]

* The *row indices* corresponding to `origin == "LGA"` and `dest == "TPA"` are obtained using *key based subset*.

* Once we have the row indices, we look at `j` which requires only the `arr_delay` column. So we simply select the column `arr_delay` for those *row indices* in the exact same way as we have seen in `vignette("datatable-intro", package="data.table")`.
* Once we have the row indices, we look at `j` which requires only the `arr_delay` column. So we simply select the column `arr_delay` for those *row indices* in the exact same way as we have seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.

* We could have returned the result by using `with = FALSE` as well.

Expand Down Expand Up @@ -290,7 +290,7 @@ flights[.("LGA", "TPA"), max(arr_delay)]

### d) *sub-assign* by reference using `:=` in `j`

We have seen this example already in the `vignette("datatable-reference-semantics", package="data.table")`. Let's take a look at all the `hours` available in the `flights` *data.table*:
We have seen this example already in the [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) vignette. Let's take a look at all the `hours` available in the `flights` *data.table*:

```{r}
# get all 'hours' in flights
Expand Down Expand Up @@ -498,7 +498,7 @@ In this vignette, we have learnt another method to subset rows in `i` by keying

* combine key based subsets with `j` and `by`. Note that the `j` and `by` operations are exactly the same as before.

Key based subsets are **incredibly fast** and are particularly useful when the task involves *repeated subsetting*. But it may not be always desirable to set key and physically reorder the *data.table*. In the next `vignette("datatable-secondary-indices-and-auto-indexing", package="data.table")`, we will address this using a *new* feature -- *secondary indexes*.
Key based subsets are **incredibly fast** and are particularly useful when the task involves *repeated subsetting*. But it may not be always desirable to set key and physically reorder the *data.table*. In the next [next vignette (`vignette("datatable-secondary-indices-and-auto-indexing", package="data.table")`)](datatable-secondary-indices-and-auto-indexing.html), we will address this using a *new* feature -- *secondary indexes*.


```{r, echo=FALSE}
Expand Down
14 changes: 7 additions & 7 deletions vignettes/datatable-reference-semantics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@ knitr::opts_chunk$set(
collapse = TRUE)
.old.th = setDTthreads(1)
```
This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the `vignette("datatable-intro", package="data.table")` first.
This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette first.

***

## Data {#data}

We will use the same `flights` data as in the `vignette("datatable-intro", package="data.table")`.
We will use the same `flights` data as in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.

```{r echo = FALSE}
options(width = 100L)
Expand Down Expand Up @@ -169,7 +169,7 @@ We see that there are totally `25` unique values in the data. Both *0* and *24*
flights[hour == 24L, hour := 0L]
```

* We can use `i` along with `:=` in `j` the very same way as we have already seen in the `vignette("datatable-intro", package="data.table")`.
* We can use `i` along with `:=` in `j` the very same way as we have already seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.

* Column `hour` is replaced with `0` only on those *row indices* where the condition `hour == 24L` specified in `i` evaluates to `TRUE`.

Expand Down Expand Up @@ -234,7 +234,7 @@ head(flights)

* We provide the columns to group by the same way as shown in the *Introduction to data.table* vignette. For each group, `max(speed)` is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. `flights` *data.table* is modified *in-place*.

* We could have also provided `by` with a *character vector* as we saw in the `vignette("datatable-intro", package="data.table")`, e.g., `by = c("origin", "dest")`.
* We could have also provided `by` with a *character vector* as we saw in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette, e.g., `by = c("origin", "dest")`.

#

Expand All @@ -253,7 +253,7 @@ head(flights)

* Note that since we allow assignment by reference without quoting column names when there is only one column as explained in [Section 2c](#delete-convenience), we can not do `out_cols := lapply(.SD, max)`. That would result in adding one new column named `out_cols`. Instead we should do either `c(out_cols)` or simply `(out_cols)`. Wrapping the variable name with `(` is enough to differentiate between the two cases.

* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the `vignette("datatable-intro", package="data.table")`. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group.
* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group.

#
Before moving on to the next section, let's clean up the newly created columns `speed`, `max_speed`, `max_dep_delay` and `max_arr_delay`.
Expand Down Expand Up @@ -369,7 +369,7 @@ However we could improve this functionality further by *shallow* copying instead
* It is used to *add/update/delete* columns by reference.
* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the `vignette("datatable-intro", package="data.table")`. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*.
* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*.
* We can use `:=` for its side effect or use `copy()` to not modify the original object while updating by reference.
Expand All @@ -379,6 +379,6 @@ setDTthreads(.old.th)

#

So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the next vignette `vignette("datatable-keys-fast-subset", package="data.table")` to perform *blazing fast subsets* by *keying data.tables*.
So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the [next vignette (`vignette("datatable-keys-fast-subset", package="data.table")`)](datatable-keys-fast-subset.html) to perform *blazing fast subsets* by *keying data.tables*.

***
4 changes: 2 additions & 2 deletions vignettes/datatable-reshape.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ dcast(DT.m1, family_id ~ ., fun.agg = function(x) sum(!is.na(x)), value.var = "d

Check `?dcast` for other useful arguments and additional examples.

## 2. Limitations in current `melt/dcast` approaches
## 2. Limitations in previous `melt/dcast` approaches

So far we've seen features of `melt` and `dcast` that are implemented efficiently for `data.table`s, using internal `data.table` machinery (*fast radix ordering*, *binary search* etc.).

Expand Down Expand Up @@ -170,7 +170,7 @@ str(DT.c1) ## gender column is class IDate now!

As an analogy, imagine you've a closet with four shelves of clothes and you'd like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!

2. The columns to `melt` may be of different types, as in this case (`character` and `integer` types). By `melt`ing them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of `str(DT.c1)`, where `gender` has been converted to *`character`* type.
2. The columns to `melt` may be of different types. By `melt`ing them all together, the columns will be coerced in result.

3. We are generating an additional column by splitting the `variable` column into two columns, whose purpose is quite cryptic. We do it because we need it for *casting* in the next step.

Expand Down
2 changes: 1 addition & 1 deletion vignettes/datatable-sd-usage.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ head(unique(Teams[[fkt[1L]]]))
Note:


1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See `vignette("datatable-reference-semantics", package="data.table")` for more.
1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See [`vignette("datatable-reference-semantics", package="data.table")`](datatable-reference-semantics.html) for more.
2. The LHS, `names(.SD)`, indicates which columns we are updating - in this case we update the entire `.SD`.
3. The RHS, `lapply()`, loops through each column of the `.SD` and converts the column to a factor.
4. We use the `.SDcols` to only select columns that have pattern of `teamID`.
Expand Down
Loading

0 comments on commit 4ad7706

Please sign in to comment.