data.table
provides a high-performance version of base R's data.frame
with syntax and feature enhancements for ease of use, convenience and programming speed.
Tuesday 22nd October 2019
Matt Dowle will be in New York for H2O World.
Please Ask-Me-Anything starting now: click http://sli.do and enter event code "askmattdowle".
I'll answer the most voted questions during my session: https://h2o.ai/h2oworldny-livestream-reg
- concise syntax: fast to type, fast to read
- fast speed
- memory efficient
- careful API lifecycle management
- community
- feature rich
- fast and friendly delimited file reader:
?fread
, see also convenience features for small data - fast and feature rich delimited file writer:
?fwrite
- low-level parallelism: many common operations are internally parallelized to use multiple CPU threads
- fast and scalable aggregations; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
- fast and feature rich joins: ordered joins (e.g. rolling forwards, backwards, nearest and limited staleness), overlapping range joins (similar to
IRanges::findOverlaps
), non-equi joins (i.e. joins using operators>, >=, <, <=
), aggregate on join (by=.EACHI
), update on join - fast add/update/delete columns by reference by group using no copies at all
- fast and feature rich reshaping data:
?dcast
(pivot/wider/spread) and?melt
(unpivot/longer/gather) - any R function from any R package can be used in queries not just the subset of functions made available by a database backend, also columns of type
list
are supported - has no dependencies at all other than base R itself, for simpler production/maintenance
- the R dependency is as old as possible for as long as possible and we continuously test against that version; e.g. v1.11.0 released on 5 May 2018 bumped the dependency up from 5 year old R 3.0.0 to 4 year old R 3.1.0
install.packages("data.table")
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
or update only if newer revision is available
data.table::update.dev.pkg()
See the Installation wiki for more details.
Use data.table
subset [
operator the same way you would use data.frame
one, but...
- no need to prefix each column with
DT$
(likesubset()
andwith()
but built-in) - any R expression using any package is allowed in
j
argument, not just list of columns - extra argument
by
to computej
expression by group
library(data.table)
DT = as.data.table(iris)
# FROM[WHERE, SELECT, GROUP BY]
# DT [i, j, by]
DT[Petal.Width > 1.0, mean(Petal.Length), by = Species]
# Species V1
#1: versicolor 4.362791
#2: virginica 5.552000
- Introduction to data.table vignette
- Getting started wiki page
data.table
is widely used by the R community. As of July 2019, it was used by over 680 CRAN and Bioconductor packages and was the 9th most starred R package on GitHub. If you need help, the data.table
community is active StackOverflow, with nearly 9,000 questions.
- click the Watch button at the top and right of GitHub project page
- read NEWS file
- follow #rdatatable on twitter
- watch recent Presentations
- read recent Articles
Guidelines for filing issues / pull requests: Contribution Guidelines.