If you are viewing this file on CRAN, please check latest news on GitHub where the formatting is also better.
Benchmarks are regularly updated: here
data.table v1.14.3 (in development)
-
nafill()
now appliesfill=
to the front/back of the vector whentype="locf|nocb"
, #3594. Thanks to @ben519 for the feature request. It also now returns a named object based on the input names. Note that if you are considering joining and then usingnafill(...,type='locf|nocb')
afterwards, please reviewroll=
/rollends=
which should achieve the same result in one step more efficiently.nafill()
is for when filling-while-joining (i.e.roll=
/rollends=
/nomatch=
) cannot be applied. -
mean(na.rm=TRUE)
by group is now GForce optimized, #4849. Thanks to the h2oai/db-benchmark project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to typeinteger64
resulting in a difference to thebit64::mean.integer64
method:data.table
returns adouble
result whereasbit64
rounds the mean to the nearest integer. -
fwrite()
now writes UTF-8 or native csv files by specifying theencoding=
argument, #1770. Thanks to @shrektan for the request and the PR. -
data.table()
no longer fills empty vectors withNA
with warning. Instead a 0-rowdata.table
is returned, #3727. Sincedata.table()
is used internally by.()
, this brings the following examples in line with expectations in most cases. Thanks to @shrektan for the suggestion and PR.DT = data.table(A=1:3, B=letters[1:3]) DT[A>3, .(ITEM='A>3', A, B)] # (1) DT[A>3][, .(ITEM='A>3', A, B)] # (2) # the above are now equivalent as expected and return: Empty data.table (0 rows and 3 cols): ITEM,A,B # Previously, (2) returned : ITEM A B <char> <int> <char> 1: A>3 NA <NA> Warning messages: 1: In as.data.table.list(jval, .named = NULL) : Item 2 has 0 rows but longest item has 1; filled with NA 2: In as.data.table.list(jval, .named = NULL) : Item 3 has 0 rows but longest item has 1; filled with NA
DT = data.table(A=1:3, B=letters[1:3], key="A") DT[.(1:3, double()), B] # new result : character(0) # old result : [1] "a" "b" "c" Warning message: In as.data.table.list(i) : Item 2 has 0 rows but longest item has 3; filled with NA
-
%like%
on factors with a large number of levels is now faster, #4748. The example in the PR shows 2.37s reduced to 0.86s on a factor length 100 million containing 1 million unique 10-character strings. Thanks to @statquant for reporting, and @shrektan for implementing. -
keyby=
now acceptsTRUE
/FALSE
together withby=
, #4307. The primary motivation is benchmarking whereby=
vskeyby=
is varied across a set of queries. Thanks to Jan Gorecki for the request and the PR.DT[, sum(colB), keyby="colA"] DT[, sum(colB), by="colA", keyby=TRUE] # same
-
fwrite()
gains a newdatatable.fwrite.sep
option to change the default separator, still","
by default. Thanks to Tony Fischetti for the PR. As is good practice in R in general, we usually resist new global options for the reason that a user changing the option for their own code can inadvertently change the behaviour of any package usingdata.table
too. However, in this case, the global option affects file output rather than code behaviour. In fact, the very reason the user may wish to change the default separator is that they know a different separator is more appropriate for their data being passed to the package usingfwrite
but cannot otherwise change thefwrite
call within that package. -
melt()
now supportsNA
entries when specifying a list ofmeasure.vars
, which translate into runs of missing values in the output. Useful for melting wide data with some missing columns, #4027. Thanks to @vspinu for reporting, and @tdhock for implementing. -
melt()
now supports multiple output variable columns via thevariable_table
attribute ofmeasure.vars
, #3396 #2575 #2551, #4998. It should be adata.table
with one row that describes each element of themeasure.vars
vector(s). These data/columns are copied to the output instead of the usual variable column. This is backwards compatible since the previous behavior (one output variable column) is used when there is novariable_table
. New functionsmeasure()
andmeasurev()
which use either a separator or a regex to create ameasure.vars
list/vector withvariable_table
attribute; useful for melting data that has several distinct pieces of information encoded in each column name. See new?measure
and new section in reshape vignette. Thanks to Matthias Gomolka, Ananda Mahto, Hugh Parsonage, Mark Fairbanks for reporting, and to Toby Dylan Hocking for implementing. Thanks to @keatingw for testing before release, requestingmeasure()
accept single groups too #5065, and Toby for implementing. -
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's
substitute
-like interface via a newenv
argument to[.data.table
. For details see the new vignette programming on data.table, and the new?substitute2
manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.DT = data.table(x = 1:5, y = 5:1) # parameters in_col_name = "x" fun = "sum" fun_arg1 = "na.rm" fun_arg1val = TRUE out_col_name = "sum_x" # parameterized query #DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val))] # desired query DT[, .(sum_x = sum(x, na.rm=TRUE))] # new interface DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val)), env = list( in_col_name = "x", fun = "sum", fun_arg1 = "na.rm", fun_arg1val = TRUE, out_col_name = "sum_x" )]
-
DT[, if (...) .(a=1L) else .(a=1L, b=2L), by=group]
now returns a 1-column result with warningj may not evaluate to the same number of columns for each group
, rather than error'names' attribute [2] must be the same length as the vector
, #4274. Thanks to @robitalec for reporting, and Michael Chirico for the PR. -
Typo checking in
i
available since 1.11.4 is extended to work in non-English sessions, #4989. Thanks to Michael Chirico for the PR. -
fifelse()
now coerces logicalNA
to other types and thena
argument supports vectorized input, #4277 #4286 #4287. Thanks to @michaelchirico and @shrektan for reporting, and @shrektan for implementing. -
.datatable.aware
is now recognized in the calling environment in addition to the namespace of the calling package, dtplyr#184. Thanks to Hadley Wickham for the idea and PR. -
New convenience function
%plike%
maps tolike(..., perl=TRUE)
, #3702.%plike%
uses Perl-compatible regular expressions (PCRE) which extend TRE, and may be more efficient in some cases. Thanks @KyleHaynes for the suggestion and PR. -
fwrite()
now acceptssep=""
, #4817. The motivation is an example where the result ofpaste0()
needs to be written to file butpaste0()
takes 40 minutes due to constructing a very large number of unique long strings in R's global character cache. Allowingfwrite(, sep="")
avoids thepaste0
and saves 40 mins. Thanks to Jan Gorecki for the request, and Ben Schwen for the PR. -
data.table
printing now supports customizable methods for both columns and list column row items, part of #1523.format_col
is S3-generic for customizing how to print whole columns and by default defers to the S3format
method for the column's class if one exists; e.g.format.sfc
for geometry columns from thesf
package, #2273. Similarly,format_list_item
is S3-generic for customizing how to print each row of list columns (which lack a format method at a column level) and also by default defers to the S3format
method for that item's class if one exists. Thanks to @mllg who initially filed #3338 with the seed of the idea, @franknarf1 who earlier suggested the idea of providing custom formatters, @fparages who submitted a patch to improve the printing of timezones for #2842, @RichardRedding for pointing out an error relating to printing wideexpression
columns in #3011, @JoshOBrien for improving the output for geometry columns, and @MichaelChirico for implementing. See?print.data.table
for examples. -
tstrsplit(,type.convert=)
now accepts a named list of functions to apply to each part, #5094. Thanks to @Kamgang-B for the request and implementing. -
as.data.table(DF, keep.rownames=key='keyCol')
now works, #4468. Thanks to Michael Chirico for the idea and the PR. -
dcast()
now supports complex values invalue.var
, #4855. This extends earlier support for complex values informula
. Thanks Elio Campitelli for the request, and Michael Chirico for the PR. -
melt()
was pseudo generic in thatmelt(DT)
would dispatch to themelt.data.table
method butmelt(not-DT)
would explicitly redirect toreshape2
. Nowmelt()
is standard generic so that methods can be developed in other packages, #4864. Thanks to @odelmarcelle for suggesting and implementing. -
DT(i, j, by, ...)
has been added, i.e. functional form of adata.table
query, #641 #4872. Thanks to Yike Lu and Elio Campitelli for filing requests, many others for comments and suggestions, and Matt Dowle for the PR. This enables thedata.table
general form query to be invoked on adata.frame
without converting it to adata.table
first. The class of the input object is retained. Thanks to Mark Fairbanks and Boniface Kamgang for testing and reporting problems that have been fixed before release, #5106 #5107.mtcars |> DT(mpg>20, .(mean_hp=mean(hp)), by=cyl)
When
data.table
queries (either[...]
or|> DT(...)
) receive adata.table
, the operations maintaindata.table
's attributes such as its key and any indices. For example, if adata.table
is reordered bydata.table
, or a key column has a value changed by:=
indata.table
, its key and indices will either be dropped or reordered appropriately. Somedata.table
operations automatically add and store an index on adata.table
for reuse in future queries, ifoptions(datatable.auto.index=TRUE)
, which isTRUE
by default.data.table
's are also over-allocated, which means there are spare column pointer slots allocated in advance so that adata.table
in the.GlobalEnv
can have a column added to it truly by reference, like an in-memory database with multiple client sessions connecting to one server R process, as adata.table
video has shown in the past. But because R and other packages don't maintaindata.table
's attributes or over-allocation (e.g. a subset or reorder by R or another package will create invaliddata.table
attributes)data.table
cannot use these attributes when it detects that base R or another package has touched thedata.table
in the meantime, even if the attributes may sometimes still be valid. So, please realize that,DT()
on adata.table
should realize better speed and memory usage thanDT()
on adata.frame
.DT()
on adata.frame
may still be useful to usedata.table
's syntax (e.g. sub-queries within group:|> DT(i, .SD[sub-query], by=grp)
) without needing to convert to adata.table
first. -
DT[i, nomatch=NULL]
wherei
contains row numbers now excludesNA
and any outside the range [1,nrow], #3109 #3666. Before,NA
rows were returned always for such values; i.e.nomatch=0|NULL
was ignored. Thanks Michel Lang and Hadley Wickham for the requests, and Jan Gorecki for the PR. Usingnomatch=0
in this case wheni
is row numbers generates the warningPlease use nomatch=NULL instead of nomatch=0; see news item 5 in v1.12.0 (Jan 2019)
.DT = data.table(A=1:3) DT[c(1L, NA, 3L, 5L)] # default nomatch=NA # A # <int> # 1: 1 # 2: NA # 3: 3 # 4: NA DT[c(1L, NA, 3L, 5L), nomatch=NULL] # A # <int> # 1: 1 # 2: 3
-
DT[, head(.SD,n), by=grp]
andtail
are now optimized whenn>1
, #5060 #523.n==1
was already optimized. Thanks to Jan Gorecki and Michael Young for requesting, and Benjamin Schwendinger for the PR. -
setcolorder()
gainsbefore=
andafter=
, #4385. Thanks to Matthias Gomolka for the request, and both Benjamin Schwendinger and Xianghui Dong for implementing. -
base::droplevels()
gains a fast method fordata.table
, #647. Thanks to Steve Lianoglou for requesting, Boniface Kamgang and Martin Binder for testing, and Jan Gorecki and Benjamin Schwendinger for the PR.fdroplevels()
for use on vectors has also been added. -
shift()
now also supportstype="cyclic"
, #4451. Arguments that are normally pushed out bytype="lag"
ortype="lead"
are re-introduced at this type at the first/last positions. Thanks to @RicoDiel for requesting, and Benjamin Schwendinger for the PR.# Usage shift(1:5, n=-1:1, type="cyclic") # [[1]] # [1] 2 3 4 5 1 # # [[2]] # [1] 1 2 3 4 5 # # [[3]] # [1] 5 1 2 3 4 # Benchmark x = sample(1e9) # 3.7 GB microbenchmark::microbenchmark( shift(x, 1, type="cyclic"), c(tail(x, 1), head(x,-1)), times = 10L, unit = "s" ) # Unit: seconds # expr min lq mean median uq max neval # shift(x, 1, type = "cyclic") 1.57 1.67 1.71 1.68 1.70 2.03 10 # c(tail(x, 1), head(x, -1)) 6.96 7.16 7.49 7.32 7.64 8.60 10
-
fread()
now supports "0" and "1" inna.strings
, #2927. Previously this was not permitted since "0" and "1" can be recognized as boolean values. Note that it is still not permitted to use "0" and "1" inna.strings
in combination withlogical01 = TRUE
. Thanks to @msgoussi for the request, and Benjamin Schwendinger for the PR. -
setkey()
now supports typeraw
as value columns (not as key columns), #5100. Thanks Hugh Parsonage for requesting, and Benjamin Schwendinger for the PR. -
shift()
is now optimised by group, #1534. Thanks to Gerhard Nachtmann for requesting, and Benjamin Schwendinger for the PR.N = 1e7 DT = data.table(x=sample(N), y=sample(1e6,N,TRUE)) shift_no_opt = shift # different name not optimised as a way to compare microbenchmark( DT[, c(NA, head(x,-1)), y], DT[, shift_no_opt(x, 1, type="lag"), y], DT[, shift(x, 1, type="lag"), y], times=10L, unit="s") # Unit: seconds # expr min lq mean median uq max neval # DT[, c(NA, head(x, -1)), y] 8.7620 9.0240 9.1870 9.2800 9.3700 9.4110 10 # DT[, shift_no_opt(x, 1, type = "lag"), y] 20.5500 20.9000 21.1600 21.3200 21.4400 21.5200 10 # DT[, shift(x, 1, type = "lag"), y] 0.4865 0.5238 0.5463 0.5446 0.5725 0.5982 10
Example from stackoverflow
set.seed(1) mg = data.table(expand.grid(year=2012:2016, id=1:1000), value=rnorm(5000)) microbenchmark(v1.9.4 = mg[, c(value[-1], NA), by=id], v1.9.6 = mg[, shift_no_opt(value, n=1, type="lead"), by=id], v1.14.4 = mg[, shift(value, n=1, type="lead"), by=id], unit="ms") # Unit: milliseconds # expr min lq mean median uq max neval # v1.9.4 3.6600 3.8250 4.4930 4.1720 4.9490 11.700 100 # v1.9.6 18.5400 19.1800 21.5100 20.6900 23.4200 29.040 100 # v1.14.4 0.4826 0.5586 0.6586 0.6329 0.7348 1.318 100
-
rbind()
andrbindlist()
now supportfill=TRUE
withuse.names=FALSE
instead of issuing the warninguse.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE.
DT1 # A B # <int> <int> # 1: 1 5 # 2: 2 6 DT2 # foo # <int> # 1: 3 # 2: 4 rbind(DT1, DT2, fill=TRUE) # no change # A B foo # <int> <int> <int> # 1: 1 5 NA # 2: 2 6 NA # 3: NA NA 3 # 4: NA NA 4 rbind(DT1, DT2, fill=TRUE, use.names=FALSE) # was: # A B foo # <int> <int> <int> # 1: 1 5 NA # 2: 2 6 NA # 3: NA NA 3 # 4: NA NA 4 # Warning message: # In rbindlist(l, use.names, fill, idcol) : # use.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE. # now: # A B # <int> <int> # 1: 1 5 # 2: 2 6 # 3: 3 NA # 4: 4 NA
-
fread()
already made a good guess as to whether column names are present by comparing the type of the fields in row 1 to the type of the fields in the sample. This guess is now improved when a column contains a string in row 1 (i.e. a potential column name) but all blank in the sample rows, #2526. Thanks @st-pasha for reporting, and @ben-schwen for the PR. -
fread()
can now read.zip
and.tar
directly, #3834. Moreover, if a compressed file name is missing its extension,fread()
now attempts to infer the correct filetype from its magic bytes. Thanks to Michael Chirico for the idea, and Benjamin Schwendinger for the PR. -
DT[, let(...)]
is a new alias for the functional form of:=
; i.e.DT[, ':='(...)]
, #3795. Thanks to Elio Campitelli for requesting, and Benjamin Schwendinger for the PR.DT = data.table(A=1:2) DT[, let(B=3:4, C=letters[1:2])] DT # A B C # <int> <int> <char> # 1: 1 3 a # 2: 2 4 b
-
weighted.mean()
is now optimised by group, #3977. Thanks to @renkun-ken for requesting, and Benjamin Schwendinger for the PR. -
as.xts.data.table()
now supports non-numeric xts coredata matrixes, 5268. Existing numeric only functionality is supported by a newnumeric.only
parameter, which defaults toTRUE
for backward compatability and the most common use case. To convert non-numeric columns, set this parameter toFALSE
. Conversions ofdata.table
columns to amatrix
now usesdata.table::as.matrix
, with all its performance benefits. Thanks to @ethanbsmith for the report and fix. -
unique.data.table()
gainscols
to specify a subset of columns to include in the resultingdata.table
, #5243. This saves the memory overhead of subsetting unneeded columns, and provides a cleaner API for a common operation previously needing more convoluted code. Thanks to @MichaelChirico for the suggestion & implementation. -
:=
is now optimized by group, #1414. Thanks to Arun Srinivasan for suggesting, and Benjamin Schwendinger for the PR. Thanks to @clerousset, @dcaseykc, @OfekShilon, and @SeanShao98 for testing dev and filing detailed bug reports which were fixed before release and their tests added to the test suite. -
.I
is now available inby
for rowwise operations, #1732. Thanks to Rafael H. M. Pereira for requesting, and Benjamin Schwendinger for the PR.DT # V1 V2 # <int> <int> # 1: 3 5 # 2: 4 6 DT[, sum(.SD), by=.I] # I V1 # <int> <int> # 1: 1 8 # 2: 2 10
-
New functions
yearmon()
andyearqtr
give a combined representation ofyear()
andmonth()
/quarter()
. These and alsoyday
,wday
,mday
,week
,month
andyear
are now optimized for memory and compute efficiency by removing thePOSIXlt
dependency, #649. Thanks to Matt Dowle for the request, and Benjamin Schwendinger for the PR.
-
by=.EACHI
wheni
is keyed buton=
different columns thani
's key could create an invalidly keyed result, #4603 #4911. Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where adata.table
is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries. -
print(DT, trunc.cols=TRUE)
and the correspondingdatatable.print.trunc.cols
option (new feature 3 in v1.13.0) could incorrectly display an extra column, #4266. Thanks to @tdhock for the bug report and @MichaelChirico for the PR. -
fread(..., nrows=0L)
now works as intended and the same asnrows=0
; i.e. returning the column names and typed empty columns determined by the large sample, #4686, #4029. Thanks to @hongyuanjia and @michaelpaulhirsch for reporting, and Benjamin Schwendinger for the PR. -
Passing
.SD
tofrankv()
withties.method='random'
or withna.last=NA
failed with.SD is locked
, #4429. Thanks @smarches for the report. -
Filtering data.table using
which=NA
to return non-matching indices will now properly work for non-optimized subsetting as well, closes #4411. -
When
j
returns an object whose class"X"
inherits fromdata.table
; i.e. classc("X", "data.table", "data.frame")
, the derived class"X"
is no longer incorrectly dropped from the class of thedata.table
returned, #4324. Thanks to @HJAllen for reporting and @shrektan for the PR. -
as.data.table()
failed with.subset2(x, i, exact = exact): attempt to select less than one element in get1index
when passed an object inheriting fromdata.table
with a different[[
method, such as the classdfidx
from thedfidx
package, #4526. Thanks @RicoDiel for the report, and Michael Chirico for the PR. -
rbind()
andrbindlist()
of length-0 ordered factors failed withInternal error: savetl_init checks failed
, #4795 #4823. Thanks to @shrektan and @dbart79 for reporting, and @shrektan for fixing. -
data.table(NULL)[, firstCol:=1L]
createddata.table(firstCol=1L)
ok but did not update the internalrow.names
attribute, causingError in '$<-.data.frame'(x, name, value) : replacement has 1 row, data has 0
when passed to packages likeggplot
which useDT
as if it is adata.frame
, #4597. Thanks to Matthew Son for reporting, and Cole Miller for the PR. -
X[Y, .SD, by=]
(joining and grouping in the same query) could segfault if i)by=
is supplied custom data (i.e. not simple expressions of columns), and ii) some rows ofY
do not match to any rows inX
, #4892. Thanks to @Kodiologist for reporting, @ColeMiller1 for investigating, and @tlapak for the PR. -
Assigning a set of 2 or more all-NA values to a factor column could segfault, #4824. Thanks to @clerousset for reporting and @shrektan for fixing.
-
as.data.table(table(NULL))
now returnsdata.table(NULL)
rather than errorattempt to set an attribute on NULL
, #4179. The result differs slightly toas.data.frame(table(NULL))
(0-row, 1-column) because 0-column works better with otherdata.table
functions likerbindlist()
. Thanks to Michael Chirico for the report and fix. -
melt
with a list formeasure.vars
would outputvariable
inconsistently betweenna.rm=TRUE
andFALSE
, #4455. Thanks to @tdhock for reporting and fixing. -
by=...get()...
could fail withobject not found
, #4873 #4981. Thanks to @sindribaldur for reporting, and @OfekShilon for fixing. -
print(x, col.names='none')
now removes the column names as intended for widedata.table
s whose column names don't fit on a single line, #4270. Thanks to @tdhock for the report, and Michael Chirico for fixing. -
DT[, min(colB), by=colA]
whencolB
is typecharacter
would miss blank strings (""
) at the beginning of a group and return the smallest non-blank instead of blank, #4848. Thanks to Vadim Khotilovich for reporting and for the PR fixing it. -
Assigning a wrong-length or non-list vector to a list column could segfault, #4166 #4667 #4678 #4729. Thanks to @fklirono, Kun Ren, @kevinvzandvoort and @peterlittlejohn for reporting, and to Václav Tlapák for the PR.
-
as.data.table()
onxts
objects containing a column namedx
would return anindex
of type plaininteger
rather thanPOSIXct
, #4897. Thanks to Emil Sjørup for reporting, and Jan Gorecki for the PR. -
A fix to
as.Date(c("", ...))
in R 4.0.3, 17909, has been backported todata.table::as.IDate()
so that it too now returnsNA
for the first item when it is blank, even in older versions of R back to 3.1.0, rather than the incorrect errorcharacter string is not in a standard unambiguous format
, #4676. Thanks to Arun Srinivasan for reporting, and Michael Chirico both for thedata.table
PR and for submitting the patch to R that was accepted and included in R 4.0.3. -
uniqueN(DT, by=character())
is now equivalent touniqueN(DT)
rather than internal error'by' is either not integer or is length 0
, #4594. Thanks Marco Colombo for the report, and Michael Chirico for the PR. Similarly forunique()
,duplicated()
andanyDuplicated()
. -
melt()
on adata.table
withlist
columns formeasure.vars
would silently ignorena.rm=TRUE
, #5044. Now the same logic asis.na()
from base R is used; i.e. if list element is scalar NA then it is considered missing and removed. Thanks to Toby Dylan Hocking for the PRs. -
fread(fill=TRUE)
could segfault if the input contained an improperly quoted character field, #4774 #5041. Thanks to @AndeolEvain and @e-nascimento for reporting and to Václav Tlapák for the PR. -
fread(fill=TRUE, verbose=TRUE)
would segfault on the out-of-sample type bump verbose output if the input did not contain column names, 5046. Thanks to Václav Tlapák for the PR. -
.SDcols=-V2:-V1
and.SDcols=(-1)
could error withxcolAns does not pass checks
andargument specifying columns specify non existing column(s)
, #4231. Thanks to Jan Gorecki for reporting and the PR. -
.SDcols=<logical vector>
is now documented in?data.table
and it is now an error if the logical vector's length is not equal to the number of columns (consistent withdata.table
's no-recycling policy; see new feature 1 in v1.12.2 Apr 2019), #4115. Thanks to @Henrik-P for reporting and Jan Gorecki for the PR. -
melt()
now outputs scalar logicalNA
instead ofNULL
in rows corresponding to missing list columns, for consistency with non-list columns when usingna.rm=TRUE
, #5053. Thanks to Toby Dylan Hocking for the PR. -
as.data.frame(DT)
,setDF(DT)
andas.list(DT)
now remove the"index"
attribute which contains any indices (a.k.a. secondary keys), as they already did for otherdata.table
-only attributes such as the primary key stored in the"sorted"
attribute. When indices were left intact, a subsequent subset, assign, or reorder of thedata.frame
bydata.frame
-code in base R or other packages would not update the indices, causing incorrect results if then converted back todata.table
, #4889. Thanks @OfekShilon for the report and the PR. -
dplyr::arrange(DT)
usesvctrs::vec_slice
which retainsdata.table
's class but uses C to bypass[
method dispatch and does not adjustdata.table
's attributes containing the index row numbers, #5042.data.table
's long-standing.internal.selfref
mechanism to detect such operations by other packages was not being checked bydata.table
when using indexes, causingdata.table
filters and joins to use invalid indexes and return incorrect results after adplyr::arrange(DT)
. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to usedata.table
isdata.table::setkey(DT, col1, col2, ...)
which reordersDT
by reference in parallel, sets the primary key for automatic use by subsequentdata.table
queries, and permits rowname-like usage such asDT["foo",]
which returns the now-contiguous-in-memory block of rows where the first column ofDT
's key contains"foo"
. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up usingDT[.("foo",20210728L), ]
. Using==
ini
is also optimized to use the key or indices, if you prefer using column names explicitly and==
. An alternative tosetkey(DT)
is returning a new ordered result usingDT[order(col1, col2, ...), ]
. -
A segfault occurred when
nrow/throttle < nthread
, #5077. With the default throttle of 1024 rows (see?setDTthreads
), at least 64 threads would be needed to trigger the segfault since there needed to be more than 65,535 rows too. It occurred on a server with 256 logical cores wheredata.table
uses 128 threads by default. Thanks to Bennet Becker for reporting, debugging at C level, and fixing. It also occurred when the throttle was increased so as to use fewer threads; e.g. at the limitsetDTthreads(throttle=nrow(DT))
. -
fread(file=URL)
now works rather than errordoes not exist or is non-readable
, #4952.fread(URL)
andfread(input=URL)
worked before and continue to work. Thanks to @pnacht for reporting and @ben-schwen for the PR. -
fwrite(DF, row.names=TRUE)
whereDF
has specific integer rownames (e.g. usingrownames(DF) <- c(10L,20L,30L)
) would ignore the integer rownames and write the row numbers instead, #4957. Thanks to @dgarrimar for reporting and @ColeMiller1 for the PR. Further, whenquote='auto'
(default) and the rownames are integers (either default or specific), they are no longer quoted. -
test.data.table()
would fail on test 1894 if the variablez
was defined by the user, #3705. The test suite already ran in its own separate environment. That environment's parent is no longer.GlobalEnv
to isolate it further. Thanks to Michael Chirico for reporting, and Matt Dowle for the PR. -
fread(text="a,b,c")
(where input data contains no\n
buttext=
has been used) now works instead of errorfile not found: a,b,c
, #4689. Thanks to @trainormg for reporting, and @ben-schwen for the PR. -
na.omit(DT)
did not removeNA
innanotime
columns, #4744. Thanks Jean-Mathieu Vermosen for reporting, and Michael Chirico for the PR. -
DT[, min(intCol, na.rm=TRUE), by=grp]
would returnInf
for any groups containing all NAs, with a type change frominteger
tonumeric
to hold theInf
, and with warning. Similarlymax
would return-Inf
. NowNA
is returned for such all-NA groups, without warning or type change. This is almost-surely less surprising, more convenient, consistent, and efficient. There was no user request for this, likely because our desire to be consistent with base R in this regard was known (base::min(x, na.rm=TRUE)
returnsInf
with warning for all-NA input). Matt Dowle made this change when reworking internals, #5105. The old behavior seemed so bad, and since there was a warning too, it seemed appropriate to treat it as a bug.DT # A B # <char> <int> # 1: a 1 # 2: a NA # 3: b 2 # 4: b NA DT[, min(B,na.rm=TRUE), by=A] # no change in behavior (no all-NA groups yet) # A V1 # <char> <int> # 1: a 1 # 2: b 2 DT[3, B:=NA] # make an all-NA group DT # A B # <char> <int> # 1: a 1 # 2: a NA # 3: b NA # 4: b NA DT[, min(B,na.rm=TRUE), by=A] # old result # A V1 # <char> <num> # V1's type changed to numeric (inconsistent) # 1: a 1 # 2: b Inf # Inf surprising # Warning message: # warning inconvenient # In gmin(B, na.rm = TRUE) : # No non-missing values found in at least one group. Coercing to numeric # type and returning 'Inf' for such groups to be consistent with base DT[, min(B,na.rm=TRUE), by=A] # new result # A V1 # <char> <int> # V1's type remains integer (consistent) # 1: a 1 # 2: b NA # NA because there are no non-NA, naturally # no inconvenient warning
On the same basis,
min
andmax
methods for emptyIDate
input now returnNA_integer_
of classIDate
, rather thanNA_double_
of classIDate
together with base R's warningno non-missing arguments to min; returning Inf
, #2256. The type change and warning would cause an error in grouping, see example below. SinceNA
was returned before it seems clear that still returningNA
but of the correct type and with no warning is appropriate, backwards compatible, and a bug fix. Thanks to Frank Narf for reporting, and Matt Dowle for fixing.DT # d g # <IDat> <char> # 1: 2020-01-01 a # 2: 2020-01-02 a # 3: 2019-12-31 b DT[, min(d[d>"2020-01-01"]), by=g] # was: # Error in `[.data.table`(DT, , min(d[d > "2020-01-01"]), by = g) : # Column 1 of result for group 2 is type 'double' but expecting type # 'integer'. Column types must be consistent for each group. # In addition: Warning message: # In min.default(integer(0), na.rm = FALSE) : # no non-missing arguments to min; returning Inf # now : # g V1 # <char> <IDat> # 1: a 2020-01-02 # 2: b <NA>
-
DT[, min(int64Col), by=grp]
(andmax
) would return incorrect results forbit64::integer64
columns, #4444. Thanks to @go-see for reporting, and Michael Chirico for the PR. -
fread(dec=',')
was able to guesssep=','
and return an incorrect result, #4483. Thanks to Michael Chirico for reporting and fixing. It was already an error to provide bothsep=','
anddec=','
manually.fread('A|B|C\n1|0,4|a\n2|0,5|b\n', dec=',') # no problem # A B C # <int> <num> <char> # 1: 1 0.4 a # 2: 2 0.5 b fread('A|B,C\n1|0,4\n2|0,5\n', dec=',') # A|B C # old result guessed sep=',' despite dec=',' # <char> <int> # 1: 1|0 4 # 2: 2|0 5 # A B,C # now detects sep='|' correctly # <int> <num> # 1: 1 0.4 # 2: 2 0.5
-
IDateTime()
ignored thetz=
andformat=
arguments because...
was not passed through to submethods, #2402. Thanks to Frank Narf for reporting, and Jens Peder Meldgaard for the PR.IDateTime("20171002095500", format="%Y%m%d%H%M%S") # was : # Error in charToDate(x) : # character string is not in a standard unambiguous format # now : # idate itime # <IDat> <ITime> # 1: 2017-10-02 09:55:00
-
DT[i, sum(b), by=grp]
(and other optimized-by-group aggregates:mean
,var
,sd
,median
,prod
,min
,max
,first
,last
,head
andtail
) could segfault ifi
contained row numbers and one or more were NA, #1994. Thanks to Arun Srinivasan for reporting, and Benjamin Schwendinger for the PR. -
identical(fread(text="A\n0.8060667366\n")$A, 0.8060667366)
is now TRUE, #4461. This is one of 13 numbers in the set of 100,000 between 0.80606 and 0.80607 in 0.0000000001 increments that were not already identical. In all 13 cases R's parser (same asread.table
) andfread
straddled the true value by a very similar small amount.fread
now uses/10^n
rather than*10^-n
to match R identically in all cases. Thanks to Gabe Becker for requesting consistency, and Michael Chirico for the PR.for (i in 0:99999) { s = sprintf("0.80606%05d", i) r = eval(parse(text=s)) f = fread(text=paste0("A\n",s,"\n"))$A if (!identical(r, f)) cat(s, sprintf("%1.18f", c(r, f, r)), "\n") } # input eval & read.table fread before fread now # 0.8060603509 0.806060350899999944 0.806060350900000055 0.806060350899999944 # 0.8060614740 0.806061473999999945 0.806061474000000056 0.806061473999999945 # 0.8060623757 0.806062375699999945 0.806062375700000056 0.806062375699999945 # 0.8060629084 0.806062908399999944 0.806062908400000055 0.806062908399999944 # 0.8060632774 0.806063277399999945 0.806063277400000056 0.806063277399999945 # 0.8060638101 0.806063810099999944 0.806063810100000055 0.806063810099999944 # 0.8060647118 0.806064711799999944 0.806064711800000055 0.806064711799999944 # 0.8060658349 0.806065834899999945 0.806065834900000056 0.806065834899999945 # 0.8060667366 0.806066736599999945 0.806066736600000056 0.806066736599999945 # 0.8060672693 0.806067269299999944 0.806067269300000055 0.806067269299999944 # 0.8060676383 0.806067638299999945 0.806067638300000056 0.806067638299999945 # 0.8060681710 0.806068170999999944 0.806068171000000055 0.806068170999999944 # 0.8060690727 0.806069072699999944 0.806069072700000055 0.806069072699999944 # # remaining 99,987 of these 100,000 were already identical
-
dcast(empty-DT)
now returns an emptydata.table
rather than errorCannot cast an empty data.table
, #1215. Thanks to Damian Betebenner for reporting, and Matt Dowle for fixing. -
DT[factor("id")]
now works rather than errori has evaluated to type integer. Expecting logical, integer or double
, #1632.DT["id"]
has worked forever by automatically converting toDT[.("id")]
for convenience, and joins have worked forever between char/fact, fact/char and fact/fact even when levels mismatch, so it was unfortunate thatDT[factor("id")]
managed to escape the simple automatic conversion toDT[.(factor("id"))]
which is now in place. Thanks to @aushev for reporting, and Matt Dowle for the fix. -
All-NA character key columns could segfault, #5070. Thanks to @JorisChau for reporting and Benjamin Schwendinger for the fix.
-
In v1.13.2 a version of an old bug was reintroduced where during a grouping operation list columns could retain a pointer to the last group. This affected only attributes of list elements and only if those were updated during the grouping operation, #4963. Thanks to @fujiaxiang for reporting and @avimallu and Václav Tlapák for investigating and the PR.
-
shift(xInt64, fill=0)
andshift(xInt64, fill=as.integer64(0))
(but notshift(xInt64, fill=0L)
) would error withINTEGER() can only be applied to a 'integer', not a 'double'
wherexInt64
conveysbit64::integer64
,0
is typedouble
and0L
is type integer, #4865. Thanks to @peterlittlejohn for reporting and Benjamin Schwendinger for the PR. -
DT[i, strCol:=classVal]
did not coerce using theas.character
method for the class, resulting in either an unexpected string value or an error such asTo assign integer64 to a target of type character, please use as.character() for clarity
. Discovered during work on the previous issue, #5189.DT # A # <char> # 1: a # 2: b # 3: c DT[2, A:=as.IDate("2021-02-03")] DT[3, A:=bit64::as.integer64("4611686018427387906")] DT # A # <char> # 1: a # 2: 2021-02-03 # was 18661 # 3: 4611686018427387906 # was error 'please use as.character'
-
tables()
failed withargument "..." is missing
when called from within a function taking...
; e.g.function(...) { tables() }
, #5197. Thanks @greg-minshall for the report and @michaelchirico for the fix. -
DT[, prod(int64Col), by=grp]
produced wrong results forbit64::integer64
due to incorrect optimization, #5225. Thanks to Benjamin Schwendinger for reporting and fixing. -
fintersect(..., all=TRUE)
andfsetdiff(..., all=TRUE)
could return incorrect results when the inputs had columns namedx
andy
, #5255. Thanks @Fpadt for the report, and @ben-schwen for the fix. -
fwrite()
could produce not-ISO-compliant timestamps such as2023-03-08T17:22:32.:00Z
when under a whole second by less than numerical tolerance of one microsecond, #5238. Thanks to @avraam-inside for the report and Václav Tlapák for the fix. -
merge.data.table()
silently ignored theincomparables
argument, #2587. It is now implemented and any other ignored arguments (e.g. misspellings) are now warned about. Thanks to @GBsuperman for the report and @ben-schwen for the fix. -
DT[, c('z','x') := {x=NULL; list(2,NULL)}]
now removes columnx
as expected rather than incorrectly assigning2
tox
as well asz
, #5284. Thex=NULL
is superfluous while thelist(2,NULL)
is the final value of{}
whose items correspond toc('z','x')
. Thanks @eutwt for the report, and @ben-schwen for the fix. -
as.data.frame(DT, row.names=)
no longer silently ignoresrow.names
, #5319. Thanks to @dereckdemezquita for the fix and PR, and @ben-schwen for guidance.
-
New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of
0
(type double) vs0L
(type integer) is too inconvenient for you the user. So such coercions happen indata.table
automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when callingset()
many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as"2"
into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS withas.<type>()
so that it is clear to readers of your code that a coercion from character to that type is intended. For example :x = c(2L,NA,4L,5L) nafill(x, fill=3) # no warning; requiring 3L too inconvenient nafill(x, fill="3") # warns in case either x or "3" was a mistake nafill(x, fill=3.14) # warns that precision has been lost nafill(x, fill=as.integer(3.14)) # no warning; the as.<type> conveys intent
-
CsubsetDT
exported C function has been renamed toDT_subsetDT
. This requiresR_GetCCallable("data.table", "CsubsetDT")
to be updated toR_GetCCallable("data.table", "DT_subsetDT")
. Additionally there is now a dedicated header file for data.table C exportsinclude/datatableAPI.h
, #4643, thanks to @eddelbuettel, which makes it easier to import data.table C functions. -
In v1.12.4, fractional
fread(..., stringsAsFactors=)
was added. For example ifstringsAsFactors=0.2
, any character column with fewer than 20% unique strings would be cast asfactor
. This is now documented in?fread
as well, #4706. Thanks to @markderry for the PR. -
cube(DT, by="a")
now gives a more helpful error thatj
is missing, #4282. -
v1.13.0 (July 2020) fixed a segfault/corruption/error (depending on version of R and circumstances) in
dcast()
whenfun.aggregate
returnedNA
(typelogical
) in an otherwisecharacter
result, #2394. This fix was the result of other internal rework and there was no news item at the time. A new test to cover this case has now been added. Thanks Vadim Khotilovich for reporting, and Michael Chirico for investigating, pinpointing when the fix occurred and adding the test. -
DT[subset]
whereDT[(subset)]
orDT[subset==TRUE]
was intended; i.e., subsetting by a logical column whose name conflicts with an existing function, now gives a friendlier error message, #5014. Thanks @michaelchirico for the suggestion and PR, and @ColeMiller1 for helping with the fix. -
Grouping by a
list
column has its error message improved stating this is unsupported, #4308. Thanks @sindribaldur for filing, and @michaelchirico for the PR. Please add your vote and especially use cases to the #1597 feature request. -
OpenBSD 6.9 released May 2021 uses a 16 year old version of zlib (v1.2.3 from 2005) plus cherry-picked bug fixes (i.e. a semi-fork of zlib) which induces
Compress gzip error: -9
fromfwrite()
, #5048. Thanks to Philippe Chataignon for investigating and fixing. Matt asked on OpenBSD's mailing list if zlib could be upgraded to 4 year old zlib 1.2.11 but forgot his tin hat: https://marc.info/?l=openbsd-misc&m=162455479311886&w=1. -
?"."
,?".."
,?".("
, and?".()"
now point to?data.table
, #4385 #4407. To help users find the documentation for these convenience features available insideDT[...]
. Recall that.
is an alias forlist
, and..var
tellsdata.table
to look forvar
in the calling environment as opposed to a column of the table. -
DT[, lhs:=rhs]
andset(DT, , lhs, rhs)
no longer raise a warning on zero lengthlhs
, #4086. Thanks to Jan Gorecki for the suggestion and PR. For example,DT[, grep("foo", names(dt)) := NULL]
no longer warns if there are no column names containing"foo"
. -
melt()
's internal C code is now more memory efficient, #5054. Thanks to Toby Dylan Hocking for the PR. -
?merge
and?setkey
have been updated to clarify that the row order is retained whensort=FALSE
, and whyNA
s are always first whensort=TRUE
, #2574 #2594. Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR. -
datatable.[dll|so]
has changed name todata_table.[dll|so]
, #4442. Thanks to Jan Gorecki for the PR. We had previously removed the.
since.
is not allowed by the following paragraph in the Writing-R-Extensions manual. Replacing.
with_
instead now seems more consistent with the last sentence.... the basename of the DLL needs to be both a valid file name and valid as part of a C entry point (e.g. it cannot contain ‘.’): for portable code it is best to confine DLL names to be ASCII alphanumeric plus underscore. If entry point R_init_lib is not found it is also looked for with ‘.’ replaced by ‘_’.
-
For nearly two years, since v1.12.4 (Oct 2019) (note 11 below in this NEWS file), using
options(datatable.nomatch=0)
has produced the following message :The option 'datatable.nomatch' is being used and is not set to the default NA. This option is still honored for now but will be deprecated in future. Please see NEWS for 1.12.4 for detailed information and motivation. To specify inner join, please specify `nomatch=NULL` explicitly in your calls rather than changing the default using this option.
The message is now upgraded to warning that the option is now ignored.
-
Many thanks to Kurt Hornik for investigating potential impact of a possible future change to
base::intersect()
on empty input, providing a patch so thatdata.table
won't break if the change is made to R, and giving us plenty of notice, #5183. -
The options
datatable.print.class
anddatatable.print.keys
are nowTRUE
by default. They have been available since v1.9.8 (Nov 2016) and v1.11.0 (May 2018) respectively.
data.table v1.14.2 (27 Sep 2021)
- clang 13.0.0 (Sep 2021) requires the system header
omp.h
to be included before R's headers, #5122. Many thanks to Prof Ripley for testing and providing a patch file.
data.table v1.14.0 (21 Feb 2021)
-
In v1.13.0 (July 2020) native parsing of datetime was added to
fread
by Michael Chirico which dramatically improved performance. Before then datetime was read as type character by default which was slow. Since v1.13.0, UTC-marked datetime (e.g.2020-07-24T10:11:12.134Z
where the finalZ
is present) has been read automatically as POSIXct and quickly. We provided the migration optiondatatable.old.fread.datetime.character
to revert to the previous slow character behavior. We also added thetz=
argument to control unmarked datetime; i.e. where theZ
(or equivalent UTC postfix) is missing in the data. The defaulttz=""
reads unmarked datetime as character as before, slowly. We gave you the ability to settz="UTC"
to turn on the new behavior and read unmarked datetime as UTC, quickly. R sessions that are running in UTC by setting the TZ environment variable, as is good practice and common in production, have also been reading unmarked datetime as UTC since v1.13.0, much faster. Note 1 of v1.13.0 (below in this file) endedIn addition to convenience, fread is now significantly faster in the presence of dates, UTC-marked datetimes, and unmarked datetime when tz="UTC" is provided.
.At
rstudio::global(2021)
, Neal Richardson, Director of Engineering at Ursa Labs, compared Arrow CSV performance todata.table
CSV performance, Bigger Data With Ease Using Apache Arrow. He opened by comparing todata.table
as his main point. Arrow was presented as 3 times faster thandata.table
. He talked at length about this result. However, no reproducible code was provided and we were not contacted in advance in case we had any comments. He mentioned New York Taxi data in his talk which is a dataset known to us as containing unmarked datetime. Rebuttal.tz=
's default is now changed from""
to"UTC"
. If you have been usingtz=
explicitly then there should be no change. The change to read UTC-marked datetime as POSIXct rather than character already happened in v1.13.0. The change now is that unmarked datetimes are now read as UTC too by default without needing to settz="UTC"
. None of the 1,017 CRAN packages directly usingdata.table
are affected. As before, the migration optiondatatable.old.fread.datetime.character
can still be set to TRUE to revert to the old character behavior. This migration option is temporary and will be removed in the near future.The community was consulted in this tweet before release.
-
If
fread()
discards a single line footer, the warning message which includes the discarded text now displays any non-ASCII characters correctly on Windows, #4747. Thanks to @shrektan for reporting and the PR. -
fintersect()
now retains the order of the first argument as reasonably expected, rather than retaining the order of the second argument, #4716. Thanks to Michel Lang for reporting, and Ben Schwen for the PR.
-
Compiling from source no longer requires
zlib
header files to be available, #4844. The output suggests installingzlib
headers, and how (e.g.zlib1g-dev
on Ubuntu) as before, but now proceeds withgzip
compression disabled infwrite
. Upon callingfwrite(DT, "file.csv.gz")
at runtime, an error message suggests to reinstalldata.table
withzlib
headers available. This does not apply to users on Windows or Mac who install the pre-compiled binary package from CRAN. -
r-datatable.com
continues to be the short, canonical and long-standing URL which forwards to the current homepage. The homepage domain has changed a few times over the years but those usingr-datatable.com
did not need to change their links. For example, we user-datatable.com
in messages (and translated messages) in preference to the word 'homepage' to save users time in searching for the current homepage. The web forwarding was provided by Domain Monster but they do not supporthttps://r-datatable.com
, onlyhttp://r-datatable.com
, despite the homepage being forwarded to beinghttps:
for many years. Meanwhile, CRAN submission checks now require all URLs to behttps:
, rejectinghttp:
. Therefore we have moved to gandi.net who do supporthttps:
web forwarding and so https://r-datatable.com now forwards correctly. Thanks to Dirk Eddelbuettel for suggesting Gandi. Further, Gandi allows the web-forward to be marked 301 (permanent) or 302 (temporary). Since the very point ofhttps://r-datatable.com
is to be a forward, 302 is appropriate in this case. This enables us to link to it in DESCRIPTION, README, and this NEWS item. Otherwise, CRAN submission checks would require the 301 forward to be followed; i.e. the forward replaced with where it points to and the package resubmitted. Thanks to Uwe Ligges for explaining this distinction.
data.table v1.13.6 (30 Dec 2020)
-
Grouping could throw an error
Failed to allocate counts or TMP
with more than 1e9 rows even with sufficient RAM due to an integer overflow, #4295 #4818. Thanks to @renkun-ken and @jangorecki for reporting, and @shrektan for fixing. -
fwrite()
's mutithreadedgzip
compression failed on Solaris with Z_STREAM_ERROR, #4099. Since this feature was released in Oct 2019 (see item 3 in v1.12.4 below in this news file) there have been no known problems with it on Linux, Windows or Mac. For Solaris, we have been successively adding more and more detailed tracing to the output in each release, culminating in tracingzlib
internals at byte level by readingzlib
's source. The problem did not manifest itself on R-hub's Solaris instances, so we had to work via CRAN output. Ifzlib
'sz_stream
structure is declared inside a parallel region but before a parallel for, it appears that the particular OpenMP implementation used by CRAN's Solaris moves the structure to a new address on entering the parallel for. Ordinarily this memory move would not matter, however,zlib
internals have a self reference pointer to the parent, and check that the pointers match. This mismatch caused the -2 (Z_STREAM_ERROR). Allocating an array of structures, one for each thread, before the parallel region avoids the memory move with no cost.It should be carefully noted that we cannot be sure it really is a problem unique to CRAN's Solaris. Even if it seems that way after one year of observations. For example, it could be compiler flags, or particular memory circumstances, either of which could occur on other operating systems too. However, we are unaware of why it would make sense for the OpenMP implementation to move the structure at that point. Any optimizations such as aligning the set of structures to cache line boundaries could be performed at the start of the parallel region, not after the parallel for. If anyone reading this knows more, please let us know.
- The last release took place at the same time as several breaking changes were made to R-devel. The CRAN submissions process runs against latest daily R-devel so we had to keep up with those latest changes by making several resubmissions. Then each resubmission reruns against the new latest R-devel again. Overall it took 7 days. For example, we added the new
environments=FALSE
to ourall.equal
call. Then about 4 hours after 1.13.4 was accepted, thes
was dropped and we now need to resubmit withenvironment=FALSE
. In any case, we have suggested that the default should be FALSE first to give packages some notice, as opposed to generating errors in the CRAN submissions process within hours. Then the default forenvironment=
could be TRUE in 6 months time after packages have had some time to update in advance of the default change. Readers of this NEWS file will be familiar withdata.table
's approach to change control and know that we do this ourselves.
data.table v1.13.4 (08 Dec 2020)
-
as.matrix(<empty DT>)
now retains the column type for the empty matrix result, #4762. Thus, for example,min(DT[0])
where DT's columns are numeric, is now consistent with non-empty all-NA input and returnsInf
with R's warningno non-missing arguments to min; returning Inf
rather than R's erroronly defined on a data frame with all numeric[-alike] variables
. Thanks to @mb706 for reporting. -
fsort()
could crash when compiled usingclang-11
(Oct 2020), #4786. Multithreaded debugging revealed that threads are no longer assigned iterations monotonically by the dynamic schedule. Although never guaranteed by the OpenMP standard, in practice monotonicity could be relied on as far as we knew, until now. We rely on monotonicity in thefsort
implementation. Happily, a schedule modifiermonotonic:dynamic
was added in OpenMP 4.5 (Nov 2015) which we now use if available (e.g. gcc 6+, clang 3.9+). If you have an old compiler which does not support OpenMP 4.5, it's probably the case that the unmodified dynamic schedule is monotonic anyway, sofsort
now checks that threads are receiving iterations monotonically and emits a graceful error if not. It may be thatclang
prior to version 11, andgcc
too, exhibit the same crash. It was just thatclang-11
was the first report. To know which version of OpenMPdata.table
is using,getDTthreads(verbose=TRUE)
now reports theYYYYMM
value_OPENMP
; e.g. 201511 corresponds to v4.5, and 201811 corresponds to v5.0. Oddly, thex.y
version number is not provided by the OpenMP API. OpenMP 4.5 may be enabled in some compilers using-fopenmp-version=45
. Otherwise, if you need to upgrade compiler, https://www.openmp.org/resources/openmp-compilers-tools/ may be helpful. -
Columns containing functions that don't inherit the class
'function'
would fail to group, #4814. Thanks @mb706 for reporting, @ecoRoland2 for helping investigate, and @Coorsaa for a follow-up example involving environments.
-
Continuous daily testing by CRAN using latest daily R-devel revealed, within one day of the change to R-devel, that a future version of R would break one of our tests, #4769. The characters "-alike" were added into one of R's error messages, so our too-strict test which expected the error
only defined on a data frame with all numeric variables
will fail when it sees the new error messageonly defined on a data frame with all numeric-alike variables
. We have relaxed the pattern the test looks for todata.*frame.*numeric
well in advance of the future version of R being released. Readers are reminded that CRAN is not just a host for packages. It is also a giant test suite for R-devel. For more information, behind the scenes of cran, 2016. -
as.Date.IDate
is no longer exported as a function to solve a new error in R-develS3 method lookup found 'as.Date.IDate' on search path
, #4777. The S3 method is still exported; i.e.as.Date(x)
will still invoke theas.Date.IDate
method whenx
is classIDate
. The function had been exported, in addition to exporting the method, to solve a compatibility issue withzoo
(andxts
which useszoo
) becausezoo
exportsas.Date
which masksbase::as.Date
. Happily, since zoo 1.8-1 (Jan 2018) made a change to itsas.IDate
, the workaround is no longer needed. -
Thanks to @fredguinog for testing
fcase
in development before 1.13.0 was released and finding a segfault, #4378. It was found separately by therchk
tool (which uses static code analysis) in release procedures and fixed beforefcase
was released, but the reproducible example has now been added to the test suite for completeness. Thanks also to @shrektan for investigating, proposing a very similar fix at C level, and a different reproducible example which has also been added to the test suite.
data.table v1.13.2 (19 Oct 2020)
-
test.data.table()
could fail the 2nd time it is run by a user in the same R session on Windows due to not resetting locale properly after testing Chinese translation, #4630. Thanks to Cole Miller for investigating and fixing. -
A regression in v1.13.0 resulted in installation on Mac often failing with
shared object 'datatable.so' not found
, and FreeBSD always failing withexpr: illegal option -- l
, #4652 #4640 #4650. Thanks to many for assistance including Simon Urbanek, Brian Ripley, Wes Morgan, and @ale07alvarez. There were no installation problems on Windows or Linux. -
Operating on columns of type
list
, e.g.dt[, listCol[[1]], by=id]
, suffered a performance regression in v1.13.0, #4646 #4658. Thanks to @fabiocs8 and @sandoronodi for the detailed reports, and to Cole Miller for substantial debugging, investigation and proposals at C level which enabled the root cause to be fixed. Related, and also fixed, was a segfault revealed by package POUMM, #4746, when grouping a list column where each item has an attribute; e.g.,coda::mcmc.list
. Detected thanks to CRAN's ASAN checks, and thanks to Venelin Mitov for assistance in tracing the memory fault. Thanks also to Hongyuan Jia and @ben-schwen for assistance in debugging the fix in dev to pass reverse dependency testing which highlighted, before release, that packageeplusr
would fail. Its good usage has been added todata.table
's test suite. -
fread("1.2\n", colClasses='integer')
(note no columns names in the data) would segfault when creating a warning message, #4644. It now warns withAttempt to override column 1 of inherent type 'float64' down to 'int32' ignored.
When column names are present however, the warning message includes the name as before; i.e.,fread("A\n1.2\n", colClasses='integer')
producesAttempt to override column 1 <<A>> of inherent type 'float64' down to 'int32' ignored.
. Thanks to Kun Ren for reporting. -
dplyr::mutate(setDT(as.list(1:64)), V1=11)
threw errorcan't set ALTREP truelength
, #4734. Thanks to @etryn for the reproducible example, and to Cole Miller for refinements.
-
bit64
v4.0.2 andbit
v4.0.3, both released on 30th July, correctly brokedata.table
's tests. Like other packages on ourSuggest
list, we checkdata.table
works withbit64
in our tests. The first break was becauseall.equal
always returnedTRUE
in previous versions ofbit64
. Now thatall.equal
works forinteger64
, the incorrect test comparison was revealed. If you usebit64
, ornanotime
which usesbit64
, it is highly recommended to upgrade to the latestbit64
version. Thanks to Cole Miller for the PR to accommodatebit64
's update.The second break caused by
bit
was the addition of acopy
function. We did not ask, but thebit
package kindly offered to change to a different name sincedata.table::copy
is long standing.bit
v4.0.4 released 4th August renamedcopy
tocopy_vector
. Otherwise, users ofdata.table
would have needed to prefix every occurrence ofcopy
withdata.table::copy
if they usebit64
too, sincebit64
depends on (rather than importing)bit
. Again, this impacteddata.table
's tests which mimic a user's environment; notdata.table
itself per se.We have requested that CRAN policy be modified to require that reverse dependency testing include packages which
Suggest
the package. Had this been the case, reverse dependency testing ofbit64
would have caught the impact ondata.table
before release. -
?.NGRP
now displays the help page as intended, #4946. Thanks to @KyleHaynes for posting the issue, and Cole Miller for the fix..NGRP
is a symbol new in v1.13.0; see below in this file. -
test.data.table()
failed in non-English locales such asLC_TIME=fr_FR.UTF-8
due toJan
vsjanv.
in tests 168 and 2042, #3450. Thanks to @shrektan for reporting, and @tdhock for making the tests locale-aware. -
User-supplied
PKG_LIBS
andPKG_CFLAGS
are now retained and the suggestion in https://mac.r-project.org/openmp/; i.e.,PKG_CPPFLAGS='-Xclang -fopenmp' PKG_LIBS=-lomp R CMD INSTALL data.table_<ver>.tar.gz
has a better chance of working on Mac.
data.table v1.13.0 (24 Jul 2020)
-
fread
now supports native parsing of%Y-%m-%d
, and ISO 8601%Y-%m-%dT%H:%M:%OS%z
, #4464. Dates are returned asdata.table
'sinteger
-backedIDate
class (see?IDate
), and datetimes are returned asPOSIXct
provided eitherZ
or the offset fromUTC
is present; e.g.fwrite()
outputs UTC by default including the finalZ
. Reminder thatIDate
inherits from R'sDate
and is identical other than it uses theinteger
type where (oddly) R uses thedouble
type for dates (8 bytes instead of 4).fread()
gains atz
argument to control datetime values that are missing a Z or UTC-offset (now referred to as unmarked datetimes); e.g. as written bywrite.csv
. By defaulttz=""
means, as in R, read the unmarked datetime in local time. Unless the timezone of the R session is UTC (e.g. the TZ environment variable is set to"UTC"
, or""
on non-Windows), unmarked datetime will then by read byfread
as character, as before. If you have been usingcolClasses="POSIXct"
that will still work using R'sas.POSIXct()
which will interpret the unmarked datetime in local time, as before, and still slowly. You can tellfread
to read unmarked datetime as UTC, and quickly, by passingtz="UTC"
which may be appropriate in many circumstances. Note that the default behaviour of R to read and write csv using unmarked datetime can lead to different research results when the csv file has been saved in one timezone and read in another due to observations being shifted to a different date. If you have been usingcolClasses="POSIXct"
for UTC-marked datetime (e.g. as written byfwrite
including the finalZ
) then it will automatically speed up with no changes needed.Since this is a potentially breaking change, i.e. existing code may depend on dates and datetimes being read as type character as before, a temporary option is provided to restore the old behaviour:
options(datatable.old.fread.datetime.character=TRUE)
. However, in most cases, we expect existing code to still work with no changes.The minor version number is bumped from 12 to 13, i.e.
v1.13.0
, where the.0
conveys 'be-aware' as is common practice. As with any new feature, there may be bugs to fix and changes to defaults required in future. In addition to convenience,fread
is now significantly faster in the presence of dates, UTC-marked datetimes, and unmarked datetime when tz="UTC" is provided.
-
%chin%
andchmatch(x, table)
are faster whenx
is length 1,table
is long, andx
occurs near the start oftable
. Thanks to Michael Chirico for the suggestion, #4117. -
CsubsetDT
C function is now exported for use by other packages, #3751. Thanks to Leonardo Silvestri for the request and the PR. This uses R'sR_RegisterCCallable
andR_GetCCallable
mechanism, R-exts§5.4.3 and?cdt
. Note that organization of our C interface will be changed in future. -
print
method fordata.table
gainstrunc.cols
argument (and corresponding optiondatatable.print.trunc.cols
, defaultFALSE
), #1497, part of #1523. This prints only as many columns as fit in the console without wrapping to new lines (e.g., the first 5 of 80 columns) and a message that states the count and names of the variables not shown. Whenclass=TRUE
the message also contains the classes of the variables.data.table
has always automatically truncated rows of a table for efficiency (e.g. printing 10 rows instead of 10 million); in the future, we may do the same for columns (e.g., 10 columns instead of 20,000) by changing the default for this argument. Thanks to @nverno for the initial suggestion and to @TysonStanley for the PR. -
setnames(DT, new=new_names)
(i.e. explicitly namednew=
argument) now works as expected rather than an error message requesting thatold=
be supplied too, #4041. Thanks @Kodiologist for the suggestion. -
nafill
andsetnafill
gainnan
argument to say whetherNaN
should be considered the same asNA
for filling purposes, #4020. Prior versions had an implicit value ofnan=NaN
; the default is nownan=NA
, i.e.,NaN
is treated as if it's missing. Thanks @AnonymousBoba for the suggestion. Also, whilenafill
still respectsgetOption('datatable.verbose')
, theverbose
argument has been removed. -
New function
fcase(...,default)
implemented in C by Morgan Jacob, #3823, is inspired by SQLCASE WHEN
which is a common tool in SQL for e.g. building labels or cutting age groups based on conditions.fcase
is comparable to R functiondplyr::case_when
however it evaluates its arguments in a lazy way (i.e. only when needed) as shown below. Please see?fcase
for more details.# Lazy evaluation x = 1:10 data.table::fcase( x < 5L, 1L, x >= 5L, 3L, x == 5L, stop("provided value is an unexpected one!") ) # [1] 1 1 1 1 3 3 3 3 3 3 dplyr::case_when( x < 5L ~ 1L, x >= 5L ~ 3L, x == 5L ~ stop("provided value is an unexpected one!") ) # Error in eval_tidy(pair$rhs, env = default_env) : # provided value is an unexpected one! # Benchmark x = sample(1:100, 3e7, replace = TRUE) # 114 MB microbenchmark::microbenchmark( dplyr::case_when( x < 10L ~ 0L, x < 20L ~ 10L, x < 30L ~ 20L, x < 40L ~ 30L, x < 50L ~ 40L, x < 60L ~ 50L, x > 60L ~ 60L ), data.table::fcase( x < 10L, 0L, x < 20L, 10L, x < 30L, 20L, x < 40L, 30L, x < 50L, 40L, x < 60L, 50L, x > 60L, 60L ), times = 5L, unit = "s") # Unit: seconds # expr min lq mean median uq max neval # dplyr::case_when 11.57 11.71 12.22 11.82 12.00 14.02 5 # data.table::fcase 1.49 1.55 1.67 1.71 1.73 1.86 5
-
.SDcols=is.numeric
now works; i.e.,SDcols=
accepts a function which is used to select the columns of.SD
, #3950. Any function (even ad hoc) that returns scalarTRUE
/FALSE
for each column will do; e.g.,.SDcols=!is.character
will return non-character columns (a laNegate()
). Note that.SDcols=patterns(...)
can still be used for filtering based on the column names. -
Compiler support for OpenMP is now detected during installation, which allows
data.table
to compile from source (in single threaded mode) on macOS which, frustratingly, does not include OpenMP support by default, #2161, unlike Windows and Linux. A helpful message is emitted during installation from source, and on package startup as before. Many thanks to @jimhester for the PR. -
rbindlist
now supports columns of typeexpression
, #546. Thanks @jangorecki for the report. -
The dimensions of objects in a
list
column are now displayed, #3671. Thanks to @randomgambit for the request, and Tyson Barrett for the PR. -
frank
gainsties.method='last'
, paralleling the same inbase::order
which has been available since R 3.3.0 (April 2016), #1689. Thanks @abudis for the encouragement to accommodate this. -
The
keep.rownames
argument inas.data.table.xts
now accepts a string, which can be used for specifying the column name of the index of the xts input, #4232. Thanks to @shrektan for the request and the PR. -
New symbol
.NGRP
available inj
, #1206..GRP
(the group number) was already available taking values from1
to.NGRP
. The number of groups,.NGRP
, might be useful inj
to calculate a percentage of groups processed so far, or to do something different for the last or penultimate group, for example. -
Added support for
round()
andtrunc()
to extend functionality ofITime
.round()
andtrunc()
can be used with argument units: "hours" or "minutes". Thanks to @JensPederM for the suggestion and PR. -
A new throttle feature has been introduced to speed up small data tasks that are repeated in a loop, #3175 #3438 #3205 #3735 #3739 #4284 #4527 #4294 #1120. The default throttle of 1024 means that a single thread will be used when nrow<=1024, two threads when nrow<=2048, etc. To change the default, use
setDTthreads(throttle=)
. Or use the new environment variableR_DATATABLE_THROTTLE
. If you useSys.setenv()
in a running R session to change this environment variable, be sure to run an emptysetDTthreads()
call afterwards for the change to take effect; see?setDTthreads
. The word throttle is used to convey that the number of threads is restricted (throttled) for small data tasks. Reducing throttle to 1 will turn off throttling and should revert behaviour to past versions (i.e. using many threads even for small data). Increasing throttle to, say, 65536 will utilize multi-threading only for larger datasets. The value 1024 is a guess. We welcome feedback and test results indicating what the best default should be.
-
A NULL timezone on POSIXct was interpreted by
as.IDate
andas.ITime
as UTC rather than the session's default timezone (tz=""
) , #4085. -
DT[i]
could segfault wheni
is a zero-columndata.table
, #4060. Thanks @shrektan for reporting and fixing. -
Dispatch of
first
andlast
functions now properly works again forxts
objects, #4053. Thanks to @ethanbsmith for reporting. -
If
.SD
is returned as-is during grouping, it is now unlocked for downstream usage, part of #4159. Thanks also to @mllg for detecting a problem with the initial fix here during the dev release #4173. -
GForce
is deactivated for[[
on non-atomic input, part of #4159. Thanks @hongyuanjia and @ColeMiller1 for helping debug an issue in dev with the original fix before release, #4612. -
all.equal(DT, y)
no longer errors wheny
is not a data.table, #4042. Thanks to @d-sci for reporting and the PR. -
A length 1
colClasses=NA_character_
would causefread
to incorrectly coerce all columns to character, #4237. -
An
fwrite
error message could include a garbled number and cause test 1737.5 to fail, #3492. Thanks to @QuLogic for debugging the issue on ARMv7hl, and the PR fixing it. -
fread
improves handling of very small (<1e-300) or very large (>1e+300) floating point numbers on non-x86 architectures (specifically ppc64le and armv7hl). Thanks to @QuLogic for reporting and fixing, PR#4165. -
When updating by reference, the use of
get
could result in columns being re-ordered silently, #4089. Thanks to @dmongin for reporting and Cole Miller for the fix. -
copy()
now overallocates deeply nested lists ofdata.table
s, #4205. Thanks to @d-sci for reporting and the PR. -
rbindlist
no longer errors when coercing complex vectors to character vectors, #4202. Thanks to @sritchie73 for reporting and the PR. -
A relatively rare case of segfault when combining non-equi joins with
by=.EACHI
is now fixed, closes #4388. -
Selecting key columns could incur a large speed penalty, #4498. Thanks to @Jesper on Stack Overflow for the report.
-
all.equal(DT1, DT2, ignore.row.order=TRUE)
could return TRUE incorrectly in the presence of NAs, #4422. -
Non-equi joins now automatically set
allow.cartesian=TRUE
, 4489. Thanks to @Henrik-P for reporting. -
X[Y, on=character(0)]
andmerge(X, Y, by.x=character(0), by.y=character(0))
no longer crash, #4272. Thanks to @tlapak for the PR. -
by=col1:col4
gave an incorrect result ifkey(DT)==c("col1","col4")
, #4285. Thanks to @cbilot for reporting, and Cole Miller for the PR. -
Matrices resulting from logical operators or comparisons on
data.table
s, e.g. indta == dtb
, can no longer have their colnames changed by reference later, #4323. Thanks to @eyherabh for reporting and @tlapak for the PR. -
The environment variable
R_DATATABLE_NUM_THREADS
was being limited byR_DATATABLE_NUM_PROCS_PERCENT
(by default 50%), #4514. It is now consistent withsetDTthreads()
and only limited by the full number of logical CPUs. For example, on a machine with 8 logical CPUs,R_DATATABLE_NUM_THREADS=6
now results in 6 threads rather than 4 (50% of 8).
-
Retrospective license change permission was sought from and granted by 4 contributors who were missed in PR#2456, #4140. We had used GitHub's contributor page which omits 3 of these due to invalid email addresses, unlike GitLab's contributor page which includes the ids. The 4th omission was a PR to a script which should not have been excluded; a script is code too. We are sorry these contributors were not properly credited before. They have now been added to the contributors list as displayed on CRAN. All the contributors of code to data.table hold its copyright jointly; your contributions belong to you. You contributed to data.table when it had a particular license at that time, and you contributed on that basis. This is why in the last license change, all contributors of code were consulted and each had a veto.
-
as.IDate
,as.ITime
,second
,minute
, andhour
now recognize UTC equivalents for speed: GMT, GMT-0, GMT+0, GMT0, Etc/GMT, and Etc/UTC, #4116. -
set2key
,set2keyv
, andkey2
have been removed, as they have been warning since v1.9.8 (Nov 2016) and halting with helpful message since v1.11.0 (May 2018). When they were introduced in version 1.9.4 (Oct 2014) they were marked as 'experimental' and quickly superseded bysetindex
andindices
. -
data.table
now supports messaging in simplified Chinese (localezh_CN
). This was the result of a monumental collaboration to translatedata.table
's roughly 1400 warnings, errors, and verbose messages (about 16,000 words/100,000 characters) over the course of two months from volunteer translators in at least 4 time zones, most of whom are first-timedata.table
contributors and many of whom are first-time OSS contributors!A big thanks goes out to @fengqifang, @hongyuanjia, @biobai, @zhiiiyang, @Leo-Lee15, @soappp9527, @amy17519, @Zachary-Wu, @caiquanyou, @dracodoc, @JulianYlli12, @renkun-ken, @Xueliang24, @koohoko, @KingdaShi, @gaospecial, @shrektan, @sunshine1126, @shawnchen1996, @yc0802, @HesperusArcher, and @Emberwhirl, all of whom took time from their busy schedules to translate and review others' translations. Especial thanks goes to @zhiiiyang and @hongyuanjia who went above and beyond in helping to push the project over the finish line, and to @GuangchuangYu who helped to organize the volunteer pool.
data.table
joinslubridate
andnlme
as the only of the top 200 most-downloaded community packages on CRAN to offer non-English messaging, and is the only of the top 50 packages to offer complete support of all messaging. We hope this is a first step in broadening the reach and accessibility of the R ecosystem to more users globally and look forward to working with other maintainers looking to bolster the portability of their packages by offering advice on learnings from this undertaking.We would be remiss not to mention the laudable lengths to which the R core team goes to maintain the much larger repository (about 6,000 messages in more than 10 languages) of translations for R itself.
We will evaluate the feasibility (in terms of maintenance difficulty and CRAN package size limits) of offering support for other languages in later releases.
-
fifelse
andfcase
now notify users that S4 objects (exceptnanotime
) are not supported #4135. Thanks to @torema-ed for bringing it to our attention and Morgan Jacob for the PR. -
frank(..., ties.method="random", na.last=NA)
now returns the same random ordering thatbase::rank
does, #4243. -
The error message when mistakenly using
:=
ini
instead ofj
has been much improved, #4227. Thanks to Hugh Parsonage for the detailed suggestion.> DT = data.table(A=1:2) > DT[B:=3] Error: Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar:=5] instead of DT[, new_var:=5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number. > DT[, B:=3] > DT A B <int> <num> 1: 1 3 2: 2 3
-
Added more explanation/examples to
?data.table
for how to use.BY
, #1363. -
Changes upstream in R have been accomodated; e.g.
c.POSIXct
now raises'origin' must be supplied
which impactedfoverlaps
, #4428. -
data.table::update.dev.pkg()
now unloads thedata.table
namespace to alleviate a DLL lock issue on Windows, #4403. Thanks to @drag5 for reporting. -
data.table
packages binaries built by R version 3 (R3) should only be installed in R3, and similarlydata.table
package binaries built by R4 should only be installed in R4. Otherwise,package ‘data.table’ was built under R version...
warning will occur which should not be ignored. This is due to a very welcome change torbind
andcbind
in R 4.0.0 which enabled us to remove workarounds, see news item in v1.12.6 below in this file. To continue to support both R3 and R4,data.table
's NAMESPACE file contains a condition on the R major version (3 or 4) and this is what gives rise to the requirement that the major version used to builddata.table
must match the major version used to install it. Thanks to @vinhdizzo for reporting, #4528. -
Internal function
shallow()
no longer makes a deep copy of secondary indices. This eliminates a relatively small time and memory overhead when indices are present that added up significantly when performing many operations, such as joins, in a loop or when joining inj
by group, #4311. Many thanks to @renkun-ken for the report, and @tlapak for the investigation and PR. -
The
datatable.old.unique.by.key
option has been removed as per the 4 year schedule detailed in note 10 of v1.12.4 (Oct 2019), note 10 of v1.11.0 (May 2018), and note 1 of v1.9.8 (Nov 2016). It has been generating a helpful warning for 2 years, and helpful error for 1 year.
data.table v1.12.8 (09 Dec 2019)
DT[, {...; .(A,B)}]
(i.e. when.()
is the final item of a multi-statement{...}
) now auto-names the columnsA
andB
(just likeDT[, .(A,B)]
) rather thanV1
andV2
, #2478 #609. Similarly,DT[, if (.N>1) .(B), by=A]
now auto-names the columnB
rather thanV1
. Explicit names are unaffected; e.g.DT[, {... y= ...; .(A=C+y)}, by=...]
named the columnA
before, and still does. Thanks also to @renkun-ken for his go-first strong testing which caught an issue not caught by the test suite or by revdep testing, related to NULL being the last item, #4061.
-
frollapply
could segfault and exceed R's C protect limits, #3993. Thanks to @DavisVaughan for reporting and fixing. -
DT[, sum(grp), by=grp]
(i.e. aggregating the same column being grouped) could error withobject 'grp' not found
, #3103. Thanks to @cbailiss for reporting.
-
Links in the manual were creating warnings when installing HTML, #4000. Thanks to Morgan Jacob.
-
Adjustments for R-devel (R 4.0.0) which now has reference counting turned on, #4058 #4093. This motivated early release to CRAN because every day CRAN tests every package using the previous day's changes in R-devel; a much valued feature of the R ecosystem. It helps R-core if packages can pass changes in R-devel as soon as possible. Thanks to Luke Tierney for the notice, and for implementing reference counting which we look forward to very much.
-
C internals have been standardized to use
PRI[u|d]64
to print[u]int64_t
. This solves new warnings fromgcc-8
on Windows with%lld
, #4062, in many cases already working aroundsnprintf
on Windows not supporting%zu
. Release procedures have been augmented to prevent any internal use ofllu
,lld
,zu
orzd
. -
test.data.table()
gainsshowProgress=interactive()
to suppress the thousands ofRunning test id <num> ...
lines displayed by CRAN checks when there are warnings or errors.
data.table v1.12.6 (18 Oct 2019)
-
shift()
on ananotime
with the defaultfill=NA
now fills ananotime
missing value correctly, #3945. Thanks to @mschubmehl for reporting and fixing in PR #3942. -
Compilation failed on CRAN's MacOS due to an older version of
zlib.h/zconf.h
which did not havez_const
defined, #3939. Other open-source projects unrelated to R have experienced this problem on MacOS too. We have followed the common practice of removingz_const
to support the olderzlib
versions, and data.table's release procedures have gained agrep
to ensurez_const
isn't used again by accident in future. The libraryzlib
is used forfwrite
's new feature of multithreaded compression on-the-fly; see item 3 of 1.12.4 below. -
A runtime error in
fwrite
's compression, but only observed so far on Solaris 10 32bit with zlib 1.2.8 (Apr 2013), #3931:Error -2: one or more threads failed to allocate buffers or there was a compression error.
In case it happens again, this area has been made more robust and the error more detailed. As is often the case, investigating the Solaris problem revealed secondary issues in the same area of the code. In this case, some%d
in verbose output should have been%lld
. This obliquity that CRAN's Solaris provides is greatly appreciated. -
A leak could occur in the event of an unsupported column type error, or if working memory could only partially be allocated; #3940. Found thanks to
clang
's Leak Sanitizer (prompted by CRAN's diligent use of latest tools), and two tests in the test suite which tested the unsupported-type error.
- Many thanks to Kurt Hornik for fixing R's S3 dispatch of
rbind
andcbind
methods, #3948. WithR>=4.0.0
(current R-devel),data.table
now registers the S3 methodscbind.data.table
andrbind.data.table
, and no longer applies the workaround documented in FAQ 2.24.
data.table v1.12.4 (03 Oct 2019)
-
rleid()
functions now support long vectors (length > 2 billion). -
fread()
:- now skips embedded
NUL
(\0
), #3400. Thanks to Marcus Davy for reporting with examples, Roy Storey for the initial PR, and Bingjie Qian for testing this feature on a very complicated real-world file. colClasses
now supports'complex'
,'raw'
,'Date'
,'POSIXct'
, and user-defined classes (so long as anas.
method exists), #491 #1634 #2610. Any error during coercion results in a warning and the column is left as the default type (probably"character"
). Thanks to @hughparsonage for the PR.stringsAsFactors=0.10
will factorize any character column containing under0.10*nrow
unique strings, #2025. Thanks to @hughparsonage for the PR.colClasses=list(numeric=20:30, numeric="ID")
will apply thenumeric
type to column numbers20:30
as before and now also column name"ID"
; i.e. all duplicate class names are now respected rather than only the first. This need may arise when specifying some columns by name and others by number, as in this example. Thanks to @hughparsonage for the PR.- gains
yaml
(defaultFALSE
) and the ability to parse CSVY-formatted input files; i.e., csv files with metadata in a header formatted as YAML (https://csvy.org/), #1701. See?fread
and files in/inst/tests/csvy/
for sample formats. Please provide feedback if you find this feature useful and would like extended capabilities. For now, consider it experimental, meaning the API/arguments may change. Thanks to @leeper atrio
for the inspiration and @MichaelChirico for implementing. select
can now be used to specify types for just the columns selected, #1426. Just likecolClasses
it can be a named vector ofcolname=type
pairs, or a namedlist
oftype=col(s)
pairs. For example:
fread(file, select=c(colD="character", # returns 2 columns: colD,colA colA="integer64")) fread(file, select=list(character="colD", # returns 5 columns: colD,8,9,10,colA integer= 8:10, character="colA"))
- gains
tmpdir=
argument which is passed totempfile()
whenever a temporary file is needed. Thanks to @mschubmehl for the PR. As before, settingTMPDIR
(to/dev/shm
for example) before starting the R session still works too; see?base::tempdir
.
- now skips embedded
-
fwrite()
:- now writes compressed
.gz
files directly, #2016. Compression, likefwrite()
, is multithreaded and compresses each chunk on-the-fly (a full size intermediate file is not created). Use a ".gz" extension, or the newcompress=
option. Many thanks to Philippe Chataignon for the significant PR. For example:
DT = data.table(A=rep(1:2, 100e6), B=rep(1:4, 50e6)) fwrite(DT, "data.csv") # 763MB; 1.3s fwrite(DT, "data.csv.gz") # 2MB; 1.6s identical(fread("data.csv.gz"), DT)
Note that compression is handled using
zlib
library. In the unlikely event of missingzlib.h
, on a machine that is compilingdata.table
from sources, one may getfwrite.c
compilation errorzlib.h: No such file or directory
. As of now, the easiest solution is to install missing library usingsudo apt install zlib1g-dev
(Debian/Ubuntu). Installing R (r-base-dev
) depends onzlib1g-dev
so this should be rather uncommon. If it happens to you please upvote related issue #3872.-
Gains
yaml
argument matching that offread
, #3534. See the item infread
for a bit more detail; here, we'd like to reiterate that feedback is appreciated in the initial phase of rollout for this feature. -
Gains
bom
argument to add a byte order mark (BOM) at the beginning of the file to signal that the file is encoded in UTF-8, #3488. Thanks to Stefan Fleck for requesting and Philippe Chataignon for implementing. -
Now supports type
complex
, #3690. -
Gains
scipen
#2020, the number 1 most-requested feature #3189. The default isgetOption("scipen")
so thatfwrite
will now respect R's option in the same way asbase::write.csv
andbase::format
, as expected. The parameter and option name have been kept the same as base R'sscipen
for consistency and to aid online search. It stands for 'scientific penalty'; i.e., the number of characters to add to the width within which non-scientific number format is used if it will fit. A high penalty essentially turns off scientific format. We believe that common practice is to use a value of 999, however, if you do use 999, because your data might include very long numbers such as10^300
,fwrite
needs to account for the worst case field width in its buffer allocation per thread. This may impact space or time. If you experience slowdowns or unacceptable memory usage, please passverbose=TRUE
tofwrite
, inspect the output, and report the issue. A workaround, until we can determine the best strategy, may be to pass a smaller value toscipen
, such as 50. We have observed thatfwrite(DT, scipen=50)
appears to write10^50
accurately, unlike base R. However, this may be a happy accident and not apply generally. Further work may be needed in this area.
DT = data.table(a=0.0001, b=1000000) fwrite(DT) # a,b # 1e-04,1e+06 fwrite(DT,scipen=1) # a,b # 0.0001,1e+06 fwrite(DT,scipen=2) # a,b # 0.0001,1000000 10^50 # [1] 1e+50 options(scipen=50) 10^50 # [1] 100000000000000007629769841091887003294964970946560 fwrite(data.table(A=10^50)) # A # 100000000000000000000000000000000000000000000000000
- now writes compressed
-
Assigning to one item of a list column no longer requires the RHS to be wrapped with
list
or.()
, #950.> DT = data.table(A=1:3, B=list(1:2,"foo",3:5)) > DT A B <int> <list> 1: 1 1,2 2: 2 foo 3: 3 3,4,5 > # The following all accomplish the same assignment: > DT[2, B:=letters[9:13]] # was error, now works > DT[2, B:=.(letters[9:13])] # was error, now works > DT[2, B:=.(list(letters[9:13]))] # .(list()) was needed, still works > DT A B <int> <list> 1: 1 1,2 2: 2 i,j,k,l,m 3: 3 3,4,5
-
print.data.table()
gains an option to display the timezone ofPOSIXct
columns when available, #2842. Thanks to Michael Chirico for reporting and Felipe Parages for the PR. -
New functions
nafill
andsetnafill
, #854. Thanks to Matthieu Gomez for the request and Jan Gorecki for implementing.DT = setDT(lapply(1:100, function(i) sample(c(rnorm(9e6), rep(NA_real_, 1e6))))) format(object.size(DT), units="GB") ## 7.5 Gb zoo::na.locf(DT, na.rm=FALSE) ## zoo 53.518s setDTthreads(1L) nafill(DT, "locf") ## DT 1 thread 7.562s setDTthreads(0L) nafill(DT, "locf") ## DT 40 threads 0.605s setnafill(DT, "locf") ## DT in-place 0.367s
-
New variable
.Last.updated
(similar to R's.Last.value
) contains the number of rows affected by the most recent:=
orset()
, #1885. For details see?.Last.updated
. -
between()
and%between%
are faster forPOSIXct
, #3519, and now support the.()
alias, #2315. Thanks to @Henrik-P for the reports. There is now also support forbit64
'sinteger64
class and more robust coercion of types, #3517.between()
gainscheck=
which checksany(lower>upper)
; off by default for speed in particular for type character. -
New convenience functions
%ilike%
and%flike%
which map to newlike()
argumentsignore.case
andfixed
respectively, #3333.%ilike%
is for case-insensitive pattern matching.%flike%
is for more efficient matching of fixed strings. Thanks to @andreasLD for providing most of the core code. -
on=.NATURAL
(or alternativelyX[on=Y]
#3621) joins two tables on their common column names, so called natural join, #629. Thanks to David Kulp for request. As before, whenon=
is not provided,X
must have a key and the key columns are used to join (like rownames, but multi-column and multi-type). -
as.data.table
gainskey
argument mirroring its use insetDT
anddata.table
, #890. As a byproduct, the arguments ofas.data.table.array
have changed order, which could affect code relying on positional arguments to this method. Thanks @cooldome for the suggestion and @MichaelChirico for implementation. -
merge.data.table
is now exported, #2618. We realize that S3 methods should not ordinarily be exported. Rather, the method should be invoked via S3 dispatch. But users continue to request its export, perhaps because of intricacies relating to the fact that data.table inherits from data.frame, there are two arguments tomerge()
but S3 dispatch applies just to the first, and a desire to explicitly calldata.table::merge.data.table
from package code. Thanks to @AndreMikulec for the most recent request. -
New rolling function to calculate rolling sum has been implemented and exported, see
?frollsum
, #2778. -
setkey
to an existing index now uses the index, #2889. Thanks to @MichaelChirico for suggesting and @saraswatmks for the PR. -
DT[order(col)[1:5], ...]
(i.e. wherei
is a compound expression involvingorder()
) is now optimized to usedata.table
's multithreadedforder
, #1921. This example is not a fully optimal top-N query since the full ordering is still computed. The improvement is that the call toorder()
is computed faster for anyi
expression usingorder
. -
as.data.table
now unpacks columns in adata.frame
which are themselves adata.frame
ormatrix
. This need arises when parsing JSON, a corollary in #3369. Bug fix 19 in v1.12.2 (see below) added a helpful error (rather than segfault) to detect such invaliddata.table
, and promised thatas.data.table()
would unpack these columns in the next release (i.e. this release) so that the invaliddata.table
is not created in the first place. Further,setDT
now warns if it observes such columns and suggests usingas.data.table
instead, #3760. -
CJ
has been ported to C and parallelized, thanks to a PR by Michael Chirico, #3596. All types benefit, but, as in manydata.table
operations, factors benefit more than character.# default 4 threads on a laptop with 16GB RAM and 8 logical CPU ids = as.vector(outer(LETTERS, LETTERS, paste0)) system.time( CJ(ids, 1:500000) ) # 3.9GB; 340m rows # user system elapsed (seconds) # 3.000 0.817 3.798 # was # 1.800 0.832 2.190 # now # ids = as.factor(ids) system.time( CJ(ids, 1:500000) ) # 2.6GB; 340m rows # user system elapsed (seconds) # 1.779 0.534 2.293 # was # 0.357 0.763 0.292 # now
-
New function
fcoalesce(...)
has been written in C, and is multithreaded fornumeric
andfactor
. It replaces missing values according to a prioritized list of candidates (as per SQL COALESCE,dplyr::coalesce
, andhutils::coalesce
), #3424. It accepts any number of vectors in several forms. For example, given three vectorsx
,y
, andz
, where eachNA
inx
is to be replaced by the corresponding value iny
if that is non-NA, else the corresponding value inz
, the following equivalent forms are all accepted:fcoalesce(x,y,z)
,fcoalesce(x,list(y,z))
, andfcoalesce(list(x,y,z))
. Being a new function, its behaviour is subject to change particularly for typelist
, #3712.# default 4 threads on a laptop with 16GB RAM and 8 logical CPU N = 100e6 x = replicate(5, {x=sample(N); x[sample(N, N/2)]=NA; x}, simplify=FALSE) # 2GB y1 = do.call(dplyr::coalesce, x)) y2 = do.call(hutils::coalesce, x)) y3 = do.call(data.table::fcoalesce, x)) # user system elapsed (seconds) # 4.935 1.876 6.810 # dplyr::coalesce # 3.122 0.831 3.956 # hutils::coalesce # 0.915 0.099 0.379 # data.table::fcoalesce identical(y1,y2) && identical(y1,y3) # TRUE
-
Type
complex
is now supported bysetkey
,setorder
,:=
,by=
,keyby=
,shift
,dcast
,frank
,rowid
,rleid
,CJ
,fcoalesce
,unique
, anduniqueN
, #3690. Thanks to Gareth Ward and Elio Campitelli for their reports and input. Sortingcomplex
is achieved the same way as base R; i.e., first by the real part then by the imaginary part (as if thecomplex
column were two separate columns ofdouble
). There is no plan to support joining/merging oncomplex
columns until a user demonstrates a need for that. -
setkey
,[key]by=
andon=
in verbose mode (options(datatable.verbose=TRUE)
) now detect any columns inheriting fromDate
which are stored as 8 byte double, test if any fractions are present, and if not suggest using a 4 byte integer instead (such asdata.table::IDate
) to save space and time, #1738. In future this could be upgraded tomessage
orwarning
depending on feedback. -
New function
fifelse(test, yes, no, na)
has been implemented in C by Morgan Jacob, #3657 and #3753. It is comparable tobase::ifelse
,dplyr::if_else
,hutils::if_else
, and (forthcoming)vctrs::if_else()
. It returns a vector of the same length astest
but unlikebase::ifelse
the output type is consistent with those ofyes
andno
. Please see?data.table::fifelse
for more details.# default 4 threads on a laptop with 16GB RAM and 8 logical CPU x = sample(c(TRUE,FALSE), 3e8, replace=TRUE) # 1GB microbenchmark::microbenchmark( base::ifelse(x, 7L, 11L), dplyr::if_else(x, 7L, 11L), hutils::if_else(x, 7L, 11L), data.table::fifelse(x, 7L, 11L), times = 5L, unit="s" ) # Unit: seconds # expr min med max neval # base::ifelse(x, 7L, 11L) 8.5 8.6 8.8 5 # dplyr::if_else(x, 7L, 11L) 9.4 9.5 9.7 5 # hutils::if_else(x, 7L, 11L) 2.6 2.6 2.7 5 # data.table::fifelse(x, 7L, 11L) 1.5 1.5 1.6 5 # setDTthreads(1) # data.table::fifelse(x, 7L, 11L) 0.8 0.8 0.9 5 # setDTthreads(2) # data.table::fifelse(x, 7L, 11L) 0.4 0.4 0.5 5 # setDTthreads(4)
-
transpose
gainskeep.names=
andmake.names=
arguments, #1886. Previously, column names were dropped and there was no way to keep them.keep.names="rn"
keeps the column names and puts them in the"rn"
column of the result. Similarly,make.names="rn"
uses column"rn"
as the column names of the result. Both arguments areNULL
by default for backwards compatibility. As these new arguments are new, they are subject to change in future according to community feedback. Thanks to @ghost for the request. -
Added a
data.table
method forutils::edit
to ensure adata.table
is returned, for convenience, #593. -
More efficient optimization of many columns in
j
(e.g. from.SD
), #1470. Thanks @Jorges1000 for the report. -
setnames(DT, old, new)
now omits anyold==new
to save redundant key and index name updates, #3783.setnames(DT, new)
(i.e. not providingold
) already omitted any column name updates wherenames(DT)==new
; e.g.setnames(DT, gsub('^_', '', names(DT)))
exits early if no columns start with_
. -
[[
by group is now optimized for regular vectors (not type list), #3209. Thanks @renkun-ken for the suggestion.[
by group was already optimized. Please file a feature request if you would like this optimization for list columns. -
New function
frollapply
for rolling computation of arbitrary R functions (caveat: inputx
is coerced to numeric beforehand, and the function must return a scalar numeric value). The API is consistent to extant rolling functionsfrollmean
andfrollsum
; note that it will generally be slower than those functions because (1) the known functions use our optimized internal C implementation and (2) there is no thread-safe API to R's Ceval
. Neverthelessfrollapply
is faster than correspondingbase
-only andzoo
versions:set.seed(108) x = rnorm(1e6); n = 1e3 base_rollapply = function(x, n, FUN) { nx = length(x) ans = rep(NA_real_, nx) for (i in n:nx) ans[i] = FUN(x[(i-n+1):i]) ans } system.time(base_rollapply(x, n, mean)) system.time(zoo::rollapplyr(x, n, function(x) mean(x), fill=NA)) system.time(zoo::rollmeanr(x, n, fill=NA)) system.time(frollapply(x, n, mean)) system.time(frollmean(x, n)) ### fun mean sum median # base_rollapply 8.815 5.151 60.175 # zoo::rollapply 34.373 27.837 88.552 # zoo::roll[fun] 0.215 0.185 NA ## median not fully supported # frollapply 5.404 1.419 56.475 # froll[fun] 0.003 0.002 NA ## median not yet supported
-
setnames()
now accepts functions inold=
andnew=
, #3703. Thanks @smingerson for the feature request and @shrektan for the PR.DT = data.table(a=1:3, b=4:6, c=7:9) setnames(DT, toupper) names(DT) # [1] "A" "B" "C" setnames(DT, c(1,3), tolower) names(DT) # [1] "a" "B" "c"
-
:=
andset()
now use zero-copy type coercion. Accordingly,DT[..., integerColumn:=0]
andset(DT,i,j,0)
no longer warn about the0
('numeric') needing to be0L
('integer') because there is no longer any time or space used for this coercion. The old long warning was off-putting to new users ("what and why L?"), whereas advanced users appreciated the old warning so they could avoid the coercion. Although the time and space for one coercion in a single call is unmeasurably small, when placed in a loop the small overhead of any allocation on R's heap could start to become noticeable (more so forset()
whose purpose is low-overhead looping). Further, when assigning a value across columns of varying types, it could be inconvenient to supply the correct type for every column. Hence, zero-copy coercion was introduced to satisfy all these requirements. A warning is still issued, as before, when fractional data is discarded; e.g. when 3.14 is assigned to an integer column. Zero-copy coercion applies to length>1 vectors as well as length-1 vectors.
-
first
,last
,head
andtail
by group no longer error in some cases, #2030 #3462. Thanks to @franknarf1 for reporting. -
keyby=colName
could use the wrong index and return incorrect results if bothcolName
andcolNameExtra
(wherecolName
is a leading subset of characters ofcolNameExtra
) are column names and an index exists oncolNameExtra
, #3498. Thanks to Xianying Tan for the detailed report and pinpointing the source line at fault. -
A missing item in
j
such asj=.(colA, )
now gives a helpful error (Item 2 of the .() or list() passed to j is missing
) rather than the unhelpful errorargument "this_jsub" is missing, with no default
(v1.12.2) orargument 2 is empty
(v1.12.0 and before), #3507. Thanks to @eddelbuettel for the report. -
fwrite()
could crash when writing very long strings such as 30 million characters, #2974, and could be unstable in memory constrained environments, #2612. Thanks to @logworthy and @zachokeeffe for reporting and Philippe Chataignon for fixing in PR #3288. -
fread()
could crash ifquote=""
(i.e. ignore quotes), the last line is too short, andfill=TRUE
, #3524. Thanks to Jiucang Hao for the report and reproducible example. -
Printing could occur unexpectedly when code is run with
source
, #2369. Thanks to @jan-glx for the report and reproducible example. -
Grouping by
NULL
on zero rowsdata.table
now behaves consistently to non-zero rowsdata.table
, #3530. Thanks to @SymbolixAU for the report and reproducible example. -
GForce optimization of
median
did not retain the class; e.g.median
ofDate
orPOSIXct
would return a raw number rather than retain the date class, #3079. Thanks to @Henrik-P for reporting. -
DT[, format(mean(date,""%b-%Y")), by=group]
could fail withinvalid 'trim' argument
, #1876. Thanks to Ross Holmberg for reporting. -
externalVar=1:5; DT[, mean(externalVar), by=group]
could return incorrect results rather than a constant (3
in this example) for each group, #875. GForce optimization was being applied incorrectly to themean
without realizingexternalVar
was not a column. -
test.data.table()
now passes in non-English R sessions, #630 #3039. Each test still checks that the number of warnings and/or errors produced is correct. However, a message is displayed suggesting to restart R withLANGUAGE=en
in order to test that the text of the warning and/or error messages are as expected, too. -
Joining a double column in
i
containing say 1.3, with an integer column inx
containing say 1, would result in the 1.3 matching to 1, #2592, and joining a factor column to an integer column would match the factor's integers rather than error. The type coercion logic has been revised and strengthened. Many thanks to @MarkusBonsch for reporting and fixing. Joining a character column ini
to a factor column inx
is now faster and retains the character column in the result rather than coercing it to factor. Joining an integer column ini
to a double column inx
now retains the integer type in the result rather than coercing the integers into the double type. Logical columns may now only be joined to logical columns, other than all-NA columns which are coerced to the matching column's type. All coercions are reported in verbose mode:options(datatable.verbose=TRUE)
. -
Attempting to recycle 2 or more items into an existing
list
column now gives the intended helpful error rather thanInternal error: recycle length error not caught earlier.
, #3543. Thanks to @MichaelChirico for finding and reporting. -
Subassigning using
$<-
to adata.table
embedded in a list column of a single-rowdata.table
could fail, #3474. Note that$<-
is not recommended; please use:=
instead which already worked in this case. Thanks to Jakob Richter for reporting. -
rbind
andrbindlist
of zero-row items now retain (again) the unused levels of any (zero-length) factor columns, #3508. This was a regression in v1.12.2 just for zero-row items. Unused factor levels were already retained for items havingnrow>=1
. Thanks to Gregory Demin for reporting. -
rbind
andrbindlist
of an item containing an ordered factor with levels containing anNA
(as opposed to an NA integer) could segfault, #3601. This was a a regression in v1.12.2. Thanks to Damian Betebenner for reporting. Also a related segfault when recycling a length-1 factor column, #3662. -
example(":=", local=TRUE)
now works rather than error, #2972. Thanks @vlulla for the report. -
rbind.data.frame
onIDate
columns changed the column frominteger
todouble
, #2008. Thanks to @rmcgehee for reporting. -
merge.data.table
now retains any custom classes of the first argument, #1378. Thanks to @michaelquinn32 for reopening. -
c
,seq
andmean
ofITime
objects now retain theITime
class via newITime
methods, #3628. Thanks @UweBlock for reporting. Thecut
andsplit
methods forITime
have been removed since the default methods work, #3630. -
as.data.table.array
now handles the case when some of the array's dimension names areNULL
, #3636. -
Adding a
list
column usingcbind
,as.data.table
, ordata.table
now works rather than treating thelist
as if it were a set of columns and introducing an invalid NA column name, #3471. However, please note that using:=
to add columns is preferred.cbind( data.table(1:2), list(c("a","b"),"a") ) # V1 V2 NA # v1.12.2 and before # <int> <char> <char> # 1: 1 a a # 2: 2 b a # # V1 V2 # v1.12.4+ # <int> <list> # 1: 1 a,b # 2: 2 a
-
Incorrect sorting/grouping results due to a bug in Intel's
icc
compiler 2019 (Version 19.0.4.243 Build 20190416) has been worked around thanks to a report and fix by Sebastian Freundt, #3647. Please rundata.table::test.data.table()
. If that passes, your installation does not have the problem. -
column not found
could incorrectly occur in rare non-equi-join cases, #3635. Thanks to @UweBlock for the report. -
Slight fix to the logic for auto-naming the
by
clause for using a custom function likeevaluate
to now be namedevaluate
instead of the name of the first symbolic argument, #3758. -
Column binding of zero column
data.table
will now work as expected, #3334. Thanks to @kzenstratus for the report. -
integer64
sum-by-group is now properly optimized, #1647, #3464. Thanks to @mlandry22-h2o for the report. -
From v1.12.0
between()
and%between%
interpret missing values inlower=
orupper=
as unlimited bounds. A new parameterNAbounds
has been added to achieve the old behaviour of returningNA
, #3522. Thanks @cguill95 for reporting. This is now consistent for character input, #3667 (thanks @AnonymousBoba), and classnanotime
is now supported too. -
integer64
defined on a subset of a new column would leave "gibberish" on the remaining rows, #3723. A bug inrbindlist
with the same root cause was also fixed, #1459. Thanks @shrektan and @jangorecki for the reports. -
groupingsets
functions now properly handle alone special symbols when using an empty set to group by, #3653. Thanks to @Henrik-P for the report. -
A
data.table
created usingsetDT()
on adata.frame
containing identical columns referencing each other would causesetkey()
to return incorrect results, #3496 and #3766. Thanks @kirillmayantsev and @alex46015 for reporting, and @jaapwalhout and @Atrebas for helping to debug and isolate the issue. -
x[, round(.SD, 1)]
and similar operations on the whole of.SD
could return a locked result, incorrectly preventing:=
on the result, #2245. Thanks @grayskripko for raising. -
Using
get
/mget
inj
could cause.SDcols
to be ignored or reordered, #1744, #1965, and #2036. Thanks @franknarf1, @MichaelChirico, and @TonyBonen, for the reports. -
DT[, i-1L, with=FALSE]
would misinterpret the minus sign and return an incorrect result, #2019. Thanks @cguill95 for the report. -
DT[id==1, DT2[.SD, on="id"]]
(i.e. joining from.SD
inj
) could incorrectly fail in some cases due to.SD
being locked, #1926, and when updating-on-join with factors #3559 #2099. Thanks @franknarf1 and @Henrik-P for the reports and for diligently tracking use cases for almost 3 years! -
as.IDate.POSIXct
returnedNA
for UTC times before Dec 1901 and after Jan 2038, #3780. Thanks @gschett for the report. -
rbindlist
now returns correct idcols for lists with different length vectors, #3785, #3786. Thanks to @shrektan for the report and fix. -
DT[ , !rep(FALSE, ncol(DT)), with=FALSE]
correctly returns the full table, #3013 and #2917. Thanks @alexnss and @DavidArenburg for the reports. -
shift(x, 0:1, type='lead', give.names=TRUE)
useslead
in all returned column names, #3832. Thanks @daynefiler for the report. -
Subtracting two
POSIXt
objects by group could lead to incorrect results because thebase
method internally callsdifftime
withunits='auto'
;data.table
does not notice if the chosen units differ by group and only the last group'sunits
attribute was retained, #3694 and #761. To surmount this, we now internally forceunits='secs'
on allPOSIXt-POSIXt
calls (reported whenverbose=TRUE
); generally we recommend callingdifftime
directly instead. Thanks @oliver-oliver and @boethian for the reports. -
Using
get
/mget
inj
could cause.SDcols
to be ignored or reordered, #1744, #1965, #2036, and #2946. Thanks @franknarf1, @MichaelChirico, @TonyBonen, and Steffen J. (StackOverflow) for the reports. -
DT[...,by={...}]
now handles expressions in{
, #3156. Thanks to @tdhock for the report. -
:=
could change adata.table
creation statement in the body of the function calling it, or a variable in calling scope, #3890. Many thanks to @kirillmayantsev for the detailed reports. -
Grouping could create a
malformed factor
and/or segfault when the factors returned by each group did not have identical levels, #2199 and #2522. Thanks to Václav Hausenblas, @franknarf1, @ben519, and @Henrik-P for reporting. -
rbindlist
(and printing adata.table
with over 100 rows because that usesrbindlist(head, tail)
) could error withmalformed factor
for unordered factor columns containing a usedNA_character_
level, #3915. This is an unusual input for unordered factors because NA_integer_ is recommended by default in R. Thanks to @sindribaldur for reporting. -
Adding a
list
column containing an item of typelist
to a one rowdata.table
could fail, #3626. Thanks to Jakob Richter for reporting.
-
rbindlist
'suse.names="check"
now emits its message for automatic column names ("V[0-9]+"
) too, #3484. See news item 5 of v1.12.2 below. -
Adding a new column by reference using
set()
on adata.table
loaded from binary file now give a more helpful error message, #2996. Thanks to Joseph Burling for reporting.This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or alloc.col() on it first (to pre-allocate space for new columns) before adding new columns by reference to it.
-
setorder
on a superset of a keyeddata.table
's key now retains its key, #3456. For example, ifa
is the key ofDT
,setorder(DT, a, -v)
will leaveDT
keyed bya
. -
New option
options(datatable.quiet = TRUE)
turns off the package startup message, #3489.suppressPackageStartupMessages()
continues to work too. Thanks to @leobarlach for the suggestion inspired byoptions(tidyverse.quiet = TRUE)
. We don't know of a way to make a package respect thequietly=
option oflibrary()
andrequire()
because thequietly=
isn't passed through for use by the package's own.onAttach
. If you can see how to do that, please submit a patch to R. -
When loading a
data.table
from disk (e.g. withreadRDS
), best practice is to runsetDT()
on the new object to assure it is correctly allocated memory for new column pointers. Barring this, unexpected behavior can follow; for example, if you assign a new column toDT
from a functionf
, the new columns will only be assigned withinf
andDT
will be unchanged. Theverbose
messaging in this situation is now more helpful, #1729. Thanks @vspinu for sharing his experience to spur this. -
New vignette Using
.SD
for Data Analysis, a deep dive into use cases for the.SD
variable to help illuminate this topic which we've found to be a sticking point for beginning and intermediatedata.table
users, #3412. -
Added a note to
?frank
clarifying that ranking is being done according to C sorting (i.e., likeforder
), #2328. Thanks to @cguill95 for the request. -
Historically,
dcast
andmelt
were built as enhancements toreshape2
's owndcast
/melt
. We removed dependency onreshape2
in v1.9.6 but maintained some backward compatibility. As that package has been superseded since December 2017, we will begin to formally complete the split fromreshape2
by removing some last vestiges. In particular we now warn when redirecting toreshape2
methods and will later error before ultimately completing the split; see #3549 and #3633. We thank thereshape2
authors for their original inspiration for these functions, and @ProfFancyPants for testing and reporting regressions in dev which have been fixed before release. -
DT[col]
wherecol
is a column containing row numbers of itself to select, now suggests the correct syntax (DT[(col)]
orDT[DT$col]
), #697. This expands the message introduced in #1884 for the case wherecol
is typelogical
andDT[col==TRUE]
is suggested. -
The
datatable.old.unique.by.key
option has been warning for 1 year that it is deprecated:... Please stop using it and pass by=key(DT) instead for clarity ...
. This warning is now upgraded to error as per the schedule in note 10 of v1.11.0 (May 2018), and note 1 of v1.9.8 (Nov 2016). In June 2020 the option will be removed. -
We intend to deprecate the
datatable.nomatch
option, more info. A message is now printed upon use of the option (once per session) as a first step. It asks you to please stop using the option and to passnomatch=NULL
explicitly if you require inner join. Outer join (nomatch=NA
) has always been the default because it is safer; it does not drop missing data silently. The problem is that the option is global; i.e., if a user changes the default using this option for their own use, that can change the behavior of joins inside packages that usedata.table
too. This is the onlydata.table
option with this concern. -
The test suite of 9k tests now runs with three R options on:
warnPartialMatchArgs
,warnPartialMatchAttr
, andwarnPartialMatchDollar
. This ensures that we don't rely on partial argument matching in internal code, for robustness and efficiency, and so that users can turn these options on for their code in production, #3664. Thanks to Vijay Lulla for the suggestion, and Michael Chirico for fixing 48 internal calls toattr()
which were missingexact=TRUE
, for example. Thanks to R-core for adding these options to R 2.6.0 (Oct 2007). -
test.data.table()
could fail if thedatatable.integer64
user option was set, #3683. Thanks @xiaguoxin for reporting. -
The warning message when using
keyby=
together with:=
is clearer, #2763. Thanks to @eliocamp. -
first
andlast
gain an explicitn=1L
argument so that it's clear the default is 1, and their almost identical manual pages have been merged into one. -
Rolling functions (
?froll
) coercelogical
input tonumeric
(instead of failing) to mimic the behavior ofinteger
input. -
The warning message when using
strptime
inj
has been improved, #2068. Thanks to @tdhock for the report. -
Added a note to
?setkey
clarifying thatsetkey
always uses C-locale sorting (as has been noted in?setorder
). Thanks @JBreidaks for the report in #2114. -
hour()
/minute()
/second()
are much faster forITime
input, #3518. -
New alias
setalloccol
foralloc.col
, #3475. For consistency withset*
prefixes for functions that operate in-place (likesetkey
,setorder
, etc.).alloc.col
is not going to be deprecated but we recommend usingsetalloccol
. -
dcast
no longer emits a message whenvalue.var
is missing butfun.aggregate
is explicitly set tolength
(sincevalue.var
is arbitrary in this case), #2980. -
Optimized
mean
ofinteger
columns no longer warns about a coercion to numeric, #986. Thanks @dgrtwo for his YouTube tutorial at 3:01 where the warning occurs. -
Using
first
andlast
function onPOSIXct
object no longer loadsxts
namespace, #3857.first
on emptydata.table
returns emptydata.table
now #3858. -
Added some clarifying details about what happens when a shell command is used in
fread
, #3877. Thanks Brian for the StackOverflow question which highlighted the lack of explanation here. -
We continue to encourage packages to
Import
rather thanDepend
ondata.table
, #3076. To prevent the growth rate in new packages usingDepend
, we have requested that CRAN apply a small patch we provided to prevent new submissions usingDepend
. If this is accepted, the error under--as-cran
will be as follows. The existing 73 packages usingDepend
will continue to pass OK until they next update, at which point they will be required to change fromDepend
toImport
.R CMD check <pkg> --as-cran ... * checking package dependencies ... ERROR data.table should be in Imports not Depends. Please contact its maintainer for more information.
data.table v1.12.2 (07 Apr 2019)
-
:=
no longer recycles length>1 RHS vectors. There was a warning when recycling left a remainder but no warning when the LHS length was an exact multiple of the RHS length (the same behaviour as base R). Consistent feedback for several years has been that recycling is more often a bug. In rare cases where you need to recycle a length>1 vector, please userep()
explicitly. Single values are still recycled silently as before. Early warning was given in this tweet. The 774 CRAN and Bioconductor packages usingdata.table
were tested and the maintainers of the 16 packages affected (2%) were consulted before going ahead, #3310. Upon agreement we went ahead. Many thanks to all those maintainers for already updating on CRAN, #3347. -
foverlaps
now supportstype="equal"
, #3416 and part of #3002. -
The number of logical CPUs used by default has been reduced from 100% to 50%. The previous 100% default was reported to cause significant slow downs when other non-trivial processes were also running, #3395 #3298. Two new optional environment variables (
R_DATATABLE_NUM_PROCS_PERCENT
&R_DATATABLE_NUM_THREADS
) control this default.setDTthreads()
gainspercent=
and?setDTthreads
has been significantly revised. The output ofgetDTthreads(verbose=TRUE)
has been expanded. The environment variableOMP_THREAD_LIMIT
is now respected (#3300) in addition toOMP_NUM_THREADS
as before. -
rbind
andrbindlist
now retain the position of duplicate column names rather than grouping them together #3373, fill length 0 columns (including NULL) with NA with warning #1871, and recycle length-1 columns #524. Thanks to Kun Ren for the requests which arose when parsing JSON. -
rbindlist
'suse.names=
default has changed fromFALSE
to"check"
. This emits a message if the column names of each item are not identical and then proceeds as ifuse.names=FALSE
for backwards compatibility; i.e., bind by column position not by column name. Therbind
method fordata.table
already setsuse.names=TRUE
so this change affectsrbindlist
only and notrbind.data.table
. To stack differently named columns together silently (the previous default behavior ofrbindlist
), it is now necessary to specifyuse.names=FALSE
for clarity to readers of your code. Thanks to Clayton Stanley who first raised the issue here. To aid pinpointing the calls torbindlist
that need attention, the message can be turned to error usingoptions(datatable.rbindlist.check="error")
. This option also accepts"warning"
,"message"
and"none"
. In this release the message is suppressed for default column names ("V[0-9]+"
); the next release will emit the message for those too. In 6 months the default will be upgraded from message to warning. There are two slightly different messages. They are helpful, include context and point to this news item :Column %d ['%s'] of item %d is missing in item %d. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. See news item 5 in v1.12.2 for options to control this message. Column %d ['%s'] of item %d appears in position %d in item %d. Set use.names=TRUE to match by column name, or use.names=FALSE to ignore column names. See news item 5 in v1.12.2 for options to control this message.
-
fread
gainskeepLeadingZeros
, #2999. By defaultFALSE
so that, as before, a field containing001
is interpreted as the integer 1, otherwise the character string"001"
. The default may be changed usingoptions(datatable.keepLeadingZeros=TRUE)
. Many thanks to @marc-outins for the PR.
-
rbindlist()
of a malformed factor which is missing a levels attribute is now a helpful error rather than a cryptic error aboutSTRING_ELT
, #3315. Thanks to Michael Chirico for reporting. -
Forgetting
type=
inshift(val, "lead")
would segfault, #3354. A helpful error is now produced to indicate"lead"
is being passed ton=
rather than the intendedtype=
argument. Thanks to @SymbolixAU for reporting. -
The default print output (top 5 and bottom 5 rows) when ncol>255 could display the columns in the wrong order, #3306. Thanks to Kun Ren for reporting.
-
Grouping by unusual column names such as
by='string_with_\\'
andkeyby="x y"
could fail, #3319 #3378. Thanks to @HughParsonage for reporting and @MichaelChirico for the fixes. -
foverlaps()
could return incorrect results forPOSIXct <= 1970-01-01
, #3349. Thanks to @lux5 for reporting. -
dcast.data.table
now handles functions passed tofun.aggregate=
via a variable; e.g.,funs <- list(sum, mean); dcast(..., fun.aggregate=funs
, #1974 #1369 #2064 #2949. Thanks to @sunbee, @Ping2016, @smidelius and @d0rg0ld for reporting. -
Some non-equijoin cases could segfault, #3401. Thanks to @Gayyam for reporting.
-
dcast.data.table
could sort rows containingNA
incorrectly, #2202. Thanks to @Galileo-Galilei for the report. -
Sorting, grouping and finding unique values of a numeric column containing at most one finite value (such as
c(Inf,0,-Inf)
) could return incorrect results, #3372 #3381; e.g.,data.table(A=c(Inf,0,-Inf), V=1:3)[,sum(V),by=A]
would treat the 3 rows as one group. This was a regression in 1.12.0. Thanks to Nicolas Ampuero for reporting. -
:=
with quoted expression and dot alias now works as expected, #3425. Thanks to @franknarf1 for raising and @jangorecki for the PR. -
A join's result could be incorrectly keyed when a single nomatch occurred at the very beginning while all other values matched, #3441. The incorrect key would cause incorrect results in subsequent queries. Thanks to @symbalex for reporting and @franknarf1 for pinpointing the root cause.
-
rbind
andrbindlist(..., use.names=TRUE)
with over 255 columns could return the columns in a random order, #3373. The contents and name of each column was correct but the order that the columns appeared in the result might not have matched the original input. -
rbind
andrbindlist
now combineinteger64
columns together with non-integer64
columns correctly #1349, and supportraw
columns #2819. -
NULL
columns are caught and error appropriately rather than segfault in some cases, #2303 #2305. Thanks to Hugh Parsonage and @franknarf1 for reporting. -
melt
would error with 'factor malformed' or segfault in the presence of duplicate column names, #1754. Many thanks to @franknarf1, William Marble, wligtenberg and Toby Dylan Hocking for reproducible examples. All examples have been added to the test suite. -
Removing a column from a null (0-column) data.table is now a (standard and simpler) warning rather than error, #2335. It is no longer an error to add a column to a null (0-column) data.table.
-
Non-UTF8 strings were not always sorted correctly on Windows (a regression in v1.12.0), #3397 #3451. Many thanks to @shrektan for reporting and fixing.
-
cbind
with a null (0-column)data.table
now works as expected, #3445. Thanks to @mb706 for reporting. -
Subsetting does a better job of catching a malformed
data.table
with error rather than segfault. A column may not be NULL, nor may a column be an object which has columns (such as adata.frame
ormatrix
). Thanks to a comment and reproducible example in #3369 from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enableas.data.table
to unpack columns which aredata.frame
to support this use case.
-
When upgrading to 1.12.0 some Windows users might have seen
CdllVersion not found
in some circumstances. We found a way to catch that so the helpful message now occurs for those upgrading from versions prior to 1.12.0 too, as well as those upgrading from 1.12.0 to a later version. See item 1 in notes section of 1.12.0 below for more background. -
v1.12.0 checked itself on loading using
tools::checkMD5sums("data.table")
but this check failed under thepackrat
package manager on Windows becausepackrat
appears to modify the DESCRIPTION file of packages it has snapshot, #3329. This check is now removed. TheCdllVersion
check was introduced after thecheckMD5sums()
attempt and is better; e.g., reliable on all platforms. -
As promised in new feature 6 of v1.11.6 Sep 2018 (see below in this news file), the
datatable.CJ.names
option's default is nowTRUE
. In v1.13.0 it will be removed. -
Travis CI gains OSX using homebrew llvm for OpenMP support, #3326. Thanks @marcusklik for the PR.
-
Calling
data.table:::print.data.table()
directly (i.e. bypassing method dispatch by using 3 colons) and passing it a 0-columndata.frame
(notdata.table
) now works, #3363. Thanks @heavywatal for the PR. -
v1.12.0 did not compile on Solaris 10 using Oracle Developer Studio 12.6, #3285. Many thanks to Prof Ripley for providing and testing a patch. For future reference and other package developers, a
const
variable should not be passed to OpenMP'snum_threads()
directive otherwiseleft operand must be modifiable lvalue
occurs. This appears to be a compiler bug which is why the specific versions are mentioned in this note. -
foverlaps
provides clearer error messages w.r.t. factor and POSIXct interval columns, #2645 #3007 #1143. Thanks to @sritchie73, @msummersgill and @DavidArenburg for the reports. -
unique(DT)
checks up-front the types of all the columns and will fail if any column is typelist
even though thoselist
columns may not be needed to establish uniqueness. Useunique(DT, by=...)
to specify columns that are not typelist
. v1.11.8 and before would also correctly fail with the same error, but not when uniqueness had been established in prior columns: it would stop early, not look at thelist
column and return the correct result. Checking up-front was necessary for some internal optimizations and it's probably best to be explicit anyway. Thanks to James Lamb for reporting, #3332. The error message has been embellished :Column 2 of by= (2) is type 'list', not yet supported. Please use the by= argument to specify columns with types that are supported.
-
Reminder that note 11 in v1.11.0 (May 2018) warned that
set2key()
andkey2()
will be removed in May 2019. They have been warning since v1.9.8 (Nov 2016) and their warnings were upgraded to errors in v1.11.0 (May 2018). When they were introduced in version 1.9.4 (Oct 2014) they were marked as 'experimental'. -
The
key(DT)<-
form ofsetkey()
has been warning since at least 2012 to usesetkey()
. The warning is now stronger:key(x)<-value is deprecated and not supported. Please change to use setkey().
. This warning will be upgraded to error in one year.
-
setDTthreads()
gainsrestore_after_fork=
, #2885. The defaultNULL
leaves the internal option unchanged which by default isTRUE
.data.table
has always switched to single-threaded mode on fork. It used to restore multithreading after a fork too but problems were reported on Mac and Intel OpenMP library (see 1.10.4 notes below). We are now trying again thanks to suggestions and success reported by Kun Ren and Mark Klik in packagefst
. If you experience problems with multithreading after a fork, please restart R and callsetDTthreads(restore_after_fork=FALSE)
. -
Subsetting, ordering and grouping now use more parallelism. See benchmarks here and Matt Dowle's presentation in October 2018 on YouTube here. These internal changes gave rise to 4 regressions which were found before release thanks to Kun Ren, #3211. He kindly volunteers to 'go-first' and runs data.table through his production systems before release. We are looking for a 'go-second' volunteer please. A request to test before release was tweeted on 17 Dec here. As usual, all CRAN and Bioconductor packages using data.table (currently 750) have been tested against this release, #3233. There are now 8,000 tests in 13,000 lines of test code; more lines of test code than there is code. Overall coverage has increased to 94% thanks to Michael Chirico.
-
New
frollmean
has been added by Jan Gorecki to calculate rolling mean, see?froll
for documentation. Function name and arguments are experimental. Related to #2778 (and #624, #626, #1855). Other rolling statistics will follow. -
fread()
can now read a remote compressed file in one step;fread("https://domain.org/file.csv.bz2")
. Thefile=
argument now supports.gz
and.bz2
too; i.e.fread(file="file.csv.gz")
works now where onlyfread("file.csv.gz")
worked in 1.11.8. -
nomatch=NULL
now does the same asnomatch=0L
in bothDT[...]
andfoverlaps()
; i.e. discards missing values silently (inner join). The default is stillnomatch=NA
(outer join) for statistical safety so that missing values are retained by default. After several years have elapsed, we will start to deprecate0L
; please start usingNULL
. In futurenomatch=.(0)
(note that.()
creates alist
type and is different tonomatch=0
) will fill with0
to save replacingNA
with0
afterwards, #857. -
setnames()
gainsskip_absent
to skip names inold
that aren't present, #3030. By defaultFALSE
so that it is still an error, as before, to attempt to change a column name that is not present. Thanks to @MusTheDataGuy for the suggestion and the PR. -
NA
inbetween()
and%between%
'slower
andupper
are now taken as missing bounds and returnTRUE
rather thanNA
. This is now documented. -
shift()
now interprets negative values ofn
to mean the oppositetype=
, #1708. Whengive.names=TRUE
the result is named using a positiven
with the appropriatetype=
. Alternatively, a newtype="shift"
names the result using a signedn
and constant type.shift(x, n=-5:5, give.names=TRUE) => "_lead_5" ... "_lag_5" shift(x, n=-5:5, type="shift", give.names=TRUE) => "_shift_-5" ... "_shift_5"
-
fwrite()
now acceptsmatrix
, #2613. Thanks to Michael Chirico for the suggestion and Felipe Parages for implementing. For now matrix input is converted to data.table (which can be costly) before writing. -
fread()
andfwrite()
can now handle file names in native and UTF-8 encoding, #3078. Thanks to Daniel Possenriede (@dpprdan) for reporting and fixing. -
DT[i]
andDT[i,cols]
now call internal parallel subsetting code, #2951. Subsetting is significantly faster (as are many other operations) with factor columns rather than character.N = 2e8 # 4GB data on 4-core CPU with 16GB RAM DT = data.table(ID = sample(LETTERS,N,TRUE), V1 = sample(5,N,TRUE), V2 = runif(N)) w = which(DT$V1 > 3) # select 40% of rows # v1.12.0 v1.11.8 system.time(DT[w]) # 0.8s 2.6s DT[, ID := as.factor(ID)] system.time(DT[w]) # 0.4s 2.3s system.time(DT[w, c("ID","V2")]) # 0.3s 1.9s
-
DT[..., .SDcols=]
now acceptspatterns()
; e.g.DT[..., .SDcols=patterns("^V")]
, for filtering columns according to a pattern (as inmelt.data.table
), #1878. Thanks to many people for pushing for this and @MichaelChirico for ultimately filing the PR. See?data.table
for full details and examples. -
split
data.table method will now preserve attributes, closes #2047. Thanks to @caneff for reporting. -
DT[i,j]
now retains user-defined and inherited attributes, #995; e.g.attr(datasets::BOD,"reference") # "A1.4, p. 270" attr(as.data.table(datasets::BOD)[2],"reference") # was NULL now "A1.4, p. 270"
If a superclass defines attributes that may not be valid after a
[
subset then the superclass should implement its own[
method to manage those after callingNextMethod()
.
-
Providing an
i
subset expression when attempting to delete a column correctly failed with helpful error, but when the column was missing too created a new column full ofNULL
values, #3089. Thanks to Michael Chirico for reporting. -
Column names that look like expressions (e.g.
"a<=colB"
) caused an error when used inon=
even when wrapped with backticks, #3092. Additionally,on=
now supports white spaces around operators; e.g.on = "colA == colB"
. Thanks to @mt1022 for reporting and to @MarkusBonsch for fixing. -
Unmatched
patterns
inmeasure.vars
fail early and with feedback, #3106. -
fread(..., skip=)
now skips non-standard\r
and\n\r
line endings properly again, #3006. Standard line endings (\n
Linux/Mac and\r\n
Windows) were skipped ok. Thanks to @brattono and @tbrycekelly for providing reproducible examples, and @st-pasha for fixing. -
fread(..., colClasses=)
could return a corrupted result when a lower type was requested for one or more columns (e.g. reading "3.14" as integer), #2922 #2863 #3143. It now ignores the request as documented and the helpful message in verbose mode is upgraded to warning. In future, coercing to a lower type might be supported (with warning if any accuracy is lost)."NULL"
is recognized again in both vector and list mode; e.g.colClasses=c("integer","NULL","integer")
andcolClasses=list(NULL=2, integer=10:40)
. Thanks to Arun Srinivasan, Kun Ren, Henri Ståhl and @kszela24 for reporting. -
cube()
will now produce expected order of results, #3179. Thanks to @Henrik-P for reporting. -
groupingsets()
groups by empty column set and constant value inj
, #3173. -
split.data.table()
failed ifDT
had a factor column named"x"
, #3151. Thanks to @tdeenes for reporting and fixing. -
fsetequal
now handles properly datasets having last column a character, closes #2318. Thanks to @pschil and @franknarf1 for reporting. -
DT[..., .SDcols=integer(0L)]
could fail, #3185. An emptydata.table
is now returned correctly. -
as.data.table.default
method will now always copy its input, closes #3230. Thanks to @NikdAK for reporting. -
DT[..., .SDcols=integer()]
failed with.SDcols is numeric but has both +ve and -ve indices
, #1789 and #3185. It now functions as.SDcols=character()
has done and creates an empty.SD
. Thanks to Gabor Grothendieck and Hugh Parsonage for reporting. A related issue with empty.SDcols
was fixed in development before release thanks to Kun Ren's testing, #3211. -
Multithreaded stability should be much improved with R 3.5+. Many thanks to Luke Tierney for pinpointing a memory issue with package
constellation
caused bydata.table
and his advice, #3165. Luke also added an extra check to R-devel when compiled with--enable-strict-barrier
. The test suite is run through latest daily R-devel after every commit as usual, but now with--enable-strict-barrier
on too via GitLab CI ("Extra" badge on thedata.table
homepage) thanks to Jan Gorecki. -
Fixed an edge-case bug of platform-dependent output of
strtoi("", base = 2L)
on whichgroupingsets
had relied, #3267.
-
When data.table loads it now checks its DLL version against the version of its R level code. This is to detect installation issues on Windows when i) the DLL is in use by another R session and ii) the CRAN source version > CRAN binary binary which happens just after a new release (R prompts users to install from source until the CRAN binary is available). This situation can lead to a state where the package's new R code calls old C code in the old DLL; R#17478, #3056. This broken state can persist until, hopefully, you experience a strange error caused by the mismatch. Otherwise, wrong results may occur silently. This situation applies to any R package with compiled code not just data.table, is Windows-only, and is long-standing. It has only recently been understood as it typically only occurs during the few days after each new release until binaries are available on CRAN.
-
When
on=
is provided but noti=
, a helpful error is now produced rather than silently ignoringon=
. Thanks to Dirk Eddelbuettel for the idea. -
.SDcols=
is more helpful when passed non-existent columns, #3116 and #3118. Thanks to Michael Chirico for the investigation and PR. -
update.dev.pkg()
gainstype=
to specify if update should be made from binaries, sources or both. #3148. Thanks to Reino Bruner for the detailed suggestions. -
setDT()
improves feedback when passed a ragged list (i.e. where all columns in the list are not the same length), #3121. Thanks @chuk-yong for highlighting. -
The one and only usage of
UNPROTECT_PTR()
has been removed, #3232. Thanks to Tomas Kalibera's investigation and advice here: https://developer.r-project.org/Blog/public/2018/12/10/unprotecting-by-value/index.html
fread()
can now read.gz
and.bz2
files directly:fread("file.csv.gz")
, #717 #3058. It usesR.utils::decompressFile
to decompress to atempfile()
which is then read byfread()
in the usual way. For greater speed on large-RAM servers, it is recommended to use ramdisk for temporary files by settingTMPDIR
to/dev/shm
before starting R; see?tempdir
. The decompressed temporary file is removed as soon asfread
completes even if there is an error reading the file. Reading a remote compressed file in one step will be supported in the next version; e.g.fread("https://domain.org/file.csv.bz2")
.
-
Joining two keyed tables using
on=
to columns not forming a leading subset ofkey(i)
could result in an invalidly keyed result, #3061. Subsequent queries on the result could then return incorrect results. A warninglonger object length is not a multiple of shorter object length
could also occur. Thanks to @renkun-ken for reporting and the PR. -
keyby=
on columns for which an index exists now uses the index (new feature 7 in v1.11.6 below) but if ani
subset is present in the same query then it could segfault, #3062. Again thanks to @renkun-ken for reporting. -
Assigning an out-of-range integer to an item in a factor column (a rare operation) correctly created an
NA
in that spot with warning, but now no longer also corrupts the variable being assigned, #2984. Thanks to @radfordneal for reporting and @MarkusBonsch for fixing. Assigning a string which is missing from the factor levels continues to automatically append the string to the factor levels. -
Assigning a sequence to a column using base R methods (e.g.
DT[["foo"]] = 1:10
) could cause subsetting to fail withInternal error in subset.c: column <n> is an ALTREP vector
, #3051. Thanks to Michel Lang for reporting. -
as.data.table
matrix
method now properly handles rownames for 0 column data.table output. Thanks @mllg for reporting. Closes #3149.
-
The test suite now turns on R's new R_CHECK_LENGTH_1_LOGIC2 to catch when internal use of
&&
or||
encounter arguments of length more than one. Thanks to Hugh Parsonage for implementing and fixing the problems caught by this. -
Some namespace changes have been made with respect to melt, dcast and xts. No change is expected but if you do have any trouble, please file an issue.
-
split.data.table
was exported in v1.11.6 in addition to being registered usingS3method(split, data.table)
. The export has been removed again. It had been added because a user said they found it difficult to find, #2920. But S3 methods are not normally exported explicitly by packages. The proper way to access thesplit.data.table
method is to callsplit(DT)
whereDT
is adata.table
. The generic (base::split
in this case) then dispatches to thesplit.data.table
method. v1.11.6 was not on CRAN very long (1 week) so we think it's better to revert this change quickly. To know what methods exist, R provides themethods()
function.methods(split) # all the methods for the split generic methods(class="data.table") # all the generics that data.table has a method for (47 currently)
-
For convenience when some of the files in
fnams
are empty inrbindlist(lapply(fnams,fread))
,fread
now reads empty input as a null-data.table with warning rather than error, #2898. For consistency,fwrite(data.table(NULL))
now creates an empty file and warns instead of error, too. -
setcolorder(DT)
without further arguments now defaults to moving the key columns to be first, #2895. Thanks to @jsams for the PR. -
Attempting to subset on
col
when the column is actually calledCol
will still error, but the error message will helpfully suggest similarly-spelled columns, #2887. This is experimental, applies just toi
currently, and we look forward to feedback. Thanks to Michael Chirico for the suggestion and PR. -
fread()
has always accepted literal data; e.g.fread("A,B\n1,2\n3,4")
. It now gains explicittext=
; e.g.fread(text="A,B\n1,2\n3,4")
. Unlike the first general purposeinput=
argument, thetext=
argument accepts multi-line input; e.g.fread(text=c("A,B","1,2","3,4"))
, #1423. Thanks to Douglas Clark for the request and Hugh Parsonage for the PR. -
fread()
has always accepted system commands; e.g.fread("grep blah file.txt")
. It now gains explicitcmd=
; e.g.fread(cmd="grep blah file.txt")
. Further, if and only ifinput=
is a system command and a variable was used to hold that command (fread(someCommand)
notfread("grep blah file.txt")
) or a variable is used to construct it (fread(paste("grep",variable,"file.txt"))
), a message is now printed suggestingcmd=
. This is to inform all users that there is a potential security concern if you are i) creating apps, and ii) your app takes input from a public user who could be malicious, and iii) input from the malicious user (such as a filename) is passed by your app tofread()
, and iv) your app in not running in a protected environment. If all 4 conditions hold then the malicious user could provide a system command instead of a filename whichfread()
would run, and that would be a problem too. If the app is not running in a protected environment (e.g. app is running as root) then this could do damage or obtain data you did not intend. Public facing apps should be running with limited operating system permission so that any breach from any source is contained. We agree with Linus Torvald's advice on this which boils down to: "when addressing security concerns the first step is do no harm, just inform". If you aren't creating apps or apis that could have a malicious user then there is no risk but we can't distinguish you so we have to inform everyone. Please change tofread(cmd=...)
at your leisure. The new message can be suppressed withoptions(datatable.fread.input.cmd.message=FALSE)
. Passing system commands tofread()
continues to be recommended and encouraged and is widely used; e.g. via the techniques gathered together in the book Data Science at the Command Line. Awarning()
is too strong because best-practice for production systems is to setoptions(warn=2)
to tolerate no warnings. Such production systems have no user input and so there is no security risk; we don't want to do harm by breaking production systems via awarning()
which gets turned into an error byoptions(warn=2)
. Now that we have informed all users, we request feedback. There are 3 options for future releases: i) remove the message, ii) leave the message in place, iii) upgrade the message to warning and then eventually error. The default choice is the middle one: leave the message in place. -
New
options(datatable.CJ.names=TRUE)
changesCJ()
to auto-name its inputs exactly asdata.table()
does, #1596. Thanks @franknarf1 for the suggestion. Current default isFALSE
; i.e. no change. The option's default will be changed toTRUE
in v1.12.0 and then eventually the option will be removed. Any code that depends onCJ(x,y)$V1
will need to be changed toCJ(x,y)$x
and is more akin to a bug fix due to the inconsistency withdata.table()
. -
If an appropriate index exists,
keyby=
will now use it. For example, givensetindex(DT,colA,colB)
, bothDT[,j,keyby=colA]
(a leading subset of the index columns) andDT[,j,keyby=.(colA,colB)]
will use the index, but notDT[,j,keyby=.(colB,colA)]
. The optionoptions(datatable.use.index=FALSE)
will turn this feature off. Please always usekeyby=
unless you wish to retain the order of groups by first-appearance order (in which case useby=
). Also, bothkeyby=
andby=
already used the key where possible but are now faster when using just the first column of the key. As usual, settingverbose=TRUE
either per-query or globally usingoptions(datatable.verbose=TRUE)
will report what's being done internally.
-
fread
now respects the order of columns passed toselect=
when column numbers are used, #2986. It already respected the order when column names are used. Thanks @privefl for raising the issue. -
gmin
andgmax
no longer fail on ordered factors, #1947. Thanks to @mcieslik-mctp for identifying and @mbacou for the nudge. -
as.ITime.character
now properly handles NA when attempting to detect the format of non-NA values in vector. Thanks @polyjian for reporting, closes #2940. -
as.matrix(DT, rownames="id")
now works whenDT
has a single row, #2930. Thanks to @malcook for reporting and @sritchie73 for fixing. The root cause was the dual meaning of therownames=
argument: i) a single column name/number (most common), or ii) rowname values length 1 for the single row. For clarity and safety,rownames.value=
has been added. Old usage (i.e.length(rownames)>1
) continues to work for now but will issue a warning in a future release, and then error in a release after that. -
Fixed regression in v1.11.0 (May 2018) caused by PR #2389 which introduced partial key retainment on
:=
assigns. This broke the joining logic that assumed implicitly that assigning always drops keys completely. Consequently, join and subset results could be wrong when matching character to factor columns with existing keys, #2881. Thanks to @ddong63 for reporting and to @MarkusBonsch for fixing. Missing test added to ensure this doesn't arise again. -
as.IDate.numeric
no longer ignores "origin", #2880. Thanks to David Arenburg for reporting and fixing. -
as.ITime.times
was rounding fractional seconds while other methods were truncating, #2870. Theas.ITime
method gainsms=
taking"truncate"
(default),"nearest"
and"ceil"
. Thanks to @rossholmberg for reporting and Michael Chirico for fixing. -
fwrite()
now writes POSIXct dates after 2038 correctly, #2995. Thanks to Manfred Zorn for reporting and Philippe Chataignon for the PR fixing it. -
fsetequal
gains theall
argument to make it consistent with the other set operator functionsfunion
,fsetdiff
andfintersect
#2968. Whenall = FALSE
fsetequal
will treat rows as elements in a set when checking whether twodata.tables
are equal (i.e. duplicate rows will be ignored). For now the default value isall = TRUE
for backwards compatibility, but this will be changed toall = FALSE
in a future release to make it consistent with the other set operation functions. Thanks to @franknarf1 for reporting and @sritchie73 for fixing. -
fintersect
failed on tables with a column calledy
, #3034. Thanks to Maxim Nazarov for reporting. -
Compilation fails in AIX because NAN and INFINITY macros definition in AIX make them not constant literals, #3043. Thanks to Ayappan for reporting and fixing.
-
The introduction of altrep in R 3.5.0 caused some performance regressions of about 20% in some cases, #2962. Investigating this led to some improvements to grouping which are faster than before R 3.5.0 in some cases. Thanks to Nikolay S. for reporting. The work to accomodate altrep is not complete but it is better and it is highly recommended to upgrade to this update.
-
Fixed 7 memory faults thanks to CRAN's
rchk
tool by Tomas Kalibera, #3033.
-
The type coercion warning message has been improved, #2989. Thanks to @sarahbeeysian on Twitter for highlighting. For example, given the follow statements:
DT = data.table(id=1:3) DT[2, id:="foo"]
the warning message has changed from :
Coerced character RHS to integer to match the column's type. Either change the target column ['id'] to character first (by creating a new character vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to integer (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
to :
Coerced character RHS to integer to match the type of the target column (column 1 named 'id'). If the target column's type integer is correct, it's best for efficiency to avoid the coercion and create the RHS as type integer. To achieve that consider the L postfix: typeof(0L) vs typeof(0), and typeof(NA) vs typeof(NA_integer_) vs typeof(NA_real_). Wrapping the RHS with as.integer() will avoid this warning but still perform the coercion. If the target column's type is not correct, it is best to revisit where the DT was created and fix the column type there; e.g., by using colClasses= in fread(). Otherwise, you can change the column type now by plonking a new column (of the desired type) over the top of it; e.g. DT[, `id`:=as.character(`id`)]. If the RHS of := has nrow(DT) elements then the assignment is called a column plonk and is the way to change a column's type. Column types can be observed with sapply(DT,typeof).
Further, if a coercion from double to integer is performed, fractional data such as 3.14 is now detected and the truncation to 3 is warned about if and only if truncation has occurred.
DT = data.table(v=1:3) DT[2, v:=3.14] Warning message: Coerced double RHS to integer to match the type of the target column (column 1 named 'v'). One or more RHS values contain fractions which have been lost; e.g. item 1 with value 3.140000 has been truncated to 3.
-
split.data.table
method is now properly exported, #2920. But we don't recommend it becausesplit
copies all the pieces into new memory. -
Setting indices on columns which are part of the key will now create those indices.
-
hour
,minute
, andsecond
utility functions use integer arithmetic when the input is already (explicitly) UTC-basedPOSIXct
for 4-10x speedup vs. usingas.POSIXlt
. -
Error added for incorrect usage of
%between%
, with some helpful diagnostic hints, #3014. Thanks @peterlittlejohn for offering his user experience and providing the impetus.
-
Empty RHS of
:=
is no longer an error when thei
clause returns no rows to assign to anyway, #2829. Thanks to @cguill95 for reporting and to @MarkusBonsch for fixing. -
Fixed runaway memory usage with R-devel (R > 3.5.0), #2882. Thanks to many people but in particular to Trang Nguyen for making the breakthrough reproducible example, Paul Bailey for liaising, and Luke Tierney for then pinpointing the issue. It was caused by an interaction of two or more data.table threads operating on new compact vectors in the ALTREP framework, such as the sequence
1:n
. This interaction could result in R's garbage collector turning off, and hence the memory explosion. Problems may occur in R 3.5.0 too but we were only able to reproduce in R > 3.5.0. The R code in data.table's implementation benefits from ALTREP (for
loops in R no longer allocate their range vector input, for example) but are not so appropriate as data.table columns. Sequences such as1:n
are common in test data but not very common in real-world datasets. Therefore, there is no need for data.table to support columns which are ALTREP compact sequences. Thedata.table()
function already expanded compact vectors (by happy accident) butsetDT()
did not (it now does). If, somehow, a compact vector still reaches the internal parallel regions, a helpful error will now be generated. If this happens, please report it as a bug. -
Tests 1590.3 & 1590.4 now pass when users run
test.data.table()
on Windows, #2856. Thanks to Avraham Adler for reporting. Those tests were passing on AppVeyor, win-builder and CRAN's Windows becauseR CMD check
setsLC_COLLATE=C
as documented in R-exts$1.3.1, whereas by default on WindowsLC_COLLATE
is usually a regional Windows-1252 dialect such asEnglish_United States.1252
. -
Around 1 billion very small groups (of size 1 or 2 rows) could result in
"Failed to realloc working memory"
even when plenty of memory is available, #2777. Thanks once again to @jsams for the detailed report as a follow up to bug fix 40 in v1.11.0.
-
test.data.table()
created/overwrote variablex
in.GlobalEnv
, #2828; i.e. a modification of user's workspace which is not allowed. Thanks to @etienne-s for reporting. -
as.chron
methods forIDate
andITime
have been removed, #2825.as.chron
still works sinceIDate
inherits fromDate
. We are not sure why we had specific methods in the first place. It may have been from a time whenIDate
did not inherit fromDate
, perhaps. Note that we don't usechron
ourselves in our own work. -
Fixed
SETLENGTH() cannot be applied to an ALTVEC object
starting in R-devel (R 3.6.0) on 1 May 2018, a few hours after 1.11.0 was accepted on CRAN, #2820. Many thanks to Luke Tierney for pinpointing the problem. -
Fixed some rare memory faults in
fread()
andrbindlist()
found withgctorture2()
andrchk
, #2841.
-
fread()
'sna.strings=
argument :"NA" # old default getOption("datatable.na.strings", "NA") # this release; i.e. the same; no change yet getOption("datatable.na.strings", "") # future release
This option controls how
,,
is read in character columns. It does not affect numeric columns which read,,
asNA
regardless. We would like,,
=>NA
for consistency with numeric types, and,"",
=>empty string to be the standard default forfwrite/fread
character columns so thatfread(fwrite(DT))==DT
without needing any change to any parameters.fwrite
has never writtenNA
as"NA"
in case"NA"
is a valid string in the data; e.g., 2 character id columns sometimes do. Instead,fwrite
has always written,,
by default for an<NA>
in a character columns. The use of R'sgetOption()
allows users to move forward now, usingoptions(datatable.fread.na.strings="")
, or restore old behaviour when the default's default is changed in future, usingoptions(datatable.fread.na.strings="NA")
. -
fread()
andfwrite()
'slogical01=
argument :logical01 = FALSE # old default getOption("datatable.logical01", FALSE) # this release; i.e. the same; no change yet getOption("datatable.logical01", TRUE) # future release
This option controls whether a column of all 0's and 1's is read as
integer
, orlogical
directly to avoid needing to change the type afterwards tological
or usecolClasses
.0/1
is smaller and faster than"TRUE"/"FALSE"
, which can make a significant difference to space and time the morelogical
columns there are. When the default's default changes toTRUE
forfread
we do not expect much impact since all arithmetic operators that are currently receiving 0's and 1's as typeinteger
(thinksum()
) but instead could receivelogical
, would return exactly the same result on the 0's and 1's aslogical
type. However, code that is manipulating column types usingis.integer
oris.logical
onfread
's result, could require change. It could be painful ifDT[(logical_column)]
(i.e.DT[logical_column==TRUE]
) changed behaviour due tological_column
no longer being typelogical
butinteger
. But that is not the change proposed. The change is the other way around; i.e., a previouslyinteger
column holding only 0's and 1's would now be typelogical
. Since it's that way around, we believe the scope for breakage is limited. We think a lot of code is converting 0/1 integer columns to logical anyway, either usingcolClasses=
or afterwards with an assign. Forfwrite
, the level of breakage depends on the consumer of the output file. We believe0/1
is a better more standard default choice to move to. See notes below about improvements tofread
's sampling for type guessing, and automatic rereading in the rare cases of out-of-sample type surprises.
These options are meant for temporary use to aid your migration, #2652. You are not meant to set them to the old default and then not migrate your code that is dependent on the default. Either set the argument explicitly so your code is not dependent on the default, or change the code to cope with the new default. Over the next few years we will slowly start to remove these options, warning you if you are using them, and return to a simple default. See the history of NEWS and NEWS.0 for past migrations that have, generally speaking, been successfully managed in this way. For example, at the end of NOTES for this version (below in this file) is a note about the usage of datatable.old.unique.by.key
now warning, as you were warned it would do over a year ago. When that change was introduced, the default was changed and that option provided an option to restore the old behaviour. These fread
/fwrite
changes are even more cautious and not even changing the default's default yet. Giving you extra warning by way of this notice to move forward. And giving you a chance to object.
-
fread()
:- Efficiency savings at C level including parallelization announced here; e.g. a 9GB 2 column integer csv input is 50s down to 12s to cold load on a 4 core laptop with 16GB RAM and SSD. Run
echo 3 >/proc/sys/vm/drop_caches
first to measure cold load time. Subsequent load time (after file has been cached by OS on the first run) 40s down to 6s. - The fread for small data page has been revised.
- Memory maps lazily; e.g. reading just the first 10 rows with
nrow=10
is 12s down to 0.01s from cold for the 9GB file. Large files close to your RAM limit may work more reliably too. The progress meter will commence sooner and more consistently. fread
has always jumped to the middle and to the end of the file for a much improved column type guess. The sample size is increased from 100 rows at 10 jump jump points (1,000 rows) to 100 rows at 100 jumps points (10,000 row sample). In the rare case of there still being out-of-sample type exceptions, those columns are now automatically reread so you don't have to usecolClasses
yourself.- Large number of columns support; e.g. 12,000 columns tested.
- Quoting rules are more robust and flexible. See point 10 on the wiki page here.
- Numeric data that has been quoted is now detected and read as numeric.
- The ability to position
autostart
anywhere inside one of multiple tables in a single file is removed with warning. It used to search upwards from that line to find the start of the table based on a consistent number of columns. People appear to be usingskip="string"
orskip=nrow
to find the header row exactly, which is retained and simpler. It was too difficult to retain search-upwards-autostart together with skipping/filling blank lines, filling incomplete rows and parallelization too. If there is any header info above the column names, it is still auto detected and auto skipped (particularly useful when loading a set of files where the column names start on different lines due to a varying height messy header). dec=','
is now implemented directly so there is no dependency on locale. The optionsdatatable.fread.dec.experiment
anddatatable.fread.dec.locale
have been removed.\\r\\r\\n
line endings are now handled such as produced bybase::download.file()
when it doubles up\\r
. Other rare line endings (\\r
and\\n\\r
) are now more robust.- Mixed line endings are now handled; e.g. a file formed by concatenating a Unix file and a Windows file so that some lines end with
\\n
while others end with\\r\\n
. - Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first row against the column types ascertained by the 10,000 rows sample (or
colClasses
if provided). If a numeric column has a string value at the top, then column names are deemed present. - Detects GB-18030 and UTF-16 encodings and in verbose mode prints a message about BOM detection.
- Detects and ignores trailing ^Z end-of-file control character sometimes created on MS DOS/Windows, #1612. Thanks to Gergely Daróczi for reporting and providing a file.
- Added ability to recognize and parse hexadecimal floating point numbers, as used for example in Java. Thanks for @scottstanfield #2316 for the report.
- Now handles floating-point NaN values in a wide variety of formats, including
NaN
,sNaN
,1.#QNAN
,NaN1234
,#NUM!
and others, #1800. Thanks to Jori Liesenborgs for highlighting and the PR. - If negative numbers are passed to
select=
the out-of-range error now suggestsdrop=
instead, #2423. Thanks to Michael Chirico for the suggestion. sep=NULL
orsep=""
(i.e., no column separator) can now be used to specify single column input reliably likebase::readLines
, #1616.sep='\\n'
still works (even on Windows where line ending is actually\\r\\n
) butNULL
or""
are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before,sep=NA
is not valid; use the default"auto"
for automatic separator detection.sep='\\n'
is now deprecated and in future will start to warn when used.- Single-column input with blank lines is now valid and the blank lines are significant (representing
NA
). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so thatfread(fwrite(DT))==DT
for single-column inputs containingNA
which are written as blank. There is no change whenncol>1
; i.e., input stops with detailed warning at the first blank line, because a blank line whenncol>1
is invalid input due to no separators being present. Thanks to @skanskan, Michael Chirico, @franknarf1 and Pasha for the testing and discussions, #2106. - Too few column names are now auto filled with default column names, with warning, #1625. If there is just one missing column name it is guessed to be for the first column (row names or an index), otherwise the column names are filled at the end. Similarly, too many column names now automatically sets
fill=TRUE
, with warning. skip=
andnrow=
are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, #1267.- Ram disk (
/dev/shm
) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., #1139 and zUMIs/19. Thanks to Kyle Chung for reporting. Standardtempdir()
is now used. If you wish to use ram disk, set TEMPDIR to/dev/shm
; see?tempdir
. - Detecting whether a very long input string is a file name or data is now much faster, #2531. Many thanks to @javrucebo for the detailed report, benchmarks and suggestions.
- A column of
TRUE/FALSE
s is ok, as well asTrue/False
s andtrue/false
s, but mixing styles (e.g.TRUE/false
) is not and will be read as typecharacter
. - New argument
index
to compliment the existingkey
argument for applying secondary orderings out of the box for convenience, #2633. - A warning is now issued whenever incorrectly quoted fields have been detected and fixed using a non-standard quote rule.
fread
has always used these advanced rules but now it warns that it is using them. Most file writers correctly quote fields if the field contains the field separator, but a common error is not to also quote fields that contain a quote and then escape those quotes, particularly if that quote occurs at the start of the field. The ability to detect and fix such files is referred to as self-healing. Ambiguities are resolved using the knowledge that the number of columns is constant, and therefore this ability is not available whenfill=TRUE
. This feature can be improved in future by using column type consistency as well as the number of fields. For example:
txt = 'A,B\n1,hello\n2,"howdy" said Joe\n3,bonjour\n' cat(txt) # A,B # 1,hello # 2,"howdy" said Joe # 3,bonjour fread(txt) A B <int> <char> 1: 1 hello 2: 2 "howdy" said Joe 3: 3 bonjour Warning message: In fread(txt) : Found and resolved improper quoting
- Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney, Ananda Mahto, @memoryfull, @brandenkmurray for testing dev and reporting these regressions before release to CRAN: #1464, #1671, #1888, #1895, #2070, #2073, #2087, #2091, #2092, #2107, #2118, #2123, #2167, #2194, #2196, #2201, #2222, #2228, #2238, #2246, #2251, #2265, #2267, #2285, #2287, #2299, #2322, #2347, #2352, #2370, #2371, #2395, #2404, #2446, #2453, #2457, #2464, #2481, #2499, #2512, #2515, #2516, #2518, #2520, #2523, #2526, #2535, #2542, #2548, #2561, #2600, #2625, #2666, #2697, #2735, #2744.
- Efficiency savings at C level including parallelization announced here; e.g. a 9GB 2 column integer csv input is 50s down to 12s to cold load on a 4 core laptop with 16GB RAM and SSD. Run
-
fwrite()
:- empty strings are now always quoted (
,"",
) to distinguish them fromNA
which by default is still empty (,,
) but can be changed usingna=
as before. Ifna=
is provided andquote=
is the default'auto'
thenquote=
is set toTRUE
so that if thena=
value occurs in the data, it can be distinguished fromNA
. Thanks to Ethan Welty for the request #2214 and Pasha for the code change and tests, #2215. logical01
has been added and the old namelogicalAsInt
retained. Pease move to the new name when convenient for you. The old argument name (logicalAsInt
) will slowly be deprecated over the next few years. The default is unchanged:FALSE
, sological
is still written as"TRUE"
/"FALSE"
in full by default. We intend to change the default's default in future toTRUE
; see the notice at the top of these release notes.
- empty strings are now always quoted (
-
Added helpful message when subsetting by a logical column without wrapping it in parentheses, #1844. Thanks @dracodoc for the suggestion and @MichaelChirico for the PR.
-
tables
gainsindex
argument for supplementary metadata aboutdata.table
s in memory (or any optionally specified environment), part of #1648. Thanks due variously to @jangorecki, @rsaporta, @MichaelChirico for ideas and work towards PR. -
Improved auto-detection of
character
inputs' formats toas.ITime
to mirror the logic inas.POSIXlt.character
, #1383 Thanks @franknarf1 for identifying a discrepancy and @MichaelChirico for investigating. -
setcolorder()
now accepts less thanncol(DT)
columns to be moved to the front, #592. Thanks @MichaelChirico for the PR. This also incidentally fixed #2007 whereby explicitly settingselect = NULL
infread
errored; thanks to @rcapell for reporting that and @dselivanov and @MichaelChirico for investigating and providing a new test. -
Three new Grouping Sets functions:
rollup
,cube
andgroupingsets
, #1377. Allows to aggregation on various grouping levels at once producing sub-totals and grand total. -
as.data.table()
gains new method forarray
s to return a useful data.table, #1418. -
print.data.table()
(all via master issue #1523):-
gains
print.keys
argument,FALSE
by default, which displays the keys and/or indices (secondary keys) of adata.table
. Thanks @MichaelChirico for the PR, Yike Lu for the suggestion and Arun for honing that idea to its present form. -
gains
col.names
argument,"auto"
by default, which toggles which registers of column names to include in printed output."top"
forcesdata.frame
-like behavior where column names are only ever included at the top of the output, as opposed to the default behavior which appends the column names below the output as well for longer (>20 rows) tables."none"
shuts down column name printing altogether. Thanks @MichaelChirico for the PR, Oleg Bondar for the suggestion, and Arun for guiding commentary. -
list columns would print the first 6 items in each cell followed by a comma if there are more than 6 in that cell. Now it ends ",..." to make it clearer, part of #1523. Thanks to @franknarf1 for drawing attention to an issue raised on Stack Overflow by @TMOTTM here.
-
-
setkeyv
accelerated if key already exists #2331. Thanks to @MarkusBonsch for the PR. -
Keys and indexes are now partially retained up to the key column assigned to with ':=' #2372. They used to be dropped completely if any one of the columns was affected by
:=
. Tanks to @MarkusBonsch for the PR. -
Faster
as.IDate
andas.ITime
methods forPOSIXct
andnumeric
, #1392. Thanks to Jan Gorecki for the PR. -
unique(DT)
now returnsDT
early when there are no duplicates to save RAM, #2013. Thanks to Michael Chirico for the PR, and thanks to @mgahan for pointing out a reversion inna.omit.data.table
before release, #2660. -
uniqueN()
is now faster on logical vectors. Thanks to Hugh Parsonage for PR#2648.N = 1e9 # was now x = c(TRUE,FALSE,NA,rep(TRUE,N)) # uniqueN(x) == 3 # 5.4s 0.00s x = c(TRUE,rep(FALSE,N), NA) # uniqueN(x,na.rm=TRUE) == 2 # 5.4s 0.00s x = c(rep(TRUE,N),FALSE,NA) # uniqueN(x) == 3 # 6.7s 0.38s
-
Subsetting optimization with keys and indices is now possible for compound queries like
DT[a==1 & b==2]
, #2472. Thanks to @MichaelChirico for reporting and to @MarkusBonsch for the implementation. -
melt.data.table
now offers friendlier functionality for providingvalue.name
forlist
input tomeasure.vars
, #1547. Thanks @MichaelChirico and @franknarf1 for the suggestion and use cases, @jangorecki and @mrdwab for implementation feedback, and @MichaelChirico for ultimate implementation. -
update.dev.pkg
is new function to update package from development repository, it will download package sources only when newer commit is available in repository.data.table::update.dev.pkg()
defaults updatesdata.table
, but any package can be used. -
Item 1 in NEWS for v1.10.2 on CRAN in Jan 2017 included :
When j is a symbol prefixed with
..
it will be looked up in calling scope and its value taken to be column names or numbers. When you see the..
prefix think one-level-up, like the directory..
in all operating systems means the parent directory. In future the..
prefix could be made to work on all symbols apearing anywhere insideDT[...]
.The response has been positive (this tweet and FR#2655) and so this prefix is now expanded to all symbols appearing in
j=
as a first step; e.g.cols = "colB" DT[, c(..cols, "colC")] # same as DT[, .(colB,colC)] DT[, -..cols] # all columns other than colB
Thus,
with=
should no longer be needed in any cases. Please change to using the..
prefix and over the next few years we will start to formally deprecate and remove thewith=
parameter. If this is well received, the..
prefix could be expanded to symbols appearing ini=
andby=
, too. Note that column names should not now start with..
. If a symbol..var
is used inj=
but..var
exists as a column name, the column still takes precedence, for backwards compatibility. Over the next few years, data.table will start issuing warnings/errors when it sees column names starting with..
. This affects one CRAN package out of 475 using data.table, so we do not believe this restriction to be unreasonable. Our main focus here which we believe..
achieves is to resolve the more common ambiguity whenvar
is in calling scope andvar
is a column name too. Further, we have not forgotten that in the past we recommended prefixing the variable in calling scope with..
yourself. If you did that and..var
exists in calling scope, that still works, provided neithervar
exists in calling scope nor..var
exists as a column name. Please now remove the..
prefix on..var
in calling scope to tidy this up. In future data.table will start to warn/error on such usage. -
setindexv
can now assign multiple (separate) indices by accepting alist
in thecols
argument. -
as.matrix.data.table
method now has an additionalrownames
argument allowing for a single column to be used as therownames
after conversion to amatrix
. Thanks to @sritchie73 for the suggestion, use cases, #2692 and implementation PR#2702 and @MichaelChirico for additional use cases.
-
The new quote rules handles this single field
"Our Stock Screen Delivers an Israeli Software Company (MNDO, CTCH)<\/a> SmallCapInvestor.com - Thu, May 19, 2011 10:02 AM EDT<\/cite><\/div>Yesterday in \""Google, But for Finding Great Stocks\"", I discussed the value of stock screeners as a powerful tool"
, #2051. Thanks to @scarrascoso for reporting. Example file added to test suite. -
fwrite()
creates a file with permissions that now play correctly withSys.umask()
, #2049. Thanks to @gnguy for reporting. -
fread()
no longer holds an open lock on the file when a line outside the large sample has too many fields and generates an error, #2044. Thanks to Hugh Parsonage for reporting. -
Setting
j = {}
no longer results in an error, #2142. Thanks Michael Chirico for the pull request. -
Segfault in
rbindlist()
when one or more items are empty, #2019. Thanks Michael Lang for the pull request. Another segfault if the result would be more than 2bn rows, thanks to @jsams's comment in #2340. -
Error printing 0-length
ITime
andNA
objects, #2032 and #2171. Thanks Michael Chirico for the pull requests and @franknarf1 for pointing out a shortcoming of the initial fix. -
as.IDate.POSIXct
error withNULL
timezone, #1973. Thanks @lbilli for reporting and Michael Chirico for the pull request. -
Printing a null
data.table
withprint
no longer visibly outputsNULL
, #1852. Thanks @aaronmcdaid for spotting and @MichaelChirico for the PR. -
data.table
now works with Shiny Reactivity / Flexdashboard. The error was typically something likecol not found
inDT[col==val]
. Thanks to Dirk Eddelbuettel leading Matt through reproducible steps and @sergeganakou and Richard White for reporting. Closes #2001 and shiny/#1696. -
The
as.IDate.POSIXct
method passedtzone
along but was not exported. Sotzone
is now taken into account byas.IDate
too as well asIDateTime
, #977 and #1498. Tests added. -
Named logical vector now select rows as expected from single row data.table. Thanks to @skranz for reporting. Closes #2152.
-
fread()
's rareInternal error: Sampling jump point 10 is before the last jump ended
has been fixed, #2157. Thanks to Frank Erickson and Artem Klevtsov for reporting with example files which are now added to the test suite. -
CJ()
no longer loses attribute information, #2029. Thanks to @MarkusBonsch and @royalts for the pull request. -
split.data.table
respectsfactor
ordering inby
argument, #2082. Thanks to @MichaelChirico for identifying and fixing the issue. -
.SD
would incorrectly include symbol on lhs of:=
when.SDcols
is specified andget()
appears inj
. Thanks @renkun-ken for reporting and the PR, and @ProfFancyPants for reporing a regression introduced in the PR. Closes #2326 and #2338. -
Integer values that are too large to fit in
int64
will now be read as strings #2250. -
Internal-only
.shallow
now retains keys correctly, #2336. Thanks to @MarkusBonsch for reporting, fixing (PR #2337) and adding 37 tests. This much advances the journey towards exportingshallow()
, #2323. -
isoweek
calculation is correct regardless of local timezone setting (Sys.timezone()
), #2407. Thanks to @MoebiusAV and @SimonCoulombe for reporting and @MichaelChirico for fixing. -
Fixed
as.xts.data.table
to support all xts supported time based index clasess #2408. Thanks to @ebs238 for reporting and for the PR. -
A memory leak when a very small number such as
0.58E-2141
is bumped to typecharacter
is resolved, #918. -
The edge case
setnames(data.table(), character(0))
now works rather than error, #2452. -
Order of rows returned in non-equi joins were incorrect in certain scenarios as reported under #1991. This is now fixed. Thanks to @Henrik-P for reporting.
-
Non-equi joins work as expected when
x
inx[i, on=...]
is a 0-row data.table. Closes #1986. -
Non-equi joins along with
by=.EACHI
returned incorrect result in some rare cases as reported under #2360. This is fixed now. This fix also takes care of #2275. Thanks to @ebs238 for the nice minimal reproducible report, @Mihael for asking on SO and to @Frank for following up on SO and filing an issue. -
by=.EACHI
works now whenlist
columns are being returned and some join values are missing, #2300. Thanks to @jangorecki and @franknarf1 for the reproducible examples which have been added to the test suite. -
Indices are now retrieved by exact name, #2465. This prevents usage of wrong indices as well as unexpected row reordering in join results. Thanks to @pannnda for reporting and providing a reproducible example and to @MarkusBonsch for fixing.
-
setnames
of whole table when original table hadNA
names skipped replacing those, #2475. Thanks to @franknarf1 and BenoitLondon on StackOverflow for the report and @MichaelChirico for fixing. -
CJ()
works with multiple empty vectors now #2511. Thanks to @MarkusBonsch for fixing. -
:=
assignment of one vector to two or more columns, e.g.DT[, c("x", "y") := 1:10]
, failed to copy the1:10
data causing errors later if and when those columns were updated by reference, #2540. This is an old issue (#185) that had been fixed but reappeared when code was refactored. Thanks to @patrickhowerter for the detailed report with reproducible example and to @MarkusBonsch for fixing and strengthening tests so it doesn't reappear again. -
"Negative length vectors not allowed" error when grouping
median
andvar
fixed, #2046 and #2111. Thanks to @caneff and @osofr for reporting and to @kmillar for debugging and explaining the cause. -
Fixed a bug on Windows where
data.table
s containing non-UTF8 strings inkey
s were not properly sorted, #2462, #1826 and StackOverflow. Thanks to @shrektan for reporting and fixing. -
x.
prefixes during joins sometimes resulted in a "column not found" error. This is now fixed. Closes #2313. Thanks to @franknarf1 for the MRE. -
setattr()
no longer segfaults when setting 'class' to empty character vector, #2386. Thanks to @hatal175 for reporting and to @MarkusBonsch for fixing. -
Fixed cases where the result of
merge.data.table()
would contain duplicate column names ifby.x
was also innames(y)
.merge.data.table()
gains theno.dups
argument (default TRUE) to match the correpsonding patched behaviour inbase:::merge.data.frame()
. Now, whenby.x
is also innames(y)
the column name fromy
has the correspondingsuffixes
added to it.by.x
remains unchanged for backwards compatibility reasons. In addition, where duplicate column names arise anyway (i.e.suffixes = c("", "")
)merge.data.table()
will now throw a warning to match the behaviour ofbase:::merge.data.frame()
. Thanks to @sritchie73 for reporting and fixing PR#2631 and PR#2653 -
CJ()
now fails with proper error message when results would exceed max integer, #2636. -
NA
in character columns now display as<NA>
just like base R to distinguish from""
and"NA"
. -
getDTthreads()
could return INT_MAX (2 billion) after an explicit call tosetDTthreads(0)
, PR#2708. -
Fixed a bug on Windows that
data.table
may break if the garbage collecting was triggered when sorting a large number of non-ASCII characters. Thanks to @shrektan for reporting and fixing PR#2678, #2674. -
Internal aliasing of
.
tolist
was over-aggressive in applyinglist
even when.
was intended withinbquote
, #1912. Thanks @MichaelChirico for reporting/filing and @ecoRoland for suggesting and testing a fix. -
Attempt to allocate a wildly large amount of RAM (16EB) when grouping by key and there are close to 2 billion 1-row groups, #2777. Thanks to @jsams for the detailed report.
-
Fix a bug that
print(dt, class=TRUE)
shows onlytopn - 1
rows. Thanks to @heavywatal for reporting #2803 and filing PR#2804.
-
The license has been changed from GPL to MPL (Mozilla Public License). All contributors were consulted and approved. PR#2456 details the reasons for the change.
-
?data.table
makes explicit the option of using alogical
vector inj
to select columns, #1978. Thanks @Henrik-P for the note and @MichaelChirico for filing. -
Test 1675.1 updated to cope with a change in R-devel in June 2017 related to
factor()
andNA
levels. -
Package
ezknitr
has been added to the whitelist of packages that run user code and should be consider data.table-aware, #2266. Thanks to Matt Mills for testing and reporting. -
Printing with
quote = TRUE
now quotes column names as well, #1319. Thanks @jan-glx for the suggestion and @MichaelChirico for the PR. -
Added a blurb to
?melt.data.table
explicating the subtle difference in behavior of theid.vars
argument vis-a-vis its analog inreshape2::melt
, #1699. Thanks @MichaelChirico for uncovering and filing. -
Added some clarification about the usage of
on
to?data.table
, #2383. Thanks to @peterlittlejohn for volunteering his confusion and @MichaelChirico for brushing things up. -
Clarified that "data.table always sorts in
C-locale
" means that upper-case letters are sorted before lower-case letters by ordering in data.table (e.g.setorder
,setkey
,DT[order(...)]
). Thanks to @hughparsonage for the pull request editing the documentation. Note this makes no difference in most cases of data; e.g. ids where only uppercase or lowercase letters are used ("AB123"<"AC234"
is always true, regardless), or country names and words which are consistently capitalized. For example,"America" < "Brazil"
is not affected (it's always true), and neither is"america" < "brazil"
(always true too); since the first letter is consistently capitalized. But, whether"america" < "Brazil"
(the words are not consistently capitalized) is true or false in base R depends on the locale of your R session. In America it is true by default and false if you i) typeSys.setlocale(locale="C")
, ii) the R session has been started in a C locale for you which can happen on servers/services (the locale comes from the environment the R session is started in). However,"america" < "Brazil"
is always, consistently false in data.table which can be a surprise because it differs to base R by default in most regions. It is false because"B"<"a"
is true because all upper-case letters come first, followed by all lower case letters (the ascii number of each letter determines the order, which is what is meant byC-locale
). -
data.table
's dependency has been moved forward from R 3.0.0 (Apr 2013) to R 3.1.0 (Apr 2014; i.e. 3.5 years old). We keep this dependency as old as possible for as long as possible as requested by users in managed environments. Thanks to Jan Gorecki, the test suite from latest dev now runs on R 3.1.0 continously, as well as R-release (currently 3.4.2) and latest R-devel snapshot. The primary motivation for the bump to R 3.1.0 was allowing one new test which relies on better non-copying behaviour in that version, #2484. It also allows further internal simplifications. Thanks to @MichaelChirico for fixing another test that failed on R 3.1.0 due to slightly different behaviour ofbase::read.csv
in R 3.1.0-only which the test was comparing to, #2489. -
New vignette added: Importing data.table - focused on using data.table as a dependency in R packages. Answers most commonly asked questions and promote good practices.
-
As warned in v1.9.8 release notes below in this file (25 Nov 2016) it has been 1 year since then and so use of
options(datatable.old.unique.by.key=TRUE)
to restore the old default is now deprecated with warning. The new warning states that this option still works and repeats the request to passby=key(DT)
explicitly tounique()
,duplicated()
,uniqueN()
andanyDuplicated()
and to stop using this option. In another year, this warning will become error. Another year after that the option will be removed. -
As
set2key()
andkey2()
have been warning since v1.9.8 (Nov 2016), their warnings have now been upgraded to errors. Note that when they were introduced in version 1.9.4 (Oct 2014) they were marked as 'experimental' in NEWS item 4. They will be removed in one year.Was warning: set2key() will be deprecated in the next relase. Please use setindex() instead. Now error: set2key() is now deprecated. Please use setindex() instead.
-
The option
datatable.showProgress
is no longer set to a default value when the package is loaded. Instead, thedefault=
argument ofgetOption
is used by bothfwrite
andfread
. The default is the result ofinteractive()
at the time of the call. UsinggetOption
in this way is intended to be more helpful to users looking atargs(fread)
and?fread
. -
print.data.table()
invisibly returns its first argument instead ofNULL
. This behavior is compatible with the standardprint.data.frame()
and tibble'sprint.tbl_df()
. Thanks to @heavywatal for PR#2807
- Fixed crash/hang on MacOS when
parallel::mclapply
is used and data.table is merely loaded, #2418. Oddly, all tests including test 1705 (which testsmclapply
with data.table) passed fine on CRAN. It appears to be some versions of MacOS or some versions of libraries on MacOS, perhaps. Many thanks to Martin Morgan for reporting and confirming this fix works. Thanks also to @asenabouth, Joe Thorley and Danton Noriega for testing, debugging and confirming that automatic parallelism inside data.table (such asfwrite
) works well even on these MacOS installations. See also news items below for 1.10.4-1 and 1.10.4-2.
-
OpenMP on MacOS is now supported by CRAN and included in CRAN's package binaries for Mac. But installing v1.10.4-1 from source on MacOS failed when OpenMP was not enabled at compile time, #2409. Thanks to Liz Macfie and @fupangpangpang for reporting. The startup message when OpenMP is not enabled has been updated.
-
Two rare potential memory faults fixed, thanks to CRAN's automated use of latest compiler tools; e.g. clang-5 and gcc-7
-
The
nanotime
v0.2.0 update (June 2017) changed frominteger64
toS4
and brokefwrite
ofnanotime
columns. Fixed to work withnanotime
both before and after v0.2.0. -
Pass R-devel changes related to
deparse(,backtick=)
andfactor()
. -
Internal
NAMED()==2
nowMAYBE_SHARED()
, #2330. Back-ported to pass under the stated dependency, R 3.0.0. -
Attempted improvement on Mac-only when the
parallel
package is used too (which forks), #2137. Intel's OpenMP implementation appears to leave threads running after the OpenMP parallel region (inside data.table) has finished unlike GNU libgomp. So, if and whenparallel
'sfork
is invoked by the user after data.table has run in parallel already, instability occurs. The problem only occurs with Mac package binaries from CRAN because they are built by CRAN with Intel's OpenMP library. No known problems on Windows or Linux and no known problems on any platform whenparallel
is not used. If this Mac-only fix still doesn't work, callsetDTthreads(1)
immediately afterlibrary(data.table)
which has been reported to fix the problem by puttingdata.table
into single threaded mode earlier. -
When
fread()
andprint()
seeinteger64
columns are present but packagebit64
is not installed, the warning is now displayed as intended. Thanks to a question by Santosh on r-help and forwarded by Bill Dunlap.
- The new specialized
nanotime
writer infwrite()
type punned using*(long long *)&REAL(column)[i]
which, strictly, is undefined behavour under C standards. It passed a plethora of tests on linux (gcc 5.4 and clang 3.8), win-builder and 6 out 10 CRAN flavours using gcc. But failed (wrong data written) with the newest version of clang (3.9.1) as used by CRAN on the failing flavors, and solaris-sparc. Replaced with the union method and added a grep to CRAN_Release.cmd.
-
When
j
is a symbol prefixed with..
it will be looked up in calling scope and its value taken to be column names or numbers.myCols = c("colA","colB") DT[, myCols, with=FALSE] DT[, ..myCols] # same
When you see the
..
prefix think one-level-up like the directory..
in all operating systems meaning the parent directory. In future the..
prefix could be made to work on all symbols apearing anywhere insideDT[...]
. It is intended to be a convenient way to protect your code from accidentally picking up a column name. Similar to howx.
andi.
prefixes (analogous to SQL table aliases) can already be used to disambiguate the same column name present in bothx
andi
. A symbol prefix rather than a..()
function will be easier for us to optimize internally and more convenient if you have many variables in calling scope that you wish to use in your expressions safely. This feature was first raised in 2012 and long wished for, #633. It is experimental. -
When
fread()
orprint()
seeinteger64
columns are present,bit64
's namespace is now automatically loaded for convenience. -
fwrite()
now supports the newnanotime
type by Dirk Eddelbuettel, #1982. Aside:data.table
already automatically supportednanotime
in grouping and joining operations via longstanding support of its underlyinginteger64
type. -
indices()
gains a new argumentvectors
, defaultFALSE
. This strsplits the index names by__
for you, #1589.DT = data.table(A=1:3, B=6:4) setindex(DT, B) setindex(DT, B, A) indices(DT) [1] "B" "B__A" indices(DT, vectors=TRUE) [[1]] [1] "B" [[2]] [1] "B" "A"
-
Some long-standing potential instability has been discovered and resolved many thanks to a detailed report from Bill Dunlap and Michael Sannella. At C level any call of the form
setAttrib(x, install(), allocVector())
can be unstable in any R package. DespitesetAttrib()
PROTECTing its inputs, the 3rd argument (allocVector
) can be executed first only for its result to to be released byinstall()
's potential GC before reachingsetAttrib
's PROTECTion of its inputs. Fixed by either PROTECTing or pre-install()
ing. Added to CRAN_Release.cmd procedures: i)grep
s to prevent usage of this idiom in future and ii) running data.table's test suite withgctorture(TRUE)
. -
A new potential instability introduced in the last release (v1.10.0) in GForce optimized grouping has been fixed by reverting one change from malloc to R_alloc. Thanks again to Michael Sannella for the detailed report.
-
fwrite()
could write floating point values incorrectly, #1968. A thread-local variable was incorrectly thread-global. This variable's usage lifetime is only a few clock cycles so it needed large data and many threads for several threads to overlap their usage of it and cause the problem. Many thanks to @mgahan and @jmosser for finding and reporting.
-
fwrite()
's..turbo
option has been removed as the warning message warned. If you've found a problem, please report it. -
No known issues have arisen due to
DT[,1]
andDT[,c("colA","colB")]
now returning columns as introduced in v1.9.8. However, as we've moved forward by settingoptions('datatable.WhenJisSymbolThenCallingScope'=TRUE)
introduced then too, it has become clear a better solution is needed. All 340 CRAN and Bioconductor packages that use data.table have been checked with this option on. 331 lines would need to be changed in 59 packages. Their usage is elegant, correct and recommended, though. Examples areDT[1, encoding]
in quanteda andDT[winner=="first", freq]
in xgboost. These are looking up the columnsencoding
andfreq
respectively and returning them as vectors. But if, for some reason, those columns are removed fromDT
andencoding
orfreq
are still variables in calling scope, their values in calling scope would be returned. Which cannot be what was intended and could lead to silent bugs. That was the risk we were trying to avoid.
options('datatable.WhenJisSymbolThenCallingScope')
is now removed. A migration timeline is no longer needed. The new strategy needs no code changes and has no breakage. It was proposed and discussed in point 2 here, as follows.
Whenj
is a symbol (as in the quanteda and xgboost examples above) it will continue to be looked up as a column name and returned as a vector, as has always been the case. If it's not a column name however, it is now a helpful error explaining that data.table is different to data.frame and what to do instead (use..
prefix orwith=FALSE
). The old behaviour of returning the symbol's value in calling scope can never have been useful to anybody and therefore not depended on. Just as theDT[,1]
change could be made in v1.9.8, this change can be made now. This change increases robustness with no downside. Rerunning all 340 CRAN and Bioconductor package checks reveal 2 packages throwing the new error: partools and simcausal. Their maintainers have been informed that there is a likely bug on those lines due to data.table's (now remedied) weakness. This is exactly what we wanted to reveal and improve. -
As before, and as we can see is in common use in CRAN and Bioconductor packages using data.table,
DT[,myCols,with=FALSE]
continues to lookupmyCols
in calling scope and take its value as column names or numbers. You can move to the new experimental convenience featureDT[, ..myCols]
if you wish at leisure.
-
fwrite(..., quote='auto')
already quoted a field if it contained asep
or\n
, orsep2[2]
whenlist
columns are present. Now it also quotes a field if it contains a double quote ("
) as documented, #1925. Thanks to Aki Matsuo for reporting. Tests added. Theqmethod
tests did test escaping embedded double quotes, but only whensep
or\n
was present in the field as well to trigger the quoting of the field. -
Fixed 3 test failures on Solaris only, #1934. Two were on both sparc and x86 and related to a
tzone
attribute difference betweenas.POSIXct
andas.POSIXlt
even when passed the defaulttz=""
. The third was on sparc only: a minor rounding issue infwrite()
of 1e-305. -
Regression crash fixed when 0's occur at the end of a non-empty subset of an empty table, #1937. Thanks Arun for tracking down. Tests added. For example, subsetting the empty
DT=data.table(a=character())
withDT[c(1,0)]
should return a 1 row result with oneNA
since 1 is past the end ofnrow(DT)==0
, the same result asDT[1]
. -
Fixed newly reported crash that also occurred in old v1.9.6 when
by=.EACHI
,nomatch=0
, the first item ini
has no match ANDj
has a function call that is passed a key column, #1933. Many thanks to Reino Bruner for finding and reporting with a reproducible example. Tests added. -
Fixed
fread()
error occurring for a subset of Windows users:showProgress is not type integer but type 'logical'.
, #1944 and #1111. Our tests cover this usage (it is just default usage), pass on AppVeyor (Windows), win-builder (Windows) and CRAN's Windows so perhaps it only occurs on a specific and different version of Windows to all those. Thanks to @demydd for reporting. Fixed by using strictlylogical
type at R level andRboolean
at C level, consistently throughout. -
Combining
on=
(new in v1.9.6) withby=
orkeyby=
gave incorrect results, #1943. Many thanks to Henrik-P for the detailed and reproducible report. Tests added. -
New function
rleidv
was ignoring itscols
argument, #1942. Thanks Josh O'Brien for reporting. Tests added.
-
It seems OpenMP is not available on CRAN's Mac platform; NOTEs appeared in CRAN checks for v1.9.8. Moved
Rprintf
frominit.c
topackageStartupMessage
to avoid the NOTE as requested urgently by Professor Ripley. Also fixed the bad grammar of the message: 'single threaded' now 'single-threaded'. If you have a Mac and run macOS or OS X on it (I run Ubuntu on mine) please contact CRAN maintainers and/or Apple if you'd like CRAN's Mac binary to support OpenMP. Otherwise, please follow these instructions for OpenMP on Mac which people have reported success with. -
Just to state explicitly: data.table does not now depend on or require OpenMP. If you don't have it (as on CRAN's Mac it appears but not in general on Mac) then data.table should build, run and pass all tests just fine.
-
There are now 5,910 raw tests as reported by
test.data.table()
. Tests cover 91% of the 4k lines of R and 89% of the 7k lines of C. These stats are now known thanks to Jim Hester's Covr package and Codecov.io. If anyone is looking for something to help with, creating tests to hit the missed lines shown by clicking theR
andsrc
folders at the bottom here would be very much appreciated. -
The FAQ vignette has been revised given the changes in v1.9.8. In particular, the very first FAQ.
-
With hindsight, the last release v1.9.8 should have been named v1.10.0 to convey it wasn't just a patch release from .6 to .8 owing to the 'potentially breaking changes' items. Thanks to @neomantic for correctly pointing out. The best we can do now is now bump to 1.10.0.