Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast data.table to matrix conversion in C [Depends: #4196] #4144

Draft
wants to merge 207 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 109 commits
Commits
Show all changes
207 commits
Select commit Hold shift + click to select a range
a0c09dc
Began implementing a fast as.matrix in C
sritchie73 Dec 28, 2019
a614743
Made loop code more efficient
sritchie73 Dec 28, 2019
a7a22a0
Casmatrix now works on simple types
sritchie73 Dec 29, 2019
27fa71c
Implemented Casmatrix for character vectors
sritchie73 Dec 29, 2019
0d6742a
Implemented Casmatrix for list type columns
sritchie73 Dec 29, 2019
89e4239
Bugfix: no longer crashes on 1-column matrices
sritchie73 Dec 29, 2019
e101b0e
removed spurious line of code
sritchie73 Dec 29, 2019
dff21b5
Style fix: changed <- to =
sritchie73 Dec 29, 2019
a7118b7
as.matrix now handles type conversion
sritchie73 Dec 29, 2019
4456f26
Special case for data.tables with multi-columns
sritchie73 Dec 29, 2019
7b4891a
Bugfix
sritchie73 Dec 29, 2019
d226099
Improved multi-column handling in as.matrix
sritchie73 Dec 30, 2019
cc08c5d
Style fix: Numeric -> Explicit integer
sritchie73 Dec 30, 2019
92c1d54
Removed expensive copy call when given rownames
sritchie73 Dec 30, 2019
002fd76
Refactored Casmatrix
sritchie73 Dec 30, 2019
ee4e6af
Added long vector support
sritchie73 Dec 30, 2019
02310aa
refactored wide-column detection
sritchie73 Dec 30, 2019
a9c9926
Replaced inner loop with memcpy
sritchie73 Dec 30, 2019
5c9f526
Added OpenMP support to Casmatrix
sritchie73 Dec 31, 2019
7cbf38e
Removed extraneous xlength call
sritchie73 Dec 31, 2019
224a8ab
Removed calls to R API from OpenMP threads
sritchie73 Jan 1, 2020
ab3ed57
R_alloc used to allocate pointer array memory
sritchie73 Jan 1, 2020
946f38e
Imposed 32 bit vector length on matrix dim
sritchie73 Jan 3, 2020
297b607
Bugfix type casting
sritchie73 Jan 3, 2020
2dd90d9
Single-line ifs now multi-line for codecov
sritchie73 Jan 3, 2020
504ee7c
Renamed asmatrix.c to matrix.c
sritchie73 Jan 3, 2020
d4c2465
Explicit integer L not needed when using `:`
sritchie73 Jan 3, 2020
4d3669e
Added check for multi-column and list rownames
sritchie73 Jan 3, 2020
b367f23
matrix column and row number check now INT_MAX
sritchie73 Jan 3, 2020
b711fb0
rownames column may now be an ff object
sritchie73 Jan 3, 2020
d41cb0e
Refactored column checks
sritchie73 Jan 3, 2020
1f566f4
homogenous non-atomic types converted correctly
sritchie73 Jan 3, 2020
f58029d
Added bit64 support and fallback coercion
sritchie73 Jan 3, 2020
893cbba
Bugfix rownames check
sritchie73 Jan 3, 2020
e63df58
Simplified rownames handling
sritchie73 Jan 4, 2020
1f7f94e
Replaced int with Rboolean in asmatrix_logical
sritchie73 Jan 4, 2020
1f50c13
Split out handling of class and types
sritchie73 Jan 4, 2020
caeb706
Better handling of additional and multi-classes
sritchie73 Jan 4, 2020
f5fdfb3
Added tests for a debugged new rownames checks
sritchie73 Jan 4, 2020
a0b760e
added tests to cover all C matrix types
sritchie73 Jan 5, 2020
49fc1b3
Added tests for ff coverage
sritchie73 Jan 5, 2020
cc954bc
Added tests for bit64 coverage
sritchie73 Jan 5, 2020
a75705a
bugfix bad test specification
sritchie73 Jan 5, 2020
a1ffe65
Added tests to cover remaining type conversions
sritchie73 Jan 5, 2020
24af9fd
Test 2134.4 now passes
sritchie73 Jan 5, 2020
9ceea1b
Moved dimension check into R code
sritchie73 Jan 5, 2020
64334f8
Catch case when bit64 package not installed
sritchie73 Jan 7, 2020
475c24d
bugfix interger.max checks
sritchie73 Jan 7, 2020
d08e67d
Removed ff tests and added nocov
sritchie73 Jan 8, 2020
7b60f79
Nocov for dimension checks
sritchie73 Jan 8, 2020
815c4b5
Bugfix; error is stop in R
sritchie73 Jan 8, 2020
d2ec0f0
Merge in master to get updated tests.Rraw file
sritchie73 Jan 8, 2020
5a81e37
Updated new test numbers so CI doesnt fail
sritchie73 Jan 8, 2020
9a68b7c
"numeric" is not a "type", see ?typeof
sritchie73 Jan 8, 2020
447d3e7
Unecessary to track dm with n and p
sritchie73 Jan 8, 2020
b4bdf9f
Fixed compiled warnings for asmatrix_logical
sritchie73 Jan 8, 2020
6294407
Added timing function to C code
sritchie73 Jan 8, 2020
589b7c0
nocov for rownames ff check
sritchie73 Jan 8, 2020
8aa261e
More robust handling of malformed data.tables
sritchie73 Jan 9, 2020
f0aafae
spelling error in warning message
sritchie73 Jan 9, 2020
ee625f3
removed rownames.index, no need to track
sritchie73 Jan 10, 2020
e90db9c
Removed unecessary {} for single-line if
sritchie73 Jan 10, 2020
8aa779f
Reverted error message
sritchie73 Jan 10, 2020
685385e
Fixed tests
sritchie73 Jan 10, 2020
dbd9331
Simplified some if statements
sritchie73 Jan 10, 2020
0bad2ca
Added helper function for column properties
sritchie73 Jan 10, 2020
da74463
sapply can return matrix causing errors
sritchie73 Jan 10, 2020
2f853b9
removed brackets from single line if
sritchie73 Jan 12, 2020
3481a93
Bugfix
sritchie73 Jan 12, 2020
5038ebc
Added support for raw matrices
sritchie73 Jan 12, 2020
bdcea33
error when coercion of integer64 but no bit64
sritchie73 Jan 12, 2020
32ba105
column_properties() only required if coercion
sritchie73 Jan 12, 2020
2f12067
integer64 now works with raw
sritchie73 Jan 12, 2020
746d287
Fixed raw to logical and added test
sritchie73 Jan 12, 2020
131e2d2
Nocov for unsupported matrix types
sritchie73 Jan 12, 2020
87ac4c9
asmatrix_logical redundant with integer
sritchie73 Jan 12, 2020
9768918
missing indentation
sritchie73 Jan 12, 2020
4d87e9f
Do not enter typeof coercion if recursive cols
sritchie73 Jan 12, 2020
c8e6aed
Added fallback for recursive non-list columns
sritchie73 Jan 12, 2020
75f6b57
Reduced number of vapply across columns
sritchie73 Jan 12, 2020
defc74e
turns out we do need to separate int and log
sritchie73 Jan 12, 2020
9f64ddb
streamlined comments
sritchie73 Jan 12, 2020
5dbc33c
No need for column_properties after type coerce
sritchie73 Jan 12, 2020
4067cd4
Added warning for type fallback
sritchie73 Jan 12, 2020
20ce5fd
Added non.atomic check to class addition
sritchie73 Jan 12, 2020
a963766
Added helper functions
sritchie73 Jan 12, 2020
b842b65
Bugfix extra class checks
sritchie73 Jan 12, 2020
80199fd
Bugfix test
sritchie73 Jan 12, 2020
8c2cbd3
More helper functions
sritchie73 Jan 12, 2020
f4ecb46
Added test to cover VECSXP
sritchie73 Jan 12, 2020
a948c7f
Added nocovs
sritchie73 Jan 12, 2020
ec698bb
Add to, don't overwrite class(X)
sritchie73 Jan 13, 2020
b7cbaab
allocMatrix instead of allocVector
sritchie73 Jan 13, 2020
47e79be
Simple OMP for loops can be used
sritchie73 Jan 13, 2020
2be89d6
Fixed test 2135.1
sritchie73 Jan 13, 2020
0f14c06
No need to assign dim after Casmatrix
sritchie73 Jan 13, 2020
f1ce5cd
Fixed nocov tags in matrix.c
sritchie73 Jan 13, 2020
ab0453c
recursive non-list types found from typelist
sritchie73 Jan 13, 2020
5d516d7
bugfix
sritchie73 Jan 13, 2020
af02985
Deduplicated helper functions
sritchie73 Jan 13, 2020
d9a5fc8
removed trailing whitespace
sritchie73 Jan 13, 2020
590363d
No need to vectorify when coercing to list
sritchie73 Jan 13, 2020
5381580
Maybe.recursive handling now clearer
sritchie73 Jan 13, 2020
f713224
Fixed type checking for unlist
sritchie73 Jan 13, 2020
e92c6d0
Fixed typo
sritchie73 Jan 13, 2020
9047f64
Improved helper function comments
sritchie73 Jan 14, 2020
c4405dd
Improved memory when list + factor or date
sritchie73 Jan 14, 2020
6cf1553
Fixed comment
sritchie73 Jan 14, 2020
bef9fa1
Merge branch 'master' into asmatrix-c
mattdowle Jan 15, 2020
ab048ad
as.matrix now uses Crindlist
sritchie73 Jan 25, 2020
18c45c4
Bugfix rownames
sritchie73 Jan 25, 2020
1a1e40a
Basic Casmatrix implementation following Crbindlist
sritchie73 Jan 26, 2020
eee7197
Condensed loop, removed debug prints
sritchie73 Jan 26, 2020
37f8c42
ncol = 0 should preserve rownames
sritchie73 Jan 26, 2020
d040581
Reverting to commit bef9fa17
sritchie73 Jan 26, 2020
c351389
Crbindlist used for type coercion
sritchie73 Jan 26, 2020
69cb3b0
Now handles integer64 coercion and factors
sritchie73 Jan 26, 2020
f618c64
Type now preserved when matrix has dim 0
sritchie73 Jan 26, 2020
2da2d76
class mismatch check should not trigger if asmatrix
sritchie73 Jan 26, 2020
109bcf8
Test ncol=0 nrow>0
sritchie73 Jan 26, 2020
44232b1
check int64 and list coercion rules
sritchie73 Jan 26, 2020
114c710
Restoring matrix.c for conversion
sritchie73 Jan 26, 2020
bdf66fb
Removed class logic
sritchie73 Jan 26, 2020
f5dd790
len check in R
sritchie73 Jan 26, 2020
1ef7be9
Comments
sritchie73 Jan 26, 2020
3de02cc
rm empty line
sritchie73 Jan 26, 2020
c43daec
Casmatrix now integer64 aware
sritchie73 Jan 27, 2020
0a136fc
Coercion of integer64 and complex to character
sritchie73 Jan 27, 2020
9f3b8eb
Class not added to matrix
sritchie73 Jan 27, 2020
8164cc1
Fixed tests
sritchie73 Jan 27, 2020
5d59d23
Rownames coerced to charcter if needed
sritchie73 Jan 27, 2020
99dead0
Will use coerceAsList pending #4196
sritchie73 Jan 27, 2020
856a8c1
No need to handle CPLX separately, #4203
sritchie73 Jan 27, 2020
9397963
Missing nprotect counter increment
sritchie73 Jan 27, 2020
c3ba221
asCharacterInteger64 generalised to 64bit vectors
sritchie73 Jan 28, 2020
6a1343d
missing EOF newlines
sritchie73 Jan 28, 2020
70cdd14
Formatting + 64 int
2005m Jan 28, 2020
bd5ca7a
integer64 should be long long int, not long int
sritchie73 Jan 28, 2020
3a384c0
Using PRId64 instead of lld as per #4602
sritchie73 Jan 28, 2020
5d82c5f
memrecycle now theoretically long vector compatible
sritchie73 Jan 28, 2020
07ae873
Added 64bit compatible memcpy
sritchie73 Jan 28, 2020
cc0d412
Bugfix memcpy64
sritchie73 Jan 28, 2020
8198f46
ansloc also needs to be int64_t
sritchie73 Jan 29, 2020
2ab6122
No need for memcpy64, bug was elsewhere
sritchie73 Jan 29, 2020
c3bd7ac
%ld -> %"PRId64"
sritchie73 Jan 29, 2020
a839c40
nrow 0 check must come before i64 class
sritchie73 Jan 29, 2020
06597bf
Optimise for data.tables with all same col type
sritchie73 Jan 29, 2020
56c1ec0
sizeof * nrow for number of bytes in memcpy
sritchie73 Jan 29, 2020
e23ddbe
VECSXP and STRSXP wrong way round
sritchie73 Jan 29, 2020
186d687
coercion rules
sritchie73 Jan 29, 2020
60614d5
bugfix coerce
sritchie73 Jan 29, 2020
a3ba3c6
Added OMP support
sritchie73 Jan 29, 2020
7e361d0
Need to allocate memory for array of ptrs
sritchie73 Jan 29, 2020
e3a9e41
Simplified code to macros
sritchie73 Jan 29, 2020
30017a8
Coerce before recycling other columns
sritchie73 Jan 29, 2020
93144c1
i64 coercion works again
sritchie73 Jan 29, 2020
f0e99c3
Macro comments
sritchie73 Jan 30, 2020
e73f143
Factors can be easily handled in C.
sritchie73 Jan 30, 2020
47a1331
Date-likes should now be formatted in C
sritchie73 Jan 30, 2020
07e6f00
moved as.character.ITime into C function
sritchie73 Jan 31, 2020
51ecbc0
formatting
sritchie73 Jan 31, 2020
a00d0a7
Second attempt at calling format from C
sritchie73 Jan 31, 2020
10f6d03
S3 dispatch doesnt work from C
sritchie73 Jan 31, 2020
f880b27
char_Date more general than char_IDate
sritchie73 Jan 31, 2020
0452272
FF support moved into as.matrix
sritchie73 Jan 31, 2020
4498083
Rownames checks and coercions now in C
sritchie73 Jan 31, 2020
c81b367
bugfix
sritchie73 Jan 31, 2020
2d57084
Need to extract rownames.value
sritchie73 Jan 31, 2020
a39d771
Moved maximum dimension check to C
sritchie73 Jan 31, 2020
005c228
Code to unpack nested data.tables
sritchie73 Jan 31, 2020
9dfd669
missing ;
sritchie73 Jan 31, 2020
ffb7310
fixed names() getters
sritchie73 Jan 31, 2020
4dfe9a5
multidim columns unpacked
sritchie73 Jan 31, 2020
f44ecdb
bugfix callRfun1
sritchie73 Jan 31, 2020
4fa0c53
fixed comment
sritchie73 Jan 31, 2020
7c7758a
Column recycling
sritchie73 Jan 31, 2020
6cffa32
Modify by reference semantics fixed
sritchie73 Jan 31, 2020
ce9e211
bugfixes and test updates
sritchie73 Jan 31, 2020
e98cf28
Names in nested dts no longer modified by ref.
sritchie73 Jan 31, 2020
82de007
Bugfix integer64
sritchie73 Jan 31, 2020
d1fec57
Silence compiler warnings for snprintf
sritchie73 Jan 31, 2020
57ee386
Attempt to silence PRId64 compiler warning
sritchie73 Jan 31, 2020
94d8bf5
linux still reports "%lld" must be used
sritchie73 Jan 31, 2020
fca5ab9
Micromanage PROTECT stack
sritchie73 Jan 31, 2020
eec32e1
rownames check fix
sritchie73 Jan 31, 2020
c501db2
Entry function from .Call controls PROTECT stack
sritchie73 Jan 31, 2020
3e88266
asCharacterFactor is internal R func
sritchie73 Jan 31, 2020
a7fac5a
rownames vs. rncontainer fixed
sritchie73 Jan 31, 2020
37cbea9
Fixed regression in as.character.ITime
sritchie73 Feb 1, 2020
c65e113
Fixed test
sritchie73 Feb 1, 2020
9a9b0df
Unintentional pointer arithmatic
sritchie73 Feb 1, 2020
80f4bd4
missed an int -> int64_t in memrecycle
sritchie73 Feb 1, 2020
cd5bb04
R_xlen_t -> int64_t
sritchie73 Feb 18, 2020
79f8899
Tests to check no modification by reference
sritchie73 Feb 18, 2020
7b20cea
Don't use copy for data.table made from structure
sritchie73 Feb 18, 2020
a284589
Only need 1 test for dropping NULL columns
sritchie73 Feb 18, 2020
93563a3
Fixed detection of coercion required
sritchie73 Feb 19, 2020
7236060
base type is logical not raw
sritchie73 Feb 19, 2020
f5c5b46
Fixed raw type coercion rules
sritchie73 Feb 19, 2020
dd81334
raw now works with list columns
sritchie73 Feb 19, 2020
2d30f8a
Fixed ncol incrementer
sritchie73 Feb 19, 2020
61fcc0a
bugfix raw type detection and coercion
sritchie73 Feb 19, 2020
6a5cb04
Refactored preprocess in asmatrix
sritchie73 Feb 19, 2020
7668701
Updated integer64 tests for raw rules
sritchie73 Feb 19, 2020
8519f31
Fixed initialisation rules
sritchie73 Feb 20, 2020
fdea714
*wd is a pointer to an INTEGER SEXP array not int64_t
sritchie73 Feb 20, 2020
53846b8
missed an int64_t case
sritchie73 Feb 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 282 additions & 63 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1870,10 +1870,9 @@ as.matrix.data.table = function(x, rownames=NULL, rownames.value=NULL, ...) {
stop("length(rownames)==0 but should be a single column name or number, or NULL")
} else {
if (isTRUE(rownames)) {
if (length(key(x))>1L) {
if (length(key(x))>1L)
warning("rownames is TRUE but key has multiple columns ",
brackify(key(x)), "; taking first column x[,1] as rownames")
}
rownames = if (length(key(x))==1L) chmatch(key(x),names(x)) else 1L
}
else if (is.logical(rownames) || is.na(rownames)) {
Expand All @@ -1883,7 +1882,6 @@ as.matrix.data.table = function(x, rownames=NULL, rownames.value=NULL, ...) {
else if (is.character(rownames)) {
w = chmatch(rownames, names(x))
if (is.na(w)) stop("'", rownames, "' is not a column of x")
rownames = w
}
else { # rownames is a column number already
rownames = as.integer(rownames)
Expand All @@ -1892,72 +1890,293 @@ as.matrix.data.table = function(x, rownames=NULL, rownames.value=NULL, ...) {
" which is outside the column number range [1,ncol=", ncol(x), "].")
}
}
if (!is.null(rownames)) {
# Extract the rownames column
rownames.value = x[[rownames]]
# bring it into memory if ff object
if (is.ff(rownames.value)) rownames.value = rownames.value[] # nocov

# Check rownames column is appropriate dimension
if (length(dim(rownames.value)) > 1L)
stop("x[,", rownames, "] has multi-column type (such as a matrix column)",
" and cannot be used as rownames: dim(x[,", rownames, "])==",
brackify(dim(rownames.value)))

# Warn if list column:
if (is.list(rownames.value))
warning("x[,", rownames, "] is a list column, which will be coerced to a",
" character vector when used as rownames")
}
} else if (!is.null(rownames.value)) {
if (length(rownames.value)!=nrow(x))
stop("length(rownames.value)==", length(rownames.value),
" but should be nrow(x)==", nrow(x))
}
if (!is.null(rownames)) {
# extract that column and drop it.
rownames.value = x[[rownames]]
dm = dim(x) - 0:1
cn = names(x)[-rownames]
X = x[, .SD, .SDcols = cn]
} else {
dm = dim(x)
cn = names(x)
X = x
}
if (any(dm == 0L))
return(array(NA, dim = dm, dimnames = list(rownames.value, cn)))
p = dm[2L]
n = dm[1L]
collabs = as.list(cn)
sritchie73 marked this conversation as resolved.
Show resolved Hide resolved

# Create a new list X containing pointers to columns in x, so that any
# modifications (i.e. type coercion, dropping of rownames column) do
# not change the input data.table. New copies in memory are only
# created for columns that are modified (those coerced to a different
# type or class)
X = x
class(X) = NULL
sritchie73 marked this conversation as resolved.
Show resolved Hide resolved
non.numeric = non.atomic = FALSE
all.logical = TRUE
for (j in seq_len(p)) {
if (is.ff(X[[j]])) X[[j]] = X[[j]][] # nocov to bring the ff into memory, since we need to create a matrix in memory
xj = X[[j]]
if (length(dj <- dim(xj)) == 2L && dj[2L] > 1L) {
if (inherits(xj, "data.table"))
xj = X[[j]] = as.matrix(X[[j]])
dnj = dimnames(xj)[[2L]]
collabs[[j]] = paste(collabs[[j]], if (length(dnj) >
0L)
dnj
else seq_len(dj[2L]), sep = ".")
}
if (!is.logical(xj))
all.logical = FALSE
if (length(levels(xj)) > 0L || !(is.numeric(xj) || is.complex(xj) || is.logical(xj)) ||
(!is.null(cl <- attr(xj, "class", exact=TRUE)) && any(cl %chin%
c("Date", "POSIXct", "POSIXlt"))))
non.numeric = TRUE
if (!is.atomic(xj))
non.atomic = TRUE
}
if (non.atomic) {
for (j in seq_len(p)) {
xj = X[[j]]
if (is.recursive(xj)) { }
else X[[j]] = as.list(as.vector(xj))
}
}
else if (all.logical) { }
else if (non.numeric) {
for (j in seq_len(p)) {
if (is.character(X[[j]])) next
xj = X[[j]]
miss = is.na(xj)
xj = if (length(levels(xj))) as.vector(xj) else format(xj)
is.na(xj) = miss
X[[j]] = xj
}
}
X = unlist(X, recursive = FALSE, use.names = FALSE)
dim(X) <- c(n, length(X)/n)
dimnames(X) <- list(rownames.value, unlist(collabs, use.names = FALSE))

# Drop the rownames column, if used
if (!is.null(rownames))
X[[rownames]] = NULL

# We now need to make sure all the columns are the same type, handling special
# cases where some columns are not stored in memory (ff objects), lists, or
# otherwise non-atomic types (e.g. S4). Further, we want to minimize
# the number of checks done, e.g. if all columns are already the same type.

# Get information about the classes and types of each column, returned in an environment.
# additional arguments allow for efficient updating of information
column_properties = function(X, X.info, cols.to.update, cols.to.drop) {
if (missing(X.info))
X.info = new.env()

# Get class and type of relevant columns
if (missing(cols.to.update) && missing(cols.to.drop)) {
X.info$classes = lapply(X, class) # lapply here because a column may have > 1 class.
X.info$types = vapply_1c(X, typeof)
} else if (!missing(cols.to.drop)) {
X.info$classes = X.info$classes[-cols.to.drop]
X.info$types = X.info$types[-cols.to.drop]
} else if (!missing(cols.to.update)) {
X.info$classes[cols.to.update] = lapply(X[cols.to.update], class)
X.info$types[cols.to.update] = vapply_1c(X[cols.to.update], typeof)
} else {
stop("Internal error: cols.to.update and cols.to.drop cannot both be supplied to column_properties()") # nocov
}

# Get unique classes and types
X.info$uniq.class.list = unique(X.info$classes) # list, i.e. handles multi-class columns
X.info$uniq.classes = sort(unique(unlist(X.info$uniq.class.list))) # flat vector of unique classes
X.info$uniq.types = sort(unique(X.info$types))

X.info
}
X.info = column_properties(X)

# Check whether any column has type or class among the 'target' vector,
# by matching to the precomputed vector of unique class or type information
# stored in X.info. col.uniqs should either be X.info$uniq.classes or
# X.info$uniq.types. Returns a single TRUE or FALSE.
any_col_is = function(col.uniqs, target) {
any(target %chin% col.uniqs)
}

# Check whether all columns have type or class among the 'target' vector,
# by matching to the precomputed vector of unique class or type information
# stored in X.info. col.uniqs should either be X.info$uniq.classes or
# X.info$uniq.types. Returns a single TRUE or FALSE.
all_col_is = function(col.uniqs, target) {
length(setdiff(col.uniqs, target)) == 0L
}

# Check whether each column has type or class among the 'target' vector,
# by matching to the precomputed vector of class or type information
# stored in X.info. col.info should either be X.info$types or X.info$classes.
# Returns TRUE or FALSE for each column, allowing for which() or which(!()) to
# be subsequently applied to get column indices for subsetting. This function
# should generally be applied only after an any_col_is() or all_col_is() check
# to avoid expensive checks across all columns where they are not needed (i.e.
# if all columns are of a single type already we want to minimise the column
# check computation)
col_is = function(col.info, target) {
if (length(target) > 1)
vapply_1b(col.info, function(xj) { any(target %chin% xj) } )
else
vapply_1b(col.info, `==`, target)
}

# Check and fix the data.table if it is malformed - i.e. contains NULL
# columns, wide columns, or columns whose length != .N. For NULL columns
# we need to drop manually, otherwise we can use as.data.table().
if (any_col_is(X.info$uniq.types, "NULL")) {
col.is.null = which(col_is(X.info$types, "NULL"))
X[col.is.null] = NULL
X.info = column_properties(X, X.info, cols.to.drop = col.is.null) # update column properties info
}

has.wide.columns = any(vapply_1b(X, function(xj) length(dim(xj)) > 0L))
n.column.lengths = length(unique(vapply_1i(X, length)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the input data.table has many columns these two calls are quite expensive in the context of the pre column checks, as these attributes are not used for any other column checks.

It would be nice if we could somehow avoid these, but I can't see how, as they're absolutely necessary for avoiding problems with memcpy in the C code (as the memcpy call assumes the number of elements to copy is .N - if this is not true bad things can happen).

As far as I can tell, the only way these checks will fail is if a data.table has been created through structure(list(...), class=c("data.table", "data.frame")) as data.table(), as.data.table(), and setDT() will error if trying to create malformed columns like this.

Is there a way we can check for a valid externalptr in attributes(dt)[[".internal.selfref"]] to avoid these expensive checks on valid data.tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have found the internal selfrefok() function. Unfortunately selfrefok() does not guarantee the data.table is not malformed, as setDT() will allow these columns and then selfrefok() will be 1L. E.g. as in test 2074.07. This is particularly problematic if one of the columns has length < .N.

if (has.wide.columns || n.column.lengths > 1L) {
X = as.data.table(X)
class(X) = NULL
X.info = column_properties(X) # get column properties info
}

# Get matrix meta-data
cn = names(X)
p = length(cn)
n = if(p == 0L) 0L else length(X[[1L]])

# The maximum dimension size (row or column) for a matrix is 2^31-1 (or the Machine maximum integer)
# Check and error before doing computation/memory expensive column checks and coercion
# nocov start
if (p > .Machine$integer.max)
stop("Matrices with > ", .Machine$integer.max, " (.Machine$integer.max) columns are not supported")
if (n > .Machine$integer.max)
stop("Matrices with > ", .Machine$integer.max, " (.Machine$integer.max) rows are not supported")
# nocov end

# If no rows or columns can simply return empty array
if (p == 0L || n == 0L)
return(array(NA, dim = list(n, p), dimnames = list(rownames.value, cn)))

# Check for ff columns which need to be brought into memory. Running the any
# command on the unique column class list means for x with many columns we
# don't do expensive per-column checks unless necessary.
# nocov start
if (any_col_is(X.info$uniq.classes, "ff")) {
which.ff = which(col_is(X.info$classes, "ff"))
for (j in which.ff) {
X[[j]] = X[[j]][] # bring into memory
}
X.info = column_properties(X, X.info, cols.to.update = which.ff) # update column properties info
}
# nocov end

# Convert factors to character vectors
if (any_col_is(X.info$uniq.classes, "factor")) {
which.factors = which(col_is(X.info$classes, "factor"))
for (j in which.factors) {
miss = is.na(X[[j]])
X[[j]] = as.vector(X[[j]])
is.na(X[[j]]) = miss
}
X.info = column_properties(X, X.info, cols.to.update = which.factors)
}

# Classes to be converted to character vectors
charconvert.classes = c("Date", "POSIXct", "POSIXlt")
if (any_col_is(X.info$uniq.classes, charconvert.classes)) {
which.charconvert = which(col_is(X.info$classes, charconvert.classes))
for (j in which.charconvert) {
miss = is.na(X[[j]])
X[[j]] = format(X[[j]])
is.na(X[[j]]) = miss
}
X.info = column_properties(X, X.info, cols.to.update = which.charconvert) # update column properties info
}

# Next determine if any columns are list or non-atomic type. If so, convert all columns to lists
atomic.types <- c("logical", "integer", "double", "complex", "character", "raw")
any.non.atomic = !all_col_is(X.info$uniq.types, atomic.types)
if (any.non.atomic) {
# We only want to coerce columns to list if they are not already recursive
# see R source code src/main/coerce.c and src/main/utils.c
recursive.types = c("list", "pairlist", "closure", "environment", "promise", "language",
"special", "builtin", "...", "any", "expression")
which.not.recursive = which(!col_is(X.info$types, recursive.types))

# Some types may be recursive if subsettable. Find out which of these are actually
# already recursive to skip
maybe.recursive = c("expternalptr", "bytecode", "weakref")
which.maybe.recursive = which(col_is(X.info$types, maybe.recursive)) # which columns maybe recursive
maybe.is.recursive = vapply_1b(X[which.maybe.recursive], is.recursive) # are any actually recursive?
which.maybe.is.recursive = which.maybe.recursive[!maybe.is.recursive] # indices of these columns that weren't recursive

which.not.recursive = sort(c(which.not.recursive, which.maybe.is.recursive))

# Coerce to list as necessary
for (j in which.not.recursive) {
X[[j]] = as.list(X[[j]])
}

X.info = column_properties(X, X.info, cols.to.update = which.not.recursive)
}

# Other classes from suggested packages that have special type conversion rules
if (!any.non.atomic && any_col_is(X.info$uniq.classes, "integer64") && length(X.info$uniq.classes) > 1L) {
# Determine target class we're converting to
target.class = NULL
if (!requireNamespace("bit64", quietly=TRUE))
stop("coercion to or from integer64 columns requires the bit64 package") # nocov
if (all_col_is(X.info$uniq.classes, c("integer64", "logical", "integer", "raw")))
target.class = "integer64"
else if (all_col_is(X.info$uniq.classes, c("integer64", "numeric", "complex", "character")))
target.class = "character" # integer64 cannot reliably be converted to numeric or complex

if (!is.null(target.class)) { # if NULL fallback on type coercion
# no method as.integer64.raw so we need to do two-step conversion
if (target.class == "integer64" && any_col_is(X.info$uniq.classes, "raw")) {
which.convert = which(col_is(X.info$classes, "raw"))
for (j in which.convert) {
X[[j]] = as.integer(X[[j]])
}
X.info = column_properties(X, X.info, cols.to.update = which.convert)
}

# Which columns need to be coerced?
which.convert = which(!col_is(X.info$classes, target.class))

# Coerce the columns
for (j in which.convert) {
switch(target.class,
"integer64"={ X[[j]] = bit64::as.integer64(X[[j]]) },
sritchie73 marked this conversation as resolved.
Show resolved Hide resolved
"character"={ X[[j]] = as.character(X[[j]]) })
}

X.info = column_properties(X, X.info, cols.to.update = which.convert) # update column properties info
}
}

# If columns still do not all have the same type we need to coerce
if (!any.non.atomic && length(X.info$uniq.types) > 1L) {
type.order = c("raw"=1L, "logical"=2L, "integer"=3L, "double"=4L,
"complex"=5L, "character"=6L, "list"=7L)
target.type = names(which.max(type.order[X.info$uniq.types]))
which.convert = which(!col_is(X.info$types, target.type))

# Coerce the columns
for (j in which.convert) {
switch(target.type,
# raw impossible to reach - no coercion necessary if all columns are raw
"logical"={ X[[j]] = as.logical(X[[j]]) },
"integer"={ X[[j]] = as.integer(X[[j]]) },
"double"={ X[[j]] = as.double(X[[j]]) },
"complex"={ X[[j]] = as.complex(X[[j]]) },
"character"={ X[[j]] = as.character(X[[j]]) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems quite wasteful to allocate memory for a new column whenever a column has to be coerced to a different type/class, only to then have to copy that memory again into the new matrix in the C code. For the atomic types at least, we may just be able to set a target type, pass that to the C code, then do the coercion there at copy-time. This would slow down the copy loop in the C code as you'd have to check and coerce each column - one way around this would be to pass the vectors of unique column types and individual column types to write a nested for loop iterating along columns of each type separately to avoid slowdowns due to branch prediction. Perhaps some memory profiling would be useful here to see if this is worth doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assign.c:memrecycle does that; new feature 29 in 1.12.4 refers to zero-copy type coercion.

# list impossible to reach - list and other recursive columns handled separately
)
}
}

# Warn if fell back on type coercion
atomic.classes = c("logical", "raw", "integer", "numeric",
"complex", "character", "list")
non.atomics = setdiff(X.info$uniq.classes, atomic.classes)
# nocov start
if (!any.non.atomic && length(non.atomics) > 0L && length(X.info$uniq.class.list) > 1L) {
warning("Could not coerce columns to a common class, class information",
"has been stripped and columns coerced to type ", typeof(X[[1L]]))
}
# nocov end

# If mix of recursive column types, fall back on unlist method
if (any.non.atomic && !(length(X.info$uniq.types) == 1L && X.info$uniq.types == "list")) {
X = unlist(X, recursive = FALSE, use.names = FALSE)
dim(X) = c(n, length(X)/n)
}
# Otherwise use fast C method to copy values into matrix
else {
X = .Call(Casmatrix, X)
}

# Add row and column names
dimnames(X) = list(rownames.value, cn)

# Determine any additional classes to give to the matrix - this applies
# where all columns have the same set of classes, and those classes
# include those defined in packages outside of base R - for example
# if all columns have class integer64 we can add this class to the
# matrix.
if (!any.non.atomic && length(non.atomics) > 0L && length(X.info$uniq.class.list) == 1L)
class(X) = c(non.atomics, class(X)) # class(X) should be "matrix" or c("matrix", "array") in R >= 4.0.0

X
}

Expand Down
Loading