Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast data.table to matrix conversion in C [Depends: #4196] #4144

Draft
wants to merge 207 commits into
base: master
Choose a base branch
from

Conversation

sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Dec 28, 2019

Closes #4209

Following on from PR #4134 I've started working on C code for faster conversion of data.tables to matrices.

The current code so far works for numeric matrices only, assuming all columns in the data.table are already numeric.

TODO:

  • Modify C code so that it will work on all atomic types, not just numeric data.
  • Handle type conversion for data.tables with multiple types
  • Parallelise C code across input data.table columns
  • Revisit R code to reduce unnecessary column checks
  • bit64 support
  • fallback unlist method for types not implemented in C
  • Add tests
  • Check compatibilty with changes to matrix in upcoming R 4.0.0

Basic code implemented for numeric matrices for data.tables whose columns are all numeric
Helper functions (e.g. REAL) are now only called once for each column/object and objects are accessed by pointer in the for loops.
@sritchie73 sritchie73 added the WIP label Dec 28, 2019
@sritchie73 sritchie73 self-assigned this Dec 28, 2019
@sritchie73
Copy link
Contributor Author

At this point it would be useful to have someone's eyes on the code I've written to sanity check I haven't done something drastically wrong. Its my first time writing C extensions for R (I've worked quite a bit with Rcpp and have learnt C in the past) and also working with C code in an R package setting.

@codecov
Copy link

codecov bot commented Dec 28, 2019

Codecov Report

Merging #4144 (bef9fa1) into master (4bda6da) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4144      +/-   ##
==========================================
+ Coverage   99.60%   99.61%   +0.01%     
==========================================
  Files          72       73       +1     
  Lines       13918    14043     +125     
==========================================
+ Hits        13863    13989     +126     
+ Misses         55       54       -1     
Impacted Files Coverage Δ
src/init.c 100.00% <ø> (ø)
R/data.table.R 100.00% <100.00%> (ø)
src/matrix.c 100.00% <100.00%> (ø)
src/utils.c 98.09% <0.00%> (-0.04%) ⬇️
src/fread.c 99.52% <0.00%> (-0.01%) ⬇️
R/xts.R 100.00% <0.00%> (ø)
R/fcast.R 100.00% <0.00%> (ø)
R/frank.R 100.00% <0.00%> (ø)
R/utils.R 100.00% <0.00%> (ø)
src/frank.c 100.00% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bda6da...53846b8. Read the comment docs.

src/asmatrix.c Outdated Show resolved Hide resolved
C implementation of as.matrix now works on simple atomic types: logical, integer, numeric, and complex, provided all columns in the data.table are of the same type.
Bug was caused by dispatch function attempting to access non-existant second column, due to forgetting that array indexing in C starts at 0 not 1.
@sritchie73
Copy link
Contributor Author

C implementation of as.matrix now works on multiple atomic types, provided the input data.table's columns are all the same type (data.tables with mixed column types are next on the TODO list). The types I've tested, and covered by test.data.table() so far are: logical, integer, numeric, complex, character, raw, and list.

To achieve this, I've made the C asmatrix function a dispatch function that simply detects the atomic type from the first column of the data.table, then calls the appropriate asmatrix_ function. This unfortunately means theres quite a bit of code duplication. I don't know if there is a better way of doing this, especially given that character vectors and lists have to be handled quite differently in C to the other atomic types.

For data.tables with multiple column types, as.matrix.data.table performs type conversion across columns before handing the data.table to the C implementation of as.matrix
Covers test 2074.07: data.tables may be constructed such that an individual column may multi-column, e.g. a data.table or matrix. Since the test notes that these can only occur when data.tables are constructed incorrectly, I've just reverted to the old non-C code to handle this special case.
@sritchie73
Copy link
Contributor Author

For data.tables with mixed column types, I've explicitly handle the type conversion in R prior to handing off to C. I'm not sure if this is an improvement over the old method of simply calling unlist(), since unlist() handles type conversion in its internal C code.

R/data.table.R Outdated Show resolved Hide resolved
R/data.table.R Outdated Show resolved Hide resolved
R/data.table.R Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

I like what you've done to split out the some-columns-has-dimensions edge case. Given that AFAIK we don't support columns with dimensions (?) maybe worth to just error for that case & not bother with the added complexity.

Once you've coerced your columns-as-list-elements to the "highest" type, how does do.call(cbind) perform?

Explicit integer use + correct handling of multi-type columns
R/data.table.R Outdated Show resolved Hide resolved
@sritchie73
Copy link
Contributor Author

I like what you've done to split out the some-columns-has-dimensions edge case. Given that AFAIK we don't support columns with dimensions (?) maybe worth to just error for that case & not bother with the added complexity.

Once you've coerced your columns-as-list-elements to the "highest" type, how does do.call(cbind) perform?

I've only implemented this to cover test 2074.07:

## as.matrix.data.table when a column has columns (only possible when constructed incorrectly)
DT = structure(list(a=1:5, d=data.table(b=6:10, c=11:15), m=matrix(16:25, ncol=2L)), class = c('data.table', 'data.frame'))
test(2074.07, as.matrix(DT), matrix(1:25, ncol=5L, dimnames=list(NULL, c('a', 'd.b', 'd.c', 'm.1', 'm.2'))))

I'd otherwise prefer to throw an error, but perhaps better here to maintain backwards compatibility. I'm not particularly concerned with increasing efficiency for this edge case which can apparently only arise erroneously.

@MichaelChirico
Copy link
Member

I know for sure 2074.07 was only added for codecov purposes (I wrote it 😂), I wouldn't have any issue removing that test. setDT warns for such columns and as.data.table "unnests" them, the only way to get such an object now is through the convoluted approach I made for the test. Don't think we should be catering too much for that.

@sritchie73
Copy link
Contributor Author

sritchie73 commented Dec 30, 2019

In that case, I'm now wondering about lines 1909-1917, which look like they are also specific to this erroneous edge case:

    if (length(dj <- dim(xj)) == 2L && dj[2L] > 1L) {
      if (inherits(xj, "data.table"))
        xj = X[[j]] = as.matrix(X[[j]])
      dnj = dimnames(xj)[[2L]]
      collabs[[j]] = paste(collabs[[j]], if (length(dnj) >
        0L)
        dnj
      else seq_len(dj[2L]), sep = ".")
    }

I think we can remove the body of this if statement, and instead replace it with a call to setDT on X after the for loop run across each column, if any columns with dimensions are detected. Alternatively, we could eliminate this check entirely, and just run setDT regardless.

Elimanted costly check of each column for dimensions, replacing it with a single (very fast) call to setDT to check for columns with multiple columns (e.g. a column that contains a matrix).

In the rare edge case where this is detected (see test 2074.07) we now use as.data.table to unpack these columns.

This elimanted the need for the later check for matrix and data.table columns and subsequent use of the old unlist rather than the new C method.

Test 2074.07 required minor modification in the expected column names of the output matrix.
@sritchie73
Copy link
Contributor Author

I think I've come up with a much nicer solution to that problem which eliminates the need for checking each column's dimensions.

Previously, when the user supplied as.matrix with a column to use as the rownames, there was an expensive copy call `x = X[,.SD,.SDcols=cn[-rownames]]` which would create a copy of the input data.table excluding the rownames column.

Now instead the as.matrix function simply keeps track of the rownames column to exclude, and skips over that column throughout the rest of the column checks and while creating the matrix.
@sritchie73
Copy link
Contributor Author

RE: discussion of raw types in #4172 , I changed the as.matrix implementation here to mirror the behaviour of as.matrix.data.frame: the resulting matrix is raw type only if all columns are raw type, otherwise the matrix is coerced to character

Previous version had lots of interacting parts making it (1) buggy, (2) difficult to test, and (3) impossible to maintain.

This refactor makes the logic clearer and more modular so that the logic is easier to follow and requires fewer tests to cover.
@sritchie73
Copy link
Contributor Author

One thing that would be nice to do in this PR as well is to group all the as.matrix tests in the test file. Currently these are spread out across tests.Rraw. Is there a way of doing this without having to renumber all the tests?

@MichaelChirico
Copy link
Member

it should be fine to leave the old test numbers.

otoh, it's just a for loop through the new file to renumber 😁

@ethanbsmith
Copy link
Contributor

wondering if you would consider optimizing xts / zoo <-> data.table as part of this. Since xts and zoo are really just a matrix with an index attribute, apart from the index column, the rest would really just be a single type matrix conversion.

data.table in place updates, grouping, indexing and handling of mixed types are so powerful, that many of my analytics function convert my xts structures to a data.table internally for processing, but then need to reconvert to xts on the way out for compatibility. so, this is a very common use case (for me)

@jangorecki
Copy link
Member

good idea, but better to keep it as a separate PR probably

@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico
Copy link
Member

Setting as draft instead of label:WIP.

@MichaelChirico MichaelChirico marked this pull request as draft February 19, 2024 04:29
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico MichaelChirico modified the milestones: 1.17.0, 1.18.0 Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String coercion inserts spaces, as.matrix.data.table
6 participants