-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast data.table to matrix conversion in C [Depends: #4196] #4144
base: master
Are you sure you want to change the base?
Conversation
Basic code implemented for numeric matrices for data.tables whose columns are all numeric
Helper functions (e.g. REAL) are now only called once for each column/object and objects are accessed by pointer in the for loops.
At this point it would be useful to have someone's eyes on the code I've written to sanity check I haven't done something drastically wrong. Its my first time writing C extensions for R (I've worked quite a bit with Rcpp and have learnt C in the past) and also working with C code in an R package setting. |
Codecov Report
@@ Coverage Diff @@
## master #4144 +/- ##
==========================================
+ Coverage 99.60% 99.61% +0.01%
==========================================
Files 72 73 +1
Lines 13918 14043 +125
==========================================
+ Hits 13863 13989 +126
+ Misses 55 54 -1
Continue to review full report at Codecov.
|
C implementation of as.matrix now works on simple atomic types: logical, integer, numeric, and complex, provided all columns in the data.table are of the same type.
Bug was caused by dispatch function attempting to access non-existant second column, due to forgetting that array indexing in C starts at 0 not 1.
C implementation of as.matrix now works on multiple atomic types, provided the input data.table's columns are all the same type (data.tables with mixed column types are next on the TODO list). The types I've tested, and covered by test.data.table() so far are: logical, integer, numeric, complex, character, raw, and list. To achieve this, I've made the C asmatrix function a dispatch function that simply detects the atomic type from the first column of the data.table, then calls the appropriate asmatrix_ function. This unfortunately means theres quite a bit of code duplication. I don't know if there is a better way of doing this, especially given that character vectors and lists have to be handled quite differently in C to the other atomic types. |
For data.tables with multiple column types, as.matrix.data.table performs type conversion across columns before handing the data.table to the C implementation of as.matrix
Covers test 2074.07: data.tables may be constructed such that an individual column may multi-column, e.g. a data.table or matrix. Since the test notes that these can only occur when data.tables are constructed incorrectly, I've just reverted to the old non-C code to handle this special case.
For data.tables with mixed column types, I've explicitly handle the type conversion in R prior to handing off to C. I'm not sure if this is an improvement over the old method of simply calling unlist(), since unlist() handles type conversion in its internal C code. |
I like what you've done to split out the some-columns-has-dimensions edge case. Given that AFAIK we don't support columns with dimensions (?) maybe worth to just error for that case & not bother with the added complexity. Once you've coerced your columns-as-list-elements to the "highest" type, how does |
I've only implemented this to cover test 2074.07:
I'd otherwise prefer to throw an error, but perhaps better here to maintain backwards compatibility. I'm not particularly concerned with increasing efficiency for this edge case which can apparently only arise erroneously. |
I know for sure 2074.07 was only added for codecov purposes (I wrote it 😂), I wouldn't have any issue removing that test. |
In that case, I'm now wondering about lines 1909-1917, which look like they are also specific to this erroneous edge case:
I think we can remove the body of this if statement, and instead replace it with a call to setDT on X after the for loop run across each column, if any columns with dimensions are detected. Alternatively, we could eliminate this check entirely, and just run setDT regardless. |
Elimanted costly check of each column for dimensions, replacing it with a single (very fast) call to setDT to check for columns with multiple columns (e.g. a column that contains a matrix). In the rare edge case where this is detected (see test 2074.07) we now use as.data.table to unpack these columns. This elimanted the need for the later check for matrix and data.table columns and subsequent use of the old unlist rather than the new C method. Test 2074.07 required minor modification in the expected column names of the output matrix.
I think I've come up with a much nicer solution to that problem which eliminates the need for checking each column's dimensions. |
Previously, when the user supplied as.matrix with a column to use as the rownames, there was an expensive copy call `x = X[,.SD,.SDcols=cn[-rownames]]` which would create a copy of the input data.table excluding the rownames column. Now instead the as.matrix function simply keeps track of the rownames column to exclude, and skips over that column throughout the rest of the column checks and while creating the matrix.
RE: discussion of raw types in #4172 , I changed the as.matrix implementation here to mirror the behaviour of as.matrix.data.frame: the resulting matrix is raw type only if all columns are raw type, otherwise the matrix is coerced to character |
Previous version had lots of interacting parts making it (1) buggy, (2) difficult to test, and (3) impossible to maintain. This refactor makes the logic clearer and more modular so that the logic is easier to follow and requires fewer tests to cover.
One thing that would be nice to do in this PR as well is to group all the as.matrix tests in the test file. Currently these are spread out across tests.Rraw. Is there a way of doing this without having to renumber all the tests? |
it should be fine to leave the old test numbers. otoh, it's just a for loop through the new file to renumber 😁 |
wondering if you would consider optimizing
|
good idea, but better to keep it as a separate PR probably |
Setting as draft instead of |
Closes #4209
Following on from PR #4134 I've started working on C code for faster conversion of data.tables to matrices.
The current code so far works for numeric matrices only, assuming all columns in the data.table are already numeric.
TODO: