-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #148 +/- ##
========================================
Coverage ? 36.5%
========================================
Files ? 130
Lines ? 22295
Branches ? 0
========================================
Hits ? 8139
Misses ? 14156
Partials ? 0
Continue to review full report at Codecov.
|
This is a new approach at the C level. Instead of returning nested lists which must later be transposed, this implementation returns a list of vectors (essentially a data frame). For most tested cases, this approach is considerably faster than the nested lists. Signed-off-by: Aaron Jacobs <[email protected]>
When 'flat' is TRUE, mongo_stream_in() will use the new rectangular cursor paging function. Signed-off-by: Aaron Jacobs <[email protected]>
Signed-off-by: Aaron Jacobs <[email protected]>
@jeroen is the main holdup for this PR the dplyr / data.table dependencies? If so I think we could just write a basic concatenation function in C that will give you comparable performance to the dplyr and data.table implementations and avoid the extra dependencies. All the column types in the pages should be consistent right? So it would just be a matter of allocating each of the output columns with the total number of rows and copying the data. |
@jimhester I think that would be an easy solution to that issue. But I suspect the main roadblock for this is the size of the new code. From the issue thread:
The other outstanding issue (in my view) is whether this should just become the default parser or if it needs to be hidden behind a parameter (called |
My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record. The jsonlite::simplfy function is slow because it does a lot of complicated messy reshaping on the structures to convert lists into vectors and data frames where possible. This process has many edge cases, for which jsonlite might have different behavior than other packages. So I don't think this is just a drop-in replacement, it will change the results. And also I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite. |
I think this parser falls back to using the existing code whenever there are nested objects, so it should handle them the same way.
I respect that. But I'm not sure how it would be possible to make this extensible at the right layer -- the performance improvements here come from handling the C-level mongo cursor directly. |
This is an attempt to address #37. As I mentioned in that issue, I decided to add a new C-level parser that produces data frames instead of nested lists. This means that the results can simply be
rbind()
-ed together, or, for improved performance, usingdplyr::bind_rows()
ordata.table::rbindlist()
.The approach has one obvious downside: it introduces a large amount of new code.
In its current incarnation, the new parser is opt-in -- I haven't changed any of the existing C code -- requiring you to pass a
flat = TRUE
parameter to thefind()
method. This also allowed me to test that it produces the same output as the current parser, and run the test suite with few modifications.The new approach appears to be much faster for larger data sets. As a quick example:
As you can see, it's about 5x faster here.
I have a few outstanding design questions, namely:
find()
) make sense? andI am happy to refactor things or redesign the API at your suggestion. I recognize that this is an enormous PR to review, and am willing to help you out as much as you'd like on that front. The code also contains some detailed comments explaining the approach.