Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

atheriel
Copy link
Contributor

@atheriel atheriel commented Aug 9, 2018

This is an attempt to address #37. As I mentioned in that issue, I decided to add a new C-level parser that produces data frames instead of nested lists. This means that the results can simply be rbind()-ed together, or, for improved performance, using dplyr::bind_rows() or data.table::rbindlist().

The approach has one obvious downside: it introduces a large amount of new code.

In its current incarnation, the new parser is opt-in -- I haven't changed any of the existing C code -- requiring you to pass a flat = TRUE parameter to the find() method. This also allowed me to test that it produces the same output as the current parser, and run the test suite with few modifications.

The new approach appears to be much faster for larger data sets. As a quick example:

m <- mongo("test_flights", verbose = FALSE)
if(m$count() == 0) {
  m$insert(flights)
  m$insert(flights)
  m$insert(flights)
}

m$count()  # About 1 million.

microbenchmark::microbenchmark(
  before = m$find("{}", flat = FALSE),
  after = m$find("{}", flat = TRUE),
  times = 5
)
Unit: seconds
   expr       min        lq      mean    median        uq       max neval
 before 18.964424 21.575127 23.314231 23.164931 23.584352 29.282319     5
  after  3.394779  3.531145  4.390083  4.251457  4.310999  6.462035     5

As you can see, it's about 5x faster here.

I have a few outstanding design questions, namely:

  1. Does an opt-in approach with a parameter (to find()) make sense? and
  2. Does the C code look OK?

I am happy to refactor things or redesign the API at your suggestion. I recognize that this is an enormous PR to review, and am willing to help you out as much as you'd like on that front. The code also contains some detailed comments explaining the approach.

@codecov-io
Copy link

codecov-io commented Aug 9, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@32147b3). Click here to learn what that means.
The diff coverage is 4.53%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master    #148   +/-   ##
========================================
  Coverage          ?   36.5%           
========================================
  Files             ?     130           
  Lines             ?   22295           
  Branches          ?       0           
========================================
  Hits              ?    8139           
  Misses            ?   14156           
  Partials          ?       0
Impacted Files Coverage Δ
src/cursor.c 4% <0%> (ø)
R/client.R 37.37% <0%> (ø)
R/mongo.R 49.7% <100%> (ø)
R/stream.R 38.93% <57.14%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32147b3...98832aa. Read the comment docs.

atheriel added 3 commits May 15, 2019 17:36
This is a new approach at the C level. Instead of returning nested
lists which must later be transposed, this implementation returns a list
of vectors (essentially a data frame).

For most tested cases, this approach is considerably faster than the
nested lists.

Signed-off-by: Aaron Jacobs <[email protected]>
When 'flat' is TRUE, mongo_stream_in() will use the new rectangular
cursor paging function.

Signed-off-by: Aaron Jacobs <[email protected]>
@jimhester
Copy link

@jeroen is the main holdup for this PR the dplyr / data.table dependencies?

If so I think we could just write a basic concatenation function in C that will give you comparable performance to the dplyr and data.table implementations and avoid the extra dependencies.

All the column types in the pages should be consistent right? So it would just be a matter of allocating each of the output columns with the total number of rows and copying the data.

@atheriel
Copy link
Contributor Author

@jimhester I think that would be an easy solution to that issue. But I suspect the main roadblock for this is the size of the new code. From the issue thread:

I would be interested if we can make it in a way that doesn't introduce a hard dependency on data.table and also doesn't introduce too much new complexity in the code.

The other outstanding issue (in my view) is whether this should just become the default parser or if it needs to be hidden behind a parameter (called flat at present) at all.

@jeroen
Copy link
Owner

jeroen commented Jan 22, 2021

My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.

The jsonlite::simplfy function is slow because it does a lot of complicated messy reshaping on the structures to convert lists into vectors and data frames where possible. This process has many edge cases, for which jsonlite might have different behavior than other packages. So I don't think this is just a drop-in replacement, it will change the results.

And also I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.

@atheriel
Copy link
Contributor Author

My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.

I think this parser falls back to using the existing code whenever there are nested objects, so it should handle them the same way.

I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.

I respect that. But I'm not sure how it would be possible to make this extensible at the right layer -- the performance improvements here come from handling the C-level mongo cursor directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants