Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

atheriel · 2018-08-09T15:14:18Z

This is an attempt to address #37. As I mentioned in that issue, I decided to add a new C-level parser that produces data frames instead of nested lists. This means that the results can simply be rbind()-ed together, or, for improved performance, using dplyr::bind_rows() or data.table::rbindlist().

The approach has one obvious downside: it introduces a large amount of new code.

In its current incarnation, the new parser is opt-in -- I haven't changed any of the existing C code -- requiring you to pass a flat = TRUE parameter to the find() method. This also allowed me to test that it produces the same output as the current parser, and run the test suite with few modifications.

The new approach appears to be much faster for larger data sets. As a quick example:

m <- mongo("test_flights", verbose = FALSE)
if(m$count() == 0) {
  m$insert(flights)
  m$insert(flights)
  m$insert(flights)
}

m$count()  # About 1 million.

microbenchmark::microbenchmark(
  before = m$find("{}", flat = FALSE),
  after = m$find("{}", flat = TRUE),
  times = 5
)

Unit: seconds
   expr       min        lq      mean    median        uq       max neval
 before 18.964424 21.575127 23.314231 23.164931 23.584352 29.282319     5
  after  3.394779  3.531145  4.390083  4.251457  4.310999  6.462035     5

As you can see, it's about 5x faster here.

I have a few outstanding design questions, namely:

Does an opt-in approach with a parameter (to find()) make sense? and
Does the C code look OK?

I am happy to refactor things or redesign the API at your suggestion. I recognize that this is an enormous PR to review, and am willing to help you out as much as you'd like on that front. The code also contains some detailed comments explaining the approach.

codecov-io · 2018-08-09T15:56:39Z

Codecov Report

❗ No coverage uploaded for pull request base (master@32147b3). Click here to learn what that means.
The diff coverage is 4.53%.

@@           Coverage Diff            @@
##             master    #148   +/-   ##
========================================
  Coverage          ?   36.5%           
========================================
  Files             ?     130           
  Lines             ?   22295           
  Branches          ?       0           
========================================
  Hits              ?    8139           
  Misses            ?   14156           
  Partials          ?       0

Impacted Files	Coverage Δ
src/cursor.c	`4% <0%> (ø)`
R/client.R	`37.37% <0%> (ø)`
R/mongo.R	`49.7% <100%> (ø)`
R/stream.R	`38.93% <57.14%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32147b3...98832aa. Read the comment docs.

This is a new approach at the C level. Instead of returning nested lists which must later be transposed, this implementation returns a list of vectors (essentially a data frame). For most tested cases, this approach is considerably faster than the nested lists. Signed-off-by: Aaron Jacobs <[email protected]>

When 'flat' is TRUE, mongo_stream_in() will use the new rectangular cursor paging function. Signed-off-by: Aaron Jacobs <[email protected]>

Signed-off-by: Aaron Jacobs <[email protected]>

jimhester · 2021-01-22T14:28:56Z

@jeroen is the main holdup for this PR the dplyr / data.table dependencies?

If so I think we could just write a basic concatenation function in C that will give you comparable performance to the dplyr and data.table implementations and avoid the extra dependencies.

All the column types in the pages should be consistent right? So it would just be a matter of allocating each of the output columns with the total number of rows and copying the data.

atheriel · 2021-01-22T15:00:48Z

@jimhester I think that would be an easy solution to that issue. But I suspect the main roadblock for this is the size of the new code. From the issue thread:

I would be interested if we can make it in a way that doesn't introduce a hard dependency on data.table and also doesn't introduce too much new complexity in the code.

The other outstanding issue (in my view) is whether this should just become the default parser or if it needs to be hidden behind a parameter (called flat at present) at all.

jeroen · 2021-01-22T15:41:21Z

My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.

The jsonlite::simplfy function is slow because it does a lot of complicated messy reshaping on the structures to convert lists into vectors and data frames where possible. This process has many edge cases, for which jsonlite might have different behavior than other packages. So I don't think this is just a drop-in replacement, it will change the results.

And also I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.

atheriel · 2021-01-22T17:38:11Z

My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.

I think this parser falls back to using the existing code whenever there are nested objects, so it should handle them the same way.

I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.

I respect that. But I'm not sure how it would be possible to make this extensible at the right layer -- the performance improvements here come from handling the C-level mongo cursor directly.

atheriel added 3 commits May 15, 2019 17:36

Modifies find() to take a new 'flat' parameter.

7d37129

When 'flat' is TRUE, mongo_stream_in() will use the new rectangular cursor paging function. Signed-off-by: Aaron Jacobs <[email protected]>

Adds optional usage of dplyr::bind_rows() or data.table::rbindlist().

98832aa

Signed-off-by: Aaron Jacobs <[email protected]>

atheriel force-pushed the add-flat-cursor branch from d98a608 to 98832aa Compare May 15, 2019 17:36

jeroen force-pushed the master branch from 6f526cf to 0fe1686 Compare November 21, 2020 14:12

jeroen force-pushed the master branch from 00bbeab to da98108 Compare November 6, 2022 11:19

jeroen force-pushed the master branch from b28bc2f to a6476f6 Compare December 4, 2023 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

atheriel commented Aug 9, 2018

codecov-io commented Aug 9, 2018 •

edited

Loading

jimhester commented Jan 22, 2021

atheriel commented Jan 22, 2021

jeroen commented Jan 22, 2021 •

edited

Loading

atheriel commented Jan 22, 2021

Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

Are you sure you want to change the base?

Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

Conversation

atheriel commented Aug 9, 2018

codecov-io commented Aug 9, 2018 • edited Loading

Codecov Report

jimhester commented Jan 22, 2021

atheriel commented Jan 22, 2021

jeroen commented Jan 22, 2021 • edited Loading

atheriel commented Jan 22, 2021

codecov-io commented Aug 9, 2018 •

edited

Loading

jeroen commented Jan 22, 2021 •

edited

Loading