Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive RAM usage for DBI::dbWriteTable() and dplyr::collect() #97

Open
krlmlr opened this issue Mar 3, 2024 · 4 comments
Open

Excessive RAM usage for DBI::dbWriteTable() and dplyr::collect() #97

krlmlr opened this issue Mar 3, 2024 · 4 comments
Assignees
Milestone

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 3, 2024

32GB RAM use when writing a 16GB file, and also RAM use reaching 32GB momentarily when reading the same 16GB file. From #72 (comment), by @SimonCoulombe.

library(DBI)
library(dplyr)
library(dbplyr)
library(duckdb)

duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"

con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path))
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")

# run this once to create the duckdb file  then restart session:
if (FALSE){
  bigdata <-  data.table::rbindlist(rlang::rep_along(1:3e6, list(iris)))
  dim(bigdata) # 450M rows, 5 columns
  lobstr::obj_size(bigdata) # 16.20 GB in RAM

  dbWriteTable(con, "straight_from_memory", bigdata)
}


bigdata <- tbl(con, "straight_from_memory") %>% collect()
@krlmlr krlmlr added this to the 0.10.0 milestone Mar 3, 2024
@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 3, 2024

I have started to investigate memory consumption with dbWriteTable() in https://github.com/krlmlr/duckdb-mem. The script https://github.com/krlmlr/duckdb-mem/blob/main/setup-manual-limited.R shows how the memory consumption is not enforced with a CREATE TABLE ... FROM ... statement. The only difference to the setup-register.R script in the same repository is that CREATE TABLE statement. @Tmonster: can you please take a look?

I'll also review memory usage for reading the table, might reorganize that repository a bit.

@krlmlr krlmlr changed the title Excessive RAM usage for dplyr::collect() Excessive RAM usage for DBI::dbWriteTable() and dplyr::collect() Mar 3, 2024
@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 3, 2024

Added an analysis of the reading behavior to https://github.com/krlmlr/duckdb-mem. Indeed, it seems that reading also consumes at least twice the size of the data read, which is surprising. We'll need to trace memory allocations to understand what's going on here. This is likely an issue in the glue, whereas the dbWriteTable() problems likely originate in the C++ core.

@SimonCoulombe
Copy link

SimonCoulombe commented Mar 3, 2024

Thanks for looking into this. I looked at the https://github.com/krlmlr/duckdb-mem, repo and I didnt see a reading scenario that includes the to_arrow() function like in that comment.

I was under the impression that it only requires 16GB for reading the 16GB file instead of 32GB. TIL in your repo that you can use "/usr/bin/time" to know how much memory a process used at its peak so I'll go check if my to_arrow() solution actually works.

bigdata <- tbl(con, "straight_from_memory") %>% collect() # 32 GB peak
bigdata <- tbl(con, "straight_from_memory") %>% to_arrow() %>% collect() # 16 GB peak

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 3, 2024

Thanks, added Arrow. The usage is still 2x the data size, unfortunately.

@krlmlr krlmlr modified the milestones: 0.10.0, 0.10.1 Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants