Excessive RAM usage for `DBI::dbWriteTable()` and `dplyr::collect()` #97

krlmlr · 2024-03-03T14:50:10Z

32GB RAM use when writing a 16GB file, and also RAM use reaching 32GB momentarily when reading the same 16GB file. From #72 (comment), by @SimonCoulombe.

library(DBI)
library(dplyr)
library(dbplyr)
library(duckdb)

duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"

con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path))
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")

# run this once to create the duckdb file  then restart session:
if (FALSE){
  bigdata <-  data.table::rbindlist(rlang::rep_along(1:3e6, list(iris)))
  dim(bigdata) # 450M rows, 5 columns
  lobstr::obj_size(bigdata) # 16.20 GB in RAM

  dbWriteTable(con, "straight_from_memory", bigdata)
}


bigdata <- tbl(con, "straight_from_memory") %>% collect()

krlmlr · 2024-03-03T16:32:36Z

I have started to investigate memory consumption with dbWriteTable() in https://github.com/krlmlr/duckdb-mem. The script https://github.com/krlmlr/duckdb-mem/blob/main/setup-manual-limited.R shows how the memory consumption is not enforced with a CREATE TABLE ... FROM ... statement. The only difference to the setup-register.R script in the same repository is that CREATE TABLE statement. @Tmonster: can you please take a look?

I'll also review memory usage for reading the table, might reorganize that repository a bit.

krlmlr · 2024-03-03T17:22:38Z

Added an analysis of the reading behavior to https://github.com/krlmlr/duckdb-mem. Indeed, it seems that reading also consumes at least twice the size of the data read, which is surprising. We'll need to trace memory allocations to understand what's going on here. This is likely an issue in the glue, whereas the dbWriteTable() problems likely originate in the C++ core.

SimonCoulombe · 2024-03-03T17:36:46Z

Thanks for looking into this. I looked at the https://github.com/krlmlr/duckdb-mem, repo and I didnt see a reading scenario that includes the to_arrow() function like in that comment.

I was under the impression that it only requires 16GB for reading the 16GB file instead of 32GB. TIL in your repo that you can use "/usr/bin/time" to know how much memory a process used at its peak so I'll go check if my to_arrow() solution actually works.

bigdata <- tbl(con, "straight_from_memory") %>% collect() # 32 GB peak
bigdata <- tbl(con, "straight_from_memory") %>% to_arrow() %>% collect() # 16 GB peak

krlmlr · 2024-03-03T18:47:09Z

Thanks, added Arrow. The usage is still 2x the data size, unfortunately.

krlmlr added this to the 0.10.0 milestone Mar 3, 2024

krlmlr assigned Tmonster Mar 3, 2024

krlmlr changed the title ~~Excessive RAM usage for dplyr::collect()~~ Excessive RAM usage for DBI::dbWriteTable() and dplyr::collect() Mar 3, 2024

krlmlr modified the milestones: 0.10.0, 0.10.1 Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive RAM usage for `DBI::dbWriteTable()` and `dplyr::collect()` #97

Excessive RAM usage for `DBI::dbWriteTable()` and `dplyr::collect()` #97

krlmlr commented Mar 3, 2024

krlmlr commented Mar 3, 2024 •

edited

Loading

krlmlr commented Mar 3, 2024

SimonCoulombe commented Mar 3, 2024 •

edited by krlmlr

Loading

krlmlr commented Mar 3, 2024

Excessive RAM usage for DBI::dbWriteTable() and dplyr::collect() #97

Excessive RAM usage for DBI::dbWriteTable() and dplyr::collect() #97

Comments

krlmlr commented Mar 3, 2024

krlmlr commented Mar 3, 2024 • edited Loading

krlmlr commented Mar 3, 2024

SimonCoulombe commented Mar 3, 2024 • edited by krlmlr Loading

krlmlr commented Mar 3, 2024

Excessive RAM usage for `DBI::dbWriteTable()` and `dplyr::collect()` #97

Excessive RAM usage for `DBI::dbWriteTable()` and `dplyr::collect()` #97

krlmlr commented Mar 3, 2024 •

edited

Loading

SimonCoulombe commented Mar 3, 2024 •

edited by krlmlr

Loading