quadratic time complexity of read.csv #8

tdhock · 2023-03-30T04:43:49Z

Note: this is not an issue with the atime package, but instead an issue with base R which was revealed by using atime.
A number of people have observed anecdotally that read.csv is slow for large number of columns, for example: https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns
I did a systematic comparison of read.csv with similar functions, see below.

That is a log-log plot of memory (top panel Y axis) and time(bottom panel Y axis) as a function of number of columns (X axis), so the larger slope of read.csv indicates a larger asymptotic complexity class. When I draw references lines on top, I observe that read.csv is quadratic time (N^2), whereas the others are linear (N), see below.

Can read.csv be improved to use a linear time algorithm, so it can handle CSV files with larger numbers of columns?

source code: https://github.com/tdhock/atime/blob/ec5295859f4a74bc7b137a9eec7c5a29d91c1ded/vignettes/compare-data.table-tidyverse.Rmd
rendered: https://rcdata.nau.edu/genomic-ml/atime/vignettes/compare-data.table-tidyverse.html

sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux 8.7 (Ootpa)
#> 
#> Matrix products: default
#> BLAS:   /projects/genomic-ml/lib64/R/lib/libRblas.so
#> LAPACK: /projects/genomic-ml/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  utils     datasets  grDevices methods   base     
#> 
#> other attached packages:
#> [1] directlabels_2021.2.24 ggplot2_3.4.1          data.table_1.14.8     
#> 
#> loaded via a namespace (and not attached):
#>  [1] highr_0.10       pillar_1.9.0     bslib_0.4.2      compiler_4.2.3  
#>  [5] jquerylib_0.1.4  tools_4.2.3      digest_0.6.31    bit_4.0.5       
#>  [9] gtable_0.3.3     jsonlite_1.8.4   evaluate_0.20    lifecycle_1.0.3 
#> [13] tibble_3.2.1     lattice_0.20-45  pkgconfig_2.0.3  rlang_1.1.0     
#> [17] bench_1.1.2      cli_3.6.1        xfun_0.38        fastmap_1.1.1   
#> [21] withr_2.5.0      dplyr_1.1.1      knitr_1.42       generics_0.1.3  
#> [25] vctrs_0.6.1      sass_0.4.5       hms_1.1.3        bit64_4.0.5     
#> [29] grid_4.2.3       tidyselect_1.2.0 glue_1.6.2       atime_2022.12.14
#> [33] R6_2.5.1         fansi_1.0.4      profmem_0.6.0    vroom_1.6.1     
#> [37] rmarkdown_2.21   purrr_1.0.1      tidyr_1.3.0      farver_2.1.1    
#> [41] readr_2.1.4      tzdb_0.3.0       magrittr_2.0.3   nc_2020.8.6     
#> [45] scales_1.2.1     htmltools_0.5.5  colorspace_2.1-1 quadprog_1.5-8  
#> [49] utf8_1.2.3       munsell_0.5.0    cachem_1.0.7     crayon_1.5.2

tdhock · 2023-03-30T19:00:42Z

On R-devel Sebastian Meyer suggested adding the following, to determine if there is any issue with scan (which is used by read.csv):

utils::read.csv(f.csv, colClasses = "numeric")

list2DF(scan(f.csv, what=`names<-`(rep(list(numeric()), N),
paste0("V",seq_len(N))), sep=",", skip=1, multi.line=FALSE))

which I translated to:

$`read.csv(colClasses)`
utils::read.csv(f.csv, colClasses = "numeric")

$`list2DF(scan)`
{
    what <- `names<-`(rep(list(numeric()), N), paste0("V", seq_len(N)))
    list2DF(scan(f.csv, what = what, sep = ",", skip = 1, multi.line = FALSE))
}

I re-ran the benchmark using these two, and I observed that colClasses does not improve asymptotic complexity.
Also list2DF(scan) seems to be log-linear, see below.

This suggests that fixing the quadratic time issues requires a change to read.csv (and not scan, although maybe there is a way to speed it up to linear from log-linear?)

tdhock · 2023-04-26T23:02:42Z

this still seems to be an issue using R-4.3.0

hturner mentioned this issue Aug 22, 2023

improve efficiency of read.csv and write.csv r-devel/r-project-sprint-2023#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quadratic time complexity of read.csv #8

quadratic time complexity of read.csv #8

tdhock commented Mar 30, 2023 •

edited

Loading

tdhock commented Mar 30, 2023 •

edited

Loading

tdhock commented Apr 26, 2023

quadratic time complexity of read.csv #8

quadratic time complexity of read.csv #8

Comments

tdhock commented Mar 30, 2023 • edited Loading

tdhock commented Mar 30, 2023 • edited Loading

tdhock commented Apr 26, 2023

tdhock commented Mar 30, 2023 •

edited

Loading

tdhock commented Mar 30, 2023 •

edited

Loading