Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quadratic time complexity of read.csv #8

Open
tdhock opened this issue Mar 30, 2023 · 2 comments
Open

quadratic time complexity of read.csv #8

tdhock opened this issue Mar 30, 2023 · 2 comments

Comments

@tdhock
Copy link
Owner

tdhock commented Mar 30, 2023

Note: this is not an issue with the atime package, but instead an issue with base R which was revealed by using atime.
A number of people have observed anecdotally that read.csv is slow for large number of columns, for example: https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns
I did a systematic comparison of read.csv with similar functions, see below.
figure-read-real-vary-cols
That is a log-log plot of memory (top panel Y axis) and time(bottom panel Y axis) as a function of number of columns (X axis), so the larger slope of read.csv indicates a larger asymptotic complexity class. When I draw references lines on top, I observe that read.csv is quadratic time (N^2), whereas the others are linear (N), see below.

figure-read-real-vary-cols-complexity

Can read.csv be improved to use a linear time algorithm, so it can handle CSV files with larger numbers of columns?

source code: https://github.com/tdhock/atime/blob/ec5295859f4a74bc7b137a9eec7c5a29d91c1ded/vignettes/compare-data.table-tidyverse.Rmd
rendered: https://rcdata.nau.edu/genomic-ml/atime/vignettes/compare-data.table-tidyverse.html

sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux 8.7 (Ootpa)
#> 
#> Matrix products: default
#> BLAS:   /projects/genomic-ml/lib64/R/lib/libRblas.so
#> LAPACK: /projects/genomic-ml/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  utils     datasets  grDevices methods   base     
#> 
#> other attached packages:
#> [1] directlabels_2021.2.24 ggplot2_3.4.1          data.table_1.14.8     
#> 
#> loaded via a namespace (and not attached):
#>  [1] highr_0.10       pillar_1.9.0     bslib_0.4.2      compiler_4.2.3  
#>  [5] jquerylib_0.1.4  tools_4.2.3      digest_0.6.31    bit_4.0.5       
#>  [9] gtable_0.3.3     jsonlite_1.8.4   evaluate_0.20    lifecycle_1.0.3 
#> [13] tibble_3.2.1     lattice_0.20-45  pkgconfig_2.0.3  rlang_1.1.0     
#> [17] bench_1.1.2      cli_3.6.1        xfun_0.38        fastmap_1.1.1   
#> [21] withr_2.5.0      dplyr_1.1.1      knitr_1.42       generics_0.1.3  
#> [25] vctrs_0.6.1      sass_0.4.5       hms_1.1.3        bit64_4.0.5     
#> [29] grid_4.2.3       tidyselect_1.2.0 glue_1.6.2       atime_2022.12.14
#> [33] R6_2.5.1         fansi_1.0.4      profmem_0.6.0    vroom_1.6.1     
#> [37] rmarkdown_2.21   purrr_1.0.1      tidyr_1.3.0      farver_2.1.1    
#> [41] readr_2.1.4      tzdb_0.3.0       magrittr_2.0.3   nc_2020.8.6     
#> [45] scales_1.2.1     htmltools_0.5.5  colorspace_2.1-1 quadprog_1.5-8  
#> [49] utf8_1.2.3       munsell_0.5.0    cachem_1.0.7     crayon_1.5.2
@tdhock
Copy link
Owner Author

tdhock commented Mar 30, 2023

On R-devel Sebastian Meyer suggested adding the following, to determine if there is any issue with scan (which is used by read.csv):

utils::read.csv(f.csv, colClasses = "numeric")

list2DF(scan(f.csv, what=`names<-`(rep(list(numeric()), N),
paste0("V",seq_len(N))), sep=",", skip=1, multi.line=FALSE))

which I translated to:

$`read.csv(colClasses)`
utils::read.csv(f.csv, colClasses = "numeric")

$`list2DF(scan)`
{
    what <- `names<-`(rep(list(numeric()), N), paste0("V", seq_len(N)))
    list2DF(scan(f.csv, what = what, sep = ",", skip = 1, multi.line = FALSE))
}

I re-ran the benchmark using these two, and I observed that colClasses does not improve asymptotic complexity.
Also list2DF(scan) seems to be log-linear, see below.
image
This suggests that fixing the quadratic time issues requires a change to read.csv (and not scan, although maybe there is a way to speed it up to linear from log-linear?)

@tdhock
Copy link
Owner Author

tdhock commented Apr 26, 2023

this still seems to be an issue using R-4.3.0
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant