You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: this is not an issue with the atime package, but instead an issue with base R which was revealed by using atime.
A number of people have observed anecdotally that read.csv is slow for large number of columns, for example: https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns
I did a systematic comparison of read.csv with similar functions, see below.
That is a log-log plot of memory (top panel Y axis) and time(bottom panel Y axis) as a function of number of columns (X axis), so the larger slope of read.csv indicates a larger asymptotic complexity class. When I draw references lines on top, I observe that read.csv is quadratic time (N^2), whereas the others are linear (N), see below.
Can read.csv be improved to use a linear time algorithm, so it can handle CSV files with larger numbers of columns?
I re-ran the benchmark using these two, and I observed that colClasses does not improve asymptotic complexity.
Also list2DF(scan) seems to be log-linear, see below.
This suggests that fixing the quadratic time issues requires a change to read.csv (and not scan, although maybe there is a way to speed it up to linear from log-linear?)
Note: this is not an issue with the atime package, but instead an issue with base R which was revealed by using atime.
![figure-read-real-vary-cols](https://user-images.githubusercontent.com/932850/228731372-dca54fbf-20cc-4a55-a6a3-89606f62eaab.png)
A number of people have observed anecdotally that read.csv is slow for large number of columns, for example: https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns
I did a systematic comparison of read.csv with similar functions, see below.
That is a log-log plot of memory (top panel Y axis) and time(bottom panel Y axis) as a function of number of columns (X axis), so the larger slope of read.csv indicates a larger asymptotic complexity class. When I draw references lines on top, I observe that read.csv is quadratic time (N^2), whereas the others are linear (N), see below.
Can read.csv be improved to use a linear time algorithm, so it can handle CSV files with larger numbers of columns?
source code: https://github.com/tdhock/atime/blob/ec5295859f4a74bc7b137a9eec7c5a29d91c1ded/vignettes/compare-data.table-tidyverse.Rmd
rendered: https://rcdata.nau.edu/genomic-ml/atime/vignettes/compare-data.table-tidyverse.html
The text was updated successfully, but these errors were encountered: