-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start replacing TRUELENGTH
markers with a hash
#6694
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6694 +/- ##
==========================================
- Coverage 98.61% 98.60% -0.02%
==========================================
Files 79 80 +1
Lines 14558 14595 +37
==========================================
+ Hits 14357 14392 +35
- Misses 201 203 +2 ☔ View full report in Codecov by Sentry. |
Generated via commit d7a9a17 Download link for the artifact containing the test results: ↓ atime-results.zip
|
Also avoid crashing when creating a 0-size hash.
This may likely require a dynamically growing hash of TRUELENGTHs instead of the current pre-allocation approach with a very conservative over-estimate.
The hash needs O(n) memory (actually 2*n/load_factor entries) which isn't great.
hi, thanks for this. Can you please propose one or two performance test cases that you think may be adversely affected by these changes? Is it when we create a table with one column, and then use |
The The I'll try giving many |
Perhaps you can also try a fast 3rd party hash map: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ I particular Google's Abseil hash is pretty fast: https://abseil.io/docs/cpp/guides/container https://abseil.io/docs/cpp/guides/hash |
It's pretty bad. For typical cases, the current hash table eats >1 order of magnitude more R* memory, and it's similarly slower in The hash table is only on par by time in the worst case for # may need 16G of RAM to run comfortably due to pathological memory allocation patterns
library(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
truehash = '24e81785669e70caac31501bf4424ba14dbc90f9'
)
N <- 10^seq(2, 8.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 6.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda') * Edit: Some of the memory use in I'll try profiling the code. Thanks @SebKrantz for the link, a newer benchmark by the same author is also very instructive. |
thanks for proposing the performance test cases and sharing the atime benchmark graphs. I agree that we should try to avoid an order of magnitude constant factor increase in time/memory usage. |
In forder() and rbindlist(), there is no good upper boundary on the number of elements in the hash known ahead of time. Grow the hash table dynamically. Since the R/W locks are far too slow and OpenMP atomics are too limited, rely on strategically placed flushes, which isn't really a solution.
Since profiling has shown that a noticeable amount of time is wasted initialising the giant pre-allocated hash tables, I was able to make the slowdown factor closer to 2 by dynamically re-allocating the hash table: The memory use is significantly reduced (except for the worst cases), but cannot be measured with library(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
static_hash = '24e81785669e70caac31501bf4424ba14dbc90f9',
dynamic_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4'
)
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
#save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda') The main problem with the current approach is that since the parallel loop in The current code keeps one previous hash table until the next reallocation cycle and hopes that
|
In case it helps, {collapse}'s hash functions (https://github.com/SebKrantz/collapse/blob/master/src/kit_dup.c and https://github.com/SebKrantz/collapse/blob/master/src/match.c) are pretty fast as well - inspired by base R -> multiplication hash using unsigned integer prime number. It's bloody fast but requires a large table. But Calloc() is quite efficient. Anyway, would be great if you'd test the Google Hash function, curious to see it it can do much better. PS: you can test |
The abseil hash function is very slightly slower in my tests, although the difference may be not significant. Perhaps that's because my C port fails to inline some of the things that naturally inline in the original C++ with templates. I can try harder, but that's a lot of extra code to bring in properly. library(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
'abseil_hash' = '159e1d48926b72af9f212b8c645a8bc8ab6b20be'
)
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')
library(atime)
library(data.table)
limit <- 1
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
chmatch1 <- atime(
N,
seconds.limit = limit, verbose = TRUE,
master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
rm(chmatch_work1); gc(full = TRUE)
save(chmatch1, file = 'times_collapse.rda') And the real memory cost isn't even that large: library(atime)
library(data.table)
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
library(data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac)
library(data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4)
library(parallel)
# only tested on a recent Linux system
# measures the _maximal_ amount of memory in kB used by the current process
writeLines('
#include <sys/resource.h>
void maxrss(double * kb) {
struct rusage ru;
int ret = getrusage(RUSAGE_SELF, &ru);
*kb = ret ? -1 : ru.ru_maxrss;
}
', 'maxrss.c')
tools::Rcmd('SHLIB maxrss.c')
dyn.load(paste0('maxrss', .Platform$dynlib.ext))
limit <- 1
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
versions <- expression(
master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
plan <- expand.grid(N = N, version = names(versions))
chmatch1 <- lapply(seq_len(nrow(plan)), \(i) {
# use a disposable child process
mccollect(mcparallel({
eval(versions[[plan$version[[i]]]], list(N = plan$N[[i]]))
.C('maxrss', kb = double(1))$kb
}))
})
rm(chmatch_work1); gc(full = TRUE)
save(chmatch1, file = 'times_collapse.rda')
chmatch1p <- lattice::xyplot(
maxrss_kb ~ N, cbind(plan, maxrss_kb = unlist(chmatch1)), group = version,
auto.key = TRUE, scales = list(log = 10),
par.settings=list(superpose.symbol=list(pch=19))
) |
Nice! Thanks. I always wondered about this tradeoff between the size of the table and the quality of the hash function. Looks like speed + large table still wins. Anyway, if you want to adopt, feel free to copy it under your MPL license. Just mention me in the top of the file and as a contributor. |
PS: I believe it also depends on the size of the |
excellent work thank you very much |
With apologies to Matt Dowle, who had poured a lot of effort into making
data.table
go fast.Ongoing work towards #6180. Unfortunately, doesn't completely remove any uses of non-API entry points by itself. Detailed motivation here in a pending blog post. Can't start implementing stretchy ALTREP vectors until
data.table
stops usingTRUELENGTH
to mark them.Currently implemented:
TRUELENGTH
to markCHARSXP
s or columns replaced with a hashNeeds more work:
rbindlist()
andforder()
pre-allocate memory for the worst-case usageforder.c
, the last remaining user ofsavetl
SET_TRUELENGTH
is atomic,hash_set
is not, will need additional care in multi-threaded environmentsavetl
machinery inassign.c
Let's just see how much worse is the performance going to get.