-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gforce grouping var class #5568
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5568 +/- ##
=======================================
Coverage 97.49% 97.49%
=======================================
Files 80 80
Lines 14810 14810
=======================================
Hits 14439 14439
Misses 371 371
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
BenchmarksOn I'm curious why cc()
N = 1e7
DT = data.table(x = sample(N), y = sample(1e2,N,TRUE))
# warm start
invisible(`[.data.table`(DT,j=.N,by=y))
invisible(data.table:::`[.data.table`(DT,j=.N,by=y))
microbenchmark::microbenchmark(
`[.data.table`(DT,j=.N,by=y), # proposed fix
data.table:::`[.data.table`(DT,j=.N,by=y), # master before
times = 100L, unit = "s"
)
# Unit: seconds
# expr min lq mean median uq max neval cld
# `[.data.table`(DT, j = .N, by = y) 0.08319297 0.09171692 0.1236303 0.1033564 0.154837 0.2407570 100 a
# data.table:::`[.data.table`(DT, j = .N, by = y) 0.08162183 0.09237532 0.1242161 0.1009008 0.161996 0.2534246 100 a
DT = data.table(x = sample(N), y = sample(1e3,N,TRUE))
microbenchmark::microbenchmark(
`[.data.table`(DT,j=.N,by=y), # proposed fix
data.table:::`[.data.table`(DT,j=.N,by=y), # master before
times = 100L, unit = "s"
)
# Unit: seconds
# expr min lq mean median uq max neval cld
# `[.data.table`(DT, j = .N, by = y) 0.08871780 0.1055516 0.1316296 0.1132985 0.1571205 0.2616192 100 a
# data.table:::`[.data.table`(DT, j = .N, by = y) 0.09055208 0.1053714 0.1388291 0.1164063 0.1821008 0.2650133 100 a
DT = data.table(x = sample(N), y = sample(1e4,N,TRUE))
microbenchmark::microbenchmark(
`[.data.table`(DT,j=.N,by=y), # proposed fix
data.table:::`[.data.table`(DT,j=.N,by=y), # master before
times = 100L, unit = "s"
)
# Unit: seconds
# expr min lq mean median uq max neval cld
# `[.data.table`(DT, j = .N, by = y) 0.10052261 0.1118025 0.1393742 0.1193117 0.1838666 0.2102043 100 a
# data.table:::`[.data.table`(DT, j = .N, by = y) 0.09610407 0.1078412 0.1309245 0.1165225 0.1371610 0.2084393 100 a
DT = data.table(x = sample(N), y = sample(1e5,N,TRUE))
microbenchmark::microbenchmark(
`[.data.table`(DT,j=.N,by=y), # proposed fix
data.table:::`[.data.table`(DT,j=.N,by=y), # master before
times = 100L, unit = "s"
)
# Unit: seconds
# expr min lq mean median uq max neval cld
# `[.data.table`(DT, j = .N, by = y) 0.1648205 0.1777414 0.2407806 0.1851763 0.2554005 0.8988475 100 a
# data.table:::`[.data.table`(DT, j = .N, by = y) 0.1646772 0.1780597 0.2441777 0.1871644 0.2588514 0.9152462 100 a |
Adding also a comparison benchmark between library(data.table)
bensch <- function(n) {
for (i in seq(2,n)) {
N <- 10^i
x <- sample(N)
for (j in seq(1, i-1)) {
M <- 10^j
y <- sample(M)
cat(sprintf("length(x)=%d\t\tlength(y)=%d\n", N, M))
print(bench::mark(x[y], .Call(data.table:::CsubsetVector, x, y))[,1:7])
cat("\n\n")
}
}
}
bensch(8)
#> length(x)=100 length(y)=10
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 172ns 179ns 4824681. 0B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 368ns 393ns 2021012. 0B 0 10000
#>
#>
#> length(x)=1000 length(y)=10
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 172ns 180ns 4770361. 0B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 369ns 465ns 1502658. 0B 0 10000
#>
#>
#> length(x)=1000 length(y)=100
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 352ns 388ns 2122521. 448B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 567ns 613ns 1240114. 448B 0 10000
#>
#>
#> length(x)=10000 length(y)=10
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 176ns 185ns 3959722. 0B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 377ns 398ns 2123734. 0B 0 10000
#>
#>
#> length(x)=10000 length(y)=100
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 356ns 448ns 1502203. 448B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 566ns 596ns 1493639. 448B 0 10000
#>
#>
#> length(x)=10000 length(y)=1000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 1.75µs 3.2µs 301182. 3.95KB 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 2.17µs 2.23µs 390908. 3.95KB 39.1 9999
#>
#>
#> length(x)=100000 length(y)=10
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 172ns 258ns 2616683. 0B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 364ns 417ns 2106109. 0B 0 10000
#>
#>
#> length(x)=100000 length(y)=100
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 353ns 381ns 2338254. 448B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 576ns 621ns 1533174. 448B 0 10000
#>
#>
#> length(x)=100000 length(y)=1000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 1.78µs 1.9µs 474043. 3.95KB 47.4 9999
#> 2 .Call(data.table:::CsubsetVector, x, y) 2.09µs 2.24µs 332232. 3.95KB 33.2 9999
#>
#>
#> length(x)=100000 length(y)=10000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 16.4µs 16.9µs 49189. 39.1KB 34.5 9993
#> 2 .Call(data.table:::CsubsetVector, x, y) 25.8µs 27.8µs 34104. 39.1KB 23.9 9993
#>
#>
#> length(x)=1000000 length(y)=10
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 180ns 191ns 4576763. 0B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 363ns 409ns 2286424. 0B 0 10000
#>
#>
#> length(x)=1000000 length(y)=100
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 363ns 551ns 1466453. 448B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 587ns 635ns 1342725. 448B 0 10000
#>
#>
#> length(x)=1000000 length(y)=1000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 1.79µs 1.95µs 464367. 3.95KB 46.4 9999
#> 2 .Call(data.table:::CsubsetVector, x, y) 2.18µs 2.28µs 418613. 3.95KB 0 10000
#>
#>
#> length(x)=1000000 length(y)=10000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 16.9µs 17.7µs 51919. 39.1KB 41.6 9992
#> 2 .Call(data.table:::CsubsetVector, x, y) 16.1µs 17.9µs 46705. 39.1KB 37.4 9992
#>
#>
#> length(x)=1000000 length(y)=100000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 215µs 233µs 3766. 391KB 30.4 1737
#> 2 .Call(data.table:::CsubsetVector, x, y) 152µs 170µs 5357. 391KB 44.7 2395
#>
#>
#> length(x)=10000000 length(y)=10
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 167ns 182ns 4813768. 0B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 375ns 399ns 2317283. 0B 0 10000
#>
#>
#> length(x)=10000000 length(y)=100
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 355ns 387ns 2361763. 448B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 571ns 611ns 1468815. 448B 0 10000
#>
#>
#> length(x)=10000000 length(y)=1000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 1.74µs 2µs 408265. 3.95KB 40.8 9999
#> 2 .Call(data.table:::CsubsetVector, x, y) 2.17µs 2.3µs 383133. 3.95KB 0 10000
#>
#>
#> length(x)=10000000 length(y)=10000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 16.5µs 18.3µs 41355. 39.1KB 20.7 9995
#> 2 .Call(data.table:::CsubsetVector, x, y) 16.1µs 18.4µs 44637. 39.1KB 22.3 9995
#>
#>
#> length(x)=10000000 length(y)=100000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 212µs 245µs 3236. 391KB 16.9 1536
#> 2 .Call(data.table:::CsubsetVector, x, y) 154µs 178µs 4882. 391KB 23.9 2248
#>
#>
#> length(x)=10000000 length(y)=1000000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 3.06ms 4.64ms 224. 3.81MB 15.5 101
#> 2 .Call(data.table:::CsubsetVector, x, y) 1.81ms 2.19ms 423. 3.81MB 27.9 182
#>
#>
#> length(x)=100000000 length(y)=10
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 176ns 185ns 4859594. 0B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 377ns 403ns 2309703. 0B 0 10000
#>
#>
#> length(x)=100000000 length(y)=100
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 364ns 405ns 2074245. 448B 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 565ns 610ns 1454738. 448B 0 10000
#>
#>
#> length(x)=100000000 length(y)=1000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 1.81µs 2µs 467334. 3.95KB 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 2.16µs 3.63µs 265340. 3.95KB 0 10000
#>
#>
#> length(x)=100000000 length(y)=10000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 17.8µs 32.1µs 31618. 39.1KB 0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y) 17.5µs 21.9µs 42032. 39.1KB 4.20 9999
#>
#>
#> length(x)=100000000 length(y)=100000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 215µs 228µs 3573. 391KB 2.05 1746
#> 2 .Call(data.table:::CsubsetVector, x, y) 163µs 197µs 4772. 391KB 4.29 2225
#>
#>
#> length(x)=100000000 length(y)=1000000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 3.05ms 4.57ms 224. 3.81MB 0 113
#> 2 .Call(data.table:::CsubsetVector, x, y) 1.81ms 2.09ms 471. 3.81MB 4.36 216
#>
#>
#> length(x)=100000000 length(y)=10000000
#> # A tibble: 2 × 7
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
#> 1 x[y] 89.9ms 99.1ms 9.80 38.1MB 0 5
#> 2 .Call(data.table:::CsubsetVector, x, y) 43.5ms 44.3ms 21.5 38.1MB 2.15 10 |
@ben-schwen was there an expected improvement in performance? If so, I would suggest adding a test based on your benchmarks #5568 (comment) |
Generated via commit 0f6cf48 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 12 minutes and 9 seconds Time taken to run |
No (expected) improvement on performance, I was just afraid of running into a regression back then. |
….table into gforce_groupingVar_class
inst/tests/tests.Rraw
Outdated
test(2262.3, options=list(datatable.verbose=TRUE, datatable.optimize=0L), names(attributes(dt[, .N, b][,b])), c("class", "att"), output="GForce FALSE") | ||
test(2262.4, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), dt[, .N, b], data.table(b=dt$b, N=1L), output="GForce optimized j to") | ||
test(2262.5, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), dt[, .N, .(b,c)], data.table(b=dt$b, c=dt$c, N=1L), output="GForce optimized j to") | ||
test(2262.6, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), names(attributes(dt[, .N, b]$b)), c("class", "att"), output="GForce optimized j to") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI slight tweak here, [,b]
could also be testing [.data.table
behavior, better to separate that into its own test if so desired. $
keeps the tested behavior more strictly related to by=
grouping
Closes #5567
The non GForce equivalent issue was #442