Gforce grouping var class #5568

ben-schwen · 2022-12-19T20:27:05Z

The non GForce equivalent issue was #442

codecov · 2022-12-19T21:11:10Z

Codecov Report

Merging #5568 (4243f62) into master (cb8aeff) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 4243f62 differs from pull request most recent head 5aa5e64. Consider uploading reports for the commit 5aa5e64 to get more accurate results

@@           Coverage Diff           @@
##           master    #5568   +/-   ##
=======================================
  Coverage   97.49%   97.49%           
=======================================
  Files          80       80           
  Lines       14810    14810           
=======================================
  Hits        14439    14439           
  Misses        371      371

Impacted Files	Coverage Δ
R/data.table.R	`99.74% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

R/data.table.R

ben-schwen · 2022-12-20T17:08:38Z

Benchmarks

On [ there doesn't seem to be a noticeable difference. Since the CsubsetVector call only happens with GForce once, we also do not have the throttling problems.

I'm curious why CsubsetVector isn't used at dogroups anymore? @mattdowle

cc()
N = 1e7
DT = data.table(x = sample(N), y = sample(1e2,N,TRUE))
# warm start
invisible(`[.data.table`(DT,j=.N,by=y))
invisible(data.table:::`[.data.table`(DT,j=.N,by=y))

microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr        min         lq      mean    median       uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.08319297 0.09171692 0.1236303 0.1033564 0.154837 0.2407570   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.08162183 0.09237532 0.1242161 0.1009008 0.161996 0.2534246   100   a

DT = data.table(x = sample(N), y = sample(1e3,N,TRUE))
microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr        min        lq      mean    median        uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.08871780 0.1055516 0.1316296 0.1132985 0.1571205 0.2616192   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.09055208 0.1053714 0.1388291 0.1164063 0.1821008 0.2650133   100   a


DT = data.table(x = sample(N), y = sample(1e4,N,TRUE))
microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr        min        lq      mean    median        uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.10052261 0.1118025 0.1393742 0.1193117 0.1838666 0.2102043   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.09610407 0.1078412 0.1309245 0.1165225 0.1371610 0.2084393   100   a

DT = data.table(x = sample(N), y = sample(1e5,N,TRUE))
microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr       min        lq      mean    median        uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.1648205 0.1777414 0.2407806 0.1851763 0.2554005 0.8988475   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.1646772 0.1780597 0.2441777 0.1871644 0.2588514 0.9152462   100   a

ben-schwen · 2022-12-27T16:41:40Z

Adding also a comparison benchmark between [ and CsubsetVector. It seems that there is a negligible overhead of using CsubsetVector instead of [. For larger vectors, it even looks beneficial to use CsubsetVector over [.

library(data.table)
bensch <- function(n) {
  for (i in seq(2,n)) {
    N <- 10^i
    x <- sample(N)
    for (j in seq(1, i-1)) {
      M <- 10^j
      y <- sample(M)
      cat(sprintf("length(x)=%d\t\tlength(y)=%d\n", N, M))
      print(bench::mark(x[y], .Call(data.table:::CsubsetVector, x, y))[,1:7])
      cat("\n\n")
    }
  }
}
bensch(8)
#> length(x)=100        length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       172ns    179ns  4824681.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    368ns    393ns  2021012.        0B        0 10000
#> 
#> 
#> length(x)=1000       length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       172ns    180ns  4770361.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    369ns    465ns  1502658.        0B        0 10000
#> 
#> 
#> length(x)=1000       length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       352ns    388ns  2122521.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    567ns    613ns  1240114.      448B        0 10000
#> 
#> 
#> length(x)=10000      length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       176ns    185ns  3959722.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    377ns    398ns  2123734.        0B        0 10000
#> 
#> 
#> length(x)=10000      length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       356ns    448ns  1502203.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    566ns    596ns  1493639.      448B        0 10000
#> 
#> 
#> length(x)=10000      length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.75µs    3.2µs   301182.    3.95KB      0   10000
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.17µs   2.23µs   390908.    3.95KB     39.1  9999
#> 
#> 
#> length(x)=100000     length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       172ns    258ns  2616683.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    364ns    417ns  2106109.        0B        0 10000
#> 
#> 
#> length(x)=100000     length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       353ns    381ns  2338254.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    576ns    621ns  1533174.      448B        0 10000
#> 
#> 
#> length(x)=100000     length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.78µs    1.9µs   474043.    3.95KB     47.4  9999
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.09µs   2.24µs   332232.    3.95KB     33.2  9999
#> 
#> 
#> length(x)=100000     length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      16.4µs   16.9µs    49189.    39.1KB     34.5  9993
#> 2 .Call(data.table:::CsubsetVector, x, y)   25.8µs   27.8µs    34104.    39.1KB     23.9  9993
#> 
#> 
#> length(x)=1000000        length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       180ns    191ns  4576763.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    363ns    409ns  2286424.        0B        0 10000
#> 
#> 
#> length(x)=1000000        length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       363ns    551ns  1466453.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    587ns    635ns  1342725.      448B        0 10000
#> 
#> 
#> length(x)=1000000        length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.79µs   1.95µs   464367.    3.95KB     46.4  9999
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.18µs   2.28µs   418613.    3.95KB      0   10000
#> 
#> 
#> length(x)=1000000        length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      16.9µs   17.7µs    51919.    39.1KB     41.6  9992
#> 2 .Call(data.table:::CsubsetVector, x, y)   16.1µs   17.9µs    46705.    39.1KB     37.4  9992
#> 
#> 
#> length(x)=1000000        length(y)=100000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       215µs    233µs     3766.     391KB     30.4  1737
#> 2 .Call(data.table:::CsubsetVector, x, y)    152µs    170µs     5357.     391KB     44.7  2395
#> 
#> 
#> length(x)=10000000       length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       167ns    182ns  4813768.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    375ns    399ns  2317283.        0B        0 10000
#> 
#> 
#> length(x)=10000000       length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       355ns    387ns  2361763.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    571ns    611ns  1468815.      448B        0 10000
#> 
#> 
#> length(x)=10000000       length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.74µs      2µs   408265.    3.95KB     40.8  9999
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.17µs    2.3µs   383133.    3.95KB      0   10000
#> 
#> 
#> length(x)=10000000       length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      16.5µs   18.3µs    41355.    39.1KB     20.7  9995
#> 2 .Call(data.table:::CsubsetVector, x, y)   16.1µs   18.4µs    44637.    39.1KB     22.3  9995
#> 
#> 
#> length(x)=10000000       length(y)=100000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       212µs    245µs     3236.     391KB     16.9  1536
#> 2 .Call(data.table:::CsubsetVector, x, y)    154µs    178µs     4882.     391KB     23.9  2248
#> 
#> 
#> length(x)=10000000       length(y)=1000000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      3.06ms   4.64ms      224.    3.81MB     15.5   101
#> 2 .Call(data.table:::CsubsetVector, x, y)   1.81ms   2.19ms      423.    3.81MB     27.9   182
#> 
#> 
#> length(x)=100000000      length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       176ns    185ns  4859594.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    377ns    403ns  2309703.        0B        0 10000
#> 
#> 
#> length(x)=100000000      length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       364ns    405ns  2074245.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    565ns    610ns  1454738.      448B        0 10000
#> 
#> 
#> length(x)=100000000      length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.81µs      2µs   467334.    3.95KB        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.16µs   3.63µs   265340.    3.95KB        0 10000
#> 
#> 
#> length(x)=100000000      length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      17.8µs   32.1µs    31618.    39.1KB     0    10000
#> 2 .Call(data.table:::CsubsetVector, x, y)   17.5µs   21.9µs    42032.    39.1KB     4.20  9999
#> 
#> 
#> length(x)=100000000      length(y)=100000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       215µs    228µs     3573.     391KB     2.05  1746
#> 2 .Call(data.table:::CsubsetVector, x, y)    163µs    197µs     4772.     391KB     4.29  2225
#> 
#> 
#> length(x)=100000000      length(y)=1000000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      3.05ms   4.57ms      224.    3.81MB     0      113
#> 2 .Call(data.table:::CsubsetVector, x, y)   1.81ms   2.09ms      471.    3.81MB     4.36   216
#> 
#> 
#> length(x)=100000000      length(y)=10000000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      89.9ms   99.1ms      9.80    38.1MB     0        5
#> 2 .Call(data.table:::CsubsetVector, x, y)   43.5ms   44.3ms     21.5     38.1MB     2.15    10

NEWS.md

tdhock · 2024-05-17T21:44:38Z

On [ there doesn't seem to be a noticeable difference.

@ben-schwen was there an expected improvement in performance? If so, I would suggest adding a test based on your benchmarks #5568 (comment)

github-actions · 2024-05-17T21:44:50Z

Generated via commit 0f6cf48

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 9 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 30 seconds

ben-schwen · 2024-05-17T21:51:08Z

On [ there doesn't seem to be a noticeable difference.

@ben-schwen was there an expected improvement in performance? If so, I would suggest adding a test based on your benchmarks #5568 (comment)

No (expected) improvement on performance, I was just afraid of running into a regression back then.

NEWS.md

inst/tests/tests.Rraw

….table into gforce_groupingVar_class

MichaelChirico · 2024-05-20T17:21:24Z

inst/tests/tests.Rraw

+test(2262.3, options=list(datatable.verbose=TRUE, datatable.optimize=0L), names(attributes(dt[, .N, b][,b])), c("class", "att"), output="GForce FALSE")
+test(2262.4, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), dt[, .N, b], data.table(b=dt$b, N=1L), output="GForce optimized j to")
+test(2262.5, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), dt[, .N, .(b,c)], data.table(b=dt$b, c=dt$c, N=1L), output="GForce optimized j to")
+test(2262.6, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), names(attributes(dt[, .N, b]$b)), c("class", "att"), output="GForce optimized j to")


FYI slight tweak here, [,b] could also be testing [.data.table behavior, better to separate that into its own test if so desired. $ keeps the tested behavior more strictly related to by= grouping

ben-schwen added 5 commits December 19, 2022 14:54

copy classes to grouping vars

ae1b780

add tests

39ef816

add different optimization levels to test

86962f5

add news

be5dabd

add output

5aa5e64

jangorecki reviewed Dec 19, 2022

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

Merge branch 'master' into gforce_groupingVar_class

472b58c

ben-schwen requested a review from MichaelChirico as a code owner May 17, 2024 21:28

fix news

6faf43d

tdhock reviewed May 17, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

fix typo

fe83896

MichaelChirico reviewed May 19, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

MichaelChirico reviewed May 19, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

MichaelChirico reviewed May 19, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico and others added 7 commits May 18, 2024 21:58

Merge branch 'master' into gforce_groupingVar_class

bb7737a

add NEWS info and tests about attributes

9752f93

Merge branch 'gforce_groupingVar_class' of github.com:Rdatatable/data…

3537f3f

….table into gforce_groupingVar_class

hone NEWS

d051b5e

hone comment

95c17fb

Reframe test annotation

8b2861d

tweak test

1556797

MichaelChirico reviewed May 20, 2024

View reviewed changes

MichaelChirico added 2 commits May 20, 2024 10:21

Second call site

918d962

Merge branch 'master' into gforce_groupingVar_class

0f6cf48

MichaelChirico approved these changes May 20, 2024

View reviewed changes

MichaelChirico merged commit 4c5f1e7 into master May 20, 2024
4 checks passed

MichaelChirico deleted the gforce_groupingVar_class branch May 20, 2024 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gforce grouping var class #5568

Gforce grouping var class #5568

ben-schwen commented Dec 19, 2022 •

edited

Loading

codecov bot commented Dec 19, 2022

ben-schwen commented Dec 20, 2022

ben-schwen commented Dec 27, 2022 •

edited

Loading

tdhock commented May 17, 2024

github-actions bot commented May 17, 2024 •

edited

Loading

ben-schwen commented May 17, 2024 •

edited

Loading

MichaelChirico May 20, 2024

Gforce grouping var class #5568

Gforce grouping var class #5568

Conversation

ben-schwen commented Dec 19, 2022 • edited Loading

codecov bot commented Dec 19, 2022

Codecov Report

ben-schwen commented Dec 20, 2022

Benchmarks

ben-schwen commented Dec 27, 2022 • edited Loading

tdhock commented May 17, 2024

github-actions bot commented May 17, 2024 • edited Loading

ben-schwen commented May 17, 2024 • edited Loading

MichaelChirico May 20, 2024

Choose a reason for hiding this comment

ben-schwen commented Dec 19, 2022 •

edited

Loading

ben-schwen commented Dec 27, 2022 •

edited

Loading

github-actions bot commented May 17, 2024 •

edited

Loading

ben-schwen commented May 17, 2024 •

edited

Loading