Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gforce grouping var class #5568

Merged
merged 17 commits into from
May 20, 2024
Merged

Gforce grouping var class #5568

merged 17 commits into from
May 20, 2024

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Dec 19, 2022

Closes #5567

The non GForce equivalent issue was #442

@codecov
Copy link

codecov bot commented Dec 19, 2022

Codecov Report

Merging #5568 (4243f62) into master (cb8aeff) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 4243f62 differs from pull request most recent head 5aa5e64. Consider uploading reports for the commit 5aa5e64 to get more accurate results

@@           Coverage Diff           @@
##           master    #5568   +/-   ##
=======================================
  Coverage   97.49%   97.49%           
=======================================
  Files          80       80           
  Lines       14810    14810           
=======================================
  Hits        14439    14439           
  Misses        371      371           
Impacted Files Coverage Δ
R/data.table.R 99.74% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

R/data.table.R Outdated Show resolved Hide resolved
@ben-schwen
Copy link
Member Author

Benchmarks

On [ there doesn't seem to be a noticeable difference. Since the CsubsetVector call only happens with GForce once, we also do not have the throttling problems.

I'm curious why CsubsetVector isn't used at dogroups anymore? @mattdowle

cc()
N = 1e7
DT = data.table(x = sample(N), y = sample(1e2,N,TRUE))
# warm start
invisible(`[.data.table`(DT,j=.N,by=y))
invisible(data.table:::`[.data.table`(DT,j=.N,by=y))

microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr        min         lq      mean    median       uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.08319297 0.09171692 0.1236303 0.1033564 0.154837 0.2407570   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.08162183 0.09237532 0.1242161 0.1009008 0.161996 0.2534246   100   a

DT = data.table(x = sample(N), y = sample(1e3,N,TRUE))
microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr        min        lq      mean    median        uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.08871780 0.1055516 0.1316296 0.1132985 0.1571205 0.2616192   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.09055208 0.1053714 0.1388291 0.1164063 0.1821008 0.2650133   100   a


DT = data.table(x = sample(N), y = sample(1e4,N,TRUE))
microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr        min        lq      mean    median        uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.10052261 0.1118025 0.1393742 0.1193117 0.1838666 0.2102043   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.09610407 0.1078412 0.1309245 0.1165225 0.1371610 0.2084393   100   a

DT = data.table(x = sample(N), y = sample(1e5,N,TRUE))
microbenchmark::microbenchmark(
  `[.data.table`(DT,j=.N,by=y),               # proposed fix
  data.table:::`[.data.table`(DT,j=.N,by=y),  # master before
  times = 100L, unit = "s"
)
# Unit: seconds
#                                             expr       min        lq      mean    median        uq       max neval cld
#               `[.data.table`(DT, j = .N, by = y) 0.1648205 0.1777414 0.2407806 0.1851763 0.2554005 0.8988475   100   a
#  data.table:::`[.data.table`(DT, j = .N, by = y) 0.1646772 0.1780597 0.2441777 0.1871644 0.2588514 0.9152462   100   a

@ben-schwen
Copy link
Member Author

ben-schwen commented Dec 27, 2022

Adding also a comparison benchmark between [ and CsubsetVector. It seems that there is a negligible overhead of using CsubsetVector instead of [. For larger vectors, it even looks beneficial to use CsubsetVector over [.

library(data.table)
bensch <- function(n) {
  for (i in seq(2,n)) {
    N <- 10^i
    x <- sample(N)
    for (j in seq(1, i-1)) {
      M <- 10^j
      y <- sample(M)
      cat(sprintf("length(x)=%d\t\tlength(y)=%d\n", N, M))
      print(bench::mark(x[y], .Call(data.table:::CsubsetVector, x, y))[,1:7])
      cat("\n\n")
    }
  }
}
bensch(8)
#> length(x)=100        length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       172ns    179ns  4824681.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    368ns    393ns  2021012.        0B        0 10000
#> 
#> 
#> length(x)=1000       length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       172ns    180ns  4770361.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    369ns    465ns  1502658.        0B        0 10000
#> 
#> 
#> length(x)=1000       length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       352ns    388ns  2122521.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    567ns    613ns  1240114.      448B        0 10000
#> 
#> 
#> length(x)=10000      length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       176ns    185ns  3959722.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    377ns    398ns  2123734.        0B        0 10000
#> 
#> 
#> length(x)=10000      length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       356ns    448ns  1502203.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    566ns    596ns  1493639.      448B        0 10000
#> 
#> 
#> length(x)=10000      length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.75µs    3.2µs   301182.    3.95KB      0   10000
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.17µs   2.23µs   390908.    3.95KB     39.1  9999
#> 
#> 
#> length(x)=100000     length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       172ns    258ns  2616683.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    364ns    417ns  2106109.        0B        0 10000
#> 
#> 
#> length(x)=100000     length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       353ns    381ns  2338254.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    576ns    621ns  1533174.      448B        0 10000
#> 
#> 
#> length(x)=100000     length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.78µs    1.9µs   474043.    3.95KB     47.4  9999
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.09µs   2.24µs   332232.    3.95KB     33.2  9999
#> 
#> 
#> length(x)=100000     length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      16.4µs   16.9µs    49189.    39.1KB     34.5  9993
#> 2 .Call(data.table:::CsubsetVector, x, y)   25.8µs   27.8µs    34104.    39.1KB     23.9  9993
#> 
#> 
#> length(x)=1000000        length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       180ns    191ns  4576763.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    363ns    409ns  2286424.        0B        0 10000
#> 
#> 
#> length(x)=1000000        length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       363ns    551ns  1466453.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    587ns    635ns  1342725.      448B        0 10000
#> 
#> 
#> length(x)=1000000        length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.79µs   1.95µs   464367.    3.95KB     46.4  9999
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.18µs   2.28µs   418613.    3.95KB      0   10000
#> 
#> 
#> length(x)=1000000        length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      16.9µs   17.7µs    51919.    39.1KB     41.6  9992
#> 2 .Call(data.table:::CsubsetVector, x, y)   16.1µs   17.9µs    46705.    39.1KB     37.4  9992
#> 
#> 
#> length(x)=1000000        length(y)=100000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       215µs    233µs     3766.     391KB     30.4  1737
#> 2 .Call(data.table:::CsubsetVector, x, y)    152µs    170µs     5357.     391KB     44.7  2395
#> 
#> 
#> length(x)=10000000       length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       167ns    182ns  4813768.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    375ns    399ns  2317283.        0B        0 10000
#> 
#> 
#> length(x)=10000000       length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       355ns    387ns  2361763.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    571ns    611ns  1468815.      448B        0 10000
#> 
#> 
#> length(x)=10000000       length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.74µs      2µs   408265.    3.95KB     40.8  9999
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.17µs    2.3µs   383133.    3.95KB      0   10000
#> 
#> 
#> length(x)=10000000       length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      16.5µs   18.3µs    41355.    39.1KB     20.7  9995
#> 2 .Call(data.table:::CsubsetVector, x, y)   16.1µs   18.4µs    44637.    39.1KB     22.3  9995
#> 
#> 
#> length(x)=10000000       length(y)=100000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       212µs    245µs     3236.     391KB     16.9  1536
#> 2 .Call(data.table:::CsubsetVector, x, y)    154µs    178µs     4882.     391KB     23.9  2248
#> 
#> 
#> length(x)=10000000       length(y)=1000000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      3.06ms   4.64ms      224.    3.81MB     15.5   101
#> 2 .Call(data.table:::CsubsetVector, x, y)   1.81ms   2.19ms      423.    3.81MB     27.9   182
#> 
#> 
#> length(x)=100000000      length(y)=10
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       176ns    185ns  4859594.        0B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    377ns    403ns  2309703.        0B        0 10000
#> 
#> 
#> length(x)=100000000      length(y)=100
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       364ns    405ns  2074245.      448B        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)    565ns    610ns  1454738.      448B        0 10000
#> 
#> 
#> length(x)=100000000      length(y)=1000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      1.81µs      2µs   467334.    3.95KB        0 10000
#> 2 .Call(data.table:::CsubsetVector, x, y)   2.16µs   3.63µs   265340.    3.95KB        0 10000
#> 
#> 
#> length(x)=100000000      length(y)=10000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      17.8µs   32.1µs    31618.    39.1KB     0    10000
#> 2 .Call(data.table:::CsubsetVector, x, y)   17.5µs   21.9µs    42032.    39.1KB     4.20  9999
#> 
#> 
#> length(x)=100000000      length(y)=100000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                       215µs    228µs     3573.     391KB     2.05  1746
#> 2 .Call(data.table:::CsubsetVector, x, y)    163µs    197µs     4772.     391KB     4.29  2225
#> 
#> 
#> length(x)=100000000      length(y)=1000000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      3.05ms   4.57ms      224.    3.81MB     0      113
#> 2 .Call(data.table:::CsubsetVector, x, y)   1.81ms   2.09ms      471.    3.81MB     4.36   216
#> 
#> 
#> length(x)=100000000      length(y)=10000000
#> # A tibble: 2 × 7
#>   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
#> 1 x[y]                                      89.9ms   99.1ms      9.80    38.1MB     0        5
#> 2 .Call(data.table:::CsubsetVector, x, y)   43.5ms   44.3ms     21.5     38.1MB     2.15    10

NEWS.md Outdated Show resolved Hide resolved
@tdhock
Copy link
Member

tdhock commented May 17, 2024

On [ there doesn't seem to be a noticeable difference.

@ben-schwen was there an expected improvement in performance? If so, I would suggest adding a test based on your benchmarks #5568 (comment)

Copy link

github-actions bot commented May 17, 2024

Comparison Plot

Generated via commit 0f6cf48

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 9 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 30 seconds

@ben-schwen
Copy link
Member Author

ben-schwen commented May 17, 2024

On [ there doesn't seem to be a noticeable difference.

@ben-schwen was there an expected improvement in performance? If so, I would suggest adding a test based on your benchmarks #5568 (comment)

No (expected) improvement on performance, I was just afraid of running into a regression back then.

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
test(2262.3, options=list(datatable.verbose=TRUE, datatable.optimize=0L), names(attributes(dt[, .N, b][,b])), c("class", "att"), output="GForce FALSE")
test(2262.4, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), dt[, .N, b], data.table(b=dt$b, N=1L), output="GForce optimized j to")
test(2262.5, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), dt[, .N, .(b,c)], data.table(b=dt$b, c=dt$c, N=1L), output="GForce optimized j to")
test(2262.6, options=list(datatable.verbose=TRUE, datatable.optimize=Inf), names(attributes(dt[, .N, b]$b)), c("class", "att"), output="GForce optimized j to")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI slight tweak here, [,b] could also be testing [.data.table behavior, better to separate that into its own test if so desired. $ keeps the tested behavior more strictly related to by= grouping

@MichaelChirico MichaelChirico merged commit 4c5f1e7 into master May 20, 2024
4 checks passed
@MichaelChirico MichaelChirico deleted the gforce_groupingVar_class branch May 20, 2024 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Losing classes when grouping
4 participants