Skip to content

Commit

Permalink
fix: don't declare implicitly exported functions public (#1147)
Browse files Browse the repository at this point in the history
* don't export deprecated functions

`@deprecate` by default exports the passed functions, which I assume was not intended here. This actually causes precompilation errors on 1.12 since these functions are also declared public

* remove public declaration instead

* Update src/helpers/recursive_ops.jl
  • Loading branch information
simeonschaub authored Dec 28, 2024
1 parent 90997a0 commit ac2879b
Showing 1 changed file with 1 addition and 3 deletions.
4 changes: 1 addition & 3 deletions src/helpers/recursive_ops.jl
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,4 @@ For the following types it directly defines recursion rules:
"""
function recursive_map end

@compat(public,
(recursive_add!!, recursive_copyto!, recursive_eltype,
recursive_make_zero, recursive_map, recursive_make_zero!!))
@compat(public, (recursive_eltype,))

1 comment on commit ac2879b

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: ac2879b Previous: d962073 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3625 ns 3833 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4541 ns 4250 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5125 ns 4666 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3791 ns 4041.5 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61743 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10125 ns 10459 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10875 ns 10417 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10334 ns 10083 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10417 ns 10625 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 430910 ns
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1209 ns 1125 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1209 ns 1375 ns 0.88
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1500 ns 1375 ns 1.09
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1042 ns 1208 ns 0.86
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18223.5 ns
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4000 ns 3958 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 4125 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4334 ns 4208 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3875 ns 3958 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 110886 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56709 ns 57917 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38334 ns 46459 ns 0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46917 ns 46750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81750 ns 82708 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37932 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2043708.5 ns 2047958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2096520.5 ns 2090000 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2096437.5 ns 2093917 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1991167 ns 1976812.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 197294.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144625 ns 146708 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145667 ns 182667 ns 0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 144916 ns 145833 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144854.5 ns 143583 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166157.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1116791 ns 1151625.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1150459 ns 1117646 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1128083 ns 1124084 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1121458 ns 1165146 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 535998 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3417 ns 3500 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4042 ns 4083 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4459 ns 4042 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3187.5 ns 3916 ns 0.81
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 72464.5 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9417 ns 9083 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9458 ns 9166 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9750 ns 9125 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8708 ns 8854.5 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 469472 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14375 ns 17334 ns 0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16208 ns 18542 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18750 ns 17834 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16875 ns 16333 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54038 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213375 ns 214916.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220000 ns 214541 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217250 ns 213500 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213916 ns 220667 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 270771 ns
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 625 ns 0.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 708 ns 583 ns 1.21
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 625 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17308 ns
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1458 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1375 ns 1750 ns 0.79
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1541 ns 1458 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1417 ns 1625 ns 0.87
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 101606.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7083 ns 6208 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5250 ns 5958 ns 0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 6000 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10084 ns 10208 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23383 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221709 ns 221042 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229750 ns 228959 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229125 ns 229375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214125 ns 223854.5 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 167775.5 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4000 ns 3958 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23070 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17083 ns 16708 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16625 ns 17083 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17083 ns 16875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16833 ns 16584 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 162035 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 575083 ns 570250 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 571792 ns 577041 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 570750 ns 576958 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 577208 ns 573916 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113295 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1418250 ns 1424354 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1422875 ns 1421125 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1422500 ns 1417666 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1425750 ns 1422417 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 211866.5 ns
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1081041.5 ns 1082874.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 946916.5 ns 969583.5 ns 0.98
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1353229.5 ns 1345833 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1292458 ns 1275270.5 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 269913.5 ns
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6001958 ns 5772500 ns 1.04
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4632042 ns 4552375 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4929041.5 ns 4981312.5 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5549750.5 ns 5767584 ns 0.96
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1070564 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23780 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2209 ns 2125 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2209 ns 2208 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2250 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 170642 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3667 ns 4167 ns 0.88
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4750 ns 4375 ns 1.09
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5208 ns 4875 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4041 ns 4500 ns 0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65525 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11084 ns 11291 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12083 ns 11292 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12208 ns 12000 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10834 ns 11375 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 445478.5 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 6458 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6666 ns 6833 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8167 ns 8000 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6166 ns 6875 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52877 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18250 ns 17083 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18458 ns 19250 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18542 ns 17791.5 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17520.5 ns 17875 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 296963 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32928.5 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9271 ns 8792 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9208 ns 8875 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9354.5 ns 8916.5 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8375 ns 8209 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 157633 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64458 ns 64500 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64917 ns 64583 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64583 ns 64250 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64375 ns 64750 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111288 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 278375 ns 285625 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 292291 ns 283375 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 278833 ns 276208.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 279500 ns 297500 ns 0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 186917 ns
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3287958 ns 3402333 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2909792 ns 3060583 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3017771 ns 3019687.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3935292 ns 4056229 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 579655 ns
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7602875 ns 7721750 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7372333 ns 7459709 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7461313 ns 7439375.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8220167 ns 8277625 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1357048 ns
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17533125 ns 17593999.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17557125 ns 17466354 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17531667 ns 17549604.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 9214250 ns 9302166.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23446917 ns 23554916.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43586125 ns 33592458 ns 1.30
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37247062.5 ns 37227500 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35028291.5 ns 35248104 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1855921.5 ns
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189114500 ns 188482416 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 178190333 ns 164033541 ns 1.09
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 153393396 ns 153090042 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 434855500 ns 443063541 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13947546 ns
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 290046875 ns 290580729 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 271392771 ns 257093729.5 ns 1.06
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 284812041.5 ns 296199833.5 ns 0.96
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 473569708.5 ns 482390645.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23021 ns 22750 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22458 ns 24645.5 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23625 ns 23792 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22708 ns 21958 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96516 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 115458.5 ns 103459 ns 1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103250 ns 104709 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104375 ns 103916.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 105042 ns 103729.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 508001.5 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5834 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 6083 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6708 ns 6625 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 6209 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68991.5 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14042 ns 14667 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15500 ns 15020.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15687.5 ns 16020.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14500 ns 15250 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 478721 ns
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 2979083.5 ns 3027500 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2084000 ns 2071021 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2281500 ns 2285333.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4814250 ns 4820958 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585630.5 ns
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23560375 ns 23646313 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18266583.5 ns 18048395.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16959209 ns 16906125 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34863041.5 ns 35430208 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2766675 ns
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33305667 ns 33437292 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27994104 ns 27650521 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27448959 ns 27492875 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 40756916 ns 42564979.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74000 ns 72854.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73333 ns 73458 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 74917 ns 74021 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74500 ns 75000 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104050 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218083 ns 303958 ns 0.72
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 210625 ns 219312.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 296708.5 ns 219042 ns 1.35
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217792 ns 319666.5 ns 0.68
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 558286.5 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11750 ns 11500 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12417 ns 11959 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12458.5 ns 12416 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11834 ns 12208 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 72847.5 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26125 ns 26083.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27167 ns 26104.5 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27375 ns 27209 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26458 ns 26750 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 484580 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11583 ns 12166.5 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12167 ns 12645.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14000 ns 13500 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11792 ns 12875 ns 0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 55176 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25542 ns 25750 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26417 ns 26459 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28709 ns 26375 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26042 ns 25833 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 307604.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179208 ns 182125 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 181042 ns 180500 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184333.5 ns 183000 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179416 ns 180375 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57654 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 590646 ns 581750 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 591479 ns 590708.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 593500 ns 609500 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582749.5 ns 594250 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 291261 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6083.5 ns 5625 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 5958 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6708 ns 6500 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6292 ns 6250 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 71643 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14250 ns 13917 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15167 ns 13916 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15292 ns 14583 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14042 ns 14291 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 470922.5 ns
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1203770.5 ns 1196250 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1236645.5 ns 1251708 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1343083 ns 1274542 ns 1.05
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1024395.5 ns 1013000 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA 300123 ns
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4091000 ns 4142875 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4576917 ns 4864958 ns 0.94
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4574875.5 ns 4545520.5 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3718250 ns 3911541.5 ns 0.95
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1038641 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1875 ns 1792 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23874.5 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5083 ns 4834 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5000 ns 4917 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4959 ns 5000 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4916 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 193867 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5500 ns 5250 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5709 ns 5917 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6875 ns 6333 ns 1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5416 ns 6042 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 57200 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11042 ns 11000 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11584 ns 11458 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11500 ns 11292 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10625 ns 11000 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 332575 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 334 ns 334 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 334 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22978 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2834 ns 2708 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 3041 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3000 ns 2792 ns 1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2833 ns 2709 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 163496 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11625 ns 11167 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11292 ns 11667 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12875 ns 12375 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11209 ns 12083 ns 0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 58225 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24958 ns 25083 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25208 ns 25416 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25375 ns 25167 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25042 ns 24583 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 299318 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4250 ns 4208 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25190 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16209 ns 16375 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16083 ns 16417 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16625 ns 16250 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16500 ns 16042 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 202972 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5792 ns 5791 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5959 ns 5875 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34611 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20625 ns 20375 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21042 ns 20479.5 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21083 ns 21208 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20125 ns 20854.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 178483.5 ns
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 414125 ns 427021 ns 0.97
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 367771 ns 388041 ns 0.95
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 480813 ns 475333 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 104146 ns 107750 ns 0.97
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67750.5 ns
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 927125 ns 885834 ns 1.05
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 964354 ns 960667 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1186833 ns 1182208 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 376584 ns 375875 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 192974.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 77583 ns 80125 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 79125 ns 80750 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83542 ns 82167 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79958 ns 80791 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193934 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1917959 ns 1942937 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1933541 ns 1918166.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1931521.5 ns 1916333 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1860375 ns 1923604 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 392771 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22416 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 174762 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6562.5 ns 6167 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6417 ns 6792 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8166 ns 7333 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6208 ns 6667 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59227 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9292 ns 8791.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9250 ns 9416 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9375 ns 9292 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9083 ns 9167 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 304901.5 ns
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120543687.5 ns 119015458 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181954416.5 ns 173560375 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148126750 ns 148104416 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106134709 ns 104510604 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5492614.5 ns
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 609833750 ns 611899646 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 578593208 ns 555362500 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 451045708.5 ns 453017291 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 627478333.5 ns 632276917 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35107131 ns
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 652518625 ns 666765667 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 683671437.5 ns 666371104 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 587115583.5 ns 582119812.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 852245209 ns 866159459 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 57541 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39209 ns 47708 ns 0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 48208 ns 46875 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85167 ns 84375 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38635 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1920104 ns 1944250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1988000 ns 1980416 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1980667 ns 1976042 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1907896 ns 1906083 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 176329 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267041 ns 267917 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 270500 ns 268292 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 268750 ns 267937.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 265291 ns 267625 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 123893.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 596166 ns 703792 ns 0.85
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 698625 ns 681124.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 702916.5 ns 595667 ns 1.18
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 589292 ns 697208 ns 0.85
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 677537.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2180187.5 ns 2209437.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2215229 ns 2173708 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2212000 ns 2200062 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2207792 ns 2113875 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133207 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5497667 ns 5503083 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5581500 ns 5488667 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5516125 ns 5509792 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5545124.5 ns 5568042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 717120 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 656041 ns 638000 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 642917 ns 645667 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 637375 ns 647187.5 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 644167 ns 644709 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46463 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1822875 ns 1827583 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1668958.5 ns 1720833 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1723334 ns 1720291 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2101084 ns 2097125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 222123 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 59166 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38708 ns 47625 ns 0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46916 ns 45833 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85084 ns 84209 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28664 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2028604.5 ns 2051584 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2097916.5 ns 2075395.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087625 ns 2040667 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2005812 ns 2021583 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 188609 ns
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13343604 ns 13373292 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12536250 ns 12436750 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12547834 ns 12559270.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15250271 ns 14986208.5 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 510611.5 ns
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47204500 ns 47390625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41927292 ns 41705020.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40799666 ns 40992438 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58864104 ns 58725208 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2889030 ns
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 73523334 ns 73938270.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91557750 ns 90830563 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90571250.5 ns 90514083 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 75976041 ns 76122334 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58083 ns 59916 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38875 ns 47541 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47709 ns 47458 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82042 ns 83500 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48950 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1916542 ns 1948584 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1982083 ns 1954250 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1947333 ns 1965437.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1876854 ns 1888625 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195268 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 333 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32997 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 5834 ns 5979.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6584 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6458.5 ns 6500 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5958 ns 6187.5 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 171034 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32918 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2666 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2750 ns 2875 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2917 ns 2792 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2666 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 161268 ns
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 286917729.5 ns 286733687.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 347948583.5 ns 339568833 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314136145.5 ns 314522187.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 267700542 ns 270045166 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7080984 ns
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1009676125 ns 1015582292 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 974877416 ns 953582875 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 854637270.5 ns 840575375 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1260982959 ns 1282644084 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34048271 ns
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1387098104 ns 1419694479.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1694333625 ns 1672572375 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1631003167 ns 1620047667 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1358038896 ns 1358918958.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1411604.5 ns 1454458 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1409250 ns 1408583 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1407354.5 ns 1410041.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1405916 ns 1442292 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128067 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5023999.5 ns 5055625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5051396 ns 5019625 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5029104.5 ns 5009458 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5040479 ns 5053667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 514176 ns
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 170919250 ns 171675979 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 183735542 ns 126429812.5 ns 1.45
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 115460229.5 ns 106760875 ns 1.08
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 168486416 ns 165741833.5 ns 1.02
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4853309 ns
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 627387000 ns 622640208 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 561666625 ns 492172500 ns 1.14
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 453969542 ns 462809167 ns 0.98
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 654142166 ns 660164833 ns 0.99
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17017885 ns
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8912729 ns 8982250 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9063708 ns 8969792 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7941979 ns 7891125 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9820979.5 ns 9977959 ns 0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1590505 ns
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36015084 ns 36106959 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 38799959 ns 37109917 ns 1.05
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33679959 ns 33736459 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37936417 ns 39159896 ns 0.97
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6472671 ns
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47459 ns 47375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47708 ns 47500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47625 ns 47645.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47209 ns 47500 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 17832 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50416 ns 50417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50292 ns 50875 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50458 ns 51729 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50291 ns 50333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 162828 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6208 ns 6583 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7083 ns 7208 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7562.5 ns 7646 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6292 ns 7333 ns 0.86
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 74130 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9375 ns 9292 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10250 ns 10209 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 10333 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9917 ns 10167 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 422862.5 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5666 ns 5854.5 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 6292 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6916 ns 6834 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5375 ns 6166 ns 0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 78877.5 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12875 ns 12667 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13583 ns 13208.5 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13583 ns 13459 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13208 ns 12958 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 370972.5 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1084 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33127 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7792 ns 7770.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8167 ns 8125 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8083 ns 7834 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7792 ns 8250 ns 0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 187081.5 ns
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23333 ns 23417 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23417 ns 23375 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23583 ns 23500 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23084 ns 23458 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18527 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52042 ns 52292 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52750 ns 52667 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52875 ns 52667 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52542 ns 52417 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 204233 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1398875 ns 1448145.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1455625 ns 1457021 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1404042 ns 1402542 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1406584 ns 1403042 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196492.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4999875 ns 5036750 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5037708 ns 5020979 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5003083 ns 5021708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5024916 ns 5042708.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 495167 ns
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3047396 ns 3054459 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2106521 ns 2092750 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2296895.5 ns 2302708.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4962229.5 ns 4935833 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 583841 ns
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24384458 ns 24359708.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19075709 ns 18879875 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17765562.5 ns 17805083 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35955916.5 ns 36477083 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2836787 ns
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33991937.5 ns 34112104.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28748917 ns 28352833 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28081042 ns 27995625 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41668854.5 ns 42341709 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 142678458 ns 143179166 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 147270333 ns 147785458 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126985770.5 ns 126873458.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 174826021 ns 172641167 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22556485 ns
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1026522125 ns 1416291312.5 ns 0.72
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 866022875.5 ns 1304509479 ns 0.66
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 743843334 ns 1238526750 ns 0.60
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 682878792 ns 685736000 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 116543149 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76083 ns 76042 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76250 ns 79459 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77625 ns 76687 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75833.5 ns 75124.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 163749.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 275437.5 ns 189375 ns 1.45
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 283542 ns 278000 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 275959 ns 289166.5 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 282375 ns 193709 ns 1.46
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 882740 ns
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35483000 ns 35548875.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36565000 ns 36247291.5 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32543896 ns 32430687.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40679500 ns 40776042 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5828412 ns
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 147536708 ns 148827666 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 157209875 ns 152471625 ns 1.03
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 136063312.5 ns 135828541 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 286255000 ns 224259958 ns 1.28
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34875549.5 ns
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 122158104.5 ns 120283062 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181447688 ns 173757375 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147872917 ns 148381833 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104774833.5 ns 100995854 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5433572 ns
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 468969166 ns 468476625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 487732687.5 ns 466581667 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 437061208 ns 438033125 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 745602708 ns 758068771 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 31632434 ns
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 708533125.5 ns 656498666 ns 1.08
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 662068729.5 ns 639464917 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 625681375 ns 572772729.5 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 856533500 ns 867522166 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1243917 ns 1241166.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 778625 ns 960584 ns 0.81
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 961709 ns 985604 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2098041.5 ns 2040750 ns 1.03
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 581626.5 ns
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2966062.5 ns 3033584 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2513979 ns 2618542 ns 0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2620167 ns 2633875 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3551916 ns 3767750 ns 0.94
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1532656 ns
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5803146 ns 5830292 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5896375 ns 5796375 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5798708 ns 5804458 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2924083 ns 2978917 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7083 ns 7500 ns 0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5291 ns 6042 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6208 ns 6209 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10166 ns 10333 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25159 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212500 ns 212708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220625 ns 220542 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220709 ns 223542 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213625 ns 208708 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 199491.5 ns
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 297113041 ns 297468334 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 291058458 ns 215016959 ns 1.35
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 193310291.5 ns 193569000 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 304396812.5 ns 311798792 ns 0.98
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7678125.5 ns
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1231332166.5 ns 1238998917 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 973933875 ns 901957166.5 ns 1.08
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 836913500 ns 825878542 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1148765416.5 ns 1319998292 ns 0.87
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26856489.5 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4792 ns 5542 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5875 ns 5834 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6354 ns 6708 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4667 ns 5375 ns 0.87
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 93183 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7000 ns 7083 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7333 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7458 ns 7875 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7395.5 ns 7042 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 440751 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24653 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8625 ns 9083 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9500 ns 8666 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9917 ns 9292 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8792 ns 8583 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 176547.5 ns
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 353584 ns 351792 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 353833 ns 351708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352208 ns 352375 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 351500 ns 354000 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21275 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 807916.5 ns 827667 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 789854 ns 779562.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 776042 ns 778208 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 778833 ns 824354.5 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 215262.5 ns
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 339229 ns 337833 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 321000 ns 342521 ns 0.94
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 454187 ns 452875 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10916 ns 11687.5 ns 0.93
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18631 ns
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 714125 ns 713208.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 731625 ns 736500 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1006333 ns 1010250 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26667 ns 27208.5 ns 0.98
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 196596.5 ns
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 381833.5 ns 381792 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 330959 ns 354187 ns 0.93
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 444916.5 ns 441708 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 31417 ns 31083 ns 1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA 23162 ns
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 727875 ns 731646 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 783542 ns 785667 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1030146 ns 1027917 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 90750 ns 91083 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 193002.5 ns
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3583 ns 3542 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3709 ns 3458 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3625 ns 3583 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3375 ns 3542 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17634 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4291 ns 4167 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4208 ns 4250 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4333 ns 4500 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4125 ns 4208 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 200435.5 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 3375 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4167 ns 3917 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4375 ns 4084 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3583 ns 3917 ns 0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 151437.5 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 8375 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 8167 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8333 ns 8584 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8458 ns 8500 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 927946.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204583 ns 204791 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209000 ns 210875 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210500 ns 211541 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199084 ns 202083 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35183 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 602833.5 ns 600417 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 629209 ns 627875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 625584 ns 630312 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582250 ns 583542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 266930.5 ns
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 990542 ns 1010270.5 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1053625 ns 1015521 ns 1.04
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 954292 ns 949979.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 901104 ns 909416 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206789.5 ns
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4511208 ns 4557687.5 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4854542 ns 4722959 ns 1.03
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4490209 ns 4470333.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4299083.5 ns 4443646.5 ns 0.97
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 930739 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3084 ns 3334 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3500 ns 3500 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4083.5 ns 4125 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3000 ns 3625 ns 0.83
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 144120 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7250 ns 7292 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7167 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7167 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7041 ns 7458.5 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 806482 ns
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1636250 ns 1562000 ns 1.05
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1158208.5 ns 1179000 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1368083 ns 1346417 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2308063 ns 2481104 ns 0.93
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214505 ns
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12270583 ns 12361833 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9567750 ns 9575979 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9243645.5 ns 9245041 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18134146 ns 18149645.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1954133 ns
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17281250 ns 17389625 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14453375 ns 14446583 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14325333 ns 14298208.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21045500 ns 21068500 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85708 ns 88500 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91520.5 ns 99167 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 93250 ns 91917 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 87833.5 ns 90708.5 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126207 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2017958 ns 2074916 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2050542 ns 2029541 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2029834 ns 1761250 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026959 ns 2035041.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 841405 ns
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 1375 ns 2084 ns 0.66
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 1917 ns 2666 ns 0.72
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3583.5 ns 3583.5 ns 1
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 2375 ns 1916 ns 1.24
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16017 ns
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2875 ns 2625 ns 1.10
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2833 ns 2875 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2750 ns 2917 ns 0.94
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2792 ns 2834 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 165765.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7375 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5333 ns 6042 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 6083 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10084 ns 10083 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34231 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214458 ns 212333.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220042 ns 220563 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221416 ns 223084 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 235834 ns 208417 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 263066.5 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22879.5 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14459 ns 14709 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14375 ns 14625 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14541 ns 14541 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14500 ns 14292 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 399546.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 94312.5 ns 94500 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 95875 ns 93916.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 97583 ns 96125 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 94354.5 ns 95625 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125486.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1919437.5 ns 1950959 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1938250 ns 1918895.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1927084 ns 1651334 ns 1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1803750 ns 1942375 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 794850 ns
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 875354.5 ns 881833 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 802104.5 ns 830792 ns 0.97
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1225042 ns 1225417 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 970374.5 ns 944312.5 ns 1.03
lenet(28, 28, 1, 32)/forward/GPU/CUDA 273954 ns
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2714354 ns 2742708 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2504167 ns 2522750 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3360375 ns 3329959 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3360334 ns 3361458 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1467965 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17542 ns 15166.5 ns 1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16937.5 ns 17000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18708 ns 16583 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14584 ns 15667 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 129735 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214709 ns 214666 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215958.5 ns 224541.5 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215562.5 ns 216208 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217958 ns 217645.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 539139.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 223375 ns 219500 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220958 ns 220000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222645.5 ns 221167 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 219625 ns 220834 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 217203.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 495895.5 ns 495958 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 506625 ns 507958 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 510958 ns 498625 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 561375 ns 506541 ns 1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1153506.5 ns
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 3917 ns 4166.5 ns 0.94
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4667 ns 4312.5 ns 1.08
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4834 ns 4583 ns 1.05
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 4833 ns 4625 ns 1.04
batchedmm(16, Bsize=4)/forward/GPU/CUDA 17326 ns
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7520.5 ns 7187.5 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7625 ns 7292 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7458 ns 7229.5 ns 1.03
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7417 ns 7625 ns 0.97
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 176736 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16646 ns 17417 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18500 ns 19292 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19625 ns 18625 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18042 ns 18500 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 133143.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213000 ns 219083.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212916 ns 211959 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213667 ns 213521 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224895.5 ns 213208 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 820129 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4354.5 ns 4250 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4625 ns 4334 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4917 ns 4750 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3875 ns 4375 ns 0.89
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 175343 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10208 ns 10417 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10333 ns 10750 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10834 ns 10500 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10208 ns 10500 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 980341 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3250 ns 2958 ns 1.10
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3687.5 ns 3417 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4292 ns 3959 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2917 ns 3542 ns 0.82
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 215866 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7166 ns 7291 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7458 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7583 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7375 ns 7625 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1015020 ns
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23687417 ns 23616833 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 42666354 ns 34076542 ns 1.25
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37344478.5 ns 37648750 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34948333.5 ns 35355896 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1824017 ns
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183871416 ns 185118750 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 182812313 ns 161569416 ns 1.13
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 145975437.5 ns 146021041.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 274277542 ns 274915208 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16507012 ns
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 273782791 ns 273527291 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 257949042 ns 244066854 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 231995083.5 ns 231262500 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 323882958.5 ns 325681645.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183541 ns 183916.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184000 ns 184479.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185292 ns 183709 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182542 ns 185125 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 191911.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 629458.5 ns 635250 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 587334 ns 590375 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 587125.5 ns 586375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 649291 ns 586875.5 ns 1.11
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 963628 ns
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3851750 ns 3912854 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3983792 ns 3922688 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3579833 ns 3534875 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4612292 ns 4683208 ns 0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA 531156 ns
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17385812.5 ns 17461333 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 18439958.5 ns 17877604 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16577084 ns 16535333 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20232667 ns 20876542 ns 0.97
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2638769 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 541 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32361 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9312.5 ns 8875 ns 1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9604.5 ns 9458 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9541 ns 9167 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8750 ns 9084 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 248738 ns
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 650277229.5 ns 653952292 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 513797917 ns 393857103.5 ns 1.30
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 364513416 ns 328714250 ns 1.11
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 753229708 ns 759532875 ns 0.99
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 11759811 ns
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1878034500 ns 1886540417 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1671899375 ns 1638767625 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1507608416.5 ns 1505416479 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2202946667 ns 2232982666.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49516620 ns
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1535958.5 ns 1645500 ns 0.93
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1179292 ns 1196083 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1380729.5 ns 1372166 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2368083 ns 2490500 ns 0.95
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215337 ns
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12730083 ns 12742021.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9937625 ns 9937333.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9659583.5 ns 9670291 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18459917 ns 18551458 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2010689 ns
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17677292 ns 17729729 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14810083 ns 14747250 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14573229.5 ns 14539958 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21483000 ns 21491875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26250 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26208 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23665 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67166 ns 67416 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66875 ns 67167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67250 ns 68042 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66958 ns 66917 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 367986.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204583 ns 203916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209292 ns 209500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210500 ns 208375 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199625 ns 199583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26073 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 613125 ns 615979 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 625459 ns 622458.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 633583 ns 625042 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 632083 ns 628771 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 320857.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 592750 ns 654750 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 647000 ns 648792 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 648834 ns 639250 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 671792 ns 553000 ns 1.21
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131354 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2247291 ns 2255292 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2303208 ns 2216833.5 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2243604 ns 2230625 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2314875.5 ns 2261625 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1083962 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16687.5 ns 17479.5 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18458 ns 18166 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19770.5 ns 18334 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18146 ns 18542 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 132087.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229375 ns 230250 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 262896 ns 218666.5 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 231208 ns 220145.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258624.5 ns 225083.5 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 885149.5 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 542 ns 1.23
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 541 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23686 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8708 ns 9625 ns 0.90
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10000 ns 9500 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10000 ns 9583 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9250 ns 9583 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 241904 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5417 ns 5166 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5583 ns 5667 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6417 ns 6291.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4770.5 ns 5625 ns 0.85
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 194851.5 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 6959 ns 1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7417 ns 7709 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7125 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7250 ns 1
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 705733 ns
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2167 ns 2292 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2208 ns 2125 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2542 ns 2333 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2208 ns 2167 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17804 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6541 ns 6354.5 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6500 ns 6500 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6875 ns 6583.5 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6417 ns 6459 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 294742 ns
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 746916 ns 748750 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 761333 ns 746708 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 750541 ns 749375 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749459 ns 749125 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 20924 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 790875 ns 794125 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 777375 ns 775500 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 792500 ns 775812.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 778250 ns 794500.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 268681.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7458 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5250 ns 6084 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5875 ns 5583 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10292 ns 10541 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32725 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219208 ns 231542 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230937.5 ns 231875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 236625 ns 229604 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214312.5 ns 215187.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 332717.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10291 ns 10166.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10937.5 ns 10416 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10625 ns 10479 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9916 ns 10417 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 219475.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24416 ns 25083.5 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25417 ns 23916 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24875 ns 24625 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24354.5 ns 25000 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1060762 ns
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106190416 ns 106424375 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 126215417 ns 117279208.5 ns 1.08
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120200125 ns 120424354 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117655917 ns 117916208 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2587994 ns
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 395454916.5 ns 397131541.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 372350083.5 ns 366183958 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 355285895.5 ns 355277020.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 542892500 ns 545563875.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15209611 ns
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 607219000 ns 609770291 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 775694542 ns 756955334 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 743546708 ns 745569813 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 606917208 ns 607706416.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6729.5 ns 6875 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7458 ns 9229 ns 0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8791 ns 8833 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6084 ns 7500 ns 0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 214170 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14645.5 ns 14375 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14167 ns 13750 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14334 ns 14667 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13417 ns 13542 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1010027 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6042 ns 5959 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6708.5 ns 6354.5 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6958 ns 7083 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5166.5 ns 6042 ns 0.86
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 211003 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12916 ns 12666 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12979.5 ns 12917 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13041 ns 12916 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12375 ns 12292 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 725511 ns
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5792 ns 5875 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 6084 ns 5937.5 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 7166 ns 5812.5 ns 1.23
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5979.5 ns 6000 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16985 ns
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 16375 ns 15375 ns 1.07
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15917 ns 18229.5 ns 0.87
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15750 ns 15625 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15750 ns 15834 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 184955.5 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 417 ns 417 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 417 ns 416 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23469 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6291 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6292 ns 6541 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6458 ns 6375 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6020.5 ns 6042 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 226513 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 5917 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6083 ns 6083 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5833 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24637 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21375 ns 20895.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21083 ns 21084 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21167 ns 21334 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 20875 ns 20875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 248819 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144938 ns 145167 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 147666 ns 145333 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147500 ns 147791 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144208 ns 146250.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166863.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1328917 ns 1351583 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1366916.5 ns 1324833.5 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1323667 ns 1269708 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1330125 ns 1342020.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1231201 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21917 ns 24854 ns 0.88
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23250 ns 24750 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25417 ns 24083.5 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24583 ns 23041.5 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 261684.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 126249.5 ns 130333 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 132125 ns 131875 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 180458 ns 120583 ns 1.50
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 182166 ns 127250 ns 1.43
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1329052 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 334 ns 333 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23064 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6375 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6750 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6583 ns 6167 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6083 ns 6166 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 241726 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4583 ns 4250 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4875 ns 4583 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5062.5 ns 5000 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4375 ns 4666 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 230879.5 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9792 ns 9917 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10375 ns 10000 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10333 ns 10458 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns 10250 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1281938 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1584 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1584 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23016.5 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5709 ns 5625 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5750 ns 6000 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6042 ns 5792 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns 5666 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 260870.5 ns
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6736854 ns 6809750 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6358292 ns 6375834 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6526333 ns 6505250 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7511917 ns 7653125.5 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214549 ns
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24072542 ns 24098271 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21309271.5 ns 21313750 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21010584 ns 21034292 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29840125 ns 29936333.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2110310.5 ns
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37228250 ns 37354916.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45827250 ns 45524125 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45480416 ns 45728625 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38465479 ns 38256604.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5708 ns 5708 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5708 ns 5916 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6729.5 ns 6542 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5208.5 ns 5958 ns 0.87
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 215925.5 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 8792 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 8375 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 8792 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8145.5 ns 8042 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1004537.5 ns
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1503813 ns 1544521 ns 0.97
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1243541.5 ns 1274291.5 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1631312.5 ns 1619792 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2004542 ns 2113874.5 ns 0.95
lenet(28, 28, 1, 128)/forward/GPU/CUDA 280207 ns
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7912062.5 ns 7917042 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6650042 ns 6631541 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7185875 ns 7090646 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10076645.5 ns 10525708 ns 0.96
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1812720 ns
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 371770.5 ns 363667 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 359708 ns 373917 ns 0.96
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 457000 ns 456000 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 27125 ns 24312 ns 1.12
batchedmm(128, Bsize=4)/forward/GPU/CUDA 47414 ns
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 728042 ns 737791.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 792916 ns 796895.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1060625 ns 1063396 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 122625 ns 91145.5 ns 1.35
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 280856 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397666 ns 397459 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 213417 ns 287666 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288291 ns 287958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 754041 ns 751208 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44363 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 669875 ns 667375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 474875 ns 532500 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 529792 ns 533459 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 975625 ns 974250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 194646.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 678312.5 ns 677250 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 642583 ns 646333 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 646625 ns 555812.5 ns 1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 638374.5 ns 589334 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132515 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2433792 ns 2506042 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2525125 ns 2452187.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2458416 ns 2421083 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2464167 ns 2509083.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1286025 ns
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 4270.5 ns 3042 ns 1.40
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2791 ns 3500 ns 0.80
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4334 ns 3709 ns 1.17
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 3021 ns 2834 ns 1.07
batchedmm(2, Bsize=32)/forward/GPU/CUDA 17018 ns
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5583 ns 5458 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5542 ns 5625 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5500 ns 5625 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5584 ns 5583 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 187936.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1463042 ns 1459917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1495875 ns 1499291 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1503458 ns 1501417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1446334 ns 1439583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41308.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5127000 ns 5106812.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5300416.5 ns 5286437.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5293458 ns 5284041.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4725667 ns 4996333.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195229 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3709 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33264.5 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15250 ns 15250 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15083 ns 15417 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15417 ns 15416 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15125 ns 15000 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 350238 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71333 ns 71500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71417 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71208 ns 70542 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71500 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112408 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318125 ns 319958 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 327584 ns 318333 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 319500 ns 318208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 320333 ns 321834 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 194166 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1084 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 959 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23803 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8000 ns 7916 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 8208 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8417 ns 8125 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7708 ns 7667 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 246141 ns
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 501979.5 ns 514834 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 480104 ns 490208 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 566979 ns 567542 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 220416 ns 218520.5 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA 128980 ns
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1391667 ns 1371833 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1479770.5 ns 1457062.5 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1756604 ns 1755667 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 864792 ns 909250 ns 0.95
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 275170 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 416 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31717 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6166 ns 1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6542 ns 6708 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6125 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5958 ns 6083 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 248251 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1776021 ns 1721334 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1733687.5 ns 1725146 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1727458 ns 1724500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1726125 ns 1728229.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167904 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4363208 ns 4358375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4382750 ns 4376792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4374000 ns 4335333 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4367334 ns 4390375 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1079923 ns
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6875 ns 6750 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6708 ns 6625 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6792 ns 6875 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6666 ns 6542 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19517 ns
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 59895.5 ns 32500 ns 1.84
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 49208 ns 50895.5 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 52583 ns 32875 ns 1.60
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 32417 ns 49729 ns 0.65
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 267079.5 ns
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 18084 ns 17937.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 18292 ns 18042 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 19709 ns 18125 ns 1.09
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 18292 ns 18458 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18390 ns
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53833 ns 53208 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53375 ns 53250 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53375 ns 53250 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53625 ns 53562.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 319120 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75333 ns 75709 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75583 ns 75291 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75250 ns 75208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75500 ns 75250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46304 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324291 ns 330270.5 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 336479.5 ns 328625 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 324708 ns 325083 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 327458 ns 329042 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 209708.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1487583 ns 1486375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1522083 ns 1526375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1529334 ns 1527375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1471333 ns 1464666 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52335 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5126125 ns 5175375 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5305125 ns 5310021 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5295000 ns 4950479 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4684000 ns 5010146 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202194.5 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28333 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28333 ns 28375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28292 ns 28292 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28209 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24238 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66500 ns 66292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66250 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66416 ns 66459 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66625 ns 66459 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 495044 ns
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1478812 ns 1396208.5 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 933416.5 ns 1137042 ns 0.82
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1129625 ns 1061959 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2267917 ns 2245417 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 577563.5 ns
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3095187.5 ns 2966209 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2641125 ns 2741250 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2747417 ns 2597667 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3815833.5 ns 3844125 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1965829 ns
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7798041 ns 7918709 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8017625 ns 7905417 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7904083.5 ns 7547354 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4861812 ns 4916042 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 119833.5 ns 80583 ns 1.49
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81604 ns 81458 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82000 ns 81541 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80604 ns 80709 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193857.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020000 ns 2026042 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2021083 ns 2026125.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2024292 ns 1719750 ns 1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1749917 ns 2018208 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 744082.5 ns

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.