You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8b87c2b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4709
ns4625
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4792
ns4084
ns1.17
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5166
ns5791
ns0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4416
ns4292
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60862
ns60959
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10416
ns10125
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9875
ns9959
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11417
ns10375
ns1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10542
ns10666
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
426730.5
ns427044
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1000
ns1167
ns0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1333
ns1250
ns1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1291
ns1458
ns0.89
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1395.5
ns3542
ns0.39
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18565
ns18260
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4209
ns4125
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4042
ns3833
ns1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4167
ns4125
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4000
ns4000
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
111556
ns111381
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56375
ns57709
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46916
ns47250
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46167
ns38250
ns1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80959
ns80333
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37697
ns37655
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2046500
ns2026167
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2089354
ns2092708.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2048708.5
ns2059625.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1993834
ns1993416
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
199690
ns197377
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
147104.5
ns152958
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144104.5
ns148250
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
148584
ns146417
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144583.5
ns150375
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165605
ns167595
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1131291
ns1098542
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1119584
ns1124250
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1111791.5
ns1116146
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1118209
ns1107229.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
531488
ns523151
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3500
ns3584
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3709
ns3625
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5520.5
ns5708.5
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3375
ns3417
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
71213
ns70157
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9084
ns8834
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9625
ns8667
ns1.11
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10167
ns9291
ns1.09
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8584
ns9042
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
497375
ns492826.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15458.5
ns17000
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15250
ns16375
ns0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19146
ns18667
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14604
ns17083
ns0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55040
ns54850
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213833
ns213146
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213292
ns216104
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215292
ns214167
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217500
ns225333
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
277020.5
ns272672.5
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns459
ns1.18
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
583
ns542
ns1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
709
ns709
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
625
ns583
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17919
ns17542
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1542
ns1708
ns0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1458
ns1458
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1916
ns1625
ns1.18
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1750
ns0.79
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
104816
ns104205
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7250
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5875
ns5833
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5916
ns5209
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9875
ns4000
ns2.47
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24078
ns23961
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
229750
ns228750.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228583
ns228333
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230292
ns228500
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213917
ns226334
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
172648
ns170956
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3833
ns3875
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3875
ns3916
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3834
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23922
ns23832
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16458
ns16833
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16583
ns16708
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16958
ns16708
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16750
ns16958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
166168.5
ns165501.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
579542
ns579042
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
576458
ns574375
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
578750
ns575083
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
574667
ns576292
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113828
ns113664
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1424688
ns1417708
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1421083
ns1429333
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1423208.5
ns1425729.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1419500
ns1422208
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
215564
ns214791
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1071229.5
ns1082104
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
961417
ns959958.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1343000
ns1341792
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1300000.5
ns1294792
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
277770.5
ns281583.5
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5955916
ns5777875
ns1.03
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4519500
ns4456083
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4916354.5
ns4934792
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5726333
ns5627500
ns1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1105672
ns1106964
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
583
ns542
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
24042
ns23988
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
ns2084
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2083
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2125
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
173326.5
ns179026
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4000
ns6084
ns0.66
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4584
ns6167
ns0.74
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7083
ns7041
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4125
ns6375
ns0.65
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
65959
ns66163.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11084
ns11291
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11000
ns10791
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12292
ns12125
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10791
ns11354.5
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
456125.5
ns456626.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7000
ns7000
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6458
ns7042
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8500
ns8375
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6292
ns7042
ns0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
54186
ns52652
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16708
ns17375
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17875
ns17167
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18750
ns17770.5
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16875
ns18708
ns0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
308312
ns306093.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns459
ns1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns459
ns1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33294
ns33004
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8708
ns8583
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9208
ns8208
ns1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9458
ns9583
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8292
ns9042
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
162415.5
ns162492.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64625
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64667
ns64417
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64666
ns64625
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64625
ns64750
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112234
ns112347.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
284395.5
ns277542
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
286937.5
ns281625
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
285291
ns288750
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
277917
ns275500
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
188885.5
ns189809
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3237000
ns3285583
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3046417
ns3022333.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3014917
ns2780375
ns1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3953541.5
ns4038625
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
577323
ns573967
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7569937.5
ns7586208.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7460791.5
ns7415437
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7457666.5
ns7333375
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8209666
ns8220958
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1380365.5
ns1351752.5
ns1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18994750
ns18835167
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19146458
ns19044834
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19185583
ns19135125
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15773833
ns15633417
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24040875
ns23661916.5
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33769833
ns33965500
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37025062.5
ns41107417
ns0.90
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34849833
ns34858709
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1855448
ns1862815
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
192176500
ns189289541
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
165400792
ns164224708
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
153088459
ns157847979
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
439540208
ns438904833
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13926820
ns13913764
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
292222499.5
ns289733584
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
338088333
ns338173667
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
298393250
ns307489541.5
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
394164437.5
ns393585937.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23395.5
ns21708.5
ns1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23000
ns24458
ns0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
26479.5
ns25937
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22271
ns24229
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96215.5
ns96907
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103541.5
ns103750
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104375
ns105292
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
105000
ns104208
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
106291
ns151250
ns0.70
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
499410
ns504189
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7125
ns6583
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6542
ns7292
ns0.90
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7916
ns7959
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns6958
ns0.84
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
67753
ns68581
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15250
ns14916.5
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15500
ns14709
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16666
ns16666
ns1
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14667
ns14292
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
471687
ns483895
ns0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3030208.5
ns3017937
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2057020.5
ns2022458
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2271375
ns2307959
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4518521
ns4846645.5
ns0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
585712
ns585796
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23780833
ns23617917
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
17907042
ns17975417
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16907896
ns18323812.5
ns0.92
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
34889792
ns35597209
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3222471
ns3109235
ns1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33703875
ns33405687.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27577959
ns27693604
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27463958
ns27860958
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41773187
ns42002937.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
73687.5
ns72375
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73292
ns84624.5
ns0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
83417
ns83250
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74667
ns73750
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101830
ns102852
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
318542
ns218167
ns1.46
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216770.5
ns309979
ns0.70
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219750
ns317479
ns0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
297396
ns288875
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
550055
ns550996
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11937.5
ns12041
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11958
ns12729.5
ns0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13395.5
ns13833
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11584
ns11666.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71500
ns71604
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26666
ns26625
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26875
ns26959
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27792
ns28292
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26500
ns26458
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
478647.5
ns484486.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12458
ns12417
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12750
ns12542
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14042
ns14584
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12042
ns13041.5
ns0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
54279
ns53694
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25792
ns26312.5
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25791
ns26270.5
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26584
ns26667
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
25833.5
ns26333
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
307846.5
ns309291.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
180187.5
ns178770.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
179750
ns182334
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183375
ns184895.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179041
ns179750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
57080
ns57908
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
584708.5
ns587125
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
587833
ns596500
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
595750
ns593770.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
587000
ns583166
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
286439
ns290369.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6541.5
ns7354.5
ns0.89
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6708
ns7167
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7500
ns7875
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5750
ns6833
ns0.84
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70275
ns70829
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13937.5
ns14375
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14708
ns14708
ns1
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15583
ns15625
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13500
ns14083
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
465284
ns471312.5
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1198000
ns1235042
ns0.97
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1218958
ns1283583
ns0.95
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1268562.5
ns1282875
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1315416
ns1325208
ns0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA
302635
ns301270
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4311792
ns4111125
ns1.05
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4360354
ns4361625
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4524583
ns4786395.5
ns0.95
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4481833
ns4453229.5
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1039337
ns1047552
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1834
ns1750
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23819
ns23328
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4834
ns4833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4875
ns4792
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5000
ns4917
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4917
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
189325
ns186698
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6250
ns7208.5
ns0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6084
ns5584
ns1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8291
ns8667
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5750
ns7312.5
ns0.79
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
56699
ns54539
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11125
ns10833
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12083
ns10834
ns1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11875
ns12375
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11125
ns11916
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
333470
ns329099
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns334
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23140
ns22753
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2834
ns2708
ns1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2709
ns2667
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3042
ns2959
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns3000
ns0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
160474
ns157496
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11833
ns13167
ns0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12500
ns13166
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
15000
ns15000
ns1
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11667
ns13792
ns0.85
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57479
ns55218
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24667
ns24833
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25000
ns24542
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25583
ns25375
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25125
ns24709
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
294701.5
ns289966
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
ns4083
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4166
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4125
ns4125
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
25243
ns24660
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
15959
ns15958
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16167
ns16417
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16500
ns16042
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16125
ns16125
ns1
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
196657.5
ns194045.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5667
ns5667
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5708
ns5625
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5709
ns5750
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5708
ns5791
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34103
ns32989
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20375
ns21125
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21166
ns20459
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21500
ns21542
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21083
ns20875
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
178406.5
ns174273
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
380541
ns403209
ns0.94
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
375333
ns371125
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
487875
ns474292
ns1.03
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
532687
ns539604.5
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67192
ns66734
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
993167
ns1011917
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
884334
ns884896
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1238562.5
ns1220125
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1412624.5
ns1400208
ns1.01
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
189581
ns190566.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
86875
ns82917
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80583
ns82791
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85875
ns88958.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80791.5
ns83187.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192886.5
ns192556.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924208
ns1921500
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1916917
ns1696166
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1920541
ns1938083
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1907750
ns1915875
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
398152
ns393732
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22307
ns21580
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1791
ns1792
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
170162
ns165924
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6792
ns6708
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7458.5
ns6250
ns1.19
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9604.5
ns9750
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6458.5
ns8125
ns0.79
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60140
ns56950.5
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8875
ns8916.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9208
ns8958
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9250
ns9625
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9208
ns9542
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
308605.5
ns299584.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
156095333.5
ns120035854.5
ns1.30
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174294250
ns174382959
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147908167
ns154831333
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
105395375
ns103109500
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5479498
ns5474606
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
674867041
ns617124000
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555334333
ns555612167
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
454020333.5
ns468382792
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
758003104
ns756087750
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34951781
ns38213656
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
701059834
ns651747459
ns1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
666716125.5
ns666674583.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
580121499.5
ns602170708.5
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
741952792
ns734251875
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57708
ns57208
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47333
ns48167
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47250
ns39167
ns1.21
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83959
ns83958
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37806
ns37250
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1934958.5
ns1929792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1972000
ns1973292
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1976374.5
ns1984249.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1886667
ns1881417
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
174540
ns171491
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
274833.5
ns273354
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
267625
ns267959
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
288750
ns270687.5
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
275791.5
ns268834
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
127747
ns124192.5
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
588791.5
ns658333
ns0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
676334
ns674854.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
669375.5
ns665333
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
637708
ns670500
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
705367
ns664813
ns1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2201812.5
ns2190167
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2173417
ns2214354.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2204166
ns2216958.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2175854
ns2099979
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133869
ns133238
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5561000
ns5505354.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5485083
ns5504750
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5500791
ns5565292
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5486667
ns5499708
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
758600
ns740235
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
650375
ns650417
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
639375
ns649020.5
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
639250
ns640625
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
645541
ns648292
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46906
ns47265
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1797375
ns1821708
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1723000
ns1720959
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1729417
ns1675729.5
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2102375
ns2108500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
224012.5
ns224014
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57125
ns58583
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46792
ns46645.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46792
ns38750
ns1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83625
ns83834
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28934
ns28947
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2042125
ns2024916
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2085750
ns2086188
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2086104
ns2100521
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1992187.5
ns1993416.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192769
ns191815.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13486000
ns13473875
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12454854
ns12547041.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12584062
ns12559604
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15166646
ns15213416.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
516981.5
ns517805
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47757417
ns47353458
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41920875
ns41833334
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41057895.5
ns41118750
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58660917
ns58300041
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3200471
ns3203904
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74173979
ns74077042
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
68296125
ns68022250
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90853250
ns90906749.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76369500
ns99115937.5
ns0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57542
ns58958
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47333
ns47375
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47208
ns38729.5
ns1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83542
ns83500
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47283
ns47777
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1917416.5
ns1923375
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1969750
ns1961541
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1977666
ns1980229
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1891062.5
ns1890354
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191945
ns194350.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
250
ns291
ns0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
417
ns291
ns1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32084
ns32617.5
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6166
ns6208.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6417
ns5958
ns1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6959
ns6708
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6334
ns6437.5
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
173427.5
ns173722.5
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31620
ns32110
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2667
ns2583
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2792
ns2542
ns1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2959
ns2833
ns1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2833
ns0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
161588.5
ns161891
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
322222750
ns286335145.5
ns1.13
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
341161875
ns339870250
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
313409520.5
ns320445937.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
272857666
ns272825875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7106282
ns7113314
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1057275812.5
ns990386709
ns1.07
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
937359791
ns938484666
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
852420750
ns868613416.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1161160000
ns1158749666
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34076180
ns33903874
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1357441042
ns1310266104.5
ns1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1321006541.5
ns1325766333.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1604272875
ns1623996500
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1302899708.5
ns1663239334
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1417312.5
ns1461479
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1438625
ns1415750
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1422375
ns1429167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1404187.5
ns1414437.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127360
ns128213
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5059667
ns5019792
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5032458
ns5022458
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5024750
ns5050000
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5017709
ns5006541.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
498493.5
ns557532
ns0.89
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
172134417
ns175263520.5
ns0.98
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
132190854
ns129816208.5
ns1.02
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
125671875
ns145953208.5
ns0.86
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
162159562.5
ns164619104.5
ns0.99
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4881912.5
ns4883992
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
676531000
ns831528333
ns0.81
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
642244500
ns497840084
ns1.29
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
502997666
ns556789916
ns0.90
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
678617458
ns679969833
ns1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
17408311
ns16195623
ns1.07
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
9098854
ns8914083
ns1.02
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8775166.5
ns8769917
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7856833.5
ns8216313
ns0.96
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10166000
ns10158000
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1591045
ns1595526
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
37558563
ns35894250
ns1.05
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37073459
ns36843625
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33526542
ns34476562
ns0.97
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38790125
ns38802729
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6476971
ns6454567.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47333
ns47396
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47333
ns49334
ns0.96
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47625
ns47542
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47125
ns47417
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
19085
ns19457
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50333
ns50292
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
52875
ns50520.5
ns1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
53083
ns50584
ns1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50250
ns50250
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
184149.5
ns189575
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7458
ns8104
ns0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7333
ns6791
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8667
ns9125
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6708
ns7333
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
84192.5
ns86829.5
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9917
ns9875
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9917
ns9583
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11041
ns10375
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9917
ns10208
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
493810
ns537525
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7542
ns8208
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7667
ns8250
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9667
ns9812.5
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5417
ns6375
ns0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
91440.5
ns113788.5
ns0.80
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12625
ns13333.5
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13833
ns12625
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
14000
ns13584
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12666
ns13208
ns0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
454481
ns479705.5
ns0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1000
ns958
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1083
ns958
ns1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32617
ns32580
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7708
ns7750
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8145.5
ns7625
ns1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns8542
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8125
ns8208
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
196206.5
ns201701.5
ns0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23083
ns23250
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23375
ns23042
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23583
ns23500
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23750
ns23167
ns1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18627
ns18765.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52500
ns52875
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52875
ns52292
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
53417
ns52792
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52166
ns52459
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
249106
ns260844.5
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1448167
ns1400229
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1405000
ns1398666.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1405874.5
ns1400708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1403917
ns1398917
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195637
ns196521.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5038167
ns5018604
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5020646
ns5004729.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5017458
ns5044229.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5008375
ns5001271
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
558064
ns595122
ns0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3065354.5
ns3043083
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2082084
ns2094042
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2285291
ns2287146
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4897375
ns4530875
ns1.08
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
583035
ns582703
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24715854
ns24366625
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18870292
ns18829583
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18758208
ns19120291
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36783917
ns36653000
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3184571
ns3189516.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34426125
ns33943229
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28319896
ns28373417
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28022958.5
ns28357208
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41761166.5
ns41659750
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144957333
ns144299750
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
142855500
ns142248375
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
124763354
ns126632146
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173311167
ns173840291.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22559600
ns22781482
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
956543708
ns1307941437.5
ns0.73
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1622781604
ns1133574500.5
ns1.43
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1236835833
ns711240125
ns1.74
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
673901750
ns670828250
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118606884
ns118499942
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74208
ns74542
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74834
ns73917
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
86875
ns83125
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73041.5
ns72916.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
204598.5
ns225032.5
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
278208.5
ns202979.5
ns1.37
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
202666.5
ns282792
ns0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
288416
ns253479.5
ns1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
287917
ns244146
ns1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1117217.5
ns1201754
ns0.93
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
36148959
ns35408938
ns1.02
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35295854
ns35449645.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32189834
ns32512083
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40944021
ns41003541.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5845476
ns5848198
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
151293125
ns146608875
ns1.03
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
152622708.5
ns151542938
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
134152417
ns138849083
ns0.97
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287902584
ns287439584
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34882228
ns34913824
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
155688000
ns121086291.5
ns1.29
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174601250
ns174190000
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147696687.5
ns155717667
ns0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
106151041.5
ns106488666.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5471843
ns5478422
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
518343938
ns611208666
ns0.85
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
467330167
ns466441167
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
438511083.5
ns453562937.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
738327500
ns741621625
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32271735
ns35157227
ns0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
689829417
ns648662584
ns1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
655962042
ns657411208
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
572893458
ns585962375
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
850499333
ns845072208
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1204208
ns1304708
ns0.92
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
909228.5
ns965666
ns0.94
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
975604.5
ns744354
ns1.31
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2068166
ns1944604
ns1.06
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
573967.5
ns572387
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2921979
ns2974271
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2595937
ns2531646
ns1.03
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2601958
ns2512854
ns1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3701291
ns3691334
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1629819
ns1817474
ns0.90
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6735042
ns6642416
ns1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6496187.5
ns6630792
ns0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6432833.5
ns6466375
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4458667
ns4443145.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7334
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6084
ns6208
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns5458
ns1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10167
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25112
ns25916
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214479.5
ns212104
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219625
ns219562.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221583
ns220667
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206125
ns206291
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
247799
ns257490
ns0.96
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
312548750
ns301772791.5
ns1.04
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
223228250
ns222879750
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
196993083
ns222700312.5
ns0.88
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
310829208
ns311773125
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7675013
ns7676597.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1097849625.5
ns1082870459
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
906889750
ns892532250
ns1.02
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
868243875
ns883941208.5
ns0.98
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1161595250
ns1154293562
ns1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26504585
ns26959026
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5250
ns6459
ns0.81
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6520.5
ns5209
ns1.25
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7375
ns10000
ns0.74
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5125
ns5708.5
ns0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
155225.5
ns168546.5
ns0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6917
ns7458
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7541
ns6792
ns1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7584
ns7542
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7250
ns7792
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
614403
ns639812.5
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns458
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
542
ns458
ns1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns542
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
458
ns542
ns0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24324
ns24361
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9209
ns9000
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9333
ns9000
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9709
ns9583
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9083
ns9708
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
214987
ns234125.5
ns0.92
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352000
ns351500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351167
ns351500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352000
ns351916
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
351667
ns356625
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21526
ns21502
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
822667
ns811270.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
803791
ns774958.5
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
774000
ns776584
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
819209
ns821875
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
271931
ns315795.5
ns0.86
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
315625
ns335896
ns0.94
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
334062.5
ns338208.5
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
448958
ns441167
ns1.02
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
335542
ns331375
ns1.01
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18135.5
ns18761.5
ns0.97
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
693229
ns695166
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
737125
ns738208
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1034583
ns1036458
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
697563
ns692396
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
240714.5
ns292461.5
ns0.82
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
329166
ns354166.5
ns0.93
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
345354
ns346771
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
424875
ns433791
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
374166
ns370250
ns1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22796
ns23121
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
753187.5
ns757417
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
751083
ns749625
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1069042
ns1070562.5
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
824250
ns828458
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
214489
ns257074.5
ns0.83
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3458
ns3292
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3500
ns3458
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3875
ns3750
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3292
ns3417
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
18145
ns18586
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4417
ns4167
ns1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4208
ns4375
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4333
ns4417
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4209
ns4250
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
237972.5
ns296700.5
ns0.80
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6417
ns3625
ns1.77
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4042
ns3750
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6542
ns6541
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3375
ns6354.5
ns0.53
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
174590
ns232189.5
ns0.75
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8209
ns8187.5
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8250
ns8000
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8708
ns8458
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8709
ns8500
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1063088
ns1227082
ns0.87
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203375
ns203417
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209625
ns209541.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210958
ns208250
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200833
ns198709
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34926
ns35300
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
601916
ns612417
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
633750
ns623292
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622208.5
ns623250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
586000
ns630166
ns0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
307649.5
ns347973
ns0.88
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
966417
ns977646
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
932833
ns935437.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
945958.5
ns970083
ns0.98
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1291166
ns1286374.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
208387
ns209031
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4606250
ns4514333
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4489917
ns4466146
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4299708
ns4452875
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6229250
ns6260416.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
933347.5
ns947144.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3875
ns3542
ns1.09
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3833
ns3417
ns1.12
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6167
ns5896
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
2917
ns6667
ns0.44
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
191984.5
ns219336.5
ns0.88
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7666
ns6917
ns1.11
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7125
ns6958
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7708
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7208
ns7291
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
941897
ns1020167.5
ns0.92
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1602667
ns1635042
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1171416
ns1200395.5
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1364375
ns1363584
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2512583
ns2345187.5
ns1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215456.5
ns215784.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12345833
ns12316854.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9563708.5
ns9564000
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9248333
ns9378437.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18039541.5
ns17989542
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1941766
ns1948181
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17410875
ns17368125
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14343875
ns14382958
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14290187.5
ns14502250
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21033375
ns21085917
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
93146
ns90917
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
89750
ns89500
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92375
ns91833
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
104667
ns113437.5
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126306.5
ns126891
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2057146
ns2009625
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2030833
ns2030000
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2027062.5
ns2039270.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2024458
ns1871125
ns1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
951168
ns1032563
ns0.92
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
327771
ns342166.5
ns0.96
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
344667
ns343375
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
393729
ns406458
ns0.97
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
312667
ns311729
ns1.00
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16220
ns16465.5
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
703375.5
ns706208
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
721271
ns728542
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1023666.5
ns1018584
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
653917
ns650375
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
187186
ns195366.5
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7083
ns7375
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5875
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5833
ns5416
ns1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9916
ns10000
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34409
ns34591
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214083
ns243791
ns0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222333.5
ns220125
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221187.5
ns221083
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206125
ns239167
ns0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
301322.5
ns327793
ns0.92
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
ns3709
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3625
ns3708
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
23004
ns22616
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14250
ns14292
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14333
ns14416
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14500
ns14208
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14416
ns14417
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
460312.5
ns480334.5
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
92937.5
ns94458
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
133375
ns92625
ns1.44
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
96583.5
ns96875
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
136958
ns96229.5
ns1.42
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125681
ns126007
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1754208.5
ns1714792
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1922334
ns1926792
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1933417
ns1913291.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1927416.5
ns1711417
ns1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
955943
ns1034230
ns0.92
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
857708
ns876916.5
ns0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
817583
ns817791
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1222291.5
ns1169438
ns1.05
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
963416
ns966187.5
ns1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA
275885
ns275657.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2826354
ns2828583
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2472708.5
ns2474833
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3311750
ns3335750
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3417042
ns3304292
ns1.03
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1599363
ns1618381.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15667
ns16709
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15541
ns15625
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18791
ns18667
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15042
ns15583
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
143363
ns142594
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221562
ns228750
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
257625
ns215750
ns1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216167
ns217625
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
253521
ns255500
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
648580
ns641543.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221958
ns222458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
222584
ns221500
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222875
ns223458.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
219542
ns222604.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
287448
ns269850.5
ns1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
560521
ns537583
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
506729
ns497334
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
497875
ns499583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
524917
ns526833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1378195
ns1430878.5
ns0.96
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
312208.5
ns330125
ns0.95
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
334917
ns332834
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
355354.5
ns435458.5
ns0.82
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
323229.5
ns315917
ns1.02
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16853
ns16581
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
710916
ns717084
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
725333.5
ns728166.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1020291
ns1021104
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
666458
ns662729.5
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
196645
ns195479.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18292
ns17875
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17250
ns17167
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20250
ns20250
ns1
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16687
ns17208
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
147801.5
ns145639
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219292
ns223750
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219437.5
ns212417
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213646
ns214041
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222104.5
ns221917
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1001312.5
ns1035551.5
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6458
ns6708
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4792
ns6333
ns0.76
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7250
ns7208
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4458
ns6625
ns0.67
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
238642
ns240542
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10792
ns10584
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10375
ns9917
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11375
ns11166.5
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10333
ns10917
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1064757
ns1097401.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6042
ns3500
ns1.73
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3792
ns3208
ns1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4750
ns6333.5
ns0.75
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3209
ns6750
ns0.48
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
236410
ns250006
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7250
ns7625
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7084
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8042
ns8125
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7584
ns7500
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1074231
ns1102649
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24130479
ns23315625
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
38799500
ns34529125
ns1.12
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37733750
ns41513333.5
ns0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34918167
ns34929834
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1843476
ns1838602
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
186803646
ns184421875
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159613166
ns159459792
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146295625
ns151225083
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
412659125
ns413223958
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16523543
ns16387494
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
436777542
ns428743125
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
253178667
ns252439020.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
232826083.5
ns233017396
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
484428667
ns484197291
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183792
ns183584
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182000
ns182750
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185584
ns186625
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182354.5
ns183146
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
220958.5
ns228677.5
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
593000
ns596083
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
587187
ns586292
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
588166
ns589770.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
632000
ns631958
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1068694.5
ns1119701
ns0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3862583.5
ns3838833
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3623187
ns3643375.5
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3513333
ns3563521
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5351459
ns5359750
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
534395
ns537722
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17921270.5
ns17412417
ns1.03
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17168125
ns17190667
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16586271
ns17100375
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22125084
ns22144083
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2619299
ns2612799
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns458
ns1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
459
ns583
ns0.79
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32390
ns32035
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9417
ns9208
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8875
ns8542
ns1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10125
ns10208
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9125
ns9459
ns0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
265134.5
ns264327.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
505787208
ns504274209
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
430827229
ns430218396
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
432173291.5
ns471374500
ns0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
584857000
ns672994208.5
ns0.87
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12384263
ns12486595
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2073799791.5
ns2049529562.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1628408167
ns1632649709
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1495535812
ns1536417708
ns0.97
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2213815333
ns2205666041.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49261027.5
ns49389302
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1644542
ns1657645.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1184062.5
ns1189208.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1367187.5
ns1382000
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2468292
ns2334125
ns1.06
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
217369
ns214982
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12780979.5
ns12688500
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9943666
ns9942000
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9649896
ns9748312.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18379437
ns18407312
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2035807.5
ns2050613
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17754833
ns17691583.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14655042
ns14746041.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14543333
ns14804417
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21358459
ns21386084
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26583
ns26291
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26208
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23360
ns24125
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66834
ns66875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67542
ns66917
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67083
ns67083
ns1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66875
ns67209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
392635.5
ns398847.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203542
ns202667
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209584
ns209000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209708
ns209167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199875
ns199583
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25945.5
ns26392
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
608625
ns612416.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
632958.5
ns627416.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622333
ns667979
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
584541.5
ns631250
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
349189
ns353043.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
653500
ns645542
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
670875
ns643375
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
547042
ns664187.5
ns0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
675666.5
ns540834
ns1.25
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131441
ns132126
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2289416
ns2247375
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2233958
ns2239958
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2245708
ns2302917
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2234188
ns2219000
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1153968
ns1328726
ns0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17583
ns17667
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17000
ns16979.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21083.5
ns20792
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17479
ns18500
ns0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
142918
ns146392.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
226645.5
ns229708
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230417
ns225333
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220688
ns229292
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218917
ns259083
ns0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
981199
ns1081671
ns0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
541
ns500
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns459
ns1.27
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
458
ns542
ns0.85
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23217
ns23645
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10041.5
ns9833.5
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10125
ns9542
ns1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10417
ns10708
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9250
ns9916
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
255034
ns262941
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5916
ns7291
ns0.81
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6229.5
ns5833
ns1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8563
ns9625
ns0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5500
ns7250
ns0.76
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
222902
ns234003
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7250
ns7333
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7709
ns7000
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
ns7833
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6958.5
ns7250
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
767625.5
ns810029.5
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2291
ns2042
ns1.12
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2250
ns2000
ns1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2333
ns2375
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2333
ns2208
ns1.06
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17725
ns18218
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6542
ns6542
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6667
ns6500
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6958
ns6708
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6583
ns6750
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
317996.5
ns335368
ns0.95
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
748750
ns750166
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
747083
ns746604.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
747042
ns751041
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
749125
ns761417
ns0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21402
ns21856
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
790729
ns775334
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
790333.5
ns775042
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
773125
ns804792
ns0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
775458.5
ns791625
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
291072
ns299022
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7209
ns7375
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6042
ns5875
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns5208
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10125
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32814
ns32492
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220166
ns233188
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
240583
ns227750
ns1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228583
ns254458
ns0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255708
ns255583
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
355564.5
ns359227
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
12541
ns11042
ns1.14
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10500
ns12458
ns0.84
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13167
ns12959
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10125
ns12000
ns0.84
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
239405.5
ns245075.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24791.5
ns24875
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24375
ns24458
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25541
ns25458
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24812.5
ns24583.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1085912
ns1120608
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
108107292
ns106980458
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117455666.5
ns118006979.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120529584
ns123940208
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117307042
ns118407959
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2652543
ns2661574
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
395929750
ns394378313
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
367066041
ns368164500
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
354756333
ns358657167
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
484413208
ns482282708
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15198392
ns15138278
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
767591687.5
ns759267583
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
579795958
ns577881125
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
743372729
ns749378833
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
765609167
ns945671312.5
ns0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7458.5
ns7458
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7479.5
ns7958
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8916
ns8750
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6708
ns7333
ns0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
232243
ns235620
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13917
ns14500
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14125
ns13333
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15166
ns15041
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14458
ns14292
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1035695.5
ns1078273.5
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9042
ns8542
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6833
ns7792
ns0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9750
ns9187.5
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5500
ns7833.5
ns0.70
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
227355.5
ns235827.5
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12625
ns13167
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12959
ns12084
ns1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12917
ns13084
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12292
ns12833
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
753887
ns787391.5
ns0.96
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
327750
ns347250
ns0.94
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
342666.5
ns344875
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
398083
ns409896
ns0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
317437.5
ns310562
ns1.02
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16593
ns16566
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
702854.5
ns713833.5
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
720833
ns727291
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1025771
ns1023416
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
661750
ns654959
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
196204.5
ns197250.5
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns291
ns1.29
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
291
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23062
ns23066
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6333
ns6250
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6584
ns6334
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6792
ns6750
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6250
ns6791
ns0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
236488
ns238420
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5833
ns5750
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5792
ns5750
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5667
ns5834
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24282
ns23863
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21687
ns21750
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21584
ns21000
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21750
ns21958
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21063
ns21708
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
260349.5
ns261085
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
172458
ns152146
ns1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
185292
ns145250
ns1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
148917
ns149541
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
186625
ns145937
ns1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166632
ns166536.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1351354.5
ns1328792
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1310042
ns1319083.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1312208
ns1350812.5
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1317292
ns1317084
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1279433
ns1336276
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24291
ns24917
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22125
ns24208
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25958
ns25708
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21916
ns24208.5
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
277859
ns351114.5
ns0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
127896
ns131125
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
174583
ns117791
ns1.48
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118667
ns172917
ns0.69
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
135125
ns177334
ns0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1390180
ns1465398.5
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
333
ns333
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22950
ns22926
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6416.5
ns6417
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6625
ns6458
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6834
ns6917
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6250
ns6542
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
253555
ns254551
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6000
ns7625
ns0.79
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4167
ns4167
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7375
ns7708.5
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4666
ns7375
ns0.63
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
241371.5
ns250274.5
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10166
ns10042
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10042
ns9708
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10625
ns10333
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10333
ns10250
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1304285.5
ns1345295
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22830
ns22897
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5708
ns5625
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5667
ns5584
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6042
ns5959
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5583
ns5958
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
270940
ns271438.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6820479
ns6886125
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6334041.5
ns6378229
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6486416.5
ns6526875
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7665459
ns7602250
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213607.5
ns213111
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24142500
ns24073062
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21253833
ns21283625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
20999479
ns21045584
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29726209
ns29677875
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2083084.5
ns2108165
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37375166.5
ns37353145.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
33959583
ns34386667
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45667583
ns45930020.5
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
37873562.5
ns49322334
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6979.5
ns7708.5
ns0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6667
ns5875
ns1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8104.5
ns8333
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5479.5
ns7062.5
ns0.78
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
228629.5
ns238522.5
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8375
ns8458
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8375
ns8042
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8584
ns8583
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8125
ns8292
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1060872.5
ns1070850
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1527229
ns1544374.5
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1259812.5
ns1259666.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1616208
ns1632771
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2147979
ns2150667
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
271439
ns278945
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7973083.5
ns7908937.5
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6586020.5
ns6609937
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7034625
ns7237750.5
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10461334
ns10434334
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1861989
ns1889956
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
318167
ns340979
ns0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
341959
ns345792
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
408000
ns417125
ns0.98
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
345291
ns345833
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46596
ns42448
ns1.10
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
734812.5
ns746500.5
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
781000
ns784542
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1068667
ns1073250
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
746084
ns761062.5
ns0.98
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
299516.5
ns303720.5
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397708
ns397500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288000
ns288250
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288125
ns212666
ns1.35
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
752083
ns756084
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44143
ns43887
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
633750
ns671083
ns0.94
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
531000
ns530083
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
530834
ns470667
ns1.13
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973250
ns974750
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
188258.5
ns188388.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
667374.5
ns679250
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
643458.5
ns645333.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
545833
ns642458
ns0.85
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
678833.5
ns638562.5
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131695
ns131530
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2403188
ns2409292
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2439250
ns2456416.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2454541
ns2514583
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2454542
ns2456292
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1200754
ns1277300
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
325000
ns345146
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
340500
ns343583
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
394250
ns403708.5
ns0.98
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
314000
ns312208
ns1.01
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15982
ns16009
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
702813
ns709667
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
719125
ns724500
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1024146
ns1022687.5
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
651667
ns650417
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
196545
ns195917
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458417
ns1460417
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1503167
ns1500812.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1499542
ns1496375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1439209
ns1438708
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40255
ns40600
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5142459
ns5128791
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5295000.5
ns5302375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5017687.5
ns5313000
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4991625
ns4970208.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
197920.5
ns196206.5
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
ns3709
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33701
ns32895
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14917
ns15167
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15333
ns15083
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15375
ns15083
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15125
ns15375
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
380032
ns376729
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71375
ns71459
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71292
ns71250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71250
ns71375
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71250
ns70708
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113118
ns113177.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
322292
ns317917
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
321459
ns320417
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
327292
ns325333
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
318334
ns320916
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
196182.5
ns193043
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns958
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1083
ns958
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
958
ns1042
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23902
ns23363
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7959
ns8083
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8083
ns7792
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8541
ns8750
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8125
ns8750
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
263222.5
ns260535.5
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
451021
ns475499.5
ns0.95
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
470667
ns470520.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
556978.5
ns557125
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
567333
ns557959
ns1.02
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129930
ns129404
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1413124.5
ns1399270.5
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1374375
ns1382375
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1599125
ns1611125
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1589500
ns1582104.5
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
275820
ns274924
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31985
ns31647
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6042
ns1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6833
ns6666
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6291
ns6625
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
265480
ns262541.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1723041.5
ns1761833
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1770375
ns1723396
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1726791
ns1733812.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1769792
ns1730625
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169107.5
ns169477.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4370833
ns4358625
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4358458
ns4358708
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4355958
ns4403062.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4350000
ns4373875
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1170977
ns1208123
ns0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6625
ns7167
ns0.92
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6750
ns6875
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7041
ns6916
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
9000
ns6750
ns1.33
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
21354
ns20662
ns1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
33104.5
ns51625
ns0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
51458
ns32917
ns1.56
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33083
ns48208.5
ns0.69
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51042
ns51417
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
211403.5
ns292106.5
ns0.72
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
332479
ns354562.5
ns0.94
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
345500
ns348666.5
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
420625
ns433333
ns0.97
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
326208
ns322041.5
ns1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18610.5
ns18353
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
719166
ns724625
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
732604
ns730583
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1029625
ns1038687.5
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
679354
ns675333
ns1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
345590
ns335730.5
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75167
ns75458
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75125
ns75333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75292
ns75375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
74875
ns74584
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47792
ns46864.5
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
334542
ns325166
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
340667
ns324250
ns1.05
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
326000
ns336875
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
326708
ns325125
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
213631.5
ns209059.5
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1484750
ns1485709
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1530208
ns1526833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1526875
ns1522792
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1463833
ns1462625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52711
ns51397
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5145375.5
ns5113395.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5286834
ns5295292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4997792
ns5300812.5
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4998437.5
ns5001042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
207150
ns202971.5
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28209
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28292
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28209
ns28209
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24880
ns24514.5
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66375
ns66417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66584
ns66458
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66542
ns66500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66541
ns66500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
537867.5
ns505942
ns1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1339125
ns1502084
ns0.89
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1143854
ns1124250
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1056979.5
ns944270.5
ns1.12
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2227833
ns2255250
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
577124.5
ns566674
ns1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3019562
ns3090791
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2730250
ns2751542
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2578250
ns2628896
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3815792
ns3819709
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2002712
ns1979936
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8920709
ns8847333
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8781875
ns8768375
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8792854
ns8750250
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6367541.5
ns6340375
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
84000
ns85125
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82083
ns83021
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84583
ns85708.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80791.5
ns83562.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192031
ns192703
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2015625
ns2012875
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2019458.5
ns2024062.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1745917
ns2038542
ns0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2013895.5
ns2008812
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
797860.5
ns791664.5
ns1.01
This comment was automatically generated by workflow using github-action-benchmark.