-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
refactor: cleanup some old pre-1.0 hacks (#1102)
- Loading branch information
Showing
3 changed files
with
11 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cd96335
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4375
ns4083
ns1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4208
ns4458
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5042
ns4583
ns1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3833
ns4458
ns0.86
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59750
ns61537
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10229.5
ns9958
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10458
ns11083
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11208
ns10125
ns1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10083.5
ns10292
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
421969
ns428120
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1042
ns1208
ns0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1333
ns1333
ns1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1334
ns1333
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1167
ns1042
ns1.12
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18218
ns17813
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
3791
ns3959
ns0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4125
ns4042
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4375
ns4375
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4083
ns4000
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
110020
ns110308
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
55625
ns57500
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46833
ns38333
ns1.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46208
ns46625
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81750
ns82166
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36958.5
ns36705
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2050166
ns2027541
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2100334
ns2090041.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2073937.5
ns2097083
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1993041
ns1999875
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195385
ns195283
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
143208
ns143625
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143958.5
ns143417
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146000
ns145584
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
182375
ns147187.5
ns1.24
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165528
ns166525
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1157292
ns1109542
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1158062.5
ns1126812.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1107125
ns1122083
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1113937.5
ns1020645.5
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
525805
ns533338
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3542
ns3375
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3959
ns3416
ns1.16
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4458
ns4541
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3458
ns3604.5
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70267.5
ns68868.5
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8792
ns9292
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8667
ns9542
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9500
ns9792
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9333
ns8833
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
486148
ns494765.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15916
ns15583
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15208
ns16458
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17458
ns16500
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15750
ns15083
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55035.5
ns54721
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214687.5
ns212833
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213875
ns215167
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214499.5
ns214416
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
214020.5
ns212417
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
271923
ns274119.5
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
584
ns583
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns792
ns0.68
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns750
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns583
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17550
ns17270
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1667
ns1667
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1625
ns1667
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1458
ns1.29
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1583
ns1708
ns0.93
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
102829
ns103124
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7083
ns7291
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6041
ns5292
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5917
ns5958
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns9917
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23605
ns23563
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221000
ns220708
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229416.5
ns236874.5
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230875
ns228875
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
252791.5
ns220166
ns1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
168416.5
ns169828.5
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3958
ns3875
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3875
ns3917
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3916
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3916
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23282
ns23299
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16625
ns16708
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16875
ns16833
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16958
ns16834
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16667
ns16667
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
160471
ns162920
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
574250
ns574791
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
576167
ns578334
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
579895.5
ns574000
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
573917
ns574333
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113142
ns113504
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1424041.5
ns1420083
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1417292
ns1415750
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1420500
ns1420208
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1425500
ns1425187.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
209769
ns212199
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1054333.5
ns1067895.5
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
959917
ns940416
ns1.02
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1343583.5
ns1346520.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1300896
ns1295333
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
279273.5
ns276087
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5749437.5
ns6005792
ns0.96
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4599687.5
ns4619125
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4952395.5
ns4921458.5
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5610084
ns5705500
ns0.98
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1087158.5
ns1093586
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23646
ns23336
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2167
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2166
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2083
ns2209
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
173162
ns170662.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4542
ns4083
ns1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4208
ns4250
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5000
ns5250
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4000
ns4250
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
64791.5
ns66890.5
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11208
ns11166
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11291.5
ns11750
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12209
ns11792
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11292
ns11145.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
449166
ns455730.5
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6792
ns6708
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6833
ns6917
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8312.5
ns8000
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6250
ns6833
ns0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
51887
ns53251
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16792
ns17646
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16667
ns17687.5
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17500
ns17583
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17875
ns18520.5
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
301591
ns303857.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32520
ns32349
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8625
ns8500
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8416.5
ns9458
ns0.89
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9417
ns9291
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8250
ns9375
ns0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
159487
ns158134
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64959
ns64375
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64667
ns64500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64416
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64541
ns64417
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
110435.5
ns111051
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
291250
ns280917
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
285125
ns285417
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
274833.5
ns280750
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
279770.5
ns279291.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
183913
ns185526.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3222417
ns3281750
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3060583
ns2797500
ns1.09
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3017291.5
ns3018917
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4070708
ns4088625
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
571448
ns571296
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7560916.5
ns7642500
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7434917
ns7291354
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7464958
ns7449292
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8157583.5
ns8096333
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1323265
ns1326986
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17698375
ns17512333
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17382541
ns17557479.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17917041
ns17568792
ns1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14113978.5
ns14165000
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24259667
ns23618750
ns1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33537791.5
ns43411666
ns0.77
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37485625
ns37050562
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34876854.5
ns34914229.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1864963
ns1853387
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
191699417
ns187623875
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
233048750
ns247457083
ns0.94
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
194089542
ns194208333
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
434858250
ns434785500
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13855629
ns13912861.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
292275916
ns289468416
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
336958667
ns350360437.5
ns0.96
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
297206917
ns297011958
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
408837354
ns409128187.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22333
ns24042
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24521
ns23958
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23666
ns23916
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22417
ns22020.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95962.5
ns96407
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
104625
ns103208.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103334
ns104791
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104875
ns104667
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103250
ns103417
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
503280
ns511501
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6417
ns6145.5
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6250
ns5625
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7250
ns6979.5
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6041
ns5709
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
67524
ns69596.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15250
ns14520.5
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15500
ns15542
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16125
ns16125
ns1
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
12875
ns14625
ns0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
474310.5
ns479202.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
2994417
ns3041750
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2072458
ns2066041.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2264416
ns2266312
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4512000
ns4490041.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
589406.5
ns590463
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23917916
ns23486917
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18038749.5
ns18259854
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17983750
ns17822021
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35261125
ns35704478.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2768485.5
ns2768088
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33831646.5
ns33321020.5
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27630729
ns28000312.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28545541
ns28560333.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41340292
ns41618958
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72833
ns72209
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73521
ns81645.5
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
74958
ns74917
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
83500
ns72396
ns1.15
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
102113
ns105122.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
208042
ns278083
ns0.75
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
291208
ns314375
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219875
ns208562.5
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217417
ns241750.5
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
550239
ns565906
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11916
ns11417
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12042
ns11833.5
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13000
ns12250
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11583
ns12166
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
70941.5
ns73969
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26541
ns26125
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26875
ns27542
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27833
ns26708
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26708
ns26708
ns1
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
472589
ns488459.5
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12083
ns12208
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12917
ns13084
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13771
ns13833
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12334
ns12687.5
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52605
ns55593
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25917
ns25333
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25625
ns26458
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
25958
ns26250
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26542
ns26458
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
304518.5
ns314229
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
179750
ns181625
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
181375
ns180250
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
182875
ns183667
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182083
ns179417
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
57612
ns58869
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
585375
ns587667
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
582375
ns585625
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
584291.5
ns583584
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
585895.5
ns584708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
287910.5
ns294563.5
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6396
ns5395.5
ns1.19
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6084
ns6042
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7667
ns8416.5
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5542
ns8791
ns0.63
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70404
ns73281.5
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14458
ns13834
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14917
ns15125
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16000
ns14333
ns1.12
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14416
ns14250
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
461584
ns478456
ns0.96
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1193604.5
ns1191541
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1246000
ns1236750
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1273583.5
ns1285583.5
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1016875
ns1003417
ns1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301246
ns302585
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4298583
ns4114354
ns1.04
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4454937.5
ns4527875
ns0.98
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4559833
ns4560333.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3718125
ns3695000
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1052722
ns1056192.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
24315
ns23824
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5000
ns4875
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4916
ns4917
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns4917
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4916
ns4959
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
193381
ns193428
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6270.5
ns6084
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6292
ns6166
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7042
ns6917
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5666
ns6209
ns0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
56858.5
ns57953
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11000
ns10333
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10584
ns11709
ns0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11292
ns11333
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10625
ns11583
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
341133.5
ns343622
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
416
ns334
ns1.25
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
334
ns333
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23459
ns23294
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns2792
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2708
ns3042
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3000
ns3000
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2750
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
163121
ns163978
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
12209
ns11375
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12250
ns11666
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
15083
ns12500
ns1.21
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11375
ns11542
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
59412
ns59566.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24792
ns24459
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24833
ns25042
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25042
ns25083.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25167
ns25083
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
302787.5
ns305262.5
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4250
ns4208
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4167
ns4167
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4209
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4167
ns4208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
25427.5
ns25152
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16041
ns16083
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16250
ns16042
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16125
ns16375
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16084
ns16167
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
203537
ns203575
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5750
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5834
ns5833
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34639
ns34167
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21375
ns20291.5
ns1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21125
ns21041
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
22125
ns21500
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23000
ns21167
ns1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
181357
ns180386.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
404042
ns420667
ns0.96
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
390084
ns363520.5
ns1.07
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
483167
ns482000
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
103834
ns125291.5
ns0.83
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67491
ns67480
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
913854
ns897041
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
961459
ns967000.5
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1201334
ns1167958
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
448417
ns396500
ns1.13
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
192152
ns197078.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80542
ns80125
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81500
ns81020.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
79854.5
ns82625
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
78813
ns83458
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193447.5
ns194831
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1946833
ns1694000
ns1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1932479
ns1917291.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1920708
ns1931459
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1904937.5
ns1896062.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
402534
ns416256.5
ns0.97
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22000
ns22312
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
169877.5
ns176862.5
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7521
ns6208
ns1.21
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7167
ns6875
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7792
ns7750
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6500
ns7000
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61779
ns62506
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9250
ns8833
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9500
ns9250
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9042
ns9417
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9292
ns9333
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
314965
ns325531
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
158324292
ns121103854.5
ns1.31
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174385041
ns181392229
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148149145.5
ns147959958.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104978917
ns103681750
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5475583
ns5500074
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
673914521
ns613086875
ns1.10
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
556536500
ns578493750
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
454282229
ns454857041.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
754352104
ns752941812.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35161544.5
ns35077599
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
703002500
ns649102417
ns1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
668300021
ns685608520.5
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
587968625
ns589011249.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
742489083
ns739858625
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57833
ns59500
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
48000
ns38708
ns1.24
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47959
ns48000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82333
ns82708
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38135
ns38528
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1945042
ns1741292
ns1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1994937.5
ns1966416
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1978208
ns1984416
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1862834
ns1859270.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
174772.5
ns177396
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
267333
ns271125
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
267521
ns274250
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
268709
ns268416
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
266959
ns267791.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
138445.5
ns137600.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
605250
ns587833
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
597333.5
ns666917
ns0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
696500
ns587208
ns1.19
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
676042
ns665917
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
740206.5
ns757074
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2204042
ns2224291.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2205084
ns2235083
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2220750
ns2099770.5
ns1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2219958
ns2218208
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
135150.5
ns135238
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5598583
ns5494167
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5526083
ns5547875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5502958
ns5497792
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5487708.5
ns5395666.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
792599
ns797087
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
660166
ns643250
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
643583
ns646958
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
659417
ns642375
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
644542
ns640208
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47532
ns47636
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1795875
ns1820958
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1722291
ns1668166
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1727709
ns1721291
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2095458
ns2100708
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
227325
ns227359.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56375
ns58583
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46291
ns38208.5
ns1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46625
ns47292
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82500
ns82750
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
29417
ns29299.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030542
ns2023770.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2111062.5
ns2018000
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2091895.5
ns2096292
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1996833
ns1983895.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
193004
ns191243
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13382833
ns13392479
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12443000
ns12447084
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12480979
ns12573562.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15173917
ns15225667
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
517073
ns515936
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47607083
ns47214583.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41883313
ns42007792
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40854417
ns40831167
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58509979
ns58287250
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2896765.5
ns2893597
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
97269708
ns73879562
ns1.32
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
68581771
ns91062583
ns0.75
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90434166
ns90595250
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
98826583
ns98708500
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56833
ns59041
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47417
ns38458
ns1.23
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47291
ns47500
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80833
ns83041
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46888
ns46889
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1939104
ns1914042
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2010459
ns1980250
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1977312.5
ns1983041.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1892292
ns1895208.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192004
ns191685.5
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
334
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns416
ns0.80
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
31834
ns31909.5
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6084
ns5958
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6083
ns6666
ns0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6709
ns6500
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6167
ns6667
ns0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
176223.5
ns174339.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31304
ns31434
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2625
ns2625
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2584
ns2959
ns0.87
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2917
ns2792
ns1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2667
ns2833
ns0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
164663.5
ns161698.5
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
324499000.5
ns284655874.5
ns1.14
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
340579375
ns346665396
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
313389416.5
ns314185249.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
273909208
ns271410834
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7105361
ns7071052.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1052816166
ns986652459
ns1.07
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
943649000
ns960769500
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
840615666.5
ns837320313
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1152028667
ns1160509417
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34095663
ns34004605
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1721214458
ns1311324917
ns1.31
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1359927020.5
ns1697266750
ns0.80
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1606248000
ns1638971166
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1668736833
ns1734387958.5
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1425375
ns1414375
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1415542
ns1459333
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1416520.5
ns1417583
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1410375
ns1464750
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127634
ns127631
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5060999.5
ns4707666.5
ns1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5059104
ns5056666.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5025375
ns5045625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5018125
ns5028167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
596333
ns589690
ns1.01
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
163798854
ns174231250
ns0.94
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
128369875
ns167491167
ns0.77
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
130888792
ns128702541
ns1.02
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
168698771
ns154878708
ns1.09
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
5432122
ns4890073
ns1.11
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
630866750
ns622332667
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
635134916
ns581984000
ns1.09
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
554211625
ns496978166
ns1.12
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
648292583
ns643892875
ns1.01
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16519965
ns16065970
ns1.03
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
9165854
ns8934042
ns1.03
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8986459
ns9020375
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7922833
ns7917083
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9756167
ns9692542
ns1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1610067
ns1603050
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
37032625
ns36495271
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37212042
ns38137292
ns0.98
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33438583
ns33438520.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37841958
ns37760500
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6473180
ns6473707
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47479.5
ns47375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47437.5
ns47417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47667
ns47500
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47500
ns47542
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18175
ns18555
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50250
ns50250
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50625
ns50375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50542
ns50667
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50416.5
ns50375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
243634.5
ns207795
ns1.17
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7229.5
ns6375
ns1.13
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6917
ns7041
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7834
ns7958
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7292
ns7208.5
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
134228.5
ns101178.5
ns1.33
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10375
ns10125
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9458
ns10625
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10334
ns10625
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10250
ns10417
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
725024.5
ns593102.5
ns1.22
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6917
ns5875
ns1.18
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6292
ns6208.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7417
ns6750
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5937.5
ns6084
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
158899
ns121281
ns1.31
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13334
ns12708
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13083
ns13541
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13958
ns13250
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12875
ns13208
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
654550.5
ns511694
ns1.28
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1083
ns1000
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32302
ns32282
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8000
ns7750
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7958.5
ns8042
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8000
ns8000
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns8041
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
248668.5
ns210142.5
ns1.18
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23334
ns23166
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23625
ns23209
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23604.5
ns23250
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23334
ns23104.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18197
ns18312
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52375
ns52416
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52583
ns52542
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52709
ns52709
ns1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52291
ns52625
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
365195
ns291833.5
ns1.25
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1409312.5
ns1400833
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1395312.5
ns1445959
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1395667
ns1396833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1399187.5
ns1398917
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196466
ns197117.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5048625
ns5008208
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5082916.5
ns5030250
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5010208
ns5026354
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5015083
ns4996437.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
697077
ns600264
ns1.16
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3082583
ns3038708
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2075667
ns2105979
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2279000
ns2274062.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4910958
ns4858083
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
586799
ns586328
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24742792
ns24399625
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18899334
ns19072583.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18912125
ns18904750
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36606271
ns36638687.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2884394
ns2819518
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34600271
ns33955417
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28275125
ns28785062.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27978625
ns28141333
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41693583
ns41707708.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
146263625
ns142540583
ns1.03
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
148262792
ns146733875
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
125521666
ns125527687.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173208104.5
ns174248667
ns0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22564372
ns22566115
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
948935833
ns968276062.5
ns0.98
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1199705645.5
ns860326354.5
ns1.39
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
727524542
ns858659167
ns0.85
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
936153853.5
ns683117959
ns1.37
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
115985315
ns118099274
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74250
ns72375
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76209
ns74000
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76042
ns76250
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72167
ns73208
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
331111.5
ns235570
ns1.41
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
282500
ns203292
ns1.39
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
191083.5
ns282896
ns0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
280584
ns203583
ns1.38
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
291917
ns207583
ns1.41
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1500994.5
ns1260670
ns1.19
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
36314916.5
ns35143208
ns1.03
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36531396
ns36705709
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32439729.5
ns32591958.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40435354
ns40607646
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5837859
ns5841170.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
151857209
ns148155791.5
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
153888604
ns158417083.5
ns0.97
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
135530208.5
ns137765333
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
283241209
ns283770667
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34859945
ns34905958
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
159567375
ns120795375
ns1.32
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174506458
ns181579562.5
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147925667
ns148004834
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104572437
ns108061458.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5480695
ns5466179.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
524085270.5
ns468909791.5
ns1.12
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
467380250
ns485490958.5
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
437823166
ns438520417
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
737646542
ns742778708
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32284174.5
ns32266057
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
696105375
ns707166333
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
658106854.5
ns671742104.5
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
575346979
ns577648896
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
729353375
ns734518917
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1155874.5
ns1349520.5
ns0.86
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
998792
ns780417
ns1.28
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
991542
ns909417
ns1.09
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2092625
ns2087500
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
579446
ns566986
ns1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2931916.5
ns2979167
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2619083.5
ns2496208
ns1.05
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2626604.5
ns2619166
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3482417
ns3728333
ns0.93
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1969877.5
ns1738136
ns1.13
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5947625
ns5799875
ns1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5782625
ns5883292
ns0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5801958.5
ns5800167
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2880584
ns2892541.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7375
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083
ns5292
ns1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5959
ns6208
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9959
ns10042
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26024
ns25118
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212562.5
ns212333
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221083.5
ns221583
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221333
ns220562.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
209292
ns215896
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
302079.5
ns262400.5
ns1.15
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
311414437.5
ns307233708
ns1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
232931208
ns279732584
ns0.83
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
202032375
ns198830375
ns1.02
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
308462875
ns309726917
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7680461
ns7656813
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1101691479.5
ns1090685500
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
909424125
ns1068219000
ns0.85
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
804661000
ns818375167
ns0.98
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1153673416.5
ns1160424021
ns0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26512167
ns26548125.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5833.5
ns5812.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5833
ns5708
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6270.5
ns6959
ns0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5146
ns5458
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
196235
ns154820
ns1.27
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7125
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7125
ns7708
ns0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7417
ns7625
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7125
ns7542
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
702510
ns618164
ns1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
584
ns625
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns584
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24654
ns23615
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9333
ns9250
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8750
ns9458
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9875
ns9625
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9292
ns9750
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
239874
ns207782.5
ns1.15
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352958
ns356333
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351875
ns352417
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
351583.5
ns356083
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352208
ns357500.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21408
ns21053.5
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
779709
ns780146
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
775541
ns776312.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
776062.5
ns809375
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
817375
ns826750
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
316328
ns303323.5
ns1.04
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
317708
ns338396
ns0.94
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
341667
ns325208
ns1.05
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
453354
ns453375
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
10875
ns10542
ns1.03
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18691
ns17732
ns1.05
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
712145.5
ns718917
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
734917
ns732645.5
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1006834
ns1009833
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
27250
ns26583
ns1.03
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
293795
ns257155
ns1.14
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
359083.5
ns374000
ns0.96
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
350250
ns331500
ns1.06
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
442875
ns441875
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
30583
ns30917
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22877.5
ns22404
ns1.02
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
736584
ns739437.5
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
783750
ns779666.5
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1041500
ns1041375.5
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
105875
ns104312.5
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
265090.5
ns235395
ns1.13
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3666
ns3625
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3667
ns3625
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3667
ns3625
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3750
ns3459
ns1.08
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17832
ns17702
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4250
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4292
ns4250
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4333
ns4334
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4333
ns4375
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
285935
ns245299
ns1.17
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4209
ns3479.5
ns1.21
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3875
ns3792
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4291
ns4334
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3250
ns3709
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
226147.5
ns185222
ns1.22
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8541
ns8125
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8208.5
ns8687.5
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8833
ns8666
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8792
ns8375
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1241590
ns1127148
ns1.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203041
ns206541
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210833
ns212000
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
213042
ns211000
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200208
ns202291
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35096
ns34888
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
600458
ns648750
ns0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
664687.5
ns634312.5
ns1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621125
ns632771
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
587666
ns596417
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
364021
ns322649.5
ns1.13
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
1006145.5
ns998333
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1034750
ns1039375
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
960375
ns952083
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
870666.5
ns904292
ns0.96
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207603
ns208498.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4675520.5
ns4540000
ns1.03
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4661500
ns4817791.5
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4484166.5
ns4468750
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
5182375
ns5130375
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
945582
ns959939
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4167
ns3875
ns1.08
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3458
ns3334
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4416.5
ns4125
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3250
ns3750
ns0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
242881.5
ns197248.5
ns1.23
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7625
ns7645.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7125
ns7333
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7791
ns7292
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7166
ns7458
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1049374
ns1027567
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1641104.5
ns1650375
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1162041.5
ns1182479.5
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1361146
ns1370292
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2337792
ns2441916.5
ns0.96
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215237
ns215671.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12428417
ns12370500
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9554417
ns9601667
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9282166
ns9328687.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18043958
ns18097145.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1957521
ns1953457
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17446729
ns17380125
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14307562.5
ns14471146
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14338292
ns14397875
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21055500
ns21055583
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
90250
ns91125
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
89750
ns90875
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92271
ns94958
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
92625
ns88000
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126161
ns126032
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2045083
ns2023583.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2029000
ns2028542
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2032875
ns2033312
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2022667
ns2043416.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1071170.5
ns1084734
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
1645.5
ns3458.5
ns0.48
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
2375
ns1625
ns1.46
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3708
ns3500
ns1.06
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
2875
ns1750
ns1.64
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16073
ns15936
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2792
ns2584
ns1.08
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2708
ns2791
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2916
ns2917
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2792
ns2833
ns0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
196950.5
ns195099.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7166
ns7250
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083
ns5292
ns1.15
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5958
ns6083
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10125
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33844
ns33830
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215354.5
ns224916
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220833.5
ns234875
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220875
ns231083
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
209625.5
ns218917
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
351596
ns348229.5
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3709
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3708
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22384
ns21982
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14250
ns14459
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14459
ns14208
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14334
ns14417
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14375
ns14584
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
522120.5
ns489892.5
ns1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
140291
ns94917
ns1.48
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
91729.5
ns93416.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
96250
ns99875
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
94458
ns92625
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125465
ns125549
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1947916
ns1921625
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1932104.5
ns1933333.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1925000
ns1928500
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1650916
ns1950604.5
ns0.85
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1016603
ns964756
ns1.05
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
859167
ns873521
ns0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
818395.5
ns804167
ns1.02
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1219500
ns1218520.5
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
962834
ns954959
ns1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA
269546
ns285492.5
ns0.94
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2844645.5
ns2830854
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2436375
ns2531000
ns0.96
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3336375
ns3356083
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3413042
ns3412042
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1630539.5
ns1671062
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15708.5
ns16271
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16000
ns16500
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17166.5
ns18666.5
ns0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14958.5
ns18916
ns0.79
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
143350.5
ns144500.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217083.5
ns260708
ns0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215417
ns254749.5
ns0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216916
ns227979
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227125
ns226584
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
653459
ns650846.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221959
ns222167
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221645.5
ns222041.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
221395.5
ns222166
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221083
ns220000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
274733.5
ns277439
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
511042
ns561333.5
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
495750
ns549000
ns0.90
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
497042
ns558813
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
508875
ns557729.5
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1471458
ns1450310.5
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
4666.5
ns4000
ns1.17
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4083
ns4166
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
5708
ns5750
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
3917
ns4042
ns0.97
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16967
ns17089
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7125
ns7000
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7417
ns7208
ns1.03
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7520.5
ns7166
ns1.05
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7250
ns7542
ns0.96
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
198610.5
ns196929
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18084
ns18083
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18166.5
ns18959
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18666.5
ns19250
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18395.5
ns18124.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
148303
ns165663
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214125
ns222875
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212791.5
ns213896
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213042
ns225792
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219417
ns222042
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1024505
ns1029397
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4625
ns4500
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4333
ns3958
ns1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5708
ns5125
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3708.5
ns4333
ns0.86
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
244514
ns204180
ns1.20
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10875
ns10917
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10062.5
ns10583
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11375
ns10500
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10583
ns10750
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1099794
ns1058573
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4000
ns3291.5
ns1.22
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3792
ns3542
ns1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4375
ns4417
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
2750
ns3458
ns0.80
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
250198
ns245634
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7270.5
ns7458
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7459
ns7583
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7916
ns7625
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7500
ns7541
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1106505
ns1074772.5
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24086812.5
ns23471041.5
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34704750
ns43849166
ns0.79
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37376896
ns37957792
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34935000
ns34964125
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1853477
ns1792082
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
186942250
ns184426958
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159685500
ns173017604
ns0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146457125
ns147161645.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
411532208
ns411405916
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16500596
ns16521696
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
434054666
ns426004833.5
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
253740479
ns259123250
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
299567770.5
ns296958750
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
479705417
ns480245750
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
184375
ns183042
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183584
ns185188
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184333
ns186041.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
185292
ns184333.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
229399
ns226412
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
594750
ns597750
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
586209
ns598229
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
586729.5
ns632895.5
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
599250.5
ns586958
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1121066.5
ns1097502
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3936375
ns3838542
ns1.03
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
4081937
ns4115979
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3587479
ns3571292
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4565729.5
ns4600166.5
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA
538820
ns534974
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
18136458.5
ns17343875
ns1.05
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17936750
ns18514250
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16532771
ns16537292
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
20226167
ns20367667
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2633099
ns2795688
ns0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns583
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns666
ns0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
31971
ns32682
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9375
ns9042
ns1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9000
ns9709
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9500
ns9500
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9333
ns9666
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
265140
ns266437.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
503989417
ns499772583
ns1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
431858541.5
ns504959958
ns0.86
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
427434834
ns422832542
ns1.01
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
592092708
ns673427063
ns0.88
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
11928812
ns11842270.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1891189687.5
ns1875482271
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1632073542
ns1653498000
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1496948750
ns1486024395.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2217192312.5
ns2210913770.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49332313
ns49084588.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1638750
ns1649062.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1179458
ns1182584
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1387875
ns1392250
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2352479.5
ns2377145.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214938
ns218920
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12852583.5
ns12688458.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9964500
ns10001583.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9669416.5
ns9698792
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18345667
ns18502292
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2032751.5
ns2042988
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17791875
ns17689291
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14679354.5
ns14793041.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14576209
ns14622084
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21490021
ns21477583.5
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26333
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23891
ns24105
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66750
ns67000
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67791
ns67042
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67042
ns67875
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67042
ns67208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
403092
ns396461.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203542
ns204959
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209834
ns209958
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210500
ns209875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199500
ns199833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26253
ns26682
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
602666.5
ns646208
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
670479
ns670000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621791
ns644166
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
633916
ns630416
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
351051
ns354787
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
678375
ns598417
ns1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
654937.5
ns657292
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
646500
ns664187.5
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
669916
ns659708
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131843
ns132717
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2326042
ns2235958
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2262000
ns2279125
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2145978.5
ns2249833
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2234542
ns2316042
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1242552
ns1193695.5
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17645.5
ns18500
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17062.5
ns19250
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19125
ns19292
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17875
ns17500
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
146421.5
ns146082
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220959
ns259917
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219500
ns259625
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220291
ns230208.5
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
235729
ns256708
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1091456.5
ns1005431.5
ns1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
666
ns625
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
708
ns625
ns1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23721
ns23900
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10084
ns9750
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9791.5
ns10333
ns0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10041
ns10000
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9875
ns10000
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
261550.5
ns259163
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6042
ns5833
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5625
ns5916
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6584
ns6459
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5125
ns5833
ns0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
234716
ns228223.5
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7417
ns7500
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7208
ns7666
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8041
ns7666
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7334
ns7333
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
806215
ns770644
ns1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2229.5
ns2333
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2458
ns2187.5
ns1.12
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2375
ns2292
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2292
ns2250
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17855
ns17986
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6542
ns6500
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6667
ns6666
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6834
ns6666
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6500
ns6625
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
333301.5
ns321059
ns1.04
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
755083
ns749208.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746333
ns748958
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
749250
ns750125
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
750187.5
ns748834
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21362
ns21410
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
788958.5
ns798125
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
772209
ns791208
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
787687.5
ns837729.5
ns0.94
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
791333
ns775270.5
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
298265
ns301663.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7125
ns7417
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6041
ns5291
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5959
ns6042
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10209
ns10292
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33317.5
ns33301
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
226500
ns232896
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
236063
ns268833.5
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228041
ns267354.5
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255812.5
ns215500
ns1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
363202
ns361937
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10542
ns10000
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10334
ns9833.5
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11208
ns11042
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9833
ns10333
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
246668.5
ns250034.5
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24729.5
ns24334
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24666
ns25250
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25542
ns24542
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24625
ns24334
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1134784.5
ns1111417.5
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106546667
ns106812374.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
118425312.5
ns126726167
ns0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120189792
ns121727417
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117420708
ns118228479
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2655736
ns2616848
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
394570417
ns391804291
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
368931959
ns379056792
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
424438979
ns355535666
ns1.19
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
482063875
ns486452916
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15246102
ns15186296
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
945190750
ns756685666.5
ns1.25
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
580209500
ns774854291
ns0.75
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
744122999.5
ns746786813
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
945148083
ns947077458
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7708
ns8416
ns0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7250
ns7125
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8750
ns8125
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6958.5
ns9604
ns0.72
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
238753.5
ns240976
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14500
ns14250
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13875
ns14291
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14000
ns14167
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14333
ns14166
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1093778.5
ns1095523
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6708
ns5917
ns1.13
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6125
ns6125
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8208
ns6687.5
ns1.23
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5417
ns6292
ns0.86
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
238599
ns239291
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12833
ns12583
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12750
ns13125
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13166
ns13291.5
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12666
ns12417
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
799288.5
ns797358.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5667
ns5459
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
6250
ns5833
ns1.07
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
6459
ns7000
ns0.92
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5500
ns5542
ns0.99
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17328
ns16938
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15583
ns15500
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15417
ns15458
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15625
ns15666
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15791
ns15875
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
202450
ns200590
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns416
ns0.80
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23671
ns23824
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6541
ns6583
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6459
ns6500
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6667
ns6666.5
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6312.5
ns6583
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
241480.5
ns239979.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5834
ns6000
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5917
ns5917
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5917
ns5958
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25115
ns24627
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21604.5
ns20916.5
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21166
ns21209
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21750
ns21833
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21708.5
ns21292
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
267689.5
ns265615.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
186417
ns192687.5
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144250
ns146521
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
148916.5
ns149374.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
187729
ns142250
ns1.32
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167935.5
ns168462.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1375083.5
ns1318667
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1321917
ns1326875
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1326146
ns1328208
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1322375
ns1311167
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1358092
ns1370856
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23000
ns22125
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25770.5
ns22083
ns1.17
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24167
ns24209
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23834
ns24417
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
354989
ns357178
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
130916
ns130958
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
188500
ns180395.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
127375
ns130875
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
176959
ns178917
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1479622.5
ns1498842
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23532
ns23528
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6458
ns6417
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6333
ns6791
ns0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6833
ns6834
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6458
ns6792
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
258733.5
ns258073.5
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4833
ns4500
ns1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4917
ns5250
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5709
ns5125
ns1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4167
ns4667
ns0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
256891
ns256140
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9916
ns10000
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9958
ns10416
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10584
ns10292
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10167
ns10208
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1360812
ns1357774
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1667
ns1625
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23180
ns23069
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5708
ns5750
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5667
ns6084
ns0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6000
ns5917
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5708
ns5667
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
276437
ns275859
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6818791
ns6814167
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6367083
ns6368854.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6546291.5
ns6497917
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7662166
ns7560667
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215904
ns215030
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24172500
ns24038396
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21282334
ns21318250
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21008479
ns21055625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29757292
ns29800458
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2111780
ns2117334
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
48853770.5
ns37406895.5
ns1.31
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34383187.5
ns45481041
ns0.76
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45683833.5
ns45606750
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49363417
ns49407375
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6479.5
ns6375
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6125
ns6208
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6833
ns7292
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5750
ns5916
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
238562.5
ns237163.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8625
ns8375
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8084
ns8666
ns0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8208
ns8416
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7917
ns8375
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1069949
ns1062411
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1541500
ns1544167
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1273500
ns1249833.5
ns1.02
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1639187
ns1625709
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2161000
ns2004375
ns1.08
lenet(28, 28, 1, 128)/forward/GPU/CUDA
276949
ns275720
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7986167
ns7903083
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6543375.5
ns6659625
ns0.98
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7167709
ns7184500
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10462145.5
ns10128083
ns1.03
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1888924
ns1884846.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
343084
ns369396
ns0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
369208
ns353625.5
ns1.04
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
456437.5
ns456542
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
26417
ns24041.5
ns1.10
batchedmm(128, Bsize=4)/forward/GPU/CUDA
42517
ns46544
ns0.91
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
749479
ns743500
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
814979
ns796417
ns1.02
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1061458
ns1071583
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
119729.5
ns125958
ns0.95
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
307361.5
ns312111.5
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
395625
ns397375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288375
ns212250
ns1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288167
ns288125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
749875
ns753500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44492
ns44394
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
646208
ns673292
ns0.96
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
533666
ns472125
ns1.13
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
529000
ns531791
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974208
ns974625
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
191704
ns191967.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
670000
ns657167
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
636958
ns669958.5
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
641042
ns661104
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
677625
ns662708
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132879
ns132971.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2560042
ns2458250
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2486124.5
ns2498250
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2459583
ns2467687
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2464667
ns2501875
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1294427.5
ns1568577
ns0.83
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
2459
ns4333
ns0.57
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
3208
ns2583
ns1.24
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4500
ns4417
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
3354
ns2750
ns1.22
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16581
ns16411
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5541
ns5375
ns1.03
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5500
ns5458
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5583
ns5625
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5541
ns5625
ns0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
200795
ns199892.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458667
ns1463541
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1501750
ns1497208
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1499417
ns1503375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1438916
ns1442834
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40877
ns41596
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5154042
ns5109479
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5302542
ns5289042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5280125
ns5301333.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4986917
ns4680604
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
198039.5
ns198982.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3709
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3667
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3709
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33533
ns33311
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14875
ns15125
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15125
ns15084
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15417
ns15250
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15125
ns15250
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
380809
ns376159
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71583
ns71208
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71458
ns71209
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71166
ns71250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
70000
ns71500
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
112938
ns112893
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
327333
ns317750
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
333917
ns323708
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
320375
ns334166
ns0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
318167
ns320500
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
194303.5
ns195635
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1041
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1125
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
959
ns1083
ns0.89
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24404
ns23896
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8166
ns8000
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8083
ns8625
ns0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8250
ns8333
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8000
ns8167
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
265429.5
ns263562
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
503042
ns509521
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
488125
ns479125
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
565250
ns564625
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
215521
ns232458.5
ns0.93
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129735
ns129625
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1418125
ns1393208
ns1.02
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1470041
ns1479000
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1769041.5
ns1765792
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
863062.5
ns868125
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
275150
ns276144
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32275
ns31637
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6334
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6625
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6645.5
ns6625
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6291.5
ns6667
ns0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
266163
ns263537
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1767250
ns1722958.5
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1723000
ns1735250
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1726625
ns1733292
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1769375
ns1763312
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169706
ns169598.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4423667
ns4353521
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4340375
ns4379875
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4364395.5
ns4349063
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4356604.5
ns4390959
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1259542.5
ns1422688.5
ns0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6875
ns6938
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6708
ns6875
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
9167
ns7166
ns1.28
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
9667
ns6583
ns1.47
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
21299
ns20547
ns1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
52542
ns50541
ns1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
48458
ns50312.5
ns0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
32834
ns51250
ns0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51708.5
ns58249.5
ns0.89
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
295364.5
ns308428
ns0.96
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17959
ns17750
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
18333
ns17875
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
18667
ns19125
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17833.5
ns17500
ns1.02
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18767
ns18339
ns1.02
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53250
ns53375
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53583
ns53166
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53292
ns53250
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53500
ns53458
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
337341
ns344770
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75666
ns75459
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75250
ns75375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75208
ns75395.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
74750
ns75458
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46984
ns47276
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
342791
ns336417
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
339042
ns341125
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
324833
ns339250
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
325083
ns336541
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
212927.5
ns213552
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1484000
ns1489000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1528916
ns1522292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1527041
ns1529458
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1464042
ns1468458
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52506
ns52575
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5172417
ns5115542
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5313667
ns5292541
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5251417
ns5289458.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4985750
ns4978625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
206884
ns206120
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28375
ns28125
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28209
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28208
ns28208
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24921
ns24358
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66458
ns66334
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66625
ns66167
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66291
ns66209
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66250
ns66750
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
525792
ns526089
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1326354
ns1498042
ns0.89
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1132104
ns911000
ns1.24
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1139166
ns1149625
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2248604
ns2098500
ns1.07
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
583822.5
ns582137
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3055395.5
ns3080771
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2729979
ns2593125
ns1.05
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2738333
ns2751125
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3816042
ns3818125
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2120607
ns2100592
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8049792
ns7913063
ns1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8097167
ns8011208
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7911292
ns7901167
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4824937
ns4863125
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82042
ns81500
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81875
ns82000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82625
ns84125
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82125
ns83083
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194553
ns194175
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2055125
ns2020500
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2001916.5
ns2036292
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2021458
ns2018708
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2014750
ns2021916
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
811167.5
ns810603
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.