-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: trigger build for docs (#1087)
- Loading branch information
Showing
3 changed files
with
15 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3986545
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3792
ns3917
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4084
ns4125
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4834
ns4916
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3959
ns4083
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61509.5
ns61146.5
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10500
ns10917
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10541
ns10834
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns11250
ns0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10250
ns10833.5
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
431498.5
ns428022
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1062.5
ns1125
ns0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1167
ns1375
ns0.85
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1417
ns1333
ns1.06
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1208
ns1208
ns1
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18573
ns18431
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4000
ns3500
ns1.14
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4000
ns4208
ns0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4209
ns4250
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3750
ns4000
ns0.94
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
111184
ns110418
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57750
ns56958
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38542
ns46584
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46583
ns38458
ns1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82208
ns82541
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37503.5
ns37005
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2037645.5
ns2031792
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2095625
ns2084916.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1844375
ns2098292
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2001375
ns1994604.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
196039
ns194818
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145583
ns144208
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143584
ns146833
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146458
ns145854.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
145000
ns155416.5
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168190
ns165909
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1114291
ns1062250
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1150292
ns1115937.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
805500
ns1107500
ns0.73
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1122750
ns1116875
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
526921
ns521759.5
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3292
ns3416
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3666
ns3459
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4167
ns4625
ns0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3500
ns3375
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
72235.5
ns71735
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10125
ns9333
ns1.08
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8375
ns9458
ns0.89
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8792
ns10209
ns0.86
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8833
ns9042
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
480020
ns496830.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
14875
ns14708
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15000
ns15750
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17520.5
ns17292
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14583
ns17791.5
ns0.82
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
53914
ns53700
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214792
ns225708.5
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214875
ns225416
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214750
ns215417
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226813
ns212208
ns1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
272785
ns271526
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns417
ns1.50
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns750
ns0.83
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
917
ns792
ns1.16
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
459
ns708
ns0.65
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17774
ns17628
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1417
ns1.26
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1417
ns1791
ns0.79
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1709
ns1875
ns0.91
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1417
ns1625
ns0.87
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
102929.5
ns103125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7125
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5250
ns6000
ns0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns5334
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns9917
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23666
ns23465
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225187.5
ns223145.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
237479.5
ns241208
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229334
ns230125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226709
ns214458
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
168739
ns168094
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3959
ns3959
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3875
ns3917
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23839
ns23549
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16792
ns16500
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16833
ns17083
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16958
ns16792
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16750
ns16917
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
161365
ns161967.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
571458
ns575041
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
576000
ns576292
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
574041
ns576625
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
571458
ns571458
ns1
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113559.5
ns112966.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1425375
ns1420500
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1418875
ns1421667
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1418958
ns1425458
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1422750
ns1418167
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
210833
ns211527
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1076645.5
ns1068625
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
934291
ns968021
ns0.97
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1340187.5
ns1326812.5
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1294270.5
ns1291125
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
271656
ns271093.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5796417
ns5787333
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4651792
ns4571958
ns1.02
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4918209
ns4958979
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5515938
ns5712792
ns0.97
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1071316.5
ns1067736
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
583
ns542
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23948.5
ns23893
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2167
ns2084
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2209
ns2250
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2125
ns2167
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
169153
ns171705
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
3625
ns3500
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4084
ns4042
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4687.5
ns4792
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3709
ns4084
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66303.5
ns66188
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11270.5
ns11667
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11417
ns11791
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11625
ns12000
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10667
ns11375
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
456550
ns454536
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6312.5
ns6187.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6770.5
ns6542
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7792
ns8166.5
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7083
ns6291.5
ns1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52528
ns52258
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18375
ns16667
ns1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17833
ns18729.5
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17791
ns18375
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16833
ns16959
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
301396
ns309572
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns625
ns0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
584
ns584
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32972
ns32881.5
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9020.5
ns9375
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8459
ns9250
ns0.91
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9041
ns9167
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8708
ns8875
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
159042.5
ns160215.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64542
ns64500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64895.5
ns64459
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64292
ns64792
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64542
ns64500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
110877
ns111097
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
284875
ns283000
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
297937.5
ns279520.5
ns1.07
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
282333
ns293000
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
274104.5
ns284584
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
184904.5
ns185358
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3295541
ns3360124.5
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
2811062.5
ns3074542
ns0.91
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3016125
ns2838792
ns1.06
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3935209
ns4085167
ns0.96
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
572132
ns581270.5
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7478250
ns7606291.5
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7348937.5
ns7454771
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7339479.5
ns7331375
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8212959
ns7941542
ns1.03
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1367334
ns1351704.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18775625
ns18789417
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19121334
ns19123583
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19108667
ns20307208
ns0.94
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15653542
ns15680375
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23560250
ns23678834
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
42472875
ns33872500
ns1.25
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37127771
ns40958583
ns0.91
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34865500
ns34902646
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1862818
ns1864422.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
188025167
ns189702708
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
176960479.5
ns164774667
ns1.07
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
152823708
ns158085687.5
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
441336000
ns441097750
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13912250
ns13899004
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
290589750
ns290371250
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
276449542
ns338928292
ns0.82
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
296753875
ns306426854
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
333259041
ns333159833
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22875
ns22167
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23333
ns22854.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24125
ns23958.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23542
ns23084
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
98041.5
ns96966
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103625
ns103916
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
135834
ns104666.5
ns1.30
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
105084
ns105458
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103250
ns102667
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
518052
ns508850
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6209
ns5458
ns1.14
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6500
ns6042
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7041.5
ns7042
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5959
ns6291
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
70884
ns69786
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15084
ns15083
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15708
ns16125
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16250
ns16875
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14770.5
ns15208
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
492747
ns487835
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3001020.5
ns3022209
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2085333
ns2045708
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2274000
ns2300875
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4550083
ns4790083
ns0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
589071
ns589360
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23511750
ns23426083
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18279542
ns18045062.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16979209
ns18263042
ns0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35598583
ns35659959
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3111231
ns3113003
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33266500
ns33282791.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28064750
ns27619271.5
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27365500
ns27738250
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41824541.5
ns41798521
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
71750
ns74062.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74021
ns72958.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
74875
ns74416
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73458
ns72333
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104698
ns103887
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
314125.5
ns221583
ns1.42
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212229
ns207708
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
323000
ns317187
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218042
ns307146
ns0.71
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
559024
ns555798
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11625
ns11417
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12292
ns12834
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12500
ns13104
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11875
ns15167
ns0.78
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
73943
ns72891.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26583
ns26416
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26667
ns28000
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27708
ns28208
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26666
ns26625
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
493150
ns487294
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12208
ns11417
ns1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12896
ns13520.5
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13916
ns13875
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12500
ns12542
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
54608
ns54288
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26125
ns25667
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26000
ns25959
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
25916.5
ns26333
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26000
ns26833
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
315887.5
ns314956
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
179208
ns178708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183145.5
ns181500
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183166
ns182750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
180125
ns181062.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
58575
ns57250
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
582958.5
ns581875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
596541.5
ns583250
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
583833
ns591542
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582834
ns590812.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
294599.5
ns293401.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6292
ns5416
ns1.16
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6459
ns6166.5
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6750
ns7437.5
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6041
ns9167
ns0.66
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
72806
ns72329.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14542
ns13875
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13333
ns15375
ns0.87
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15667
ns15792
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14333
ns13958.5
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
482192.5
ns474586.5
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1177728.5
ns1175416.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1356208.5
ns1643125
ns0.83
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1250750
ns1273167
ns0.98
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1317541
ns1317228.5
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301448
ns302286
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4117688
ns4103708
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4491417
ns4373292
ns1.03
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4696854.5
ns4773000
ns0.98
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4452542
ns4454896
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1051206.5
ns1054654.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1875
ns1791
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
24165
ns24200
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5000
ns4792
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4958
ns4875
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4917
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
194564.5
ns192415
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6041
ns5542
ns1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6000
ns5709
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6145.5
ns7104
ns0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5958
ns5833
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
57313.5
ns56178.5
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11979.5
ns10917
ns1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11854.5
ns11709
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11042
ns12041
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11292
ns10875
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
342366
ns346003.5
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns334
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns334
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23004
ns23150
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3000
ns2708
ns1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
ns3042
ns0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3000
ns3041
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2792
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
159207
ns162167.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11583
ns10917
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11292
ns11417
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13437.5
ns13167
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11708.5
ns12291
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57286.5
ns57232
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25312.5
ns24458
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25083
ns24959
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25334
ns25083
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25167
ns25458
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
296722
ns300859.5
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4208
ns4125
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4167
ns4250
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4167
ns4209
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
25099
ns25510
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16125
ns16042
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16041
ns16166
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16166
ns16250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16042
ns16250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
199370.5
ns199412
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5833
ns5667
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5833
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5792
ns5750
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33986
ns34134.5
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21083
ns20958
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21125
ns20937.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21208
ns21375
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20667
ns21167
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
176941.5
ns179792
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
396792
ns394084
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
354313
ns373978.5
ns0.95
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
489167
ns468708
ns1.04
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
521584
ns517958.5
ns1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66831
ns67463
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
1005417
ns995417
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
876583
ns859333
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1235667
ns1222292
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1420854
ns1318979.5
ns1.08
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
191762.5
ns196257.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80250
ns79625
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80209
ns81667
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84167
ns83104
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81125
ns80667
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193433
ns193536.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1916083
ns1916416
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1933854
ns1914125
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1917917
ns1929917
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1923708.5
ns1920166
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
409629
ns396634
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22197
ns22405
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1834
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
170854.5
ns171684.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6791
ns6000
ns1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6417
ns6666
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7375
ns7771
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6959
ns7083
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61202
ns58222
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9291.5
ns9000
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9166.5
ns9166
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9375
ns9417
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9334
ns9791
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
313492.5
ns309426.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120748834
ns120543916.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181703729
ns174574000
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148437750
ns155303334
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104851584
ns102497708
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5474996
ns5490560
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
616853125
ns619560667
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
579539270.5
ns556574708
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
451846854.5
ns466726875
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
757165312.5
ns754133187
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34944567
ns38221000
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
649889209
ns650261583
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
688661771
ns664971666.5
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
592710229
ns598925604
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
741917708
ns742000833
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59750
ns58833
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38959
ns47833
ns0.81
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
48000
ns38750
ns1.24
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83416
ns83541.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37459
ns38157
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1922792
ns1923645.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1985083
ns1974708
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1978104
ns1987125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1893917
ns1902791
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
174160
ns176957.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
290625
ns264584
ns1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
266708
ns287375
ns0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
271521
ns281959
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
268167
ns265292
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
132776.5
ns127356.5
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
657229.5
ns636958
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
681187.5
ns669000
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
691583
ns710937
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
597417
ns618041
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
713916
ns717016.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2243937
ns2207208
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2191895.5
ns2220833
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2213542
ns2257083
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2180437.5
ns2184083.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133381
ns133922
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5496875
ns5489833
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5583292
ns5482375
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5498250
ns5611458
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5492750.5
ns5518187.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
753967
ns752554
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
636833
ns643333
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
644417
ns648250
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
645333
ns641875
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
637292
ns629833
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46993.5
ns47345
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1826042
ns1823375
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1667083
ns1724583
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1726542
ns1668166
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2105854.5
ns2109958.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
222295
ns225195
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58500
ns57875
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38708
ns45041
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47250
ns37792
ns1.25
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84292
ns84250
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28598
ns28912
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031041
ns2037042
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2099020.5
ns1774875
ns1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2091916.5
ns2106104
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1856417
ns2004709
ns0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
190652
ns192033
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13391395.5
ns13402709
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12453250
ns12434083.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12557375.5
ns12568959
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15140541
ns15252437.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
514312
ns515691
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47481750
ns47190042
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41986250
ns41818959
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40944792
ns41112729
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
57945917
ns58040458
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3259544
ns3265135
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
96867229.5
ns74602708
ns1.30
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91436187.5
ns90585333
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90591917
ns90752104
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76381625
ns99199917
ns0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59083.5
ns58666
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38750
ns47667
ns0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47417
ns38916
ns1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84000
ns84583
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46955
ns47499
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1925125
ns1924458
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1979250
ns1963958
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1970729.5
ns1984729
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1897750
ns1890125
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191790.5
ns192543.5
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns291
ns1.29
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns334
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32566
ns32327.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6417
ns6208
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6542
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6459
ns6583
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6083
ns6333
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
174123.5
ns175547
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31409
ns31532
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2833
ns2667
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2791
ns2917
ns0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2834
ns2875
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2583
ns2625
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
161269
ns161599.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
286258979.5
ns285406812.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
346927270.5
ns342256021
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
313997291.5
ns320515458
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
270108416
ns269110542
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7104986
ns7087814
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
998016667
ns998781958
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
959348209
ns940297375
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
851652541.5
ns865123208
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1162498166
ns1166879459
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33999768
ns34086790.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1672427541
ns1303773521
ns1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1705785000
ns1682982750
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1631619209
ns1621725000
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1314128542
ns1679721250
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1406813
ns1409125
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1416875
ns1410750
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1459625
ns1411520.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1407750
ns1409625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127789
ns127505
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5022896
ns5022459
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5051333
ns5010625
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5029542
ns5048916.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5031875
ns5026917
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
559312.5
ns574233
ns0.97
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
169600250
ns175430604
ns0.97
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
180340396
ns129749208
ns1.39
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
130036124.5
ns147644584
ns0.88
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
169790708.5
ns156276000
ns1.09
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
5056885.5
ns4878356.5
ns1.04
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
669854958
ns835576208
ns0.80
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
604244667
ns648955208
ns0.93
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
501867209
ns552474042
ns0.91
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
684062709
ns684916667
ns1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16520518
ns18031032
ns0.92
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8950666
ns8921583.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8876958.5
ns8765125
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7849458.5
ns8191792
ns0.96
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10185417
ns10144916
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1594436
ns1593700.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36026541.5
ns36087854.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
38047792
ns36659145.5
ns1.04
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33343417
ns34354041
ns0.97
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38792000
ns38831791
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6457988
ns6456160
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47417
ns47375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47375
ns47542
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47584
ns47542
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47333
ns47417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18535
ns18714
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50291
ns50270.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50375
ns50792
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50417
ns50666
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50083
ns50500
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
191873
ns208956.5
ns0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6458
ns6209
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6917
ns6916
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7750
ns7479
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6958
ns7375
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
91345
ns103120
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10458
ns9750
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9916
ns10208
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10084
ns10916
ns0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10708
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
527140.5
ns648155
ns0.81
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5625
ns5833
ns0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5917
ns6041
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6958
ns7416.5
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5750
ns6250
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
120543
ns142455.5
ns0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13583
ns13291
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13354.5
ns13250
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13458
ns13583
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13000
ns13812.5
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
537999
ns557240
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1083
ns1000
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1042
ns1084
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32473
ns32427
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7917
ns7709
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7917
ns8125
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7959
ns8250
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8167
ns8292
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
206314.5
ns212730.5
ns0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23437.5
ns23125
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23167
ns23229.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23584
ns23208.5
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23542
ns23291.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18671
ns18651
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52458
ns52791
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52541
ns52750
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
53458
ns52916
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52062.5
ns52917
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
291832.5
ns297972.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458937
ns1456542
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1401583
ns1402875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1403833.5
ns1406125
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1459708.5
ns1406020.5
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195968
ns195429
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5008771
ns5011375
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5044104
ns4999896
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5017250
ns5004104
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5011916
ns5012229.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
599687
ns629207
ns0.95
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3061000
ns3028791
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2086750
ns2080292
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2304917
ns2306959
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4539041
ns4528333.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
581670
ns580927
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24376958
ns24334708
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19122667
ns18811750
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
19181062.5
ns19285875
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36163041
ns36624250
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3185287.5
ns3188653
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34039875
ns34044375
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28717291.5
ns28337458.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28156000
ns28359625
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41614584
ns41526375
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144831583
ns144751292
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
143542708
ns142343042
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
124983229.5
ns126295625.5
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173618479
ns174518292
ns0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22558463
ns22564495
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1247182979
ns926464812.5
ns1.35
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
836595146
ns1101725458.5
ns0.76
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
738893583
ns713128541
ns1.04
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
672803125
ns670038542
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118329511
ns118583467
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
84666
ns72500
ns1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73666
ns73291
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76146
ns75666.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
75688
ns73875
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
240753.5
ns246837
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
287042
ns234791.5
ns1.22
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212354
ns260042
ns0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
296854
ns201917
ns1.47
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
284250
ns278875
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1238105
ns1315953
ns0.94
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35497979
ns35443854.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35870917
ns35414792
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32110833
ns32315667
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40961896
ns40952208.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5843453.5
ns5844700.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
149169500
ns147883042
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
155980437.5
ns151270666.5
ns1.03
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
134845625
ns140446166.5
ns0.96
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287434667
ns287799542
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34879809
ns34900582
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
121767709
ns120464917
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181613625
ns174558291
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148039291
ns155482959
ns0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104612333.5
ns105495999.5
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5485164
ns5466861
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
472118833
ns469121958
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
486130458.5
ns467127125
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
440650208
ns455180958.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
746192375
ns741251875
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32245076
ns35153962
ns0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
643396416
ns641357958
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
675303249.5
ns655567666
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
575492166
ns585153583
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
856961334
ns844860208
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1312541
ns1247062.5
ns1.05
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
677667
ns995875
ns0.68
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
963459
ns746854.5
ns1.29
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2093375
ns2056875
ns1.02
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
580070.5
ns576802
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2966541.5
ns2967458
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2496854
ns2623375
ns0.95
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2623959
ns2522834
ns1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3704083
ns3709541.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1730505
ns1870250
ns0.93
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6656375
ns6657292
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6477624.5
ns6484416.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6431167
ns6480437.5
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4450479.5
ns4452792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7334
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5417
ns6167
ns0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns5416
ns1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9917
ns10083
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25252
ns25185
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212583
ns212729.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229770.5
ns220541
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220500
ns228291
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206083
ns207000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
251646.5
ns290563
ns0.87
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
301644020.5
ns301766770.5
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
280942354.5
ns222332187.5
ns1.26
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
189363792
ns224980750
ns0.84
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
305392479
ns312140792
ns0.98
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7676597
ns7675117
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1087372208.5
ns1084663708.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
980974208
ns904835749.5
ns1.08
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
865965209
ns854386750
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1158600916.5
ns1158149229
ns1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26533591
ns26306824
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5354.5
ns5416
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5375
ns5375
ns1
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6917
ns6375
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4958
ns6083
ns0.82
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
146657
ns188494
ns0.78
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7395.5
ns7709
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7375
ns7375
ns1
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7250
ns7542
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7250
ns7750
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
596011.5
ns704994
ns0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
584
ns541
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns667
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns583
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns584
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24031
ns24467
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8917
ns9000
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9708
ns9417
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9583
ns9750
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
8833
ns9375
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
216620.5
ns240486.5
ns0.90
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
353333
ns353833.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352041
ns353708
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352666.5
ns352854.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352417
ns353687.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21463
ns21482
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
820625
ns780084
ns1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
828917
ns822375.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
774875
ns795166.5
ns0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
778729
ns807896
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
269469
ns308900.5
ns0.87
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
337187.5
ns335708
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
313687.5
ns336354
ns0.93
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
444709
ns443833.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
334500
ns324875
ns1.03
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17922
ns18282
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
689958
ns684187.5
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
746333
ns744666
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1025042
ns1037333
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
694854.5
ns690584
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
242950
ns282432
ns0.86
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
351417
ns351833
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
327270.5
ns346708.5
ns0.94
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
414729.5
ns433333
ns0.96
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
371750
ns367625
ns1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22559
ns23030
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
747208
ns746875
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
749416
ns752833
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1069374.5
ns1077583
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
815937.5
ns820333
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
224503
ns243734
ns0.92
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3708
ns3500
ns1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3625
ns3666
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3750
ns3645.5
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3291
ns3708
ns0.89
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17855
ns17855
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4125
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4208
ns4459
ns0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4333
ns4458
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4208
ns4584
ns0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
248489.5
ns265709
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3708
ns3041
ns1.22
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4167
ns3625
ns1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4791
ns4375
ns1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3792
ns3834
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
203806
ns202022.5
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8667
ns8167
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8250
ns8791
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8458
ns8791
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8667
ns8792
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1166315.5
ns1169080.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204875
ns205917
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209750
ns210542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209834
ns210750
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200000
ns200459
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34893
ns35422
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
602917
ns604333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
628833
ns629416
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621584
ns628417
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
592041
ns593229.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
321942.5
ns325735.5
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
978791
ns977521
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
937250.5
ns938375
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
960250
ns967604.5
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1307271
ns1300334
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207418
ns207218
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4504084
ns4502021
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4619604.5
ns4489021
ns1.03
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4294917
ns4453583
ns0.96
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6229292
ns6274333.5
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
936037
ns925859
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3354
ns2958
ns1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3583
ns3167
ns1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4417
ns4208
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3333
ns3916
ns0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
196464
ns208529.5
ns0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7334
ns7417
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7417
ns7333
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7291
ns7875
ns0.93
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6917
ns7291
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
985634
ns986067.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1640792
ns1630104
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1171541.5
ns1186917
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1327125
ns1369208
ns0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2384666
ns2425375
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
216205.5
ns214079.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12345499.5
ns12320416.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9603042
ns9599459
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9259895.5
ns9406208
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18032958.5
ns17994354.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1950941
ns1943728.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17348083
ns17326895.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14444583.5
ns14332166.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14302167
ns14502583
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21057645.5
ns21072542
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
87666.5
ns88166
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
89562
ns92124.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
90292
ns94000
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
88875
ns134709
ns0.66
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126565
ns126226
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2024000
ns2026417
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2030958.5
ns2016541
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1707583
ns2054833
ns0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2030042
ns2026813
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
999913
ns1038168
ns0.96
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
343750
ns343333.5
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
326145.5
ns341833
ns0.95
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
396833
ns417604.5
ns0.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
309896
ns302437.5
ns1.02
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16654
ns15633
ns1.07
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
702666
ns699708
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
733666
ns732792
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1020166
ns1028083
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
652500
ns646021
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
190386.5
ns194912.5
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7416
ns7208
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5291
ns6083
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns5292
ns1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10041
ns10125
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34743
ns33846
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224334
ns219687.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229333
ns231542
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220959
ns226000
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206292
ns217542
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
296926
ns312247.5
ns0.95
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3667
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3792
ns3709
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
ns3750
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
23083
ns22524
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14416
ns14375
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14209
ns14500
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14292
ns14208
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14458
ns14417
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
448235
ns464888.5
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
92854
ns91584
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
99583
ns92500
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
94542
ns97583
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
96042
ns138791
ns0.69
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125978
ns125619
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1920562.5
ns1712500
ns1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1914937.5
ns1913083
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1653792
ns1947334
ns0.85
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1928541
ns1925375
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
893203
ns931491
ns0.96
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
878750
ns866125
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
800021
ns822833
ns0.97
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1221729
ns1155708.5
ns1.06
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
963792
ns955167
ns1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA
277692.5
ns274168
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2824834
ns2710209
ns1.04
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2464958
ns2521125
ns0.98
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3323271
ns3343333
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3398958
ns3411041.5
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1565101.5
ns1667424.5
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17667
ns17334
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15458.5
ns15875
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17250.5
ns16625
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14645.5
ns15166
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
142432.5
ns146084.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
218209
ns216333
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222958.5
ns228896
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216334
ns218959
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215062.5
ns228333
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
637432
ns703455.5
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221145.5
ns222021
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
222375
ns219625
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
220917
ns222270.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
220333
ns219792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
280530
ns349468
ns0.80
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
510354
ns555354
ns0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
499375
ns541500
ns0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
500021
ns507791
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
507041
ns509583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1281236
ns1442859
ns0.89
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
332250
ns328208
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
316000
ns335375
ns0.94
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
364333
ns440041.5
ns0.83
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
323834
ns315958.5
ns1.02
batchedmm(16, Bsize=4)/forward/GPU/CUDA
17441
ns17029
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
715833.5
ns713479.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
735083
ns737250.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1022959
ns1022771
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
667041
ns658584
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
193588.5
ns197188.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18666
ns17667
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17375
ns18459
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19167
ns19667
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17083.5
ns19083
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
147781
ns167142
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212542
ns212084
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214146
ns214250
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213834
ns219917
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
211354.5
ns222583.5
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
877964
ns1047376
ns0.84
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4083
ns4104.5
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4291.5
ns3979.5
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5375
ns5375
ns1
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3958
ns4666
ns0.85
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
169898
ns213956.5
ns0.79
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10834
ns10875
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10542
ns10562.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10583
ns11042
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10459
ns10917
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
993411.5
ns1051553
ns0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3417
ns3104.5
ns1.10
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3167
ns3270.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4375
ns4375
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3062.5
ns3583
ns0.85
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
203556.5
ns243243.5
ns0.84
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7791
ns7875
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7458
ns7729.5
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7250
ns7916
ns0.92
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7541
ns7542
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1041955
ns1070496
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23557729
ns23697312
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43140979
ns33840729
ns1.27
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37880833
ns40993667
ns0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34954917
ns34934625
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1859678
ns1799949
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184630708
ns186397917
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
172192624.5
ns158804792
ns1.08
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146314396
ns151420541
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
415449708
ns414128875
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16494786
ns16543234
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
428781042
ns431001417
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
259710791
ns253386479.5
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
231751208
ns233549833
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
484878833
ns484447625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183625
ns184291
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183375
ns183000
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184417
ns185125
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182667
ns184854
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
177771.5
ns228024.5
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
590604
ns592083
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
588083
ns598459
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
586792
ns630729
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
586958
ns597020.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1015783.5
ns1101301
ns0.92
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3860917
ns3831520.5
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3732375
ns3861437.5
ns0.97
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3478062.5
ns3512750
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5358854.5
ns5353125
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
533317.5
ns533681
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17452375
ns17425875.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17779209
ns17302042
ns1.03
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16551750
ns17078292
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22184000
ns22192062
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2614491.5
ns2765136
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32765
ns32753
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9625
ns9875
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9542
ns9125
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9625
ns9958
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8917
ns9458
ns0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
263711.5
ns264732
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
501494042
ns502597125
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
411555459
ns431706229.5
ns0.95
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
374781084
ns473571542
ns0.79
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
672198042
ns673775020.5
ns1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12477100
ns12478261
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2044775145.5
ns2057099563
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1660536667
ns1632471000
ns1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1495631604
ns1543342583
ns0.97
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2221523375
ns2210188062.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49258137.5
ns49309371
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1643291
ns1549958
ns1.06
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1172917
ns1179083
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1391041.5
ns1373000
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2338333
ns2487854
ns0.94
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215612.5
ns216771.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12698542
ns12739625.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9998999.5
ns9965583
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9717041
ns9786125
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18433792
ns18379000
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2039696
ns2050930
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17679687.5
ns17630417
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14770854.5
ns14653084
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14602583.5
ns14783084
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21327625
ns21379021
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26292
ns26167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26291
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26208
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24225
ns23941
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67250
ns66666
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66834
ns67458
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
68166
ns67292
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66792
ns67208
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
378162.5
ns391319.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203125
ns202708
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
208500
ns209583
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
208666
ns209041
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200125
ns199791
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26005
ns26680
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
646625
ns626000
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
628813
ns633708
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
669895.5
ns626500
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
580791.5
ns630375
ns0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
311381
ns351467
ns0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
651667
ns642167
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
638666
ns572542
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
647417
ns640500
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
653083.5
ns643500
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131397
ns131644.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2243375
ns2256125
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2314937.5
ns2244834
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2249625
ns2286125
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2235375
ns2239937
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1114755
ns1255781
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18291
ns18354.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17500
ns18208
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20917
ns19708
ns1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18292
ns19166
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
143094
ns144781.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223500
ns229291
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
226042
ns258750
ns0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
262917
ns224104
ns1.17
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
230125
ns230333
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
943015
ns1072717
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns667
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
666
ns625
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns583
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23380
ns23495
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10104.5
ns10166
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10166
ns10042
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10000
ns10584
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9583
ns9916.5
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
254915.5
ns257743.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5084
ns4875
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5375
ns5541.5
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6791
ns6916
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5250
ns6458
ns0.81
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
190346.5
ns224997.5
ns0.85
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7250
ns6834
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7125
ns7709
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7250
ns7791
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7083
ns7875
ns0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
735734
ns771041
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2167
ns1875
ns1.16
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2208
ns2208
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2209
ns2291
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2417
ns2125
ns1.14
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18111
ns18191
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6750
ns6875
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6375
ns6792
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6625
ns6834
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6625
ns6834
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
306022.5
ns320849
ns0.95
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
751583.5
ns751000.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
748875
ns746834
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
746812.5
ns750750
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
748500
ns749250
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21064
ns21394
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
791834
ns775375
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
788667
ns797791
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
786646.5
ns789542
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
792479
ns792229.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
294710
ns298871
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7417
ns7375
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5208
ns6083
ns0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns5375
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10084
ns10125
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33108.5
ns32874
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228645.5
ns260333
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
231416
ns266395.5
ns0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
271625
ns232084
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225958
ns254562.5
ns0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
351410
ns358414
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10292
ns10042
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10084
ns10000
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11166
ns11333.5
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10000
ns10583
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
209596.5
ns246624.5
ns0.85
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24709
ns25125
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24333
ns24875
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24291
ns26500
ns0.92
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24437.5
ns24584
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1037550
ns1075334
ns0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
107199542
ns106802625
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
126347334
ns117761313
ns1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120468625
ns123597167
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117762042
ns118005709
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2637816
ns2586758.5
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
393813416
ns396471833
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
380007916
ns366941541
ns1.04
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
355873375
ns358340709
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
484550250
ns482420291
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15152772.5
ns15213427.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
939763875
ns762317541.5
ns1.23
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
777743792
ns763508042
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
745742833
ns750628250
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
767071771.5
ns949414250.5
ns0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7167
ns9292
ns0.77
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6833
ns6833
ns1
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8458
ns8937
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7562.5
ns7542
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
228024
ns233625
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14250
ns14125
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14042
ns13375
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
13875
ns14166
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13333
ns14292
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1000779
ns1040314
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6167
ns5292
ns1.17
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6125
ns6125
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8250
ns8500
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5604.5
ns6687.5
ns0.84
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
214266.5
ns227795.5
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12417
ns12375
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12542
ns12292
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12875
ns12834
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12541
ns13208
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
724930
ns752144.5
ns0.96
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
349208
ns343584
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
326145.5
ns342542
ns0.95
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
393333
ns422396
ns0.93
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
314271
ns307708
ns1.02
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17228
ns16984
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
706500
ns703478.5
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
739437.5
ns732500
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1020354
ns1028375
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
658541
ns649750
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
198297
ns200115
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23935.5
ns23291
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6500
ns6250
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6584
ns6458
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6584
ns6791
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6250
ns6542
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
240134
ns237895.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5792
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5917
ns5875
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5917
ns5875
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5834
ns5875
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24721
ns24490
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21500
ns21417
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21333
ns20958
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21292
ns21625
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21208
ns21542
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
262379.5
ns261762
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144229.5
ns142958
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144042
ns144250
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
147292
ns147875
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
145833
ns184292
ns0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167351
ns166717.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1320395.5
ns1330292
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1358771
ns1314521
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1324084
ns1355792
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1329333.5
ns1327770.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1268788
ns1305974
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24083
ns24584
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22375
ns21792
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25104.5
ns24250
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21917
ns22250
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
280502
ns280907
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
131646
ns183708
ns0.72
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
121334
ns129375
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
177687.5
ns177104.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
130209
ns130708.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1380349
ns1407612.5
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns417
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23199
ns22943.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6708
ns6458
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7083
ns6709
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns6792
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6083
ns6667
ns0.91
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
258254.5
ns254857.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5042
ns4459
ns1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4500
ns4500
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4917
ns5583
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4917
ns4833
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
243109
ns243558
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10375
ns10000
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10042
ns10542
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns10375
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10167
ns10625
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1338362
ns1305595.5
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1667
ns1625
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1542
ns1625
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23629
ns22968
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5875
ns5584
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5666
ns5959
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5958
ns5958
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5625
ns5667
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
278503
ns272222.5
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6825854.5
ns6735562.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6429125
ns6387645.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6541187.5
ns6536625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7656375
ns7531791
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215102
ns213356
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24080834
ns24025604
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21338208
ns21251771
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21079333
ns21005062.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29660375
ns29807125
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2111008
ns2110248
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
48564000
ns37237458
ns1.30
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45595770.5
ns45593041.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45721854
ns45798709
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38038271
ns49459000
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5687.5
ns5334
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6041
ns6000
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6917
ns7395.5
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5375
ns6834
ns0.79
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
239823
ns227910
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8291
ns8041
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns8958
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8750
ns8833
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8667
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1069933
ns1024262
ns1.04
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1555021
ns1489978.5
ns1.04
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1235375.5
ns1272375
ns0.97
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1618375
ns1615958.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2095209
ns2147750
ns0.98
lenet(28, 28, 1, 128)/forward/GPU/CUDA
285020
ns272490.5
ns1.05
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7898542
ns7874499.5
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6630645.5
ns6577854
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7200958
ns7193458
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10372854.5
ns10471167
ns0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1904820
ns1816656.5
ns1.05
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
342000
ns340250
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
323833
ns347208
ns0.93
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
382208
ns416958
ns0.92
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
342042
ns334709
ns1.02
batchedmm(128, Bsize=4)/forward/GPU/CUDA
43080
ns46984
ns0.92
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
725958
ns730604
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
782938
ns790083.5
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1067750
ns1071292
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
737041.5
ns738229.5
ns1.00
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
314201.5
ns302775.5
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397583
ns397292
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
211916
ns288167
ns0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288208
ns212041
ns1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
750834
ns756500
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44587.5
ns43949
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
670500
ns673500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
470708
ns531875
ns0.88
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
531792
ns473459
ns1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974083
ns974125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
192970
ns189220
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
651646
ns599645.5
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
644458.5
ns593000
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
659271
ns645479
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
645333
ns599833
ns1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132814
ns131878.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2440750
ns2459000
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2525916.5
ns2452125
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2439124.5
ns2536291
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2464750
ns2463125
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1349058.5
ns1516114.5
ns0.89
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
344292
ns340708
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
326104
ns346896
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
393875
ns408875
ns0.96
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
312896
ns307667
ns1.02
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16925
ns16566.5
ns1.02
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
709938
ns700646
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
739917
ns735458
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1021708
ns1026375
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
650083.5
ns646000
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
202873.5
ns198090
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458625
ns1462084
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1490666
ns1503958
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1498417
ns1487166
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1436416
ns1441875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41016
ns41094.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5105458
ns5136542
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5294583
ns5289167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5292167
ns5302000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5007208
ns4991770.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
201135.5
ns199139
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3709
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3709
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33479.5
ns33022
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15292
ns15084
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15125
ns15500
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15291
ns15167
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15042
ns15291
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
381756.5
ns363808
ns1.05
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71209
ns71334
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71250
ns71333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71125
ns71375
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
70062.5
ns71333
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
114111
ns113636
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
318250
ns323083
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
329625
ns320292
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
318708
ns333084
ns0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317958
ns317958
ns1
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
197229.5
ns193954.5
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1083
ns1000
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1041
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24163
ns23722
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8167
ns8250
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8041
ns8208
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8667
ns8500
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7625
ns8208
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
264271.5
ns259533.5
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
464166.5
ns464583
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
448167
ns461104.5
ns0.97
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
553459
ns552750
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
548917
ns554624.5
ns0.99
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129241.5
ns129785
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1380229
ns1390375
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1393229
ns1380375
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1619541
ns1608937.5
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1590270.5
ns1604062
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
277974
ns276124
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32417
ns33098
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6583
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6459
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6542
ns6791
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
5958
ns6291
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
267135
ns267970
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1723834
ns1723583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1731042
ns1720292
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1722458
ns1738625
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1727375
ns1724667
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168945.5
ns169379
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4366646
ns4361375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4396958.5
ns4350333.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4374416.5
ns4436167
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4349500
ns4374667
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1192401
ns1273197
ns0.94
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6750
ns6541
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6541
ns6834
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7292
ns7375
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6542
ns6917
ns0.95
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20406
ns21180
ns0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
81771
ns35667
ns2.29
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
49083
ns71541.5
ns0.69
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
72271
ns72292
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51334
ns53000
ns0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
213340.5
ns251861
ns0.85
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
354167
ns350750
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
329541.5
ns343312.5
ns0.96
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
401083
ns436042
ns0.92
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
321771
ns315333
ns1.02
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18865
ns18874
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
722646.5
ns718208.5
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
740500
ns741959
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1030625
ns1045208.5
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
673875
ns671250
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
350549.5
ns338267.5
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75250
ns75250
ns1
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75250
ns75458
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75458
ns75209
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75042
ns75375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47823
ns47612
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324625
ns334958.5
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
341667
ns325541
ns1.05
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
324250
ns340125
ns0.95
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
330833
ns324750
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
216202
ns215177
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1485500
ns1487250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1517334
ns1529250
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1526000
ns1513834
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1463167
ns1466437.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
53576
ns54137
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5124354.5
ns5122625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5278542
ns5274395.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5287917
ns5290959
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4986958
ns4995000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
209445
ns208782
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28208
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28291
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
25452
ns25447
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66333
ns66208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66250
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66250
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66333
ns66750
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
539628
ns505462.5
ns1.07
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1483687.5
ns1492333
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
859791.5
ns1145208
ns0.75
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1143208
ns895291.5
ns1.28
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2247229.5
ns2232499.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
585407
ns590189
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3085000
ns3067958
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2591208
ns2729834
ns0.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2737895.5
ns2642083
ns1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3816250
ns3820583.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2035890
ns2059273
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8818187.5
ns9001312.5
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8953500
ns8781958.5
ns1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8776854
ns8758666.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6365041
ns6346312
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80791
ns80333
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
79875
ns81417
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82792
ns82500
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80708
ns78270.5
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194256.5
ns193540.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2013375
ns2014625
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1748958
ns2014312.5
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2018500
ns2021437.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2022750
ns2020042
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
809328
ns784625
ns1.03
This comment was automatically generated by workflow using github-action-benchmark.