-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
non_differentiable gpu_device and cpu_device (#1089)
* non_differentiable gpu_device and cpu_device * Update lib/MLDataDevices/ext/MLDataDevicesChainRulesCoreExt.jl * fix: missing imports * chore: bump version for release --------- Co-authored-by: Avik Pal <[email protected]>
- Loading branch information
1 parent
f1e0ad8
commit 38f1a73
Showing
2 changed files
with
8 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "MLDataDevices" | ||
uuid = "7e8f7934-dd98-4c1a-8fe8-92b47a384d40" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "1.6.1" | ||
version = "1.6.2" | ||
|
||
[deps] | ||
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
38f1a73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register subdir=lib/MLDataDevices
38f1a73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/119646
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
38f1a73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4208
ns4125
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3895.5
ns4292
ns0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4750
ns4875
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3875
ns4188
ns0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
62917.5
ns61773
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
11250
ns10375
ns1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10708
ns10250
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11208
ns10709
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10333
ns10584
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
438545
ns433806
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1125
ns1209
ns0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1167
ns1208
ns0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1541
ns1334
ns1.16
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1250
ns1333
ns0.94
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18677
ns18632
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4166
ns3958
ns1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4041
ns3770.5
ns1.07
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4125
ns4250
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4042
ns3750
ns1.08
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
112338
ns111653
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57750
ns57167
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38375
ns46708
ns0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46291
ns47042
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82125
ns85000
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37665
ns37778
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2025875
ns2021166.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2090208
ns2091833
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2087125
ns2090417
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2003042
ns2037250
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
199276
ns197839
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145125
ns144125
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143166
ns143687.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145875
ns145875
ns1
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
184521
ns144542
ns1.28
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166939
ns166264.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1108375
ns815917
ns1.36
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1112042
ns1110583
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1118979
ns1128458
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1120229
ns1161791.5
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
539594
ns531966.5
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3958
ns3834
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3333
ns3667
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4334
ns4208
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3375
ns3875
ns0.87
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
72226
ns72027
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9666
ns9666
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9875
ns9208
ns1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9959
ns9667
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9708.5
ns8791
ns1.10
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
496688
ns495388.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18334
ns17250
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15041
ns15292
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17583
ns17750
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14666
ns14875
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55623
ns54800
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215167
ns213334
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214250
ns213667
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214750
ns215625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
214687.5
ns213125
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
280563
ns273384.5
ns1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
708
ns625
ns1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
584
ns500
ns1.17
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
667
ns834
ns0.80
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns542
ns1.08
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
18098
ns17538
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1708
ns1459
ns1.17
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1709
ns1625
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1958
ns1541
ns1.27
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1584
ns0.87
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
104646
ns101749
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns6625
ns1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5167
ns5833
ns0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5875
ns6000
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9875
ns10541
ns0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24104
ns23308
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221209
ns230042
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
231959
ns228000
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229958
ns229917
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226541
ns215459
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
172625
ns167869.5
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3958
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
24227
ns23769
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
17166
ns16625
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16583
ns16645.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16875
ns16916
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16750
ns16542
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
165474.5
ns160993.5
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
599125
ns583542
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
582292
ns582166
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
578209
ns573083
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
571084
ns578334
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
114507
ns112908
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1442792
ns1416417
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1428646
ns1413563
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1428895.5
ns1420000
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1420104.5
ns1427041.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
215306
ns209512.5
ns1.03
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1075333.5
ns1074937.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
935375
ns961625
ns0.97
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1340937.5
ns1349604
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1309500
ns1275750
ns1.03
lenet(28, 28, 1, 64)/forward/GPU/CUDA
279907.5
ns272786
ns1.03
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5997834
ns5988250
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4511396
ns4453229
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4924000
ns4954875
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5545938
ns5751250
ns0.96
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1109413
ns1067705
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns583
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns583
ns0.86
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
24308
ns23552
ns1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2250
ns2125
ns1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2167
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
175730.5
ns171901
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4375
ns4208.5
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
3625
ns4417
ns0.82
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5250
ns5042
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3917
ns4166
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
68155
ns65093
ns1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11708
ns11292
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11709
ns11292
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12042
ns11875
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10917
ns11417
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
463688.5
ns448429
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7145.5
ns7020.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8500
ns7041
ns1.21
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8041
ns7625
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6167
ns6500
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
53097.5
ns52253
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17459
ns16979.5
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16666
ns17833
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18458
ns18875
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16917
ns16875
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
305688
ns301549.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
667
ns584
ns1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
541
ns542
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33408
ns32680
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9375
ns8750
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8375
ns8834
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9084
ns9625
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8792
ns8667
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
162736
ns156693
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64584
ns64125
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64542
ns64291
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64500
ns64458
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64583
ns64584
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112246
ns111163
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
273833
ns280625
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
284750
ns274250
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
273833
ns278083
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
280125
ns289292
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
188461.5
ns184761.5
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3352917
ns3374250
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
2859834
ns3022020.5
ns0.95
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3018375
ns3033167
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4081521
ns4059271.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
581514
ns577014
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7365834
ns7622583.5
ns0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7362708
ns7400875
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7444334
ns7463083
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8211042
ns8222208
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1352556
ns1350413
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18787750
ns18744750
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19099041
ns19149375
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19169000
ns19037709
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15691292
ns15854917
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23530604
ns23424208
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
42588125
ns33648791
ns1.27
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37157833
ns37255625
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34914792
ns35462146
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1854127
ns1854361
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
187564000
ns189507459
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
177265645.5
ns163150563
ns1.09
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
152184437.5
ns151759708
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
437472417
ns449307375
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13907657
ns13915090
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
289863667
ns290474792
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
355940583
ns338390437.5
ns1.05
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
300149020.5
ns298728666
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
332483208
ns400176437.5
ns0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24354.5
ns24666
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22625
ns23062.5
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25333
ns25125
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22042
ns21833
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
100207
ns95619.5
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
104583
ns103041
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104625
ns103750
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
105145.5
ns104584
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103000
ns104146
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
519245.5
ns500114.5
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6167
ns6042
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6417
ns6500
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7792
ns6667
ns1.17
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5958
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
70633
ns68217
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15250
ns14833
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16250
ns16208
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15958
ns16542
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14792
ns15541.5
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
491137.5
ns474515
ns1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
2890458
ns3028583
ns0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2080375
ns2072250
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2260667
ns2258958
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4898959
ns4727250
ns1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
582542
ns581996.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23630250
ns23485750
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18270209
ns18074583
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16933041.5
ns17953667
ns0.94
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35714083
ns36188354.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2894491
ns3102669
ns0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33294083
ns33313750
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27973458
ns27588229.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27440417
ns27385167
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41585958
ns42266896
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74708.5
ns72125
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75729.5
ns75625
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76750
ns75209
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72041
ns72313
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
103323
ns102770.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
289146
ns217709
ns1.33
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
313625
ns264292
ns1.19
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212416
ns208812
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217500
ns216750
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
555124
ns548643
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12250
ns11834
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12042
ns13750
ns0.88
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13125
ns12208
ns1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11416.5
ns11791.5
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
72087
ns71431.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27208
ns26500
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27750
ns27375
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27854.5
ns28000
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26375
ns27167
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
476976
ns474755
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
13000
ns12292
ns1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12750
ns13250
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14667
ns13625
ns1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13000
ns12625
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52635
ns53420
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26208
ns25708
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26125
ns26084
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26208
ns26375
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26208
ns26209
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
305320.5
ns305780
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
180500
ns181833
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
180729.5
ns182750
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
181833
ns182000
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179750
ns179750
ns1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56100.5
ns56584
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
583667
ns582667
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
586125
ns589020.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
583750
ns585562.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
583792
ns582875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
289385.5
ns286509.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6395.5
ns5958
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5916
ns7000
ns0.85
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8542
ns6917
ns1.23
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5459
ns6167
ns0.89
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
71686
ns71314
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14750
ns14041.5
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15208
ns15042
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15208
ns15334
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14416
ns15042
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
468551
ns465404.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1187959
ns1163666
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1888958
ns1608417
ns1.17
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1305792
ns1245958
ns1.05
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1334770.5
ns1315062.5
ns1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA
302497.5
ns301860.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4137167
ns4119833.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4514604.5
ns4367812.5
ns1.03
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4613834
ns4633625
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4486208
ns4681521
ns0.96
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1040017
ns1040008
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1834
ns1916
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23961
ns23628.5
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4916
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5000
ns4917
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
189260
ns188198
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6125
ns5959
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5417
ns6333
ns0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6750
ns6584
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6000
ns5625
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
56193.5
ns55698
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11667
ns10958
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11083
ns11875
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12041
ns11667
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10333
ns11041.5
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
335012.5
ns330993.5
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
334
ns334
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23036
ns23016
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3042
ns2791
ns1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
ns2750
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3041
ns3083
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2791
ns2792
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
159234.5
ns158081
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11667
ns12000
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11291
ns12292
ns0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13333
ns12979
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11541
ns11500
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
58294.5
ns56764.5
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25500
ns25250
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24541
ns25292
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25250
ns25542
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24500
ns25125
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
299590
ns293131
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
ns4167
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4167
ns4209
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4167
ns4250
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4208
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24930
ns24851
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16208
ns16084
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
15958
ns16084
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16291
ns16250
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16291
ns16125
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
200508.5
ns193865.5
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5875
ns5833
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5792
ns5833
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5791
ns5833
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33184
ns33648.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20895.5
ns20937.5
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20583
ns20875
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
20875
ns21375
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21208
ns20833
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
177158
ns175295.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
398333.5
ns405354.5
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
350916
ns383146
ns0.92
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
490333
ns487375
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
531125
ns505333
ns1.05
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66865
ns67095
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
932125
ns921500
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
886416
ns879833.5
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1236834
ns1239500
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1389000
ns1413875
ns0.98
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
189939
ns190914
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82541.5
ns80792
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80354
ns80625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82437.5
ns82416.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
97083
ns82208.5
ns1.18
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192428
ns193084
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1914917
ns1921166
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1939042
ns1923375
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1920479.5
ns1702792
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1905042
ns1942625
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
410076
ns397267
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22365
ns22298
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1875
ns1792
ns1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1834
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
175349
ns171128.5
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7375
ns6750
ns1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6458
ns7125
ns0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7333
ns7750
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6584
ns6583
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60938.5
ns60207.5
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9584
ns9334
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9208
ns9458
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9167
ns9458
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9416
ns9500
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
319338.5
ns309332.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120592520.5
ns118908083
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181751312
ns173905459
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148080208
ns148147000
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
102236292
ns104063562
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5477430
ns5483006
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
616714854.5
ns615077271
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
577694542
ns556251208
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
455123729
ns456191166.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
752842396
ns775264354
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38217675
ns38217009
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
652219667
ns651954834
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
684789500
ns668816521
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
585586750
ns584471208
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
741987542
ns743364500
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59375
ns59041
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38959
ns47167
ns0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46791
ns48042
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83959
ns85604.5
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37739.5
ns38577
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1916959
ns1921792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1979666
ns1983375
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1976083
ns1974021
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1894000
ns1888041.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
175218.5
ns177270
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
269208
ns267667
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
275479
ns269500
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
271041
ns269000
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
266333.5
ns265375
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
136845.5
ns129439
ns1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
680750
ns602875
ns1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
692354
ns667625
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
599084
ns589104
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
595292
ns696166.5
ns0.86
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
721806
ns698695
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2222875
ns2214416
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2206562.5
ns2132916.5
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2190792
ns2099687.5
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2219812.5
ns2218542
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132948
ns135139.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5499000
ns5496500
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5581125
ns5493084
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5523834
ns5512750
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5495417
ns5608375
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
731329.5
ns786813
ns0.93
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
638708
ns645084
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
653417
ns646042
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
641250
ns643042
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
636417
ns645042
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47525
ns47537
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1847375
ns1818666
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1675167
ns1720625
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1727334
ns1727375
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2104041
ns2097625
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
219920
ns225809.5
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58417
ns58458
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38834
ns46958
ns0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46958
ns47500
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84208
ns85709
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28615
ns29149.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2026062.5
ns2024312
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2097250
ns2089792
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2092063
ns2079417
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1994667
ns2030812.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
189265
ns192873
ns0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13388125
ns13367875
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12478208.5
ns12448375
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12574250
ns12498688
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15224000
ns15196500
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
513523
ns515450
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47281583
ns47301125
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
42012708
ns41737208
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41057937.5
ns41031917
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58763708
ns59054000
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3027655.5
ns3246636.5
ns0.93
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
96334145.5
ns73864187.5
ns1.30
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91884667
ns90734875
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
91286333
ns90710083
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76278542
ns99247604
ns0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58875
ns58667
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38916.5
ns47292
ns0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47125
ns47625
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82417
ns85416.5
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47552.5
ns47961
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924541.5
ns1915542
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1974562.5
ns1967250
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1774812.5
ns1778666.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1885958
ns1904791
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
196220
ns195659
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
417
ns375
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32997
ns32740
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6812.5
ns6167
ns1.10
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6062.5
ns6000
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6667
ns6625
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6000
ns6042
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
174932.5
ns176130
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32308
ns31946
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2917
ns2625
ns1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2709
ns2792
ns0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2834
ns2916
ns0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2583
ns2625
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
161677
ns164970
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
284931833.5
ns286577604
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
346561021
ns339468333
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314560020.5
ns314095271
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
270608125
ns270924375
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7114118
ns7117527
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1002689042
ns1001221667
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
958558000
ns939877583
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
855141500
ns851361917
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1156706167
ns1176703208
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34166632.5
ns33887966
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1685313333
ns1311845770.5
ns1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1711077458
ns1679371125
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1615744166
ns1604290334
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1303020395.5
ns1668435000
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1421208
ns1415333.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1428958
ns1417520.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1408625
ns1416104
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1404708
ns1420146
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128545
ns128175
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5018624.5
ns5010542
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5051333.5
ns5020291.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4733250
ns5037500
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5021583.5
ns5047042
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
496665
ns595594
ns0.83
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
175572229
ns175229188
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
180410646
ns123461167
ns1.46
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
128544479
ns127594250
ns1.01
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
157146354.5
ns154552916.5
ns1.02
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4884679
ns4884050
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
670240791
ns667971584
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
607052917
ns641402625
ns0.95
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
535894542
ns501342541
ns1.07
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
644547708
ns657859875
ns0.98
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
17659435
ns15872908
ns1.11
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8871750
ns8987479.5
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8823292
ns8781270.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7877542
ns7857729
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10108021
ns10412374.5
ns0.97
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1604141
ns1592095
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36437625
ns36150584
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37737666
ns36797500
ns1.03
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33406667
ns33192666.5
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38634291.5
ns40244625
ns0.96
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6473557
ns6455577
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47500
ns47417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47416
ns47584
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47834
ns47583
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47541
ns47333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18281
ns18534
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50395.5
ns52833.5
ns0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50583
ns50375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50583
ns50666
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50458.5
ns50250
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
170078
ns202850
ns0.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7500
ns7459
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6917
ns7417
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7521
ns7312.5
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6625
ns7458.5
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
80332.5
ns98661
ns0.81
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10458
ns9792
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9750
ns10125
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10292
ns10542
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10000
ns10250
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
461160.5
ns555252.5
ns0.83
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6167
ns6750
ns0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5875
ns6042
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7333
ns7208.5
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5333
ns6542
ns0.82
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
87519.5
ns104446.5
ns0.84
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13479.5
ns13125
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13291
ns12917
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13250
ns13292
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12667
ns13083
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
422223
ns478181
ns0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1083
ns1125
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1084
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32913
ns32701
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8166
ns8375
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8042
ns8125
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7958
ns8125
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8041
ns8083
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
195210
ns206369.5
ns0.95
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23417
ns23417
ns1
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23167
ns23500
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23750
ns23416
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23459
ns23333
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
19164
ns18592
ns1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52458
ns52750
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52625
ns54709
ns0.96
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52625
ns52917
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52500
ns52917
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
226904.5
ns283991
ns0.80
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1404979
ns1399417
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1404271
ns1396395.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1401667
ns1396833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1400583
ns1449874.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196430
ns196187
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5006854
ns5003208
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5043375.5
ns5005375
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5019271
ns5023834
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5013583.5
ns5050167
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
541904
ns585941
ns0.92
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3050625
ns3039563
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2111958
ns2072875
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2267479
ns2275208
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4552979
ns4856479
ns0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
583345
ns583070
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24377459
ns24354562.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19098958
ns18867354
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18712353.5
ns18817521
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36880667
ns37413770.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2992730
ns3176919
ns0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34016250
ns33990500
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28716709
ns28382208.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27913333
ns28070021
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41657708
ns42353875
ns0.98
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
142625542
ns144782125
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
142415500
ns142800542
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
124505750
ns123809687.5
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
174391229.5
ns168891563
ns1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22776842
ns22773536
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
920278833.5
ns1277305063
ns0.72
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
871702708.5
ns1180173271
ns0.74
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
713641791.5
ns757990666
ns0.94
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
671743250
ns688381500
ns0.98
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
116134147
ns118470004
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75208.5
ns75042
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74417
ns73625
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75917
ns77166
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72229
ns74708
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
198593
ns220284.5
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
287583
ns285750
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
205125
ns191208
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
268979
ns192209
ns1.40
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255333
ns286417
ns0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1021640
ns1195118
ns0.85
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35433625
ns35568917
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36009208
ns35278833
ns1.02
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32265896
ns32149729
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40598479
ns41733750
ns0.97
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5840629
ns5841675.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148147125
ns148531084
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
155796771
ns153045542
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
134729167
ns136231750
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
286194250
ns228329854.5
ns1.25
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34880901
ns34864707.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120489250
ns119094187.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181727625
ns174236667
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148246958.5
ns147985917
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
107474479
ns107449375
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5474097
ns5482351
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
469354333
ns467600417
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
484572458
ns465577292
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
439936978.5
ns438034750
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
739328625
ns759816229.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35154753
ns35154520.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
707340521
ns709358854.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
673922146
ns655624271
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
572041396
ns571617791
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
849558125
ns869387791
ns0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1298708.5
ns1327250.5
ns0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
721687
ns905875
ns0.80
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
919229
ns907750
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2090500
ns2079042
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
581149
ns578714.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2964833
ns2967333.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2526166.5
ns2631479.5
ns0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2616228.5
ns2620896
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3691270.5
ns3771729
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1611589.5
ns1755565
ns0.92
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6637958
ns6610917
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6471417
ns6496875
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6516083
ns6497437.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4027166.5
ns4521833
ns0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7417
ns7208
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5375
ns6125
ns0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns6084
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10542
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25734
ns25575
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213292
ns212875
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223084
ns229500
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221166
ns221187.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206500
ns246625
ns0.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
222772.5
ns261769.5
ns0.85
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
300948375
ns313730896
ns0.96
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
281067792
ns222537125
ns1.26
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
190339437.5
ns194707917
ns0.98
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
311293458
ns313279354
ns0.99
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7676637.5
ns7673155
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1078603979
ns1080950395.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
988164875
ns899873458
ns1.10
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
871100958
ns834690333
ns1.04
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1152323979.5
ns1180116917
ns0.98
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26676107
ns26459206.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6125
ns5875
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5542
ns5417
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6625
ns6250
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5375
ns6084
ns0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
111098
ns162725
ns0.68
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7708
ns7375
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7084
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7583
ns7750
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6958
ns7625
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
498819
ns624677.5
ns0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
666
ns666
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
541
ns583
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24257
ns23758
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9375
ns9542
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8667
ns9291
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9375
ns9584
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9208
ns9209
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
184515.5
ns225738
ns0.82
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
354291.5
ns352000
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352292
ns352042
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352229.5
ns354604.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
355750
ns353833
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21862
ns21344
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
824562.5
ns822291
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
775208
ns812479
ns0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
774792
ns824250
ns0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
821812
ns831958
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
232768
ns304872
ns0.76
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
337458
ns337167
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
315020.5
ns343334
ns0.92
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
448104
ns446875
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
338562.5
ns316354.5
ns1.07
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17974
ns18389
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
688937.5
ns695521
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
736417
ns750792
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1019625
ns1026833
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
700125
ns688999.5
ns1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
224901
ns282579.5
ns0.80
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
353708
ns356667
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
327187.5
ns354500
ns0.92
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
421250
ns421500
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
381333
ns347042
ns1.10
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22774
ns22715
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
751625
ns754229
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
747792
ns753792
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1067270.5
ns1072417
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
832521
ns823125
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
209326.5
ns256204.5
ns0.82
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3542
ns3583
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3542
ns3500
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3959
ns3708
ns1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3583
ns3667
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
18314
ns17612
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4208
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4208
ns4500
ns0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4375
ns4292
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4291
ns4417
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
228582.5
ns280326.5
ns0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4333
ns4209
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4042
ns4333
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4708
ns4291
ns1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4041
ns4125
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
170151.5
ns232867.5
ns0.73
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8750
ns8291
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns8500
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8541
ns8562.5
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8333
ns8667
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1071112
ns1214158.5
ns0.88
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204666
ns203792
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
208416
ns211375
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210500
ns209083
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199458
ns202541
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34973
ns34629
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
644875
ns645167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
623042
ns623895.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622916
ns630084
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
629125
ns633833
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
316013
ns349768
ns0.90
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
965625.5
ns972062.5
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
931292
ns937916.5
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
949791
ns960125
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1287375
ns1319708
ns0.98
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207697.5
ns208475
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4502250
ns4500166
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4587333.5
ns4475687.5
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4293249.5
ns4308250
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6257042
ns6508250
ns0.96
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
935844
ns944786.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4042
ns4084
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3458
ns3750
ns0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4333
ns4083
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3417
ns3542
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
197475.5
ns226002.5
ns0.87
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7875
ns7542
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7667
ns7625
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7958
ns7625
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7083
ns7334
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
992361
ns1008436
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1639334
ns1647479.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1153625
ns1203104.5
ns0.96
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1347709
ns1378125
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2351833.5
ns2472896
ns0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214480
ns213582
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12352166.5
ns12309291
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9610125
ns9565666
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9257687.5
ns9280334
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
17946167
ns18216500
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1947706
ns1940596
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17372666
ns17356917
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14382041.5
ns14358625
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14310583
ns14329312.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21072020.5
ns21175541
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
90333
ns133834
ns0.67
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
90833
ns90000
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
91625
ns93687
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
134166
ns90750
ns1.48
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126661
ns125997
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2020563
ns2019458
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2038375
ns2029375
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1989208
ns2029667
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2026937.5
ns2049458
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1036489
ns1042357
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
344459
ns347333
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
323354
ns349250
ns0.93
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
395291.5
ns394583
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
315666.5
ns293978.5
ns1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15917
ns16455.5
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
703000
ns709041
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
719667
ns741583.5
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1017062.5
ns1022875
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
658709
ns644791
ns1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
197602.5
ns197069.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7250
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5042
ns5875
ns0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6041
ns6041
ns1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns10583
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34203
ns34401
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222334
ns224416.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223750
ns220375
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221667
ns231250
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
211938
ns236834
ns0.89
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
315474
ns318034
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3709
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3709
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
23085
ns23219
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14416
ns14375
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14167
ns14375
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14416
ns14417
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14334
ns14167
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
475619.5
ns484400.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
102583
ns97417
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
94625
ns94042
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
95792
ns97959
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
92375
ns95500
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126060
ns125837
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1914833
ns1920250
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1935917
ns1649417
ns1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1723083
ns1923437
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1923271
ns1953916
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
937375.5
ns974936
ns0.96
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
875000
ns879729.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
798167
ns832708
ns0.96
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1216563
ns1229562.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
975458
ns939583
ns1.04
lenet(28, 28, 1, 32)/forward/GPU/CUDA
282281
ns281248
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2835584
ns2831145.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2458187.5
ns2527396
ns0.97
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3311542
ns3353354.5
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3418209
ns3411104.5
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1620696.5
ns1661947.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18729
ns14854.5
ns1.26
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15625
ns15583
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17167
ns18792
ns0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15563
ns16000
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144619.5
ns144462
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
255875
ns255958
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216500
ns215583.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215625
ns257583
ns0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
254834
ns262500
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
653404.5
ns650445
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222291.5
ns221375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
220167
ns220792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222937.5
ns223083
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
223417
ns220646
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
276300.5
ns273454.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
510541.5
ns559542
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
501250
ns510542
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
498084
ns507813
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
556000
ns535208.5
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1413449
ns1396532
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
336938
ns328770.5
ns1.02
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
312000
ns336937
ns0.93
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
378334
ns370500
ns1.02
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
330583.5
ns299625
ns1.10
batchedmm(16, Bsize=4)/forward/GPU/CUDA
17443.5
ns17616
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
713625
ns711834
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
728875
ns732166.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1013479.5
ns1024479.5
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
669791
ns657917
ns1.02
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
198819.5
ns200486.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20187
ns18520.5
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19166
ns19083
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18750
ns19625
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18167
ns18396
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
149298
ns147224
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224104
ns213625
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215021
ns221875
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213250
ns221812.5
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221417
ns237333
ns0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1054396
ns951211
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4500
ns4583
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4604.5
ns4417
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5291
ns4917
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4250
ns4437.5
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
250417
ns239868.5
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10666
ns10625
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11167
ns10500
ns1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10916
ns10958
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10667
ns10625
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1105680.5
ns1112681.5
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3709
ns3791
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3667
ns3541
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4542
ns4229.5
ns1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3250
ns3833
ns0.85
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
255235.5
ns252769
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8000
ns7334
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7583
ns7792
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8291
ns7916
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7208
ns7437.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1113468.5
ns1116124.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23595542
ns23341875
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43052124.5
ns34053354.5
ns1.26
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
38080375.5
ns37482854.5
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34894124.5
ns35456625
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1838652.5
ns1845777.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
183963083
ns184378291
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
183192958
ns158584667
ns1.16
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146668854
ns146193479
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
412535000
ns422496166.5
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16493440
ns16510255
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
428118750
ns426674167
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
258337959
ns253893875
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
232950667
ns232875895.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
483252042
ns494805750
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183708
ns184500
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
184708.5
ns183458
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185084
ns185583
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
184708.5
ns183416.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
229170.5
ns231684
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
635625
ns599042
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
608958
ns586312.5
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
586750
ns636833
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
598000
ns641125
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1136450
ns1087543.5
ns1.04
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3837375
ns3842645.5
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3726667
ns3643229
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3480459
ns3509333
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5388895.5
ns5524187.5
ns0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA
537067
ns534809
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17354583
ns17462833
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17694167
ns17328500.5
ns1.02
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16528146
ns16632083
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22060667
ns23474479.5
ns0.94
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2634503
ns2613903
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
584
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
584
ns584
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns666
ns0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
31760.5
ns32551
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9417
ns9375
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9375
ns8625
ns1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9875
ns9792
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9041
ns9354.5
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
264747
ns264963
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
499432375
ns500529917
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
408185125
ns429131021
ns0.95
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
433810812.5
ns390085458
ns1.11
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
592107916
ns680776812.5
ns0.87
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12429503
ns12474289.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2040048312.5
ns2050021916.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1663687458
ns1635602292
ns1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1492613875
ns1501725478.5
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2211016250
ns2237822875
ns0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49189263
ns49165291
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1645792
ns1648791.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1175479
ns1195792
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1372437.5
ns1379625
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2492875.5
ns2436187.5
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
216494.5
ns215012
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12712041.5
ns12725833.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9982875
ns9944041.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9669792
ns9667395.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18338875
ns18594104.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2038701
ns2038696
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17668854
ns17722000
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14759833
ns14694125
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14522791.5
ns14557833
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21750333
ns21533833
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26417
ns26250
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26291
ns26750
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24090
ns23955
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67416
ns66833
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66750
ns66625
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
68000
ns66958
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67125
ns66833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
407713
ns403690.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203750
ns202791
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
208250
ns209375
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209750
ns209667
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199542
ns200708
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26545
ns26177
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
614458.5
ns612146
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
626479.5
ns622334
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622000
ns680520.5
ns0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
628041
ns634750
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
354394
ns350618
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
634000
ns650500
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
662125
ns542145.5
ns1.22
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
546833
ns634666
ns0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
678084
ns679459
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131699
ns131917
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2219958
ns2229542
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2290584
ns2231250
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2209458
ns2251687.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2242166.5
ns2330333
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1187469.5
ns1238942
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19375
ns16854
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17917
ns19500
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18770.5
ns19791.5
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18396
ns17750
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144324
ns144506
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
232750
ns230625
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221167
ns260583
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219917
ns261125
ns0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
258396
ns265583.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1055691
ns1064679
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
667
ns625
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns625
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23295
ns23448
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10209
ns10125
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10000
ns9792
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10250
ns10000
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9666
ns9979
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
258972
ns257505.5
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6000
ns6125
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5708
ns5625
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7083
ns6666
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5375
ns6084
ns0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
231630
ns233944.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7625
ns7416
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7292
ns7334
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7334
ns7834
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6792
ns7417
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
802749
ns800597
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2208
ns2209
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2208
ns2292
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2500
ns2208
ns1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2208
ns2250
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18522
ns17989
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6792
ns6541.5
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6687.5
ns6542
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6625
ns7125
ns0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6458
ns6750
ns0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
332179
ns330052
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
755667
ns751958.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
749145.5
ns746604.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
749542
ns749167
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
746916
ns748959
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21431
ns21090
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
790645.5
ns791292
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
775270.5
ns792333
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
775020.5
ns773291
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
810729
ns792291.5
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
298142.5
ns299003.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7416
ns7291
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5208
ns5917
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5917
ns6083
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10166
ns10791
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32730.5
ns33088.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
232208
ns233333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229250
ns229479
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229083
ns269542
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255375
ns220958
ns1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
363419.5
ns359587
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10333
ns10625
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10166
ns10375
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11250
ns10958
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9917
ns10959
ns0.90
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
255351
ns249563.5
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25167
ns25042
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25125
ns24625
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25667
ns25375
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24792
ns25250
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1125690
ns1114585
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106221875
ns106488708
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
125394375
ns117008645.5
ns1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
121390333
ns120350584
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117359542
ns118085396
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2631223
ns2661446
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
392505292
ns393399750
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
378545917
ns368428125
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
355444792
ns359138458
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
480457375
ns486814000
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15266181
ns15211152
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
937472500
ns759103375
ns1.23
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
779964959
ns755373708
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
749323541.5
ns744752604
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
763325375
ns959286729.5
ns0.80
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7500
ns6896
ns1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8541.5
ns7791
ns1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9125
ns8250
ns1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7042
ns7583
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
238515
ns240721
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14208
ns14458
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14834
ns14208.5
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14708
ns14750
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14166
ns14312.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1101022.5
ns1072384
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6125
ns6292
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6042
ns6125
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7062.5
ns7291
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5542
ns6458
ns0.86
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
237732.5
ns234548
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12625
ns12584
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12709
ns12625
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12667
ns12959
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12125
ns12583
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
796510.5
ns784420
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
346208
ns347708
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
320958.5
ns386916.5
ns0.83
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
397021
ns398834
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
317417
ns292375
ns1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17118
ns16947
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
709854
ns708249.5
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
726459
ns746000
ns0.97
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1020999.5
ns1025229
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
663583.5
ns652416.5
ns1.02
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
200899.5
ns199954
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
416
ns416
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23684.5
ns23200
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6625
ns6458
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6333
ns6417
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6541
ns6750
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6000
ns6542
ns0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
241466.5
ns238715
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5917
ns5958
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5958
ns5917
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5917
ns6000
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5750
ns5917
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24709
ns24219
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21666
ns21250
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21500
ns20875
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21958.5
ns21417
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
20750
ns21896
ns0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
265505.5
ns261648
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144292
ns145458
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144583.5
ns147521
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
147479
ns147770.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
190521
ns147458.5
ns1.29
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168147.5
ns167051
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1320958.5
ns1322500
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1335437
ns1320041
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1300333
ns1325833
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1320104.5
ns1391458
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1368672
ns1346456
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22625
ns22250
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22125
ns24750
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24208
ns23416
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23084
ns22396
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
356959.5
ns353387
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
179167
ns178709
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
118875
ns118687.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
119167
ns127459
ns0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
129709
ns134041.5
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1505917
ns1464281
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23148
ns22942
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6625
ns6459
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6333
ns6500
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6458
ns6666
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6229.5
ns6750
ns0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
259448
ns255510
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5375
ns5000
ns1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4625
ns4729.5
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5334
ns5292
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4333
ns4709
ns0.92
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
259168.5
ns256450
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10541.5
ns10042
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10125
ns10167
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10500
ns10417
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10125
ns10125
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1348181
ns1348843.5
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1667
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1667
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23350
ns22876
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6000
ns5625
ns1.07
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5625
ns5625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6041
ns6041
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5625
ns5708
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
276521
ns272214.5
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6816750
ns6888875
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6363479
ns6384792
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6523125
ns6514708.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7623625
ns7555583
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214095
ns214320
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24024229
ns24087271
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21333333
ns21278062.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21044500
ns21040583
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29721312.5
ns29921333
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2157342.5
ns2106395
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
48604000
ns37396292
ns1.30
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45855104
ns45619104.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45946750
ns45717854
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38018271
ns49514208
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6208
ns6208
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5750
ns6333
ns0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6584
ns6459
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5750
ns6083
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
238393
ns236136
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9041
ns9125
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8666
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8291
ns8416
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8416
ns8583
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1067250.5
ns1059780
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1551208
ns1497208
ns1.04
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1242333
ns1271146
ns0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1627229
ns1623333
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2145291.5
ns2143312.5
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
279384.5
ns273613.5
ns1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7908792
ns7900125
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6541625
ns6605479
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7076917
ns7156416.5
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10471270.5
ns10528062.5
ns0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1889848
ns1850752
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
342084
ns343000
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
325667
ns349166.5
ns0.93
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
364520.5
ns383250
ns0.95
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
346875
ns325438
ns1.07
batchedmm(128, Bsize=4)/forward/GPU/CUDA
43276.5
ns46572
ns0.93
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
751333.5
ns746124.5
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
784042
ns795499.5
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1057167
ns1076208.5
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
760375
ns753291.5
ns1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
312654
ns309766
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397687.5
ns397375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
213000
ns287916
ns0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287750
ns288000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
750292
ns749125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44710
ns44192
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
673645.5
ns666145.5
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
472000
ns531062.5
ns0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
531792
ns529625
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974417
ns975062.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
191361
ns188202
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
646583
ns646708
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
543750
ns543166.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
597208
ns654229
ns0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
677459
ns659479
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131924.5
ns132313.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2453750
ns2450208
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2497021
ns2447833
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2150458
ns2404020.5
ns0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2458500
ns2562667
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1369206
ns1598744
ns0.86
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
342625
ns347208
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
320374.5
ns347542
ns0.92
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
395375
ns400125
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
319042
ns291604
ns1.09
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16400
ns16522
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
707145.5
ns706875
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
724417
ns734333
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1012958
ns1028542
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
657083.5
ns647750
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
199317.5
ns199294.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458375
ns1458584
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1493042
ns1498042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1499834
ns1499666
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1437375
ns1444167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40669
ns40454
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5122624.5
ns5120438
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5305041
ns5292292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5297979
ns5286000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4988562.5
ns5017937.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
199526
ns195965.5
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3709
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3709
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3709
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33617
ns32802
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15375
ns15125
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14917
ns15292
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15416
ns15417
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15083
ns14875
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
378853
ns372915.5
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71125
ns70917
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71083
ns71250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71167
ns70916
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
69833
ns71375
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113699
ns112608
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
318167
ns317750
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
324084
ns318417
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
320084
ns318375
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
318708
ns327667
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
196046
ns192232
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1125
ns1000
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1084
ns1083
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24007
ns23208
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8250
ns8000
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7958
ns8042
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8667
ns8250
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8042
ns8250
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
262710.5
ns259321
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
468625
ns468417
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
454583
ns479458
ns0.95
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
548646
ns555416
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
553541.5
ns544792
ns1.02
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129403
ns128776.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1392000
ns1386166.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1394229
ns1391187.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1613875
ns1623687.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1592062.5
ns1644333.5
ns0.97
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274426
ns275740
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
416
ns333
ns1.25
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31844
ns31924
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6666
ns5958
ns1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6167
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6542
ns6459
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6083
ns6166
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
265642
ns262594
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1721792
ns1733625
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1722771
ns1722729.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1726625
ns1729958
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1721375
ns1727000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169529.5
ns168805
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4366500
ns4353667
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4385854.5
ns4366916.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4393625.5
ns4362042
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4358458
ns4429395.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1321267
ns1264129.5
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6708
ns6959
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7042
ns6708
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6958
ns7000
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6667
ns6833
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
19868.5
ns20795
ns0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51416
ns51500
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
32792
ns38042
ns0.86
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
32666
ns47209
ns0.69
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
70521
ns48666.5
ns1.45
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
297123.5
ns295172.5
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
353542
ns355084
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
325167
ns350583
ns0.93
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
408125
ns423208.5
ns0.96
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
325916
ns295000
ns1.10
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18546
ns18329
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
718854
ns718562.5
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
731000
ns744125
ns0.98
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1025333.5
ns1031500
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
687084
ns672625
ns1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
335478
ns347666.5
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75292
ns75042
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75083
ns75250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75250
ns75333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
74604.5
ns75584
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47512
ns46603
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324750
ns324708
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
334708
ns327334
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
326791.5
ns324375
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
324667
ns334062.5
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
212265
ns207370
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1484917
ns1485208
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1518458
ns1526250
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1526750
ns1526625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1463500
ns1467250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52587
ns51906
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5132416.5
ns5116396.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5271708
ns5284312.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5296709
ns5277167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4984250
ns5025562.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
207065
ns203896.5
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28292
ns28625
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28250
ns28334
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
25113
ns24422
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66542
ns66250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66125
ns66250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66750
ns66250
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66292
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
537900
ns519781.5
ns1.03
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1468167
ns1501250
ns0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
828250
ns1125791
ns0.74
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1059083.5
ns1125104.5
ns0.94
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2230354.5
ns2259459
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
586046.5
ns571991
ns1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3081687.5
ns3070000
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2620708.5
ns2775000
ns0.94
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2737021
ns2736500
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3814083
ns3899292
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2011434
ns2055229
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8838209
ns8838896
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8744875
ns8809083.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8797042
ns8782709
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6371334
ns6483958.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
83791
ns80583
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
79333
ns81334
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
81854.5
ns83645.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
129083
ns136500
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192348
ns192157
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2017875
ns2012958
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2022541
ns2009583
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2018416.5
ns2015916.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2016353.5
ns2051000
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
798353
ns803108
ns0.99
This comment was automatically generated by workflow using github-action-benchmark.