Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: bump crate-ci/typos from 1.27.0 to 1.27.3 (#1065)
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.27.0 to 1.27.3. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.27.0...v1.27.3) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Loading branch information
0be7504
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4292
ns4584
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4166
ns4917
ns0.85
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5000
ns5666
ns0.88
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4187.5
ns4042
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60972
ns60487
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10459
ns10167
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9666
ns11000
ns0.88
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11583
ns10542
ns1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10375
ns10542
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
426712
ns424703
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1125
ns1125
ns1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3000
ns1166
ns2.57
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3041
ns1292
ns2.35
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1083
ns1125
ns0.96
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18505
ns18464
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4042
ns4000
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4042
ns4000
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4333
ns4208
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4125
ns4083
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
112061
ns109915.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
55667
ns57375
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46208
ns38250
ns1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46792
ns46375
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
79959
ns81584
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37292.5
ns37506
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2048125
ns2012792
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2094395.5
ns2093417
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2098292
ns2086646
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1986396
ns2000208
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
196342
ns197705
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144771
ns147000
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
146979.5
ns143145.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
147708
ns149666
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
147167
ns147229.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166937.5
ns168379
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1137083
ns1012208
ns1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1121625
ns1152209
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1134166
ns1110709
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1113312.5
ns1119500
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
528613
ns522581.5
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3375
ns4834
ns0.70
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3541
ns3792
ns0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4459
ns4667
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3583
ns3958
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70628
ns65957
ns1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8791
ns9000
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8416
ns9292
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9792
ns9459
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8916
ns8500
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
479018
ns469308.5
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15145.5
ns18167
ns0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
14770.5
ns15625
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18375
ns18917
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15500
ns16583
ns0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55206
ns52878
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215083.5
ns252312.5
ns0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214000
ns215959
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214959
ns214625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212416
ns214583
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
274790
ns267130
ns1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns584
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns583
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
667
ns708
ns0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns500
ns1.17
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17833
ns17462
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1625
ns1500
ns1.08
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1417
ns1459
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1750
ns1.07
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1417
ns1417
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
103732
ns100800
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
6917
ns7167
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5875
ns5125
ns1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5834
ns5875
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9834
ns9833
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23612
ns23225
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
230000
ns259792
ns0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
233000
ns232458.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229104.5
ns229520.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212834
ns221875
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
168410
ns166055.5
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3875
ns3833
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3833
ns3875
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23718
ns23597
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16834
ns17125
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16750
ns16541
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17250
ns16833
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16667
ns16667
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
164253.5
ns160583
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
603959
ns576583
ns1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
576333
ns581541
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
573083
ns573687.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
577708
ns575750
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113138.5
ns113170
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1445791.5
ns1423250
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1433375
ns1430791.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1418104
ns1431000
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1417292
ns1421792
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
213357
ns207811
ns1.03
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1038750
ns1075667
ns0.97
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
969875
ns948313
ns1.02
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1352625.5
ns1346646
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1297479.5
ns1310750
ns0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA
277707.5
ns270367.5
ns1.03
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5979604.5
ns5995500.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4659833
ns4593750
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4957333.5
ns4976208.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5723916.5
ns5505395.5
ns1.04
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1103929
ns1090295.5
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23796.5
ns23458
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2167
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2167
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2208
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2083
ns2083
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
176570.5
ns172926
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5062.5
ns6458
ns0.78
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6333
ns5125
ns1.24
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7083
ns7208
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4979
ns4333
ns1.15
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66510
ns64432
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11375
ns11458
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11334
ns11583
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12167
ns11958
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11375
ns10833
ns1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
456940
ns442914.5
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7167
ns7792
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7000
ns7041
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7666
ns8458
ns0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7187.5
ns6375
ns1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
53184
ns51253.5
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17625
ns17833.5
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17334
ns18000
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18291
ns18542
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17083
ns16875
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
308823
ns298470
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
458
ns583
ns0.79
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33350
ns32349
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9083
ns8875
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9000
ns9333
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9166
ns9291
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9083
ns8875
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
162050.5
ns157321.5
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64916
ns64792
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64625
ns64667
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64625
ns64750
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64458
ns64375
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112372
ns111151
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
289833
ns275916
ns1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
279500
ns293917
ns0.95
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
282167
ns291666
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
278917
ns274417
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
191380
ns183162.5
ns1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3267875
ns3323375
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3100000
ns2861812
ns1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3087125
ns3049625
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3985042
ns3939000
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
582393
ns580012.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7531208
ns7623333
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7465041
ns7263625
ns1.03
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7467854
ns7327354
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8032146
ns8196041
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1388075.5
ns1311084.5
ns1.06
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
19401417
ns18847291
ns1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19152875
ns19137541
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19084458
ns19205875
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15677125
ns15425792
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24347125
ns23654958
ns1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34024291.5
ns43401291.5
ns0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37229500
ns37089791.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34880042
ns34880750
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1857760.5
ns1841996
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
193645562
ns188777125
ns1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
164151291.5
ns178489062.5
ns0.92
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
151654958
ns152827958
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
439821375
ns438354958
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13858604
ns13884864
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
293543645.5
ns289730542
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
336893125.5
ns273653750
ns1.23
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
299206416.5
ns300146084
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
333915208
ns363130458
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23584
ns24959
ns0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25271
ns23166
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25750
ns26250
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23834
ns21541
ns1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
97106.5
ns93319
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103958.5
ns104333
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104229.5
ns104208
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
113750
ns104041
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103166
ns103292
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
509956.5
ns494914.5
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6187.5
ns7375
ns0.84
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7084
ns7062.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7833
ns8083
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7000
ns6959
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68645
ns66496.5
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14687.5
ns15333
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15500
ns16334
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16750
ns15958
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14833
ns14750
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
482599
ns467266
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
2799667
ns3009270.5
ns0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2078479.5
ns2083125
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2280312.5
ns2291250
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4472000
ns4920209
ns0.91
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
587881
ns585803
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24038167
ns23529584
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
17995729
ns18299083
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17057395.5
ns17952042
ns0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35636667
ns35984709
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3108956.5
ns3109259
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33833209
ns33275020.5
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27496229.5
ns28041667
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27409833
ns27515834
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41716709
ns41779084
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72750
ns75459
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75542
ns81146
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76875
ns76416.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74541.5
ns72291
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101955.5
ns100380
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
296812.5
ns285041.5
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
302854.5
ns311542
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
315187.5
ns292833
ns1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
204333.5
ns315375
ns0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
551842.5
ns544347
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12042
ns12667
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13500
ns12771
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13250
ns13750
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12958
ns12083
ns1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71982
ns70337.5
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26833
ns27042
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27333
ns27625
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27979.5
ns27708
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27062.5
ns26875
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
484665
ns473629
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12625
ns13083
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12792
ns13250
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13625
ns14833
ns0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13500
ns13125
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
54014
ns52795
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26209
ns26250
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26000
ns26750
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26250
ns28792
ns0.91
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26542
ns26167
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
309883
ns304928.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
180709
ns181791
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
181792
ns181750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185459
ns184875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
181375
ns181833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
58480.5
ns56540.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
585521
ns615187.5
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
595417
ns620771.5
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
592000
ns583541
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582583
ns595499.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
290043
ns285956
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6292
ns6958
ns0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7333.5
ns7083
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7458
ns8041
ns0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7291
ns6375
ns1.14
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70388.5
ns70068.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14458
ns14375
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14729.5
ns15333
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15187.5
ns15333
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14417
ns14500
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
470302.5
ns463652.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1190583
ns1234312.5
ns0.96
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1259542
ns1279667
ns0.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1286958
ns1269833.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1309792
ns1312458
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301566
ns301465
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4329292
ns4127187.5
ns1.05
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4329916
ns4510874.5
ns0.96
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4565562
ns4533354
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4476104.5
ns4443687.5
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1040707.5
ns1047444
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
24056
ns23871
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns4917
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4959
ns4917
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4959
ns4959
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
191633.5
ns190792.5
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6041.5
ns7041.5
ns0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7062.5
ns6292
ns1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8042
ns9208
ns0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7083
ns7166
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
55968.5
ns56472
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11750
ns11750
ns1
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10354.5
ns11584
ns0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12125
ns11812.5
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10917
ns10792
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
334934
ns335267
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23116
ns23092
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2917
ns2958
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
ns2667
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3084
ns2667
ns1.16
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2709
ns2708
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
162654.5
ns161307
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
12770.5
ns14395.5
ns0.89
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
13709
ns12333
ns1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
15250
ns14917
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
13875
ns13145.5
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57764
ns56807.5
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25417
ns25375
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24583
ns25333
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25167
ns24958
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25041.5
ns25333
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
295171
ns292514
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
ns4125
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4167
ns4167
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4167
ns4125
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
25293
ns25065
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16125
ns16333
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16250
ns16000
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16917
ns16250
ns1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16084
ns16167
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
200954
ns198557.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5791
ns5791
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5792
ns5833
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5750
ns5792
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5791
ns5750
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33912
ns33912.5
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20959
ns21041
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21041
ns21250
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21583
ns21395.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20792
ns21042
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
176959
ns176321
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
386291
ns408208
ns0.95
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
372708
ns363583.5
ns1.03
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
486562.5
ns492667
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
519792
ns523542
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67330
ns67347
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
1011291
ns978667
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
873062.5
ns891000.5
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1237521
ns1242958
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1397209
ns1420417
ns0.98
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
191794.5
ns190609
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80250
ns82666
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81792
ns82709
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86000
ns85834
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83375
ns133542
ns0.62
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194543
ns193457
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1932750
ns1923750
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1920541.5
ns1936250.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1918916
ns1914520.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1923021
ns1920083
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
395736
ns399634.5
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22304
ns22639
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
170577
ns174147.5
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6562.5
ns8542
ns0.77
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6584
ns7292
ns0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8687.5
ns9083
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6917
ns6541
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
58494
ns60578
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9500
ns9542
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9083
ns9479.5
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9500
ns9542
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9000
ns9541
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
299864
ns313158.5
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
158867583
ns120031270.5
ns1.32
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173898792
ns181860604
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148221333
ns147859583
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104030875
ns107036271
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5486248
ns5506155
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
681447729.5
ns615708666.5
ns1.11
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
554483292
ns581207833
ns0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
449606833
ns450770312.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
759359917
ns758274833.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38204857
ns34927722
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
704631875
ns650246750
ns1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
667416541.5
ns685688396
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
592203208.5
ns577502729
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
746616584
ns743657333
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57333
ns59167
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46750
ns39333
ns1.19
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47167
ns47625
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83083
ns83542
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38237
ns38483
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1937667
ns1924917
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1966479.5
ns1972334
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1981312
ns1976458
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1666938
ns1895208
ns0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
175800.5
ns176241.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
268167
ns270958
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
268062.5
ns269042
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
270375
ns270875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
267833.5
ns267958
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
121785.5
ns128472
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
676083.5
ns682312.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
661375
ns684021
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
682625
ns678333
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
594229.5
ns683083
ns0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
667322.5
ns712823
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2162666.5
ns2110062.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2158625
ns2217708.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2235541.5
ns2221875
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2184396
ns2230541
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
134022
ns134372
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5614125
ns5507000
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5487000
ns5539625
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5503834
ns5512958
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5397958
ns5509604
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
730476
ns755964
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
649042
ns638125
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
648000
ns651667
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
645042
ns638459
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
640167
ns647208
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47671
ns47881
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1819333
ns1826416
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1719500
ns1675750
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1722333
ns1720875
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2100708
ns2104000
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
226229
ns224321
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56375
ns58208
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46042
ns38792
ns1.19
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
45375
ns46584
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81917
ns83542
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28944
ns29060
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2043125
ns2031958
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2091959
ns2100291.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2087459
ns2085291
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1994792
ns2007250
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191139
ns191693.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13390792
ns13371646.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12557146
ns12465792
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12640375
ns12501042
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15104000
ns15188916
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
518367
ns510743.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47595458
ns47270208
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41795625
ns42049416.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41070000.5
ns41051834
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58470417
ns58110084
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3196312
ns3204565.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
97085500
ns96634583
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
68418416
ns91624583
ns0.75
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90577750
ns90630541
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76363042
ns98906458.5
ns0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56625
ns58500
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47250
ns38709
ns1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47459
ns47125
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
79458
ns83541
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46662
ns47960
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1927541.5
ns1920000
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1977249.5
ns1969792
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1980458.5
ns1972500
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1885937.5
ns1889834
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191777.5
ns192720
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns416
ns0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32632
ns31940
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6833
ns6750
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6625
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6562.5
ns6583
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6542
ns6250
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
170723.5
ns171690.5
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31798
ns31426
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns2833
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2875
ns2792
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2917
ns2834
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2584
ns2625
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
160722.5
ns160271
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
324739354.5
ns287478708.5
ns1.13
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
339525125
ns347117687.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
313761354
ns313742875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
270586625
ns271337417
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7046631.5
ns7120485.5
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1056579291.5
ns999672583
ns1.06
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
938968625
ns962585125
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
854294979.5
ns847863396
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1160220458
ns1159606875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33965020
ns34018012.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1708151417
ns1668327625
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1335733187.5
ns1694566583
ns0.79
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1644800875
ns1646047208
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1296542916.5
ns1665789292
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1415104.5
ns1415313
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1411979.5
ns1417167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1416812.5
ns1417459
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1409584
ns1412583
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128051
ns128511
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5069625
ns5021792
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5045062
ns5044792
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5032541
ns5021250
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5026125
ns5024292
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
538087.5
ns495850
ns1.09
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
169153250
ns169190166
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
121641542
ns179239187.5
ns0.68
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
130222229.5
ns128995104.5
ns1.01
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
165507937.5
ns162929271
ns1.02
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4915832
ns4883493
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
850025583
ns671536958
ns1.27
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
641237459
ns604481292
ns1.06
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
532476541
ns531751292
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
683223459
ns681136250
ns1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
18105472
ns16104554
ns1.12
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
9040458
ns8980854
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8724291
ns8853334
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7847084
ns7886771
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10130834
ns10140625
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1594949
ns1602269.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
37056958.5
ns36048625
ns1.03
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
36670666.5
ns37859417
ns0.97
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33132917
ns33187042
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
40008520.5
ns39063937.5
ns1.02
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6471272
ns8827671
ns0.73
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47395.5
ns47666
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47417
ns47667
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47666
ns47625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47416
ns47542
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18474
ns18332
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50292
ns50416
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50208
ns50500
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50875
ns50541
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50125
ns53000
ns0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
171116.5
ns183394
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7708
ns7833
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8250
ns7500
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9917
ns9375
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7500
ns6979.5
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
81170
ns85722.5
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10416.5
ns10500
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10083
ns10500
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10166
ns10625
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10375
ns10167
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
468151
ns484512.5
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7937.5
ns9250
ns0.86
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
8417
ns6750
ns1.25
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9334
ns9417
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6500
ns7792
ns0.83
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
96340.5
ns105586.5
ns0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13750
ns13083
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13083
ns13250
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13708
ns13458.5
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13542
ns13417
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
430131.5
ns467808.5
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1084
ns1083
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32020
ns31641
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8208
ns8167
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8083
ns8209
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8291
ns8291
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8541
ns8125
ns1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
194284
ns195119.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23250
ns25167
ns0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23166
ns23250
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23479.5
ns23270.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23333
ns23125
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18346
ns18534
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52583
ns53062
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52458.5
ns52375
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52875
ns52500
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52583
ns52708
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
246069.5
ns252220
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1414770.5
ns1400708
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1448500
ns1409834
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1408812.5
ns1399229.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1400666
ns1398458
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195901.5
ns194493.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5041521
ns5016000
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5009562.5
ns5040334
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5020750
ns4993708.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5011875
ns4643770.5
ns1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
577943
ns597903.5
ns0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3051687.5
ns3046458
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2098000
ns2118792
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2285625
ns2287146
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4540833.5
ns4859250
ns0.93
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
580172
ns581676
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24668666.5
ns24338167
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18916375
ns19105334
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18998375
ns18916917
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36603708
ns36315667
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3187885.5
ns3195442
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34405834
ns33985645.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28320666.5
ns28693250
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28063958.5
ns27979104.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41741750
ns41435375
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
141994958
ns144577667
ns0.98
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
140315020.5
ns142667333
ns0.98
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
124968083.5
ns124796041.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173994708.5
ns174395646
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22772769
ns22784954
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
945646312.5
ns908417417
ns1.04
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
821550291
ns866595875
ns0.95
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1293202000
ns690147541
ns1.87
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
687734375
ns679371625
ns1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118417230
ns118837225
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
82875
ns76312
ns1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
87604
ns76708.5
ns1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78708
ns78062.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73708.5
ns74292
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
218863.5
ns239745
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
201083.5
ns279187.5
ns0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
286750
ns297958
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
288834
ns283125
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
250916.5
ns265791.5
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1188088
ns1232585
ns0.96
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
36439417
ns35449875
ns1.03
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35404270.5
ns35824917
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32153583.5
ns32070395.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40973604
ns40877625
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5842252
ns5847896
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
151588250
ns147901291
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
153155979.5
ns155872291
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
137699666.5
ns133368083
ns1.03
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287172166
ns286886250
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34872027
ns34880972
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
158128125
ns121105063
ns1.31
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173816375
ns181834292
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147404958
ns147760625
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
107977875
ns101356500
ns1.07
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5466563
ns5478431
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
520328562
ns473677042
ns1.10
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
465862417
ns485888583.5
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
438730375
ns437646959
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
743725417
ns740881667
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35165135
ns32245879
ns1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
693072125
ns707376812.5
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
654827271
ns667253771
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
573433167
ns576063750
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
851476833
ns852206792
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1174312.5
ns1266333
ns0.93
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
994625
ns788917
ns1.26
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
961895.5
ns969500
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2072333.5
ns2069208.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
579719
ns586368.5
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2931333.5
ns2969541
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2614999.5
ns2523083
ns1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2630875
ns2620708
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3710749.5
ns3700583
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1749092
ns1794949
ns0.97
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6772375
ns6640437.5
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6502416
ns6484958
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6513500
ns6451083
ns1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4449063
ns4447979
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7500
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5334
ns1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6250
ns6125
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns9917
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25514
ns25270
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213125
ns212167
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220000
ns221000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220167
ns221125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206583
ns207458
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
255014.5
ns252957
ns1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
312678917
ns313894603.5
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
213764812.5
ns280731020.5
ns0.76
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
197457396
ns185850791.5
ns1.06
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
311892292
ns312245084
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7676616
ns7682659
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1092379645.5
ns1079816500.5
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
904888770.5
ns989067125
ns0.91
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
812108375
ns810903834
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1158438958
ns1155211625
ns1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26462420
ns26590890
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5563
ns7416.5
ns0.75
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5666
ns6209
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8000
ns6917
ns1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6125
ns5729.5
ns1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
162748.5
ns151351
ns1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7583
ns7604.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7542
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7583
ns7541
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7916
ns7542
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
614344.5
ns598449
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
541
ns541
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns541
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
ns625
ns0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
458
ns459
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24176
ns24254
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9709
ns9416
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9333
ns9292
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9791
ns9458
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9291
ns9250
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
206786
ns214013.5
ns0.97
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
353875
ns352917
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351166
ns355083.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
351250
ns350833
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
351229.5
ns353583
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21232
ns21515
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
821062.5
ns828979
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
775209
ns787167
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
774125
ns774312.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
821375
ns823875
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
296042.5
ns271369.5
ns1.09
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
315437.5
ns338958.5
ns0.93
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
340458
ns320167
ns1.06
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
450500
ns453291
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
330333
ns331895.5
ns1.00
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18319
ns18690
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
696292
ns696291
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
743062.5
ns744854.5
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1033542
ns1036229
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
698416
ns686042
ns1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
250758.5
ns234671
ns1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
331042
ns361375
ns0.92
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
348542
ns336417
ns1.04
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
417875
ns425792
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
368167
ns377584
ns0.98
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22961
ns22985
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
756354
ns760187
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
753124.5
ns753000
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1077937.5
ns1084125
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
824208
ns812791.5
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
223744
ns215024
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3542
ns3625
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3500
ns3708
ns0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3625
ns3625
ns1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3458
ns3541
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17947
ns18002
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4500
ns4291
ns1.05
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4291
ns4583
ns0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4334
ns4500
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4292
ns4541
ns0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
250374.5
ns239767
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4020.5
ns5687.5
ns0.71
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4167
ns4125
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6229
ns4959
ns1.26
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4188
ns3875
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
187203.5
ns180564
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8333
ns8708
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8292
ns8500
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8959
ns8667
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8542
ns8541
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1134224
ns1101874
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204792
ns208292
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209750
ns209250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210583
ns209166.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
198541
ns200375
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35314
ns34680
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
628542
ns649916
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
621333
ns632250
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621916.5
ns621979
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
628417
ns632208
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
311564.5
ns306075
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
981958.5
ns975416.5
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
939958.5
ns936645.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
952792
ns954895.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1288333
ns1290104.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207725
ns206706.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4717104
ns4495416.5
ns1.05
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4470167
ns4624208
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4297542
ns4293833.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6248125
ns6306792
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
935192
ns924556
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3625
ns4667
ns0.78
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3958
ns4333
ns0.91
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4916.5
ns5208
ns0.94
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3667
ns3125
ns1.17
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
194832
ns201570
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7625
ns7625
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7375
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7333
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7375
ns7125
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
992089
ns964645.5
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1641437.5
ns1660208.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1185979
ns1158208
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1378709
ns1364146
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2428250
ns2354187
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214706
ns213379
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12422083
ns12376417
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9583083
ns9587708.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9294291
ns9262687
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18012291
ns17957375
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1954819
ns1953093.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17453916
ns17363667
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14374292
ns14466208
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14354958
ns14361333
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21099271
ns21148875
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
93875
ns136479
ns0.69
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
88667
ns90541.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92166
ns91959
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
90437.5
ns88917
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126600
ns126286
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2066875
ns2029396
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1965333.5
ns2020021
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2034937.5
ns2021541.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2029166.5
ns2009791
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1084421
ns970059
ns1.12
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
328375
ns348458
ns0.94
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
349208
ns336521
ns1.04
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
397000
ns399187.5
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
309875
ns313500
ns0.99
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15807
ns15421
ns1.03
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
707500
ns709188
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
737354.5
ns737750
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1022500
ns1023375
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
654396
ns643583
ns1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
195018.5
ns185776.5
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7334
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns5292
ns1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns6042
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns9959
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33500
ns33229
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214937.5
ns223750
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230375
ns228166.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220500
ns220459
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
210791
ns215250
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
348178.5
ns289979.5
ns1.20
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3709
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3709
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3667
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22629
ns22473
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14375
ns14375
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14334
ns14250
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14375
ns14417
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14334
ns14500
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
496039.5
ns454491
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
97166
ns93937.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
94771
ns96000
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
98292
ns95750
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
94833
ns94229
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125320
ns125724.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1954375
ns1921562.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1925166
ns1938875
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1932375
ns1920667
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1915062.5
ns1918854.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1014156
ns949972
ns1.07
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
852417
ns886792
ns0.96
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
820062.5
ns812958
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1215333
ns1228020.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
959271
ns961021
ns1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA
271531
ns266393
ns1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2836167
ns2837791.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2455958
ns2523625
ns0.97
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3343125
ns3323459
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3387042
ns3391708
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1631802
ns1589685.5
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15333
ns17625
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15000
ns15833
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19271
ns18750
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16708
ns15833
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
142548.5
ns140920
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228292
ns216604.5
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215437.5
ns223875
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216000
ns216062.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
263417
ns257042
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
647857.5
ns635870.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222708.5
ns227209
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221916.5
ns220833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
223791
ns223271
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221292
ns219541
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
351782.5
ns267876.5
ns1.31
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
509750
ns523334
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
494979.5
ns557334
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
510333
ns498187.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
542437.5
ns540416
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1486215.5
ns1349491
ns1.10
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
316417
ns334459
ns0.95
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
336437.5
ns317417
ns1.06
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
354416
ns364250
ns0.97
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
320125
ns320791
ns1.00
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16603
ns16596.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
714958
ns715750.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
732917
ns735750
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1022875
ns1025729.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
665729
ns657937.5
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
197199
ns193892
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17875
ns17667
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17645.5
ns17417
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19416
ns20583.5
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17208
ns16833
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
228185
ns144720.5
ns1.58
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213625
ns212749.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223604.5
ns212500
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213000
ns213583
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
240834
ns223229
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1059556
ns930226.5
ns1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5979.5
ns7458
ns0.80
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7083
ns5000
ns1.42
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7083
ns7458
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6375
ns6000
ns1.06
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
238595
ns229973.5
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10583.5
ns10750
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10292
ns10604
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11416
ns11000
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10708
ns10458
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1093136
ns1052874
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3770.5
ns4167
ns0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3375
ns4145.5
ns0.81
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4229
ns5250
ns0.81
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3292
ns2834
ns1.16
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
251757
ns236091.5
ns1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7458
ns7625
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7167
ns7875
ns0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8000
ns7750
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7333
ns7209
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1103726
ns1061672
ns1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24419792
ns23478292
ns1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34715312.5
ns43131583
ns0.80
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37629125
ns37763437.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34906500
ns34891125.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1844116
ns1856489
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
188432604.5
ns184985667
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159504917
ns171828500
ns0.93
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146180437.5
ns146459896
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
413707208
ns412533125
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16521419
ns16498145
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
438323667
ns426401458
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
253688896
ns257893209
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
231435417
ns231907209
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
485078167
ns482223334
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
185291.5
ns183271
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183250
ns183354.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184687.5
ns186750
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183709
ns182250
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
229419
ns202451.5
ns1.13
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
631500
ns589375
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
585166.5
ns596958.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
597708
ns589000
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630583.5
ns632167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1108299
ns1041439
ns1.06
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3978979.5
ns3849562
ns1.03
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3639000
ns3881896
ns0.94
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3486041
ns3464521
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5345292
ns5356333
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
534520
ns536569.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
18266999.5
ns17412625
ns1.05
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17298416.5
ns17756875
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16542625
ns16608479
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22103834
ns22042750
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2634003.5
ns2637828
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns583
ns0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns584
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
459
ns500
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32947
ns32430
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9666
ns9875
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9459
ns9500
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9833
ns9750
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9875
ns9145.5
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
267979.5
ns267467.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
581657146
ns504434042
ns1.15
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
431111584
ns458633542
ns0.94
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
431743375
ns381209021
ns1.13
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
596638417
ns671200875.5
ns0.89
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12473530
ns12484248
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2064050353.5
ns2048273395.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1627636167
ns1661422833
ns0.98
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1488459749.5
ns1499198563
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2201925958.5
ns2207989770.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49182728
ns49043755
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1645500
ns1648062.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1168250
ns1192292
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1385416
ns1392792
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2450375
ns2475542
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
217064
ns218335.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12851792
ns12753208
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9920604
ns9970145.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9698792
ns9709187
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18357187.5
ns18405562.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2017297
ns2007331
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17787021
ns17672750
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14725958
ns14774167
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14633271
ns14626875
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21461500
ns21434167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26458
ns26208
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26209
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26291
ns26209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23903
ns24803
ns0.96
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67417
ns66917
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66666
ns66833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67292
ns67875
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66792
ns66750
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
406936
ns397350.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203875
ns204750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209625
ns209167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210333
ns209917
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199291
ns200042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26061.5
ns26341
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
611041
ns612792
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
631708
ns669042
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
669229
ns665479.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
594291.5
ns633646
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
352245
ns340366
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
671500
ns656958
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
642583.5
ns628166
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
651416.5
ns637292
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
635333
ns658854
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131732.5
ns131658
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2329958
ns2236438
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2241167
ns2302291.5
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2235209
ns2233208.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2243562.5
ns2244083.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1189796.5
ns1141510
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18584
ns17708.5
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17479.5
ns17875
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22542
ns22791.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21041
ns17812.5
ns1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
145492.5
ns143266.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228333
ns231271
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218792
ns262583
ns0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
262916
ns262520.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
258104
ns262167
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1060721
ns974956
ns1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
584
ns542
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23751
ns23116
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10042
ns10167
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9917
ns9666
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10250
ns10125
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10041
ns10084
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
261761.5
ns255373.5
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6167
ns7125
ns0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5875
ns6209
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7458
ns7354.5
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6625
ns5792
ns1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
236484.5
ns224318.5
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7708
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7084
ns7375
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7541
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7625
ns7209
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
805005.5
ns798172.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2208
ns2208.5
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2250
ns2291
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2208
ns2209
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2125
ns2167
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18044
ns17921
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6625
ns6875
ns0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6500
ns6500
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6667
ns6750
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6750
ns6708
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
332799.5
ns329206
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
754583.5
ns749437.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746792
ns748917
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
746709
ns749541
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
748833
ns751833.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21377
ns21135
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
793208
ns795541
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
774958
ns788459
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
791541.5
ns792916
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
810083
ns791791.5
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
300021.5
ns292229.5
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7209
ns7209
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5333
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns5958
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10166
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33125
ns32459
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228292
ns229666.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
226459
ns239729.5
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
269250
ns264354.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212958
ns255083.5
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
361902.5
ns359407.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11334
ns12770.5
ns0.89
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12979.5
ns11125
ns1.17
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13083
ns12792
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10958
ns10541
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
253895
ns243081
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25042
ns25208
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23875
ns24916
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25792
ns25208
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25084
ns24625
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1125992
ns1117079
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106901208
ns106480583
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117686604
ns125655584
ns0.94
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
121224709
ns120834166
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
118251959
ns117491666
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2654315
ns2637704
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
396867792
ns393188541
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
366900875
ns380341000
ns0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
353248208
ns357677834
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
486147750
ns481091583
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15270647
ns15233085
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
948542479
ns937085875
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
579410750
ns774220083
ns0.75
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
743810562.5
ns745186000
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
764589145.5
ns945237625.5
ns0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7709
ns8625
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7583
ns7500
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8042
ns8875
ns0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7875
ns7833
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
243350
ns237576
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14334
ns14250
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13875
ns14375
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14459
ns13916
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14458
ns14083
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1071498
ns1078858
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7917
ns9125
ns0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
8083
ns7041
ns1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8916
ns9083
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
8583
ns7042
ns1.22
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
237135.5
ns235440
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12667
ns12916.5
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12083
ns13208
ns0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13000
ns12792
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13083
ns12708
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
787328.5
ns787408.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
335541
ns353104
ns0.95
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
343750
ns328604
ns1.05
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
397542
ns398083
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
308542
ns314250
ns0.98
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17004
ns16719
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
710187.5
ns711500
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
735542
ns737000
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1025667
ns1029562.5
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
660021
ns649000
ns1.02
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
200985
ns196298
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23598
ns23316
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6542
ns6584
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6584
ns6750
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6792
ns6625
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6542
ns6375
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
241099.5
ns238133
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5875
ns5916
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5959
ns5875
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5792
ns5792
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24612
ns23849
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21542
ns21583
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21250
ns21542
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21750
ns22395.5
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21625
ns21250
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
264784.5
ns259774.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
178145.5
ns148459
ns1.20
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
148750
ns146500
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
152312.5
ns151167
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
146979
ns149209
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167163.5
ns168521.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1390521
ns1306312.5
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1324584
ns1335292
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1339167
ns1326333
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1324166
ns1329459
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1355480
ns1332341.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23500
ns25520.5
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24167
ns22687.5
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25625
ns25917
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24083
ns23417
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
285493.5
ns283013
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
183458
ns176479.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
118833
ns119334
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
180603.5
ns131395.5
ns1.37
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
131313
ns178542
ns0.74
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1475275
ns1446515
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns416
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23098
ns22447
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6792
ns7000
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6666
ns6792
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6750
ns6833
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6584
ns6604.5
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
259072
ns254907.5
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4625
ns5791.5
ns0.80
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4708
ns5041.5
ns0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7562
ns7375
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7167
ns5583.5
ns1.28
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
256394
ns252117.5
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10250
ns10250
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10167
ns10292
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10583
ns10208
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10250
ns10292
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1355462
ns1346292
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22982
ns23009
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5917
ns5958
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5667
ns5625
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6000
ns5667
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5666
ns5791
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
276854.5
ns270989.5
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6836667
ns6824625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6379291.5
ns6348145.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6516666.5
ns6519020.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7591021.5
ns7697209
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214663
ns213576.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24219625
ns24071458
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21283583
ns21312916.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21080667
ns21105208.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29731124.5
ns29655708
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2118978
ns2112366
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
48989083.5
ns48607583
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34134771
ns45891875
ns0.74
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45780708.5
ns45733979.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38096479
ns49303792
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6042
ns7292
ns0.83
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7291
ns6916
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7875
ns7667
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6583
ns6812.5
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
237605.5
ns236251.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8750
ns8833
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8042
ns9084
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9667
ns9125
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9083
ns8375
ns1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1065271
ns1057827.5
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1524750
ns1557041.5
ns0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1278354
ns1245708
ns1.03
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1620875.5
ns1634792
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2109874.5
ns2151354
ns0.98
lenet(28, 28, 1, 128)/forward/GPU/CUDA
280243.5
ns269564
ns1.04
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7991625
ns7905354
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6617270.5
ns6660125
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7189875
ns7215708
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10461771
ns10061000
ns1.04
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1897156.5
ns1851007
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
321937.5
ns347583.5
ns0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
345500
ns330250
ns1.05
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
409750
ns398666.5
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
339562.5
ns347854.5
ns0.98
batchedmm(128, Bsize=4)/forward/GPU/CUDA
42269
ns46483.5
ns0.91
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
746125
ns750667
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
790750
ns791375
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1075375
ns1087833
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
734625
ns760750
ns0.97
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
239383
ns231907
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
395917
ns397542
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288291
ns213292
ns1.35
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288209
ns288208
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
751084
ns750375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44476.5
ns43637
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
646125
ns666667
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
530875
ns472875
ns1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
530542
ns532542
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
972958
ns973709
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
191911
ns187534.5
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
682833
ns596583
ns1.14
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
670833
ns643625
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
659500
ns658187.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
638708
ns659375
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131884.5
ns131892
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2563479.5
ns2455000
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2432854.5
ns2514542
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2458542
ns2453792
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2383625
ns2461334
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1195975.5
ns1187757
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
331542
ns352771
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
342250
ns330916.5
ns1.03
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
400229
ns399291
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
312958
ns312854.5
ns1.00
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16588
ns15466
ns1.07
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
703000
ns710875
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
729354
ns734791
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1024041
ns1025771
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
655979.5
ns642208
ns1.02
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
199893
ns195407.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1462917
ns1465375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1500917
ns1498459
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1500875
ns1502875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1438833
ns1442833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40439.5
ns40141
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5170625
ns5101625
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5293687.5
ns5303750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5300979
ns5295812.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4994229.5
ns4993584
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195795.5
ns196609
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3666
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33586
ns33049
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15125
ns15292
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15250
ns15167
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15458
ns15291
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15166
ns15167
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
379759
ns375124.5
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71875
ns71334
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71334
ns71333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71334
ns71250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71167
ns71041
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113830
ns113867.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
330000
ns318833
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
326500
ns321959
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
318792
ns317750
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317917
ns317541
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
196755
ns192238.5
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1083
ns1084
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1084
ns1083
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23617
ns23138
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8416
ns8583
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8250
ns8417
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8584
ns8459
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8291
ns7958
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
261794
ns258287
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
465562.5
ns475687.5
ns0.98
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
478125.5
ns463395.5
ns1.03
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
555166
ns562708
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
539000
ns552729.5
ns0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129234
ns130132
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1418354
ns1400250
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1388291.5
ns1394771
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1637459
ns1643270.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1604854
ns1597458
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
273532
ns277863
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31612
ns31425
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6458
ns6750
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6333
ns6833
ns0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6791
ns6834
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6209
ns6208
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
264028.5
ns261831.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1758708
ns1726416.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1729729
ns1745625
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1739750.5
ns1724625
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1726646
ns1725854
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168260
ns169678
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4420959
ns4357021
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4347666
ns3978291.5
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4379083
ns4384375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4357583
ns4359458.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1226950
ns1215814
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6875
ns6750
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6542
ns6875
ns0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7083.5
ns7312.5
ns0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6833
ns6792
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20701
ns20951
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51979
ns48417
ns1.07
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
32583
ns33583
ns0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
52750
ns73208.5
ns0.72
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
50874.5
ns70500
ns0.72
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
210813.5
ns288573
ns0.73
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
327625.5
ns360125
ns0.91
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
347916.5
ns330312.5
ns1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
408917
ns410854.5
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
320146
ns324312.5
ns0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18775
ns18716
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
718958
ns717250
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
737917
ns741709
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1036166
ns1036125.5
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
680500
ns667292
ns1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
345085
ns340218.5
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75791
ns75417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75584
ns75208
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75375
ns75292
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75375
ns75333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47372
ns46771
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
334958
ns325792
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
328625
ns333167
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
327541
ns325417
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
326083
ns324333
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
213955
ns208628
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1483750
ns1487791
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1525917
ns1523333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1527208
ns1526708
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1462750
ns1466375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52155
ns51173
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5146333
ns5109167
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5307500
ns5274250
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5288187.5
ns5289270.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4992312.5
ns4981458.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
202940.5
ns201765
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28209
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28208
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28334
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28250
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
25072
ns24387
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66750
ns66625
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66500
ns66250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66667
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66542
ns66792
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
534798
ns518482.5
ns1.03
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1392708
ns1471916.5
ns0.95
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1134042
ns936458
ns1.21
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1145604
ns1142000
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2197625
ns2245542
ns0.98
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
584489.5
ns593805
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3059625
ns3051000
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2735708.5
ns2625979.5
ns1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2741625
ns2744916
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3826667
ns3827125
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2073138
ns2034429
ns1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8924354
ns8759417
ns1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8783666.5
ns8720687.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8795334
ns8789874.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6371000
ns6417375
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
84500
ns83687.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
84709
ns82438
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85125
ns83416.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84208
ns82771
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194258
ns194015.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2051771
ns2015417
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2021959
ns2036291
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2024188
ns2016500
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2022500
ns2009667
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
803766
ns802404
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.