Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: bump crate-ci/typos from 1.28.1 to 1.28.2 (#1130)
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.28.1 to 1.28.2. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.28.1...v1.28.2) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Loading branch information
59c0c69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4041
ns3958
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5209
ns4791
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5333
ns4792
ns1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3937.5
ns3958
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59316
ns59494
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10250
ns10750
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11083
ns10959
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11375
ns10125
ns1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10542
ns10562.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
416506
ns417797.5
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1167
ns1125
ns1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1292
ns1292
ns1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1416
ns1208
ns1.17
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1167
ns1083
ns1.08
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18132
ns18173
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4020.5
ns4083
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4250
ns3417
ns1.24
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4000
ns4250
ns0.94
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4166
ns3709
ns1.12
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
107833.5
ns107683
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
70208
ns70750
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
58667
ns64000
ns0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
64125
ns64250
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
79750
ns83042
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36663
ns36561
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2033104
ns2030500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2103708
ns2082541.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2094916
ns2089104
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2002834
ns2008667
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192404.5
ns193196.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
184125
ns140083
ns1.31
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
189792
ns181291
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
186063
ns181167
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
185125
ns185250
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166512
ns166362
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1118896
ns1120708
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1163979
ns1119000
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1120500
ns1120041.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1129854
ns1124104
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
516905
ns525948
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3375
ns3334
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3917
ns4125
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5041
ns3729.5
ns1.35
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3333.5
ns3542
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70027
ns70915
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9166
ns9125
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9125
ns9542
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9042
ns8708
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8625
ns8875
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
443773
ns475931.5
ns0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19084
ns15062.5
ns1.27
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15375
ns15250
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18375
ns17437.5
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14625
ns15375
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
54384.5
ns53231
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225917
ns216458.5
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214542
ns225042
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215125
ns213541.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213000
ns222375
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
268804
ns270372.5
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
791
ns750
ns1.05
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns666
ns1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
541
ns583
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17259
ns17324
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1417
ns1500
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1542
ns1520.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1750
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1667
ns1500
ns1.11
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
100025
ns100368.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8917
ns8125
ns1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6417
ns8125
ns0.79
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
8042
ns7041
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10334
ns10667
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23360
ns22992
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
233625
ns234000
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230375
ns239937.5
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230166
ns228833.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225083
ns222271
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
165842
ns167254
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3834
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4000
ns3958
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3916
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23503
ns23377
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
17333
ns16833
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17125
ns16667
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
18416
ns18375
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16542
ns16583
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
160007
ns160878
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
602459
ns610542
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
612791
ns613209
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
611250
ns634042
ns0.96
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
609583
ns609000
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113120
ns113540.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1422458
ns1430375
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1432875
ns1420292
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1432708.5
ns1446167
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1421250
ns1425542
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
210148
ns210405
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1073292
ns1076083
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
969125
ns968959
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1355229
ns1348187.5
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1303542
ns1290083
ns1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA
271326.5
ns272167
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5773875
ns5791000
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4524834
ns4597104
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4956520.5
ns4948917
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5616459
ns5522395.5
ns1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1077047.5
ns1076534
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23520
ns23590
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2208
ns2166
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2208
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
169620
ns173376
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4250
ns3917
ns1.09
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4125
ns4208
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4708
ns5125
ns0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3833
ns4083.5
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
64369
ns65133.5
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11375
ns10979.5
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11750
ns11375
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11875
ns11667
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11292
ns11125
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
441357.5
ns444460.5
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6292
ns5959
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6750
ns6416
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7166
ns7209
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6333
ns6333
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
51996.5
ns51265
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18312.5
ns17125
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18083
ns17208
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
19833
ns17709
ns1.12
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18125
ns17500
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
297374
ns297640
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
584
ns541
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
541
ns625
ns0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32515
ns32574
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8959
ns8312.5
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8917
ns8395.5
ns1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9083
ns8834
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8542
ns9125
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
156580.5
ns156527
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
96500
ns96458
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
96458
ns96250
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
96666.5
ns95958
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
96458
ns97333
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111507.5
ns111569
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
282542
ns279917
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
294792
ns272666
ns1.08
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
278250
ns276958
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
274042
ns291791
ns0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
185832.5
ns184593
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3410792
ns3390792
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
2893584
ns3045416
ns0.95
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3043771
ns3031500
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3950938
ns3960417
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
573403
ns572942
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7640458
ns7593625
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7363916.5
ns7437042
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7444583
ns7444584
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8213291
ns8265979
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1371137.5
ns1334670
ns1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17504417
ns12605208
ns1.39
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17685667
ns17554084
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17570042
ns17556062
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14113396
ns14272042
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23914500
ns24062729
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43551541
ns34415959
ns1.27
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37461209
ns37185584
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34611021
ns34968250
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1841718.5
ns1858779
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
313175916
ns317027145.5
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
178521083
ns233784625
ns0.76
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
195096687.5
ns195359167
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
279780167
ns280568396
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13887340
ns13916432
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
273572625
ns273605875
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
278931729
ns269293459
ns1.04
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
256343958
ns251015375
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
474930271
ns332609042
ns1.43
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21875
ns21834
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22459
ns21750
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23250
ns25500
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21334
ns22916
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95397
ns95464
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
111375
ns118125
ns0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104624.5
ns103417
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104666
ns104833.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103604.5
ns104125
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
506209
ns509331.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5833.5
ns5417
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6041
ns6500
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6667
ns6500
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5583.5
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68686
ns67886
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14834
ns14625
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15792
ns15292
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16375
ns15542
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14834
ns14917
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
476339.5
ns472243.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3078645.5
ns3101833
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2149083
ns2134333
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2304458.5
ns2303021
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4677166
ns5007292
ns0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
589471.5
ns586798
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23611208
ns23546583
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18335958
ns18840521
ns0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17863458.5
ns18012083
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35453375
ns36120167
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2770157.5
ns2918041
ns0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33321333
ns33910770.5
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27967958
ns27527417
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27533500
ns28620667
ns0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41461333
ns41842979
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72791
ns72812
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73083
ns74542
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
81187.5
ns74187.5
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73875
ns72666
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
100825.5
ns101631
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
316333.5
ns292354.5
ns1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
318437.5
ns217084
ns1.47
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
323125
ns315166
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
308937.5
ns292458
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
537489.5
ns549955
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11625
ns11541.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12083
ns11791
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12125
ns12250
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11959
ns11417
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
69997
ns70877.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26834
ns26084
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26959
ns26583
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27958
ns28645.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26791.5
ns27000
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
466299.5
ns471342.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12625
ns12083.5
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12604.5
ns12250
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13958
ns13292
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12208
ns12417
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52579.5
ns52255
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25958
ns25416
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26333
ns25750
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26583
ns25791
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26541
ns26459
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
298632
ns302749
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
179625
ns178333
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
179458
ns179875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183083
ns180750
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
188958
ns180334
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56675.5
ns56120
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
595770.5
ns581770.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
595666
ns583250
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
584792
ns583208.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582042
ns589771
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
280995
ns283667.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5959
ns6187.5
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6375
ns6333
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7125
ns6354.5
ns1.12
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6042
ns6250
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
69870
ns70397
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14166
ns13583
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14917
ns14000
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15625
ns14792
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14458
ns14709
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
454188
ns461030.5
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1239500
ns1242791
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1321583
ns1300208
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1360666.5
ns1359354
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1089687
ns1186229.5
ns0.92
batchedmm(512, Bsize=4)/forward/GPU/CUDA
302025.5
ns301478
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4119041
ns4116667
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4588250
ns4395875
ns1.04
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4571375
ns4529125
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3710875
ns3917271.5
ns0.95
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1037839.5
ns1038425.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23678
ns23500
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4959
ns4834
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4916
ns4834
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4875
ns4917
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4917
ns4958
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
190693.5
ns188737.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5792
ns5584
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6167
ns6084
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7042
ns7333
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5875
ns5959
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
55855.5
ns55083.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11792
ns10750
ns1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11125
ns11209
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11250
ns11542
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10584
ns11292
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
329926.5
ns335254
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
ns291
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
334
ns334
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23017
ns22752
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3000
ns2750
ns1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3083
ns2792
ns1.10
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3041
ns2709
ns1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2750
ns0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
161342
ns159355.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11875
ns11084
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11833
ns11458
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13042
ns12854.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11292
ns12083
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57364
ns57729
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24959
ns24167
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24979.5
ns24541
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
27250
ns24916
ns1.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24458
ns25167
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
291490.5
ns299680
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4250
ns4125
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4291
ns4125
ns1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4250
ns4208
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4166
ns4209
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24932
ns24651
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16500
ns16166
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16166
ns16083
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16541
ns16292
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16250
ns16042
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
196710
ns199395
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5791
ns5667
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5750
ns5709
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5834
ns5791
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5791
ns5791
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33248
ns33617
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20792
ns20020.5
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20959
ns20583
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21459
ns21083
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20542
ns21042
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
175956
ns175086.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
412875
ns407729
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
375208
ns380271
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
487209
ns483500
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
146584
ns105458.5
ns1.39
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67216
ns67085
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
916708.5
ns926875
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
989792
ns968750
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1196125
ns1173375
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
476875
ns378000
ns1.26
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
190238
ns188736
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
135084
ns132583
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81542
ns130188
ns0.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
141833
ns129458
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
135750
ns137584
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193827
ns192853
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1911291.5
ns1920250.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1946333
ns1918583
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1928333
ns1924438
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1910834
ns1920500
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
394453
ns409280
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns375
ns0.78
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22330
ns21945
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1834
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
168958
ns171197.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6625
ns6042
ns1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6792
ns6625
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7792
ns7916.5
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6666
ns7042
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
58760
ns58992.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9667
ns8791
ns1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9291
ns8792
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9333
ns9291
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9417
ns9292
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
298788.5
ns311073
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
111820937.5
ns110075500
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181915979
ns174018250
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
143480208
ns143516291
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
92143250
ns116009417
ns0.79
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5482249
ns5438117
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
614702333
ns617670521
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
582318312.5
ns555321542
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
456793479.5
ns453019437.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
623509562.5
ns637539146
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38239772
ns34975009
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
796858958
ns654977875
ns1.22
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
687543333
ns666181396
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
619636833
ns629801020.5
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
745741417
ns742545875
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
62834
ns61500
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47791
ns52500
ns0.91
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
53250
ns53125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83083
ns85458
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37226
ns37175.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1923354
ns1912375
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1992584
ns1971000
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1986708.5
ns1984958.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1895062.5
ns1907791.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
173492.5
ns173650
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
266916.5
ns285104
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
267354.5
ns265292
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
268666
ns267750
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
264979
ns266625
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
127720
ns130504
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
664125
ns686125
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
694604.5
ns704333
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
650292
ns683541.5
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
699958
ns663104
ns1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
703429.5
ns717967
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2256583
ns2234292
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2246021
ns2244771
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2238750
ns2244750
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2261771
ns2241333.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133355.5
ns133396.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5510583
ns5451812.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5590125
ns5487812.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5513333
ns5498042
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5481479.5
ns5562521
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
809354
ns754203
ns1.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
669750
ns685959
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
680333
ns670541
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
678166
ns666167
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
674417
ns680000
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47532
ns46765
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1816770.5
ns1817416
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1665417
ns1716895.5
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1717645.5
ns1744292
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2082542
ns2082750
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
226328.5
ns220971
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
70125
ns70125
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
59875
ns53125
ns1.13
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
52958
ns52708
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82666
ns84625
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28600
ns28234
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2037917
ns2030854.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2108146
ns2081770.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2092292
ns2100958
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2001334
ns2007416
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
190920
ns188927
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13460541.5
ns13472458
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12543854
ns12508625
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12654167
ns12582124.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15261812.5
ns15073041.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
515830
ns512756.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47280959
ns47011770.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
42008521
ns41636000
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40839333.5
ns40969375
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58419750
ns59058645.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2897965
ns3033111.5
ns0.96
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
97048750
ns73891958
ns1.31
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91157167
ns67845145.5
ns1.34
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90856333.5
ns92214500
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76444354
ns99774291.5
ns0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
72334
ns71166.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47292
ns64583
ns0.73
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
65375
ns65791
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82584
ns84792
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46194
ns47424
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1929937
ns1905937.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1984583.5
ns1967666.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1983584
ns1977375
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1888750
ns1898333.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
189040
ns192864
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
417
ns292
ns1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns416
ns0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32091
ns32583
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6541
ns6041
ns1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6125
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6583
ns6459
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
5958
ns6542
ns0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
172109
ns172656.5
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
291
ns333
ns0.87
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32297
ns32498
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2917
ns2708
ns1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2834
ns2709
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2875
ns2875
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2875
ns0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
162091
ns162027.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
279890375
ns278479062
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
347812250
ns339860437.5
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
310658166.5
ns309104833
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
261239625
ns282371084
ns0.93
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7100472
ns7112114
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
994066791
ns997282375
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
960267958
ns939909542
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
837209229.5
ns834322792
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1129871667
ns1020744375
ns1.11
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34010568
ns34065304
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1752205958
ns1416221791.5
ns1.24
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1693119292
ns1324822042
ns1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1650193041
ns1631228625
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1306363020.5
ns1675762813
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458375
ns1450812.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1463959
ns1456521
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1465625
ns1455333
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1459625
ns1460167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127763
ns127677
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5012416
ns5023459
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5066791
ns5018833
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5033750
ns5024791.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5030375
ns5045271
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
508105
ns588360
ns0.86
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
158175666
ns157992750
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
166759458.5
ns148446708
ns1.12
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
90721479
ns164732625
ns0.55
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
151859250
ns153538583.5
ns0.99
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4851019
ns4886668
ns0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
669929250
ns637312250
ns1.05
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
560789291
ns611560250
ns0.92
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
487588708
ns470585834
ns1.04
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
651112083
ns662978834
ns0.98
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16518582
ns16094164
ns1.03
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8927708.5
ns8954458
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
9111000
ns9014875
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7978437.5
ns7941438
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10091416
ns10320875
ns0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1611554
ns1593595
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36693146
ns37088334
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
39523229
ns37925916.5
ns1.04
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
34135874.5
ns34179167
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
59280958
ns39118729
ns1.52
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6506722
ns6471873.5
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47437.5
ns47416
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47500
ns47292
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47542
ns47459
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47292
ns47458
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18176
ns18458
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50417
ns50250
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50458
ns50291
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50708
ns50834
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50333
ns50458
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
194955.5
ns188984
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7145.5
ns6125
ns1.17
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7292
ns6708
ns1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8084
ns7875
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6667
ns7042
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
93154.5
ns89761
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10417
ns9750
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10166
ns10125
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns10250
ns1
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9833
ns10541
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
540639.5
ns516571.5
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6333
ns5750
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6375
ns5958.5
ns1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7479.5
ns7417
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5166
ns6417
ns0.81
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
107337.5
ns106479.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13709
ns12750
ns1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13000
ns13042
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13542
ns13291
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13416.5
ns13270.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
470745
ns479931
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1125
ns958
ns1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1041
ns959
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1125
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32066
ns32924
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8333
ns7542
ns1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8084
ns8000
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8042
ns7958
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7875
ns8250
ns0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
196744
ns200265
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23791.5
ns22875
ns1.04
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23209
ns23041
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23291
ns23917
ns0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23041
ns23208
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18317
ns18525
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52583
ns52208
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52625
ns52583
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52833
ns52625
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52417
ns52542
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
274763.5
ns267460
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458625
ns1451417
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1464021
ns1459084
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1466000
ns1459500
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1454708
ns1465416.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195560
ns196174
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5020749.5
ns5014166.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5048500
ns5005062.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5032583
ns5014250
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5015271
ns5037250
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
577898
ns579761
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3133854.5
ns3149500
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2152167
ns1975646
ns1.09
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2319584
ns2323562.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4994354
ns4912270.5
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
580258
ns583087.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24444667
ns24421562.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19072896
ns19801250.5
ns0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
19040875
ns18967959
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36840083
ns37230000
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2865056
ns2963899
ns0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34088208
ns34154937.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28581417
ns28340541
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28009625
ns28271812.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41680458.5
ns43122000
ns0.97
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
141268000
ns140810292
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
143350625
ns143457875
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
120743271
ns120969000
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
188129709
ns190332292
ns0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22550783
ns22567410
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2324854792
ns1439193417
ns1.62
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
841095084
ns1035778354.5
ns0.81
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1147318167
ns1029350563
ns1.11
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
833862645.5
ns847160583
ns0.98
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
117903310
ns118590973
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
84125
ns72979
ns1.15
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
78250
ns72229.5
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76312
ns75417
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
71667
ns73416.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
218484.5
ns210693.5
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
290458
ns296396
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
292000
ns283542
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
305208
ns309000
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
288208.5
ns282667
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1198254
ns1113011
ns1.08
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35368791
ns35428583
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36524083.5
ns35740146
ns1.02
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
31361542
ns31356458
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
38859354
ns39882791
ns0.97
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5837777.5
ns5846172
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148171584
ns148563000
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
157709333
ns152825542
ns1.03
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
137631188
ns135772750.5
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
150161812.5
ns153516333
ns0.98
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34862440.5
ns34902152
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
111509000
ns112450083
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181918104.5
ns173734500
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
143432542
ns143024292
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
94189375.5
ns97164708
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5460850
ns5471199
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
497837834
ns468949292
ns1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
512628166
ns523211021
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
440382167
ns440488146
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
678623500
ns623433833.5
ns1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35174507
ns32285967
ns1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
644936208
ns800549541
ns0.81
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
676380021
ns656663541.5
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
603539166.5
ns567293062.5
ns1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
727707084
ns735113417
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1357667
ns1357292
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
795375
ns1006709
ns0.79
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
995750
ns993792
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2104875
ns2076875
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
578115.5
ns574648.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2829624.5
ns2981104
ns0.95
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2513417
ns2614562.5
ns0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2616854
ns2632479
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3785792
ns3749687.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1750025
ns1705197
ns1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5815812.5
ns5826896
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5906250
ns5792500
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5802125
ns5792645.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2884250
ns2968021
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8084
ns8042
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6333
ns7000
ns0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
7042
ns7042
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10541
ns10875
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24821
ns24779
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213645.5
ns212208
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
255312.5
ns233625
ns1.09
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220667
ns220750
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205667
ns209750
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
248152.5
ns246929
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
293659209
ns452114625
ns0.65
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
259757583
ns205741771
ns1.26
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
158085937.5
ns181027291.5
ns0.87
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
293331625
ns462543917
ns0.63
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7676538.5
ns7673150.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1087845916.5
ns1095771812.5
ns0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
950749875
ns925308125
ns1.03
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
812442750
ns875879750
ns0.93
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1143172250
ns1183196167
ns0.97
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26848630
ns26783812
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5542
ns5125
ns1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6084
ns5312.5
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7020.5
ns6375
ns1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4958
ns6083
ns0.82
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
143609
ns143484
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7292
ns6875
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7542
ns7500
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7541
ns7583.5
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7042
ns7708
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
603161
ns569216
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
583
ns500
ns1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
500
ns584
ns0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns584
ns0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23552
ns23876
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9167
ns8584
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns8917
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9416
ns9500
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9084
ns9292
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
220124
ns202303
ns1.09
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
380958
ns352875
ns1.08
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352792
ns382959
ns0.92
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352625
ns352625
ns1
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
350834
ns351625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21325
ns21342
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
832062.5
ns776270.5
ns1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
827458
ns810812.5
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
775062.5
ns775187.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
823188
ns827583.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
278436.5
ns240060.5
ns1.16
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
335250
ns332770.5
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
327208
ns332583
ns0.98
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
451729
ns451459
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
12458
ns9959
ns1.25
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18024
ns18163
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
711291
ns714000
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
735541
ns727125
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1004041
ns999833
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26666
ns26625
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
252208
ns238711
ns1.06
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
375354.5
ns374437
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
336667
ns347917
ns0.97
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
439084
ns440937.5
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
28875
ns28792
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22779
ns22488
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
720625
ns733000
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
804333.5
ns778479
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1027667
ns1023541.5
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
104125
ns89875
ns1.16
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
204288.5
ns205326
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3500
ns3354.5
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3833
ns3417
ns1.12
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3750
ns3625
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3334
ns3750
ns0.89
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17589
ns17749
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4125
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4375
ns4292
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4583
ns4250
ns1.08
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4375
ns4375
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
247283.5
ns235900.5
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3625
ns3417
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3917
ns4000
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4667
ns4041
ns1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3729
ns4125
ns0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
172481
ns174157.5
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8479.5
ns8042
ns1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8750
ns8500
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8584
ns8125
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8500
ns8625
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1132928
ns1076434
ns1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
206959
ns207542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
212916
ns213916
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
214834
ns212833
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200708
ns202625
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34121
ns34097
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
649583.5
ns601333
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
623333
ns633916.5
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622250
ns621208
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
613479
ns582666.5
ns1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
316951.5
ns291620
ns1.09
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
1236916
ns1245375
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1300167
ns1251750
ns1.04
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1184250
ns1177937.5
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1155667
ns1207083
ns0.96
batchedmm(128, Bsize=128)/forward/GPU/CUDA
206523
ns207232
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4569500
ns4566750
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4789500
ns4712249.5
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4471334
ns4457500
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
4277000
ns4779979
ns0.89
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
925498
ns927700.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3500
ns2958
ns1.18
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3708
ns3917
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4500
ns3896
ns1.16
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3084
ns3833
ns0.80
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
207464.5
ns167597.5
ns1.24
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7167
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7500
ns7708
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7458
ns7208
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6750
ns7459
ns0.90
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
942535
ns944745
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1661083
ns1646750
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1212459
ns1186708
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1388375
ns1375541.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2367291.5
ns2434792
ns0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214118
ns214131
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12379333
ns12360250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9634187.5
ns9584833
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9303250.5
ns9257792
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
17994791.5
ns18118625
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1954978
ns1941495.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17400125
ns17409917
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14391542
ns14369603.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14366500
ns14347521
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
20976166.5
ns21171916
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134083
ns85209
ns1.57
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
134145.5
ns138875
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
140125
ns134958
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
133834
ns132917
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126509
ns125576
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2067833
ns2040229.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2021792
ns2026646
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2040375
ns2030000
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2038229.5
ns2046729
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
937192.5
ns954388.5
ns0.98
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
1250
ns1000
ns1.25
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
1542
ns1292
ns1.19
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3500
ns1791
ns1.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
1041.5
ns1416
ns0.74
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16603
ns16301
ns1.02
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2792
ns2458
ns1.14
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2791
ns2583
ns1.08
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2834
ns2792
ns1.02
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2687.5
ns2875
ns0.93
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
180632.5
ns180190.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8084
ns8041
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6416
ns6959
ns0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6916
ns7125
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10583
ns10833
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33846
ns33324
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224958
ns217125
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230000
ns220125
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220875
ns220542
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206709
ns207145.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
304013
ns294304
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3666
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
ns3667
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3750
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22588
ns22268
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14625
ns14542
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14250
ns14458
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14625
ns14500
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14458
ns14250
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
437508.5
ns451646.5
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145791
ns135084
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
141583
ns135167
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
142459
ns145833
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
141375.5
ns135771
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125856
ns124920.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1928792
ns1931125
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1919959
ns1923875
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1933062.5
ns1933583.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1928146
ns1941584
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
879589
ns895888.5
ns0.98
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
870875
ns869083.5
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
819625
ns814146
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1235083
ns1222709
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
966104.5
ns942729
ns1.02
lenet(28, 28, 1, 32)/forward/GPU/CUDA
274565
ns269464
ns1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2825084
ns2833167
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2525875
ns2528333.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3358499.5
ns3338750
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3396917
ns3399146
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1523569.5
ns1538408
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
14750
ns20750
ns0.71
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15375
ns15041.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16875
ns16229.5
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14958
ns14959
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
130066
ns129111.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
261458
ns215916
ns1.21
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
259875
ns229604.5
ns1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216042
ns215709
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220500
ns224833
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
583170.5
ns586555.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
219729.5
ns219250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
220479
ns220020.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
223250
ns222125
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221354.5
ns219916
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
243425.5
ns244257
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
510541.5
ns529291.5
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
506917
ns509000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
498667
ns509666
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
512312.5
ns509542
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1217935
ns1272897.5
ns0.96
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
3667
ns3125
ns1.17
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4833
ns4500
ns1.07
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
5167
ns4542
ns1.14
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
3979.5
ns3959
ns1.01
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16933
ns16759
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7625
ns7208
ns1.06
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7375
ns7208
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7333
ns7250
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7250
ns7334
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
178880.5
ns181468
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18792
ns16792
ns1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17542
ns17062.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19791
ns17812.5
ns1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18125
ns17250
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
137690.5
ns134619.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
252708
ns211583
ns1.19
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213500
ns213625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214541
ns212812.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
214000
ns213083
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
849873
ns895952
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4229.5
ns3917
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4916
ns4833
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5417
ns4625
ns1.17
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4291.5
ns4625
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
168191
ns212453
ns0.79
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10750
ns10166
ns1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11042
ns10417
ns1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10875
ns10875
ns1
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10042
ns10583
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
949351.5
ns992994
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3292
ns3125
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3625
ns3709
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4167
ns4750
ns0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3459
ns3916
ns0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
206120
ns212054.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns7083
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7791
ns7125
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7666
ns7583
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7292
ns7500
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
985137
ns1004688
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23600875
ns23464771
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43903313
ns35060375
ns1.25
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37710791.5
ns37779167
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34490521
ns34969333
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1837956
ns1848833
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
191551625
ns184464833.5
ns1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
186643917
ns160073583.5
ns1.17
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
145792667
ns145086500
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
271888584
ns445100854
ns0.61
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16496336
ns16527443
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
292672562
ns271288729
ns1.08
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
266647854
ns263438959
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
299377291.5
ns302324416
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
325821396
ns496832583.5
ns0.66
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
184041
ns181417
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182292
ns185458
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184917
ns185750
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183667
ns181708
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
199516.5
ns193313
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
632125
ns589438
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
596250
ns631229
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
589146
ns598125
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
634646
ns590687.5
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
950274
ns966959
ns0.98
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3923584
ns3877125
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
4065250
ns3946625
ns1.03
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3605250
ns3651083.5
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4910271
ns5012833.5
ns0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA
551654.5
ns530368
ns1.04
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
16427166.5
ns17988625
ns0.91
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17546270.5
ns18469458
ns0.95
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
15424750
ns17328979.5
ns0.89
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
41363334
ns20374792
ns2.03
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2634322
ns2619767.5
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32132
ns32351
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9500
ns9041
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9500
ns9541.5
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9792
ns9833
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9541
ns9500
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
247547.5
ns247867.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
513820542
ns498558729
ns1.03
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
535432083
ns468495750
ns1.14
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
355647999.5
ns362160229
ns0.98
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
672007125
ns607173041
ns1.11
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12475563.5
ns12482436
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1968156417
ns1885912604.5
ns1.04
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1778975000
ns1633604541
ns1.09
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1508167229
ns1504714375
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2144133562.5
ns2155903916.5
ns0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49346209.5
ns49283559
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1659562.5
ns1664666.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1222625
ns1200396
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1402292
ns1387542
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2420750
ns2441166
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214908
ns216027
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12714958
ns12783813
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
10033625
ns9969333
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9669250
ns9630041
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18444395.5
ns18564625
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2014155.5
ns2024417
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17720021
ns17729000
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14836625
ns14689833
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14593959
ns14572562.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21470916.5
ns21460792
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26208
ns26167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26291
ns26167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26208
ns26334
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24129
ns24291
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67292
ns67375
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67166
ns66792
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67437.5
ns67250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67125
ns66916
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
370580.5
ns376851.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
206334
ns206292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
212084
ns213042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
211708
ns212292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200042
ns200542
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25933.5
ns25875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
652229
ns608438
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
673167
ns631687.5
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
623750.5
ns622729.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
594625
ns592459
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
321589
ns328754.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
689375
ns702583
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
686646
ns644542
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
603125.5
ns631083
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
595854
ns682250
ns0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131463
ns131950
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2275292
ns2262083
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2318250
ns2242917
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2234167
ns2231125
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2258041
ns2307979
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1083174.5
ns1167364
ns0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17208
ns17125
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16708.5
ns20083
ns0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18542
ns18791
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
26209
ns18041.5
ns1.45
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
133722.5
ns132602
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
233041
ns229500
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
238708
ns218833
ns1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220895.5
ns219792
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
247479
ns230333.5
ns1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
889518
ns967555
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23455
ns23714
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9917
ns9417
ns1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9916.5
ns9833
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10166
ns9875
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9709
ns9833
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
243499.5
ns247044.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5375
ns5209
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5667
ns5812.5
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7208
ns6812.5
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5166
ns5916.5
ns0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
195501
ns211718.5
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns7084
ns1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7417
ns7459
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
ns7667
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7208
ns7500
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
713265
ns739090.5
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2459
ns1917
ns1.28
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2375
ns2208
ns1.08
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2250
ns2250
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2042
ns2250
ns0.91
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17937
ns18219
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6667
ns6292
ns1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6625
ns6417
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6667
ns6729.5
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6500
ns6584
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
295613.5
ns307391
ns0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
781188
ns749208
ns1.04
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
762250
ns748625
ns1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
746542
ns746500
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
746084
ns748625
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21171
ns21224.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
815833
ns803167
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
816958.5
ns792833
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
775937.5
ns792834
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
810604.5
ns813166
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
269191.5
ns271736
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8042
ns8125
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6417
ns7583
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6958
ns6959
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10625
ns10917
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33169.5
ns32567.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
265000
ns232666
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
268728.5
ns240625
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229125
ns227604
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217229
ns258125
ns0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
330500.5
ns333854
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10375
ns9959
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10646
ns10709
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11292
ns10833
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10125
ns10271
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
217068
ns226295
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24542
ns24167
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25354.5
ns24729.5
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25500
ns24417
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24333
ns25354.5
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1027279
ns1051998
ns0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106479791.5
ns106630458.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
126041750
ns117910875
ns1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120943833
ns120489750
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117512916.5
ns117867166.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2638425
ns2630839
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
384219250
ns375572750
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
372791166.5
ns347200750
ns1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
338002625
ns370237167
ns0.91
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
471273750
ns484151625
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15225519
ns15207487.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
803612958.5
ns607408041
ns1.32
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
771462084
ns591624416
ns1.30
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
812264500
ns811424250
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
607987313
ns961849167
ns0.63
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7042
ns6834
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7208
ns6708
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8166.5
ns8041
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6750
ns7354
ns0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
210285.5
ns213896
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14333
ns14000
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14750
ns15125
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14000
ns14458
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13792
ns13666
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
971185
ns993505
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6209
ns5958
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6417
ns6145.5
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7500
ns7458
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5792
ns6312.5
ns0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
204318
ns209272
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12875
ns12500
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12583
ns12625
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13125
ns13250
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12333
ns12250
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
695362.5
ns719970
ns0.97
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5042
ns5000
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5625
ns5667
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
6250
ns5500
ns1.14
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5958
ns5458
ns1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17261
ns17137
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15666
ns15083
ns1.04
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15709
ns15459
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15583
ns15458
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15458
ns15583
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
184710.5
ns185445
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
417
ns292
ns1.43
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23615
ns23381
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6542
ns6291
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6542
ns6334
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6542
ns6520.5
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6208
ns6541
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
226664.5
ns227150.5
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5750
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5916
ns5792
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5833
ns5834
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5834
ns5875
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24395
ns24282
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
22729
ns23416.5
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21625
ns20542
ns1.05
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21667
ns21292
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
20854.5
ns21416
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
248686
ns249310.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
192437
ns192603.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
194875
ns190208
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
190958
ns187125
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
198042
ns189437.5
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166753.5
ns167056.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1364250
ns1339333.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1373333.5
ns1319750.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1330458
ns1298333
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1326229.5
ns1349625
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1188373.5
ns1248940
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23125
ns22188
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23000
ns22167
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24041
ns23250
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21667
ns30833
ns0.70
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
256526
ns318042
ns0.81
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
131208
ns175104
ns0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
183125.5
ns129354
ns1.42
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118667
ns147250
ns0.81
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
180917
ns180250
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1281592
ns1355497.5
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23301
ns23100
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6833
ns6167
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6667
ns6416
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6833
ns6583
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6417
ns6583
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
241136.5
ns245385
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4542
ns4208
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5229.5
ns4625
ns1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5125
ns4833
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4666
ns4708
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
219000.5
ns232572
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10334
ns9583
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10625
ns10020.5
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10375
ns9791
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10375
ns10291.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1232626.5
ns1286978.5
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1667
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23375
ns23645
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6042
ns5708
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6000
ns5750
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5959
ns5667
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5625
ns5667
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
260915.5
ns263109.5
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6837750
ns6835750
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6418708
ns6400459
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6547416.5
ns6536604
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7628667
ns7672542
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214982
ns215618
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24126020.5
ns24116958
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21396208
ns21263041
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
20992000
ns20976375
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29707541
ns29871542
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2104096.5
ns2094351.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
48614958
ns37551959
ns1.29
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45739708
ns34396208.5
ns1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45440458
ns45713375
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38260167
ns49651167
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5917
ns5583
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6083
ns6250
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7041
ns6625
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5708
ns6625
ns0.86
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
205307.5
ns210693
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8583
ns8166
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8959
ns9000
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8417
ns8625
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8125
ns8500
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
959724.5
ns993726
ns0.97
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1564625
ns1570250
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1276958
ns1273479
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1632792
ns1626896
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2147187.5
ns2142333
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
276902.5
ns271789
ns1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7938667
ns7954709
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6675417
ns6282562.5
ns1.06
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7179229.5
ns7141958
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10466792
ns10525875
ns0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1755348
ns1760839.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
375979.5
ns377437.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
356791.5
ns378125
ns0.94
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
453958
ns450292
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
31791.5
ns30500
ns1.04
batchedmm(128, Bsize=4)/forward/GPU/CUDA
47221
ns42718
ns1.11
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
724250
ns743209
ns0.97
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
820708
ns790458
ns1.04
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1064167
ns1051750
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
93125
ns123333
ns0.76
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
222768.5
ns280362
ns0.79
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
413500
ns415750
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
220417
ns305875
ns0.72
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
305958
ns306125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
758417
ns757167
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44850
ns44026.5
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
664291
ns662333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
464750
ns523625
ns0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
524625
ns524208
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
971875
ns973917
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
190748
ns188149
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
660125
ns698417
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
688833
ns669875
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
599208.5
ns674375
ns0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
676041
ns683041.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131585
ns131691
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2465396
ns2527000
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2549750
ns2445791.5
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2454750
ns2456458.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2436396
ns2515459
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1099429.5
ns1199048
ns0.92
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
2084
ns1917
ns1.09
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
2500
ns2041.5
ns1.22
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4584
ns2459
ns1.86
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
2000
ns2437.5
ns0.82
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16017
ns16312
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5541
ns5208
ns1.06
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5625
ns5500
ns1.02
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5541
ns5625
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5459
ns5479.5
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
183422
ns184945
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1479917
ns1481291
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1515750
ns1524125
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1523083
ns1521750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1448834
ns1447604.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39978
ns39655
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5170937.5
ns5139771
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5319792
ns5014250
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5296208
ns5294625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4989229.5
ns5015729.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195522
ns194949
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3667
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3625
ns1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3666
ns3625
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3666
ns3750
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34698
ns33334
ns1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15458
ns15291
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15292
ns15083
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15500
ns15292
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15250
ns15167
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
348167
ns349359.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
96375
ns94542
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
104834
ns103166
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
94000
ns103209
ns0.91
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
92875
ns95625
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113764.5
ns113041.5
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
319291
ns318084
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
326792
ns316917
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
317083
ns316666
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317375
ns321750
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
196865
ns192326
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1083
ns958
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1042
ns959
ns1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23731
ns23389
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8458
ns7708
ns1.10
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8167
ns7916
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns7959
ns1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8000
ns8270.5
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
250818
ns246988.5
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
536458.5
ns534875
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
514770.5
ns514875
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
583167
ns572375
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
177291.5
ns256145.5
ns0.69
batchedmm(128, Bsize=32)/forward/GPU/CUDA
128802.5
ns129558.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1430708
ns1420041.5
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1491625
ns1466708.5
ns1.02
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1790583
ns1756250
ns1.02
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
862187.5
ns902625
ns0.96
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274040.5
ns276092.5
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32562
ns31832
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6750
ns6084
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6542
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6666
ns6292
ns1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6584
ns6292
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
252956.5
ns248681.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1721104
ns1729313
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1775187.5
ns1725667
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1796833.5
ns1769167
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1760583
ns1772187.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168117
ns168168
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4395271
ns4416792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4422959
ns4351145.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4375792
ns4368958
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4339937.5
ns4403479.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1159629.5
ns1091804.5
ns1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
16708.5
ns7041.5
ns2.37
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7042
ns7333
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
8000
ns7375
ns1.08
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
7125
ns7375
ns0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
19970
ns20581
ns0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
52520.5
ns32334
ns1.62
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
74791
ns62021
ns1.21
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33083
ns33333
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
43000
ns71833
ns0.60
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
199671
ns196104.5
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17333
ns17208
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17875
ns17520.5
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
18229.5
ns17875
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17708
ns17459
ns1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18853
ns18509
ns1.02
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53541.5
ns52875
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53500
ns53625
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53500
ns53541
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53542
ns53084
ns1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
320380
ns318108.5
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
102541.5
ns104959
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
109541
ns107334
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
99500
ns107250
ns0.93
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
97875
ns101250
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47141
ns46996
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
328250
ns324500
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
333084
ns325958
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
324125
ns323083
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
324041
ns327500
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
213969
ns208617.5
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1504750
ns1506583
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1541208
ns1549708
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1549666
ns1549292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1472416.5
ns1480958
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52318
ns51270
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5156854.5
ns5143666.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5311833
ns5297771
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5311062.5
ns5293084
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4595917
ns5004625.5
ns0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
202301
ns201935.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28125
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28125
ns28187.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28208
ns28208
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24301
ns24383
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66917
ns66666.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66542
ns66333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67750
ns66459
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66500
ns66292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
492894
ns489192
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1505459
ns1485833
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
959542
ns1144729
ns0.84
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1085458.5
ns1129875
ns0.96
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2196437.5
ns2267333
ns0.97
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
576585.5
ns580996.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3106250
ns3110979
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2641667
ns2747916.5
ns0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2753084
ns2752750
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3807583
ns3882333
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1958924
ns1989937
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7926875
ns7919834
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8046333.5
ns7899375
ns1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7926812.5
ns7923709
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4419125
ns4904167
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134333
ns77917
ns1.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
140333
ns139667
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
135750
ns140875
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
136000
ns133958
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193966.5
ns193313
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2042250
ns2016625
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2053604
ns2021791
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2031125
ns2024750
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2012625
ns2026750
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
738348.5
ns747334.5
ns0.99
This comment was automatically generated by workflow using github-action-benchmark.