-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: try re-enabling enzyme testing on 0.13.16 #1042
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
Need to also reenable some of the tests manually in LuxLib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: a08903d | Previous: cb0900f | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4000 ns |
3875 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4416 ns |
4375 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5208 ns |
5083 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4250 ns |
4208 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
60702.5 ns |
60144 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10334 ns |
10625 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10417 ns |
10666 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11583 ns |
11375 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10333 ns |
10334 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
429548.5 ns |
421452 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1208 ns |
1250 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1209 ns |
1292 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1334 ns |
1250 ns |
1.07 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1125 ns |
1167 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18619 ns |
18149 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4041 ns |
4167 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4208 ns |
4042 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4250 ns |
4292 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4083 ns |
3625 ns |
1.13 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
110443.5 ns |
109548 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57250 ns |
56166 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38333 ns |
46709 ns |
0.82 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47167 ns |
46334 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82042 ns |
82291 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37058 ns |
37127 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2029124.5 ns |
2031334 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2097750 ns |
2096166.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2114167 ns |
2086458 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1991603.5 ns |
1997167 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
195005 ns |
197158.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
143875 ns |
143042 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
153625 ns |
145583.5 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
146250 ns |
146709 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144042 ns |
149500 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167064 ns |
166231 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1115083.5 ns |
1138708.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1145062.5 ns |
1128583 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1143000 ns |
1062083.5 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1111770.5 ns |
1115041.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
519176.5 ns |
530934 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3875 ns |
3125 ns |
1.24 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3792 ns |
3458 ns |
1.10 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4396 ns |
4292 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3500 ns |
3375 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
70922 ns |
70464 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8542 ns |
9208 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9375 ns |
8917 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9833 ns |
9125 ns |
1.08 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9041 ns |
9166 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
493056 ns |
483194.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16666 ns |
15333 ns |
1.09 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15625 ns |
15458 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18708 ns |
17333 ns |
1.08 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
14958 ns |
17062.5 ns |
0.88 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
53797 ns |
53962 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212125 ns |
214583.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219770.5 ns |
212667 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215792 ns |
214625 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212625 ns |
225250 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
271057 ns |
273370 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
458 ns |
1.36 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
583 ns |
666 ns |
0.88 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
708 ns |
750 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
541 ns |
500 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17646 ns |
17502.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1375 ns |
1542 ns |
0.89 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1625 ns |
1667 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1792 ns |
1834 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1625 ns |
1375 ns |
1.18 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
103036.5 ns |
101667.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7083 ns |
7125 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5167 ns |
5917 ns |
0.87 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
5792 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
9917 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23852 ns |
23886 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220542 ns |
221417 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
231917 ns |
228125 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229333 ns |
228666 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214208 ns |
220500 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
169922.5 ns |
169891 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23947.5 ns |
23537 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16792 ns |
16750 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16500 ns |
17042 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17000 ns |
16875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16708 ns |
16750 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
163864 ns |
159725 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
570750 ns |
570333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
578959 ns |
574000 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
571125 ns |
579125 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
578875 ns |
571125 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113126.5 ns |
113492 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1421042 ns |
1428041 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1426625 ns |
1422333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1423792 ns |
1423708 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1424687.5 ns |
1423458 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
212529 ns |
208571.5 ns |
1.02 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1084083 ns |
1051187.5 ns |
1.03 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
945500 ns |
971896 ns |
0.97 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1343979.5 ns |
1346062.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1294145.5 ns |
1306416 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
269871.5 ns |
272301 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5730416.5 ns |
5990916 ns |
0.96 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4638750 ns |
4519875 ns |
1.03 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4949333 ns |
4948416.5 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5515291 ns |
5523125 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1069949 ns |
1070952 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
583 ns |
0.93 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23814 ns |
23553 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2084 ns |
2167 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2208 ns |
2167 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2083 ns |
2125 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
171935.5 ns |
168963.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
3833 ns |
3875 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4458 ns |
4167 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5292 ns |
5250 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
3792 ns |
3666 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
66284.5 ns |
65091 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10958 ns |
11416 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11750 ns |
11292 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12208 ns |
12333.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11000 ns |
11209 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
453228 ns |
446962.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6792 ns |
6458.5 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6916.5 ns |
6792 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8584 ns |
7833.5 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6334 ns |
6250 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
51674 ns |
52555 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16709 ns |
16584 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16645.5 ns |
17791 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18125 ns |
17375 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16875 ns |
17125 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
300225 ns |
308634 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
666 ns |
0.81 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
708 ns |
583 ns |
1.21 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
32510 ns |
32320 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
8541 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8542 ns |
9167 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9250 ns |
9500 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8292 ns |
9479.5 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
157936.5 ns |
159616 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64542 ns |
64750 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64833 ns |
64625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64709 ns |
64292 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64542 ns |
64542 ns |
1 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112820 ns |
111041.5 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
279459 ns |
292000 ns |
0.96 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
296687.5 ns |
292084 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
274459 ns |
275666 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
278291.5 ns |
275708 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
186750 ns |
183441 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3282834 ns |
3191791 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
2900062.5 ns |
3043437.5 ns |
0.95 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3017792 ns |
3020437.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
3941042 ns |
4089708 ns |
0.96 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
576338 ns |
601857 ns |
0.96 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7618104 ns |
7582625 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7355500 ns |
7473208.5 ns |
0.98 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7445792 ns |
7437833 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8201396.5 ns |
8187292 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1331422 ns |
1317154 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17547125 ns |
18957000 ns |
0.93 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17608979 ns |
19047250 ns |
0.92 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17586250 ns |
19104542 ns |
0.92 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14133271.5 ns |
15686625 ns |
0.90 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23678375 ns |
23902625 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
42806708 ns |
34420458 ns |
1.24 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
36969687.5 ns |
37002333 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35186500 ns |
34848770.5 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1841334 ns |
1857006 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
188420708 ns |
191696375.5 ns |
0.98 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
249825542 ns |
164341792 ns |
1.52 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
193973479.5 ns |
152698167 ns |
1.27 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
433364292 ns |
439655916 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13933807 ns |
13895377 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
288423500 ns |
292126520.5 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
353166937.5 ns |
340023312 ns |
1.04 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
295955209 ns |
298857875 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
394124521 ns |
335240875 ns |
1.18 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21583.5 ns |
22250 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22437.5 ns |
23083 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23542 ns |
23959 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22083 ns |
23417 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
95875 ns |
96101 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
102916 ns |
103542 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104125 ns |
103541 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104875 ns |
104791 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
104084 ns |
113250 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
499046 ns |
512131 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5958 ns |
5834 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6500 ns |
6375 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7000 ns |
7000 ns |
1 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6083.5 ns |
6125 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68801.5 ns |
68297.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14916 ns |
15208 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16333 ns |
15750 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16042 ns |
16583 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14916 ns |
15062.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
478630.5 ns |
474148.5 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3017666.5 ns |
3053958 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2067937 ns |
2089500 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2280833 ns |
2270042 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4862417 ns |
4804875 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
582819 ns |
582756 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23500708 ns |
23872458.5 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18333729.5 ns |
18056937.5 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18038062.5 ns |
17766021 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35525604 ns |
35515208 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3102887 ns |
3103295.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33341708 ns |
33801000 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28040562.5 ns |
27630916.5 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28530791.5 ns |
27435750 ns |
1.04 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41211375 ns |
41597458 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
71250 ns |
74917 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
81791 ns |
72541 ns |
1.13 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
88458.5 ns |
76416 ns |
1.16 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74750 ns |
74375 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
101733.5 ns |
103583 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
206250 ns |
221146 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232250 ns |
219166 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220167 ns |
208875 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
204708 ns |
206542 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
545309 ns |
560403 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11917 ns |
12166 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12333 ns |
12208.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12917 ns |
13167 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11792 ns |
12042 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
73274 ns |
71403 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26500 ns |
26979.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26375 ns |
27167 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27625 ns |
27958.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26667 ns |
26459 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
485939.5 ns |
472464 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12458 ns |
12437.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12083 ns |
12979 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14167 ns |
14167 ns |
1 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12042 ns |
12125 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
53767.5 ns |
53400 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
28417 ns |
25625 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25500 ns |
26292 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26000 ns |
26416 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26208 ns |
26167 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
308847.5 ns |
306626.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179313 ns |
180729 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
181771 ns |
182709 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
184083 ns |
183875 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
181625 ns |
180833 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
57081 ns |
56252.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
590625 ns |
593541.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
585833 ns |
593916 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
595375 ns |
584021 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
583958 ns |
582917 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
289906 ns |
289288.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5750 ns |
6500 ns |
0.88 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6250 ns |
6125 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7292 ns |
7792 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7354.5 ns |
6145.5 ns |
1.20 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
72512 ns |
70132.5 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14333 ns |
14271 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14750 ns |
14916 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15292 ns |
15500 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14292 ns |
14000 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
475946 ns |
460852.5 ns |
1.03 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1194000 ns |
1175354 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1255041.5 ns |
1353000 ns |
0.93 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1282167 ns |
1269979 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1009000 ns |
1317500 ns |
0.77 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301898 ns |
302455 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4099458 ns |
4288500 ns |
0.96 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4622875 ns |
4366958 ns |
1.06 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4583479 ns |
4543917 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3719875 ns |
4469000 ns |
0.83 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1037320.5 ns |
1030148 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1875 ns |
1792 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1792 ns |
1875 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1875 ns |
1833 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
24423 ns |
23497 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4875 ns |
4834 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4833 ns |
5041 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5000 ns |
4875 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4917 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
195083.5 ns |
185923.5 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6292 ns |
5500 ns |
1.14 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
6167 ns |
1 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7291.5 ns |
6459 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5708 ns |
5583 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
56884 ns |
55454.5 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10542 ns |
10667 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11291 ns |
11750 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11666 ns |
11458 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10667 ns |
10667 ns |
1 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
338689 ns |
337381 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
375 ns |
0.78 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23835 ns |
22737 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2708 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2750 ns |
3000 ns |
0.92 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3000 ns |
3000 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2709 ns |
2750 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
165106 ns |
157057 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11708 ns |
11625 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12042 ns |
12250 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12916 ns |
12708 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11208 ns |
11417 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
58245 ns |
56422 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25166 ns |
24250 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24375 ns |
25208 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25000 ns |
25000 ns |
1 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24833 ns |
25437.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
299833.5 ns |
294376.5 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4167 ns |
4208 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4250 ns |
4167 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
25332.5 ns |
24716 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16208 ns |
16042 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
15958 ns |
16417 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16334 ns |
16250 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16167 ns |
16167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
203337.5 ns |
193381 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5750 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5750 ns |
6083 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5917 ns |
5750 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5792 ns |
5833 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
34137 ns |
33569 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20875 ns |
20479.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20542 ns |
21000 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21208 ns |
21208 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21041 ns |
21104.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
181462.5 ns |
174365.5 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
423396 ns |
375416.5 ns |
1.13 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
368374.5 ns |
374666.5 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
485375.5 ns |
488312.5 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
102875 ns |
524187.5 ns |
0.20 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67695.5 ns |
66372.5 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
873958 ns |
931978.5 ns |
0.94 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
975084 ns |
880291.5 ns |
1.11 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1174250 ns |
1223791.5 ns |
0.96 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
327583.5 ns |
1351833.5 ns |
0.24 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
192402.5 ns |
192149.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80270.5 ns |
81312.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82875 ns |
80750 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83292 ns |
80792 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80375 ns |
80937 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194536.5 ns |
192807 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1909166.5 ns |
1932917 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1937625 ns |
1916542 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1926500 ns |
1926479 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1919542 ns |
1921042 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
401599 ns |
394461 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22634 ns |
22118 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1750 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1916 ns |
1834 ns |
1.04 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
174445 ns |
166019.5 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6333 ns |
6250 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6666 ns |
7208 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7584 ns |
8166 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6375 ns |
6312.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
59506.5 ns |
57360.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9250 ns |
8917 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8792 ns |
9167 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9334 ns |
9208 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9209 ns |
9250 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
313371.5 ns |
301535 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120626875 ns |
156508063 ns |
0.77 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
181866646 ns |
173937500 ns |
1.05 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147965312.5 ns |
148141208 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
109172333 ns |
106478500 ns |
1.03 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5482154.5 ns |
5474150 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
613775374.5 ns |
673237875 ns |
0.91 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
579490625 ns |
556883000 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
454979000 ns |
453960458.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
757824166.5 ns |
759297583 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34957490 ns |
38204722 ns |
0.92 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
651362542 ns |
701496583 ns |
0.93 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
688274854 ns |
667076166 ns |
1.03 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
584097270.5 ns |
586800771 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
744092250 ns |
744632000 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59042 ns |
56833 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39166 ns |
48042 ns |
0.82 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
48458 ns |
47125 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83583 ns |
84541 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37588 ns |
37576 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1895062.5 ns |
1935541 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1981167 ns |
1985208 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1996959 ns |
1979834 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1888458 ns |
1893771 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
174180 ns |
174934 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
264875 ns |
267875 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
273521 ns |
288042 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
275542 ns |
270229.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
265312 ns |
267250 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
130044 ns |
128767 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
696000 ns |
665041 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
694979 ns |
668958 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
587145.5 ns |
589167 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
584562.5 ns |
596209 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
714810.5 ns |
703647.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2224458 ns |
2205417 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2224833 ns |
2188541 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2203791.5 ns |
2100166.5 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2155375 ns |
2225499.5 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
133466 ns |
133307.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5495291 ns |
5538625 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5583291.5 ns |
5527958 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5523041.5 ns |
5503250 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5498583 ns |
5491271 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
772462 ns |
759584.5 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
644875 ns |
638667 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
646292 ns |
640458 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
638000 ns |
648875 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
638875 ns |
636167 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46942.5 ns |
47137 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1827375 ns |
1796937.5 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1670062.5 ns |
1724292 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1724167 ns |
1720542 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2105958 ns |
2104520.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
221811 ns |
218174.5 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58291 ns |
57000 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38667 ns |
46833 ns |
0.83 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47959 ns |
47083 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84292 ns |
84542 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28649.5 ns |
28335 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2032709 ns |
2047750 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2095416.5 ns |
2077083 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2115084 ns |
2092083 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1998167 ns |
1939979 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190814 ns |
191381.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13379583 ns |
13410020.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12465083 ns |
12472750 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12510000 ns |
12570979 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15365208 ns |
15234500 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
512956 ns |
512740.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47297125 ns |
47584458 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
42036749.5 ns |
41911083 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
40906541 ns |
41152979.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58402937.5 ns |
58152541 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3259192 ns |
3249099 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
74125833 ns |
74313208.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
91091084 ns |
91931958.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90945458 ns |
91156000 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
98786375 ns |
76595709 ns |
1.29 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58791 ns |
57334 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38875 ns |
47417 ns |
0.82 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47916 ns |
47250 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81750 ns |
84375 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47173 ns |
48075 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1916875.5 ns |
1930959 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1975416.5 ns |
1977562.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1998896 ns |
1977250 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1886479.5 ns |
1816292 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
194727.5 ns |
196217.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
334 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
417 ns |
0.80 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
334 ns |
1.25 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32121 ns |
32756 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
5979.5 ns |
6125 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6500 ns |
6583 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6542 ns |
6542 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
5959 ns |
6208 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
177657.5 ns |
178147.5 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31406 ns |
31948 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2667 ns |
2625 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2792 ns |
2875 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2834 ns |
2834 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2625 ns |
2625 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
168384.5 ns |
164100 ns |
1.03 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
284625167 ns |
323244146 ns |
0.88 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
346874125 ns |
340740458 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
314223937 ns |
314512041.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
271286833 ns |
271130916 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7105292.5 ns |
7115553 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
997536042 ns |
1053603541.5 ns |
0.95 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
962898625 ns |
941056333 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
836523750 ns |
854610104 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1157418250 ns |
1162236250 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33940243.5 ns |
33945165 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1317043896 ns |
1364084083.5 ns |
0.97 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1704351583 ns |
1705661833 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1639291667 ns |
1621953875 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1664114458 ns |
1313183229.5 ns |
1.27 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1463208 ns |
1410000 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1420208 ns |
1408291.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1416146 ns |
1453645.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1413542 ns |
1407209 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
127746 ns |
127861 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5020063 ns |
5051959 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5058875 ns |
5013583.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5062958.5 ns |
5028416.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5016917 ns |
5027271 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
621557.5 ns |
604299 ns |
1.03 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
172836708.5 ns |
161226250 ns |
1.07 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
167838917 ns |
131446875 ns |
1.28 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
129038271 ns |
127042083 ns |
1.02 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
167391875 ns |
155626750.5 ns |
1.08 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4853519.5 ns |
4974919.5 ns |
0.98 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
626867250 ns |
850481958 ns |
0.74 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
577625083 ns |
644255791 ns |
0.90 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
432711750 ns |
496077667 ns |
0.87 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
647126958 ns |
685984875 ns |
0.94 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
15994577 ns |
15948822 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8946000 ns |
9064833.5 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
9055083 ns |
8770396 ns |
1.03 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7872791 ns |
7878104.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9737396 ns |
10163000 ns |
0.96 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1615300 ns |
1608837.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36179042 ns |
37348729 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
38214417 ns |
36970124.5 ns |
1.03 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33412417 ns |
33623167 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
37675812 ns |
38875729.5 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6517064 ns |
6455570 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47459 ns |
47375 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47375 ns |
47750 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47625 ns |
47583 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47500 ns |
47625 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
19122 ns |
18855 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50875 ns |
50250 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50416.5 ns |
50750 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50541 ns |
50416 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50333 ns |
50292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
228481 ns |
202264 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6209 ns |
6375 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6542 ns |
7187.5 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7666.5 ns |
8417 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7208 ns |
6708 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
111051 ns |
108599.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10208 ns |
9604.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9625 ns |
10209 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10375 ns |
10292 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10125 ns |
10583 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
703962.5 ns |
610519 ns |
1.15 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5709 ns |
5958 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5958 ns |
6375 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7125 ns |
7583 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6334 ns |
5542 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
160862.5 ns |
131186.5 ns |
1.23 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13125 ns |
12875 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12792 ns |
13208 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13250 ns |
13583 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13209 ns |
12875 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
610923 ns |
530393 ns |
1.15 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1000 ns |
1167 ns |
0.86 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
31878 ns |
32479.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8125 ns |
7833.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7916 ns |
8042 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8208 ns |
8083 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7958 ns |
7916 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
234181 ns |
216406.5 ns |
1.08 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23000 ns |
23042 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23500 ns |
23542 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23375 ns |
23333 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23291.5 ns |
23375 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18958 ns |
19066 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52333 ns |
52291.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52542 ns |
52500 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52833 ns |
53166.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52750 ns |
52125 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
347156 ns |
309714.5 ns |
1.12 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1398291 ns |
1413917 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1400542 ns |
1401104 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1451646 ns |
1457583.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1403458 ns |
1402271 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196101 ns |
196285 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5010104 ns |
5045083 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5042416.5 ns |
4724458 ns |
1.07 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5035667 ns |
5023021 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5006875.5 ns |
4706104.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
698384 ns |
644560.5 ns |
1.08 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3037083 ns |
3086125.5 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2097125 ns |
2087104.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2310166 ns |
2281125 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4587708 ns |
4848375 ns |
0.95 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
582576 ns |
580262 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24451708.5 ns |
24765000.5 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
19116646 ns |
18889791.5 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18907687.5 ns |
19005084 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36689646.5 ns |
36681292 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3202509 ns |
3253871.5 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34075708 ns |
34537875 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28707020.5 ns |
28314500 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28058375 ns |
27967000 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41736416.5 ns |
41702500 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
143730708 ns |
144041208 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
147488750 ns |
143168583 ns |
1.03 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
126224750 ns |
124247521 ns |
1.02 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
173454520.5 ns |
173506729 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22575677 ns |
22768605 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1620032250 ns |
957619479 ns |
1.69 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
863135021.5 ns |
1175957479.5 ns |
0.73 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1509034062.5 ns |
739734292 ns |
2.04 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
665993458 ns |
672317125 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
117915974 ns |
118020449 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
72166 ns |
73979 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
73250 ns |
75750 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76770.5 ns |
75416 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
73833.5 ns |
72854.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
284295.5 ns |
300521.5 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
287666 ns |
287875 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
192041.5 ns |
285333 ns |
0.67 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
200187.5 ns |
204208 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
282000 ns |
287375 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1416407.5 ns |
1342742 ns |
1.05 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35556583 ns |
36185500 ns |
0.98 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36666375 ns |
35466000.5 ns |
1.03 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32500083 ns |
32336688 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40359917 ns |
40972250 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5850508 ns |
5837876 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
146447437.5 ns |
151179834 ns |
0.97 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
159338208.5 ns |
151456979 ns |
1.05 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
138154562.5 ns |
136606104 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
283678542 ns |
287372208 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34916771 ns |
34877857 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121153146.5 ns |
155986916 ns |
0.78 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
181784520.5 ns |
174507459 ns |
1.04 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148303833 ns |
148111416.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106006229 ns |
102908562.5 ns |
1.03 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5475461.5 ns |
5463707 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
469832812.5 ns |
520380250 ns |
0.90 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
483820458.5 ns |
465489750 ns |
1.04 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
439800958 ns |
439138000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
741809084 ns |
742252417 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32272334 ns |
35175845 ns |
0.92 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
707200771 ns |
698201250 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
674004500.5 ns |
654820792 ns |
1.03 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
574385562.5 ns |
571273229.5 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
731320834 ns |
850215250 ns |
0.86 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1295833 ns |
1101520.5 ns |
1.18 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
666625 ns |
970208.5 ns |
0.69 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
976125 ns |
920500 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
1942583 ns |
1945375.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
578916 ns |
580245.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2968500 ns |
2907896 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2508187.5 ns |
2595708 ns |
0.97 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2653229 ns |
2606333 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3698583 ns |
3655000 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1888981 ns |
1734207 ns |
1.09 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5791354.5 ns |
6744875 ns |
0.86 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5897834 ns |
6498208 ns |
0.91 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5827584 ns |
6503854.5 ns |
0.90 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2886541 ns |
4423604.5 ns |
0.65 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7416 ns |
7208 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
6083 ns |
0.87 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6209 ns |
5958.5 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10042 ns |
9959 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25562 ns |
25201 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212562.5 ns |
212291 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221625 ns |
220750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221083 ns |
220125 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
207125 ns |
206792 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
299268 ns |
262467.5 ns |
1.14 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
307907958 ns |
316552750 ns |
0.97 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
280027438 ns |
221682708 ns |
1.26 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
198301979.5 ns |
187257688 ns |
1.06 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
308612916 ns |
311596375 ns |
0.99 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7676186 ns |
7676203 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1090327479 ns |
1093022833.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
1069654042 ns |
911616145.5 ns |
1.17 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
805438000 ns |
815656375 ns |
0.99 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1153864583 ns |
1161401125 ns |
0.99 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26354888 ns |
26547253 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5312.5 ns |
5292 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5584 ns |
5667 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7041 ns |
6625 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5500 ns |
5125 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
186016 ns |
167889.5 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7416 ns |
7083 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7125 ns |
7375 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7834 ns |
7459 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7291.5 ns |
7437.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
709229 ns |
650263 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
709 ns |
0.82 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
667 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23945 ns |
23809 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
9041.5 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9209 ns |
9791 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9208.5 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
9042 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
234895 ns |
233459 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
353895.5 ns |
351417 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
352500 ns |
352250 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
352333.5 ns |
353063 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
352375 ns |
353333 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21675 ns |
21613 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
826000 ns |
791250 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
835458.5 ns |
808979 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
776312.5 ns |
773625 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
824520.5 ns |
824084 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
308824 ns |
305844 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
338833 ns |
314958 ns |
1.08 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
323020.5 ns |
333625 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
453208 ns |
448667 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
10770.5 ns |
331833 ns |
0.032457591619881085 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17775 ns |
17811 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
712979 ns |
682125 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
727625 ns |
746791.5 ns |
0.97 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1002292 ns |
1029167 ns |
0.97 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
27291.5 ns |
700937.5 ns |
0.038935711101203745 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
295524 ns |
273907.5 ns |
1.08 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
379625 ns |
328083 ns |
1.16 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
329041 ns |
348979 ns |
0.94 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
440521.5 ns |
424375 ns |
1.04 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
30208.5 ns |
370666 ns |
0.08149789837751507 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22403 ns |
22237 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
734125 ns |
743604 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
779520.5 ns |
750229 ns |
1.04 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1028562.5 ns |
1076375 ns |
0.96 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
105500 ns |
822541 ns |
0.13 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
258280.5 ns |
220485.5 ns |
1.17 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3459 ns |
3334 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3791 ns |
3792 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3792 ns |
3625 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3625 ns |
3583 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17955 ns |
18068 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4209 ns |
4166 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4333 ns |
4542 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4292 ns |
4250 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4250 ns |
4334 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
290816 ns |
278097 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3625 ns |
3292 ns |
1.10 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3958 ns |
3645.5 ns |
1.09 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4500 ns |
4708 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3500 ns |
4042 ns |
0.87 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
238246.5 ns |
212235.5 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8417 ns |
8042 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8083 ns |
8417 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8625 ns |
8792 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8167 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1225625 ns |
1255478 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203166 ns |
204000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211625 ns |
211375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
211292 ns |
211042 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199292 ns |
200541 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34921 ns |
34367 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
648729.5 ns |
605708.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
675041 ns |
625021 ns |
1.08 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
622125 ns |
620792 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
627292 ns |
582583 ns |
1.08 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
346070.5 ns |
361289.5 ns |
0.96 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
992125 ns |
973333 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1031125.5 ns |
950209 ns |
1.09 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
953834 ns |
955541 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
861124.5 ns |
1286000.5 ns |
0.67 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
206641 ns |
207830 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4533291 ns |
4594084 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4839770.5 ns |
4500750.5 ns |
1.08 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4423375 ns |
4304583 ns |
1.03 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
5168938 ns |
6304625 ns |
0.82 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
926931 ns |
925479 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3209 ns |
3333 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3854.5 ns |
3583 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4458 ns |
4250 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
2834 ns |
3541 ns |
0.80 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
229027 ns |
240989.5 ns |
0.95 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7375 ns |
6875 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7291 ns |
7542 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7250 ns |
7375 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6875 ns |
7042 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
1010836.5 ns |
1039649.5 ns |
0.97 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1640708.5 ns |
1636792 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1182958 ns |
1175749.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1363875 ns |
1347167 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2437104 ns |
2463271 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215070 ns |
213096 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12367625 ns |
12388416 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9635542 ns |
9551437.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9254375 ns |
9305937.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18041020.5 ns |
18088000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1951821.5 ns |
1951605 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17387209 ns |
17398084 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14500084 ns |
14348854.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14366416.5 ns |
14347271 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21053249.5 ns |
21112104 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
124458.5 ns |
94729.5 ns |
1.31 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
93333 ns |
90667 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
92667 ns |
92375 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
87625 ns |
114395.5 ns |
0.77 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126571 ns |
125574 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2029833 ns |
2039792 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2043291 ns |
1808208.5 ns |
1.13 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2043542 ns |
2033666.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2024104 ns |
2022500 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1034383 ns |
1052869 ns |
0.98 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
2917 ns |
326041.5 ns |
0.008946713838575765 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
2041 ns |
344833 ns |
0.005918807074728927 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
2833 ns |
396416 ns |
0.007146532935098483 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
3292 ns |
314708 ns |
0.010460490359317209 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15778 ns |
15677 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2750 ns |
701042 ns |
0.0039227321615538015 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2459 ns |
733209 ns |
0.003353750431323129 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2833 ns |
1020500 ns |
0.00277609015188633 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2792 ns |
656250 ns |
0.0042544761904761905 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
192972 ns |
196145.5 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7084 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5416 ns |
5541 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6084 ns |
6084 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10042 ns |
10000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34424 ns |
34060 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221625 ns |
221166.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221250 ns |
220916.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220834 ns |
220167 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219333.5 ns |
217124.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
316465.5 ns |
344547 ns |
0.92 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3750 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3667 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22906 ns |
22568 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14458 ns |
14167 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14250 ns |
14375 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14417 ns |
14458 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14333 ns |
14416 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
476234 ns |
487124.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
117770.5 ns |
97500 ns |
1.21 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
98104 ns |
93417 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
96500 ns |
96687.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
91584 ns |
91875 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126019 ns |
124929 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1930000 ns |
1940875 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1937167 ns |
1919916.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1921021 ns |
1931229.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1923125 ns |
1917271.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
939452.5 ns |
955641 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
867604 ns |
854084 ns |
1.02 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
807896 ns |
826333 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1207166.5 ns |
1211000 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
951167 ns |
955354.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
276975 ns |
272141 ns |
1.02 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2825833 ns |
2801124.5 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2537062.5 ns |
2515333 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3318041 ns |
3309625 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3363250 ns |
3416625 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1571603 ns |
1612126.5 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16041.5 ns |
17062.5 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16083 ns |
16708.5 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17625 ns |
18937 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
14770.5 ns |
15167 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
142076.5 ns |
142123.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
227020.5 ns |
223437.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
262125 ns |
215958 ns |
1.21 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216500 ns |
216125 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
228417 ns |
255708.5 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
633913.5 ns |
644779 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
220958 ns |
222292 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
223166 ns |
221750 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
221583.5 ns |
222542 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
219000 ns |
220917 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
268138.5 ns |
271274.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
555875 ns |
509083 ns |
1.09 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
563417 ns |
501292 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
499167 ns |
496750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
504708 ns |
550583 ns |
0.92 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1310913 ns |
1401190 ns |
0.94 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
3667 ns |
304437.5 ns |
0.0120451652638062 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
4541 ns |
331687.5 ns |
0.013690597324288675 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
4875 ns |
376292 ns |
0.01295536445101145 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
3625 ns |
321812.5 ns |
0.011264323169547485 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16636 ns |
16554 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7125 ns |
708875 ns |
0.010051137365543996 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7083 ns |
736875 ns |
0.009612213740458016 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7375 ns |
1020209 ns |
0.007228910938837042 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7333 ns |
668458 ns |
0.010970023546729936 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
193281.5 ns |
196065 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18708 ns |
17854 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19625 ns |
18520.5 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19542 ns |
19667 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16520.5 ns |
16209 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
144540.5 ns |
146750.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223396 ns |
247604 ns |
0.90 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221292 ns |
212500 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
213375 ns |
212917 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
222666 ns |
211750.5 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
896239 ns |
1011803 ns |
0.89 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4000 ns |
4125 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4458 ns |
4125 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5395.5 ns |
5187.5 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4250 ns |
4084 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
180460 ns |
201325 ns |
0.90 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10959 ns |
10667 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10416 ns |
10875 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10500 ns |
1 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10417 ns |
10375 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1021118.5 ns |
1050725 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3042 ns |
3375 ns |
0.90 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3750 ns |
3625 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4541 ns |
4167 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3292 ns |
3291 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
219358 ns |
242454 ns |
0.90 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7375 ns |
7542 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
7666 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7667 ns |
7750 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7334 ns |
7333 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1035983 ns |
1067571 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23583541 ns |
24057353.5 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
43838187.5 ns |
34753459 ns |
1.26 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37729834 ns |
37792125 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35203208 ns |
34828583.5 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1832361.5 ns |
1854184 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
184224084 ns |
187222542 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
173933854 ns |
160010375 ns |
1.09 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146952229.5 ns |
146721854.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
410469417 ns |
412776417 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16498571 ns |
16508303 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
424640333 ns |
437495583 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
261807375 ns |
253838438 ns |
1.03 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
295219708.5 ns |
232343979.5 ns |
1.27 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
479699708 ns |
483540875 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182437.5 ns |
183854 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
184959 ns |
183625 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
185625 ns |
185334 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
182958 ns |
184167 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
176456.5 ns |
220968 ns |
0.80 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
629208 ns |
594000 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
611645.5 ns |
632437.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
588334 ns |
586084 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
596292 ns |
628500 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1007095.5 ns |
1061303.5 ns |
0.95 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3908958 ns |
3892042 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
4164979 ns |
3642708 ns |
1.14 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3539625 ns |
3572042 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4551958.5 ns |
5353250 ns |
0.85 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
532362 ns |
549368 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17307166 ns |
17901624.5 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
18330542 ns |
17281292 ns |
1.06 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16469187.5 ns |
16574875 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
20270062.5 ns |
22050250 ns |
0.92 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2616105 ns |
2630980 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
541 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
708 ns |
584 ns |
1.21 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31875 ns |
31762 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9437.5 ns |
9145.5 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9459 ns |
9208 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9291.5 ns |
9417 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9125 ns |
9208 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
261192 ns |
262912.5 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
498957667 ns |
505346750 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
506947291 ns |
429818666.5 ns |
1.18 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
420975459 ns |
433256333.5 ns |
0.97 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
674716520.5 ns |
677373875 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12478081 ns |
12487373 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1869574750 ns |
2066713500 ns |
0.90 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1653016000 ns |
1635890000 ns |
1.01 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1493871875 ns |
1494391792 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2197444312 ns |
2208031208.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49046932 ns |
49163495.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1642250 ns |
1632500.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1190416 ns |
1173583 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1384604 ns |
1383958 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2488917 ns |
2483292 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
216008 ns |
214736 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12763416 ns |
12776042 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9992083 ns |
9939062.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9689979.5 ns |
9686917 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18446958 ns |
18349375 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2006858 ns |
2056758 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17699416 ns |
17758729.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14794062.5 ns |
14689958 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14579833 ns |
14551125 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21407792 ns |
21399666 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26291 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26292 ns |
26292 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26667 ns |
26333 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26250 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23694 ns |
24146 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66834 ns |
66791 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66917 ns |
67292 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67250 ns |
68417 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66917 ns |
66709 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
381806 ns |
391053.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
202917 ns |
204333 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209208 ns |
210125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210166 ns |
209458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199750 ns |
198792 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26205 ns |
26289 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
611521 ns |
642083 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
672625 ns |
624354.5 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
625292 ns |
621729.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
634083 ns |
627000.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
311868 ns |
357106 ns |
0.87 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
604020.5 ns |
645625 ns |
0.94 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
659396 ns |
636292 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
550042 ns |
602667 ns |
0.91 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
646458 ns |
672375 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131537 ns |
132245.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2232396 ns |
2294979 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2281167 ns |
2157208 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2260354 ns |
2246208 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2241125 ns |
2249458 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1133773 ns |
1236985 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19291.5 ns |
17937.5 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20063 ns |
18416.5 ns |
1.09 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19875 ns |
20083 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17166 ns |
18895.5 ns |
0.91 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
143835 ns |
145580 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218416 ns |
259583 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
267625 ns |
261791 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220729.5 ns |
219084 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
258250 ns |
257520.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
938699.5 ns |
1034996 ns |
0.91 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
666 ns |
542 ns |
1.23 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
667 ns |
0.88 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
708 ns |
625 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22958 ns |
23604 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9708 ns |
9750 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10084 ns |
10292 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10041 ns |
10250 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9291 ns |
9333 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
253769.5 ns |
260113.5 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5687.5 ns |
5083.5 ns |
1.12 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5729.5 ns |
5792 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6458 ns |
6833 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5291.5 ns |
5375 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
185951.5 ns |
229273.5 ns |
0.81 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
6709 ns |
1.11 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7458 ns |
7667 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7541.5 ns |
7583 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
6937.5 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
731372.5 ns |
777061.5 ns |
0.94 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2125 ns |
1917 ns |
1.11 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2333 ns |
2500 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2333 ns |
2208 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2167 ns |
2250 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18317 ns |
18340 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6542 ns |
6542 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6542 ns |
6667 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6959 ns |
6666 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6667 ns |
6584 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
307304.5 ns |
320616.5 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
749042 ns |
750542 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
758479 ns |
746792 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
749583 ns |
746916 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
752500.5 ns |
750584 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21294 ns |
21795 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
775167 ns |
805145.5 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
792834 ns |
791604 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
775375 ns |
772584 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
792083.5 ns |
810645.5 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
295315 ns |
302046.5 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
6959 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
5917 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6083 ns |
6000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10167 ns |
10167 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32899.5 ns |
32896 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219167 ns |
228770.5 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
270000 ns |
227709 ns |
1.19 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228875 ns |
228084 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
225958 ns |
225625.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
320960.5 ns |
359979 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9833 ns |
10250 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10959 ns |
10208 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11209 ns |
11042 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10458 ns |
9958 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
211522.5 ns |
245976 ns |
0.86 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23833.5 ns |
24896 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24875 ns |
24000 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25458 ns |
25416.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24458.5 ns |
24625 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1049023 ns |
1114734 ns |
0.94 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106078646 ns |
106794687 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
126381834 ns |
118367979 ns |
1.07 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
121177729 ns |
120992291 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117537563 ns |
118045833 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2652799 ns |
2655666 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
391974416 ns |
397097667 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
380681375 ns |
368138875 ns |
1.03 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
356763209 ns |
357737125 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
480738708 ns |
483722209 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15212996 ns |
15195689 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
753414729 ns |
769405854 ns |
0.98 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
774115708 ns |
762934333 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
748053687.5 ns |
748099729.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
944047458.5 ns |
772112770.5 ns |
1.22 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6916.5 ns |
6417 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7333 ns |
7375 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8542 ns |
8187 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7104.5 ns |
8708.5 ns |
0.82 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
230215 ns |
243458.5 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14458 ns |
13625 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14417 ns |
14834 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14292 ns |
14834 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14083 ns |
14000 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1044880 ns |
1081512.5 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6041 ns |
5500 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6542 ns |
6083.5 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7042 ns |
7500 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6021 ns |
5625 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
227729 ns |
236881 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12750 ns |
12583 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12625 ns |
12750 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12750 ns |
13000 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12750 ns |
12542 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
749467.5 ns |
792100 ns |
0.95 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
5292 ns |
328937.5 ns |
0.016088162644879347 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
6041 ns |
345250 ns |
0.017497465604634322 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6458 ns |
398625 ns |
0.01620068987143305 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
5542 ns |
315687.5 ns |
0.017555335577113442 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16776 ns |
17026 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15791 ns |
701750 ns |
0.022502315639472747 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15375 ns |
734417 ns |
0.02093497291048546 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
15583 ns |
1025666 ns |
0.015193055049109554 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
15666 ns |
663750 ns |
0.02360225988700565 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
196800 ns |
202330 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23311 ns |
23795 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6458 ns |
6250 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
6750 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6500 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6458 ns |
6104.5 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
237753.5 ns |
242897.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5875 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5916 ns |
6042 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5917 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5875 ns |
5875 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24206 ns |
24778 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21292 ns |
21834 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
20834 ns |
21542 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21500 ns |
21750 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
20708.5 ns |
21417 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
259866 ns |
265364.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
143791 ns |
184375 ns |
0.78 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
158270.5 ns |
185000 ns |
0.86 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
147479 ns |
149541 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
149250 ns |
190750 ns |
0.78 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166782 ns |
168165 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1333750 ns |
1361667 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1370333.5 ns |
1306875.5 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1335416 ns |
1318541.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1320333.5 ns |
1332084 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1283264 ns |
1372553 ns |
0.93 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23708.5 ns |
24458 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
23792 ns |
22729 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
24458 ns |
25000 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24354.5 ns |
22374.5 ns |
1.09 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
285298 ns |
355948 ns |
0.80 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
126459 ns |
176958 ns |
0.71 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
148916 ns |
131167 ns |
1.14 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
119292 ns |
126166.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
173833 ns |
177542 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1391431.5 ns |
1491511 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22764 ns |
23138 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6520.5 ns |
6125 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6500 ns |
6917 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6770.5 ns |
6667 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6417 ns |
6250 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
253763.5 ns |
259300 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4583.5 ns |
4458 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4750 ns |
4875 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5583 ns |
5708.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4584 ns |
4833 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
243103.5 ns |
258768.5 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9812.5 ns |
9709 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
10083 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10417 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
10041.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1309296.5 ns |
1358754 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1584 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1666 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1625 ns |
1667 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1584 ns |
1583 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22984 ns |
23306 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5667 ns |
5625 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5709 ns |
6125 ns |
0.93 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5958 ns |
6041 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5625 ns |
5625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
272375 ns |
275587 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6807375 ns |
6813916.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6369645.5 ns |
6428416 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6576041.5 ns |
6554167 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7693187.5 ns |
7571104.5 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214270 ns |
213811 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24061771 ns |
24163500 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21342083.5 ns |
21359167 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21083084 ns |
21066083 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29748249.5 ns |
29670209 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2098936 ns |
2101483 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
37348771 ns |
37462416 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45817791 ns |
45862833.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
46004333 ns |
45876667 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49463688 ns |
38235959 ns |
1.29 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5750 ns |
5459 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6334 ns |
6250 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6958 ns |
6958 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5895.5 ns |
5292 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
229337.5 ns |
238588.5 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
7959 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8959 ns |
8334 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8417 ns |
8250 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7959 ns |
8250 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1021096 ns |
1068264.5 ns |
0.96 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1557541 ns |
1529292 ns |
1.02 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1250583 ns |
1266666.5 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1626375 ns |
1623709 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2091875 ns |
2163750 ns |
0.97 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
272400.5 ns |
279544 ns |
0.97 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7906292 ns |
7968292 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6655979 ns |
6533250 ns |
1.02 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7142375 ns |
7125792 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10434875 ns |
10479375 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1820689 ns |
1874497 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
368291 ns |
320667 ns |
1.15 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
352000 ns |
346291 ns |
1.02 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
460542 ns |
428584 ns |
1.07 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
24791 ns |
345375 ns |
0.07177994933043794 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
45978 ns |
46619.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
733708.5 ns |
745958.5 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
814750 ns |
791666.5 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1065167 ns |
1073208.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
94833 ns |
776479 ns |
0.12 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
277791 ns |
311670 ns |
0.89 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397750 ns |
396708.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
213375 ns |
287917 ns |
0.74 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287917 ns |
288250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755083 ns |
753417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44009 ns |
44556 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
673708 ns |
645167 ns |
1.04 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
474958 ns |
527667 ns |
0.90 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
533292 ns |
532000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
972958 ns |
974292 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
189948 ns |
190424 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
651833 ns |
668958 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
663375.5 ns |
629749.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
610708 ns |
544375 ns |
1.12 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
646208 ns |
643396 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131710 ns |
132592.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2462750 ns |
2485646 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2486896 ns |
2448562.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2493833 ns |
2450292 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2462666 ns |
2461146 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1171820 ns |
1408688 ns |
0.83 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
3770.5 ns |
324000.5 ns |
0.011637327720173271 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
2250 ns |
344459 ns |
0.006531982035597851 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
4125 ns |
396583 ns |
0.010401353562810307 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
3354 ns |
314083.5 ns |
0.010678688947365907 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16239 ns |
16193 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5583 ns |
700875 ns |
0.007965757089352595 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5292 ns |
734292 ns |
0.007206942197381968 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5542 ns |
1020625 ns |
0.005430006123698714 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5500 ns |
656584 ns |
0.00837668904511837 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
194908.5 ns |
201017 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1457334 ns |
1461042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1493416 ns |
1503750 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1499125 ns |
1504625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1437458 ns |
1442917 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40269 ns |
40991 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5127875.5 ns |
5155750 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5301459 ns |
5279833.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5323375 ns |
5308333.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4984791.5 ns |
4987604 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
196759 ns |
200839 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3667 ns |
3750 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3667 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33034 ns |
33187 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15084 ns |
14958 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15208 ns |
15395.5 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15250 ns |
15375 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15208 ns |
15083 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
366758 ns |
379072.5 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71250 ns |
71541 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71375 ns |
71542 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71375 ns |
71270.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71291 ns |
71083 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113846 ns |
112914 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
317292 ns |
325333 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
326625 ns |
320729.5 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
318459 ns |
318792 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318042 ns |
317333 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
194171.5 ns |
193733 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1041 ns |
1000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1125 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1000 ns |
1000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23364.5 ns |
23845 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8209 ns |
7750 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8042 ns |
8583 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8125 ns |
8500 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8208 ns |
7750 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
257659 ns |
262768.5 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
508125 ns |
456417 ns |
1.11 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
478917 ns |
472584 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
566520.5 ns |
554479 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
219792 ns |
550167 ns |
0.40 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129544.5 ns |
128330 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1385208 ns |
1408750 ns |
0.98 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1467875 ns |
1380958 ns |
1.06 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1768791.5 ns |
1632666.5 ns |
1.08 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
871958 ns |
1597604 ns |
0.55 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
273249 ns |
274089 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31676 ns |
31588 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6334 ns |
6083 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6333 ns |
6750 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6416 ns |
6458 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6500 ns |
6125 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
261677.5 ns |
263587.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1721292 ns |
1767792 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1726209 ns |
1726375 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1723750 ns |
1725708 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1729333 ns |
1773250 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
168454 ns |
168887 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4357145.5 ns |
4406958 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4398937.5 ns |
4358916 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4385500 ns |
4369792 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4371333 ns |
4367125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1121300 ns |
1241756.5 ns |
0.90 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6666 ns |
6750 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6917 ns |
7000 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6709 ns |
6792 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6917 ns |
6750 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
20771 ns |
19512 ns |
1.06 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
51750 ns |
51584 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
52271 ns |
48771 ns |
1.07 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
33125 ns |
33250 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
51166 ns |
52958 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
211174 ns |
210086 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
17209 ns |
328750 ns |
0.052346768060836504 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
18167 ns |
344958 ns |
0.05266438233060257 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
19667 ns |
408250 ns |
0.04817391304347826 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
17875 ns |
323500 ns |
0.05525502318392581 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18120 ns |
18058 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53583 ns |
719583.5 ns |
0.07446390863603737 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53083 ns |
735666.5 ns |
0.07215633714461649 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53208 ns |
1034250 ns |
0.0514459753444525 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53333 ns |
684646 ns |
0.07789865127379697 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
332596 ns |
345041 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75292 ns |
75459 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75542 ns |
75292 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75625 ns |
75167 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75292 ns |
75333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46820.5 ns |
46969 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
327334 ns |
332833 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
334083 ns |
325833 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
325708 ns |
324583 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
324458 ns |
323834 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
210121 ns |
207979 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1484208 ns |
1487708 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1520208 ns |
1530375 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1525917 ns |
1530750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1462875 ns |
1466417 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
51807 ns |
51505.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5121958 ns |
5146312.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5309458 ns |
5151604.5 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5315875 ns |
5003270.5 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4991875 ns |
4984709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
202097 ns |
205494.5 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28250 ns |
28250 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28250 ns |
28334 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28250 ns |
28333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28209 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24786 ns |
24407 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66250 ns |
66500 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66417 ns |
66375 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66709 ns |
67458 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66375 ns |
66417 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
508762 ns |
525547 ns |
0.97 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1466375 ns |
1383749.5 ns |
1.06 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
930645.5 ns |
1059771 ns |
0.88 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1076708 ns |
1061458 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2109333.5 ns |
2248687.5 ns |
0.94 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
584770.5 ns |
581876.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3071000 ns |
3035479 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2631084 ns |
2745250 ns |
0.96 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2776875 ns |
2740958 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3811833.5 ns |
3811500 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2004790.5 ns |
2064611 ns |
0.97 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7904896 ns |
8921042 ns |
0.89 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
8008541 ns |
8776625 ns |
0.91 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7950354.5 ns |
8768729.5 ns |
0.91 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4825709 ns |
6359583 ns |
0.76 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
119062.5 ns |
82083.5 ns |
1.45 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
137041 ns |
81562.5 ns |
1.68 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
82375 ns |
83125 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80979 ns |
80583 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194253.5 ns |
192403.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1969750.5 ns |
2040625 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2032792 ns |
1935354.5 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1755791.5 ns |
2023083 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2016250.5 ns |
2003562.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
780655 ns |
805958 ns |
0.97 |
This comment was automatically generated by workflow using github-action-benchmark.
@wsmoses any idea what the following means: (THis is preceded by a 19K llvm dump) initfn=define private void @jlplt_ijl_set_task_tid_448994({} addrspace(10)* %0, i32 %1) #9 {
top:
%2 = load atomic void ()*, void ()** null unordered, align 8
%3 = icmp ne void ()* %2, null
br i1 %3, label %ccall, label %dlsym
dlsym: ; preds = %top
%4 = call void ()* @ijl_load_and_lookup(i8* inttoptr (i64 3 to i8*), i8* getelementptr inbounds ([17 x i8], [17 x i8]* @_j_str_ijl_set_task_tid_21, i32 0, i32 0), i8** @jl_libjulia_internal_handle)
store atomic void ()* %4, void ()** null release, align 8
br label %ccall
ccall: ; preds = %dlsym, %top
%5 = phi void ()* [ %2, %top ], [ %4, %dlsym ]
%6 = bitcast void ()* %5 to void ({} addrspace(10)*, i32)*
%7 = bitcast void ({} addrspace(10)*, i32)* %6 to void ()*
store atomic void ()* %7, void ()** @jlplt_ijl_set_task_tid_448994_got release, align 8
musttail call void %6({} addrspace(10)* %0, i32 %1)
ret void
}
loadfn= %2 = load atomic void ()*, void ()** null unordered, align 8
opv=void ()** null It seems to show up once I trigger parallel tests which is very strange. Before I run parallel tests, the same code just works... |
f047259
to
8f246a3
Compare
@avik-pal can you open an issue with the full dump? |
and can you see if using https://github.com/EnzymeAD/Enzyme.jl/pull/2068/files fixes it? |
92f1f7a
to
d586e10
Compare
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
32e9e57
to
af3cfb9
Compare
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Tests are now (mostly) happy. Need to look into downgrade eventually but I will call this a victory and merge |
No description provided.