-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make enzyme testing opt-in for now #1041
Conversation
443e158
to
6537129
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: ba4dc25 | Previous: 900c21c | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4292 ns |
4270.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4250 ns |
4000 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5875 ns |
5875 ns |
1 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4500 ns |
4895.5 ns |
0.92 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
59809 ns |
59833 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10375 ns |
10375 ns |
1 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10250 ns |
9958 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10833 ns |
10792 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10208 ns |
10125 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
422458 ns |
422438 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1042 ns |
1083 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1083 ns |
1000 ns |
1.08 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3000 ns |
1417 ns |
2.12 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1145.5 ns |
1125 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18107 ns |
18109 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4125 ns |
4166 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4084 ns |
4125 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4292 ns |
4187.5 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4083 ns |
4042 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
109577.5 ns |
109209 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
56083 ns |
57645.5 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46583 ns |
47000 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46333 ns |
38125 ns |
1.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80250 ns |
82084 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37793 ns |
37455 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2041292 ns |
1973687 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2085750 ns |
2089416 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2063854 ns |
2085625 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1996187.5 ns |
1985813 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
199313 ns |
195917 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148334 ns |
146416.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146833 ns |
147020.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
147750 ns |
145667 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144437.5 ns |
145604.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166164 ns |
166391 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1151500 ns |
1129209 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1119479.5 ns |
1126375 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1113896.5 ns |
1147667 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1116708 ns |
1104209 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
534203.5 ns |
521058.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3979 ns |
3416.5 ns |
1.16 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3375 ns |
3333 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5250 ns |
6333 ns |
0.83 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3625 ns |
3250 ns |
1.12 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
68253.5 ns |
66594 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9792 ns |
8792 ns |
1.11 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8916 ns |
9291 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9417 ns |
9250 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8958 ns |
9292 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
503513.5 ns |
493812 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15125 ns |
14750 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15084 ns |
15458 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18187 ns |
19167 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15125 ns |
16437.5 ns |
0.92 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55485 ns |
53833 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221124.5 ns |
215416.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213541 ns |
213208.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215104 ns |
214271 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212854.5 ns |
227104 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
277108.5 ns |
271460 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
542 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
583 ns |
625 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
792 ns |
792 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
583 ns |
583 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17961 ns |
17470 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1583 ns |
1750 ns |
0.90 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1666 ns |
1417 ns |
1.18 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1709 ns |
1.10 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1542 ns |
1645.5 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
104946 ns |
101826.5 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
6792 ns |
7250 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5916 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5292 ns |
1.13 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9792 ns |
10000 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23774 ns |
23857.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
228625 ns |
226895.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229041.5 ns |
230375 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230250 ns |
231584 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
255250 ns |
258625 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
172526.5 ns |
167659 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3875 ns |
3916 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3834 ns |
3833 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23855 ns |
23468 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16625 ns |
16750 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16666 ns |
17042 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16916 ns |
17000 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16666 ns |
16625 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
165483.5 ns |
160597 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
571416 ns |
572166 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
575792 ns |
575000 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
603292 ns |
587458 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
574167 ns |
578334 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113501.5 ns |
113397 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1437833 ns |
1421708 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1421291 ns |
1420125 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1456333 ns |
1430083 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1422000 ns |
1413292 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
215195 ns |
209669.5 ns |
1.03 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1058416 ns |
1074458 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
960521 ns |
958250.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1353667 ns |
1334396 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1293917 ns |
1310875 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
278078 ns |
269120.5 ns |
1.03 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5810729.5 ns |
5769437 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4595146 ns |
4470625 ns |
1.03 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4941375 ns |
4941021 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5508438 ns |
5552042 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1097372.5 ns |
1066489 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
24077 ns |
23585 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2084 ns |
2083 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2167 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2250 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
174630 ns |
169900 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4334 ns |
4084 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6500 ns |
6250 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6833 ns |
7209 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4167 ns |
6125 ns |
0.68 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
66416 ns |
64199 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11459 ns |
11083 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11542 ns |
11625 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12000 ns |
12000 ns |
1 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11250 ns |
10917 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
455229 ns |
446167.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7208 ns |
6042 ns |
1.19 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7375 ns |
7042 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7500 ns |
8833 ns |
0.85 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5875 ns |
7250 ns |
0.81 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
53888 ns |
51074.5 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17250 ns |
17292 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17542 ns |
18334 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18042 ns |
18083 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16417 ns |
17229.5 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
308501.5 ns |
299895.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
542 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
33132 ns |
32630 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
8458 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8875 ns |
9041 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9187.5 ns |
9166 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9083 ns |
8459 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
162463.5 ns |
158907 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64875 ns |
64625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64667 ns |
64250 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64083 ns |
65000 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64583 ns |
64667 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112830.5 ns |
111460 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
287583.5 ns |
289667 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
281083 ns |
279750 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
286167 ns |
289625 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
278000 ns |
281250 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
188746 ns |
184453.5 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3222417 ns |
3347125 ns |
0.96 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3053833 ns |
3015520.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3024458 ns |
2792979 ns |
1.08 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4052750 ns |
4064520.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
595462.5 ns |
588037 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7598375 ns |
7500166 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7440500 ns |
7470229.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7201375 ns |
7393937.5 ns |
0.97 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8195250 ns |
8209000 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1370693 ns |
1331630 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
19217250 ns |
19529541 ns |
0.98 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
19124458 ns |
19142959 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
19130959 ns |
19022708 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
15708959 ns |
15703750 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23901937.5 ns |
23617083 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33837333 ns |
33598208 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37088812.5 ns |
41100666 ns |
0.90 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34986334 ns |
35022333 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1855403 ns |
1855178.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
191718583.5 ns |
189352250 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
164425125 ns |
163568208 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
152408416 ns |
158452896 ns |
0.96 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
441427250 ns |
438607167 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13919509 ns |
13925600.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
292378833.5 ns |
287704167 ns |
1.02 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
338895729.5 ns |
337952937.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
298620583 ns |
291466708 ns |
1.02 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
395610896 ns |
395696000 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23042 ns |
21334 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24625 ns |
24375 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25375 ns |
25771 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24333 ns |
23584 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
97657 ns |
95861 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
104125 ns |
103625 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
115458 ns |
103708 ns |
1.11 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104958 ns |
104625 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
102916 ns |
103479.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
512696.5 ns |
510517.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6750 ns |
5750 ns |
1.17 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7500 ns |
7208 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7667 ns |
7666.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5750 ns |
7166 ns |
0.80 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
69092 ns |
68604 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15541 ns |
14708 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15792 ns |
15916 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16541 ns |
16666 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14834 ns |
14667 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
482365 ns |
483804.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
2981291.5 ns |
2876500 ns |
1.04 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2052208 ns |
2063833 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2263979 ns |
2288208 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4755625 ns |
4870416 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
583663.5 ns |
587700 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23786792 ns |
23421375 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18053854.5 ns |
17990750 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17489458.5 ns |
18312792 ns |
0.96 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
34946750 ns |
35646292 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3162391.5 ns |
3104605 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33775833 ns |
33240625 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27501917 ns |
27662417 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27490000 ns |
27837459 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41729750 ns |
41788833 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74437.5 ns |
72083 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76083 ns |
78729 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76541 ns |
75729.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74292 ns |
72459 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100952 ns |
100762.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
207625 ns |
204458 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
207167 ns |
219041 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
293750 ns |
320458 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
225125 ns |
205312.5 ns |
1.10 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
544412.5 ns |
541454.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12125 ns |
11333 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12875 ns |
12416 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13625 ns |
13834 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11750 ns |
13125 ns |
0.90 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
69795 ns |
69856.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26667 ns |
26520.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27333 ns |
27458 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
28208 ns |
28291 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26625 ns |
26500 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
470671.5 ns |
473341 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12708 ns |
11833 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
13375 ns |
12750 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13375 ns |
14333 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12584 ns |
13375 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
53177 ns |
51587 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26167 ns |
26375 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26792 ns |
26583 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26000 ns |
26666 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27000 ns |
26417 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
301560.5 ns |
302777.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
180479 ns |
178666.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182041 ns |
180292 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
182125 ns |
184416.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
179687.5 ns |
179709 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
56093 ns |
55677 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
586125 ns |
591146.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
587625 ns |
588583 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
594291 ns |
593062 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
582063 ns |
582708.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
282917.5 ns |
285027 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6417 ns |
5667 ns |
1.13 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7084 ns |
7167 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7750 ns |
7895.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6125 ns |
7291 ns |
0.84 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
69623 ns |
69657.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14458 ns |
14167 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14958 ns |
14958 ns |
1 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15792 ns |
15854.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14625 ns |
14583 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
458284 ns |
460443 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1163875 ns |
1194208.5 ns |
0.97 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1217875 ns |
1216792 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1274792 ns |
1262604 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1326000 ns |
1318166.5 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
302247 ns |
301559 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4269354 ns |
4098416 ns |
1.04 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4347708 ns |
4352937.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4646208 ns |
4631875 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
4438396 ns |
4436562.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1034297 ns |
1042661.5 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1792 ns |
1750 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1834 ns |
1833 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
22933 ns |
23523 ns |
0.97 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4792 ns |
4792 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4958 ns |
4875 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4916 ns |
4916 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4875 ns |
4875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
186651 ns |
187370 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6208 ns |
5500 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6583 ns |
6334 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6958 ns |
8604 ns |
0.81 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6291.5 ns |
7292 ns |
0.86 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
55052.5 ns |
54466 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10917 ns |
10958 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11833 ns |
11792 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11792 ns |
11708.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11167 ns |
11166 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
327658.5 ns |
330839 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
334 ns |
333 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
375 ns |
333 ns |
1.13 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22442 ns |
22873.5 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2708 ns |
2708 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2708 ns |
2959 ns |
0.92 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3042 ns |
3042 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2791 ns |
2750 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
156880 ns |
157537.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
13291.5 ns |
10750 ns |
1.24 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
13667 ns |
13708 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13937.5 ns |
14958 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11625 ns |
14583 ns |
0.80 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
55944 ns |
55574.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25250 ns |
25209 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25042 ns |
25250 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25417 ns |
25375 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25250 ns |
24979.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
287796.5 ns |
292656 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4125 ns |
4208 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4167 ns |
4125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24570 ns |
24774 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
15917 ns |
16333 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16208 ns |
16125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16375 ns |
16125 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16167 ns |
16084 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
191969 ns |
195031.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5792 ns |
5708 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5750 ns |
5750 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5709 ns |
5750 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5792 ns |
5709 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
33007 ns |
33326 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20666 ns |
21125 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20916 ns |
20875 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21625 ns |
21583 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21125 ns |
21500 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
174227.5 ns |
175195.5 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
379416 ns |
415708 ns |
0.91 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
377958 ns |
376667 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
488875 ns |
471499.5 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
520084 ns |
523500 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66431 ns |
66680.5 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
918792 ns |
924750.5 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
848084 ns |
849291 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1232167 ns |
1217521 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
1323208 ns |
1302292 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
189370.5 ns |
189339 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
81291.5 ns |
79792 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82625 ns |
82667 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
81292 ns |
84208 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83000 ns |
82833 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192966.5 ns |
193132 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1932167 ns |
1917625.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1699958.5 ns |
1915292 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1916125 ns |
1940917 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1917083 ns |
1896541 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
391191 ns |
395963 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21486 ns |
21798 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
164447 ns |
167505 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6750 ns |
5834 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6875 ns |
7500 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7875 ns |
9958 ns |
0.79 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6667 ns |
6875 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
55875 ns |
58244.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9166 ns |
9375 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9292 ns |
9333 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9354.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9541 ns |
9625 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
292456.5 ns |
302935 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
156951208.5 ns |
119443416.5 ns |
1.31 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174291584 ns |
173896250 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148195729.5 ns |
155811625 ns |
0.95 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
104188250 ns |
108054541 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5474066 ns |
5469386 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
674441875 ns |
616746166.5 ns |
1.09 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
556229250 ns |
555745625 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
454132562.5 ns |
468855125 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
761515396 ns |
760571396 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
35100772 ns |
34956216 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
700454083 ns |
648663875 ns |
1.08 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
666098604.5 ns |
664591146 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
581200271 ns |
601178041.5 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
737997250 ns |
746069334 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
56334 ns |
59458 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47459 ns |
47083 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47792 ns |
39166 ns |
1.22 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83542 ns |
83208 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36978 ns |
37582 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1938916.5 ns |
1926708 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1976604 ns |
1983042 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1978541.5 ns |
1986937.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1893667 ns |
1850250 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
173150 ns |
173017.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
268750.5 ns |
265187.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
274583 ns |
267959 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
292333.5 ns |
276771 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
265458 ns |
266917 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
117998.5 ns |
128834.5 ns |
0.92 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
694354 ns |
604083 ns |
1.15 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
683917 ns |
692833.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
693250 ns |
705709 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
656583.5 ns |
590291.5 ns |
1.11 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
652056.5 ns |
683429 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2193104.5 ns |
2195333 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2181208 ns |
2225625 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2216521 ns |
2230583 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2184458 ns |
2183333 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132517 ns |
133325.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5563750 ns |
5480833 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5510667 ns |
5508958 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5513000 ns |
5585895.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5481750.5 ns |
5490125 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
695298 ns |
766206 ns |
0.91 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
638500 ns |
646750 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
639292 ns |
660250 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
634541 ns |
642917 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
646979.5 ns |
647375 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46553 ns |
47306 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1793958 ns |
1828875 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1725834 ns |
1721042 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1749625 ns |
1665209 ns |
1.05 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2104250 ns |
2097000 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
220666.5 ns |
223896.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
56916 ns |
58667 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47125 ns |
47750 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46125 ns |
38958 ns |
1.18 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83625 ns |
82750 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28292 ns |
29191 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2045604.5 ns |
2029083.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2090729.5 ns |
2091166 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2082062.5 ns |
2107249.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1996375 ns |
1994854.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
188293 ns |
190986 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13451791 ns |
13371291 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12408750 ns |
12436583.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12579833.5 ns |
12675625 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15153250 ns |
15146959 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
515861.5 ns |
517535.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47609667 ns |
47259416 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41845375 ns |
41746209 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41044270.5 ns |
41384750 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58540625 ns |
58440500 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3200949 ns |
3203835 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
74073166.5 ns |
73984667 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
90759458 ns |
91223791.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90813833 ns |
90609938 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
75984750 ns |
77234000 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57000 ns |
59000 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47292 ns |
47417 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47541 ns |
38917 ns |
1.22 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
79042 ns |
81125 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47481 ns |
47741 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1933833.5 ns |
1911646 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1966833 ns |
1970541 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1973625 ns |
1976417 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1890208 ns |
1882083 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
194959.5 ns |
195868.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
333 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32490 ns |
32615 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6458.5 ns |
6500 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6459 ns |
6375 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6750 ns |
6750 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6500 ns |
6375 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
170804 ns |
176818 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32160 ns |
32102 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2625 ns |
2625 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2792 ns |
2875 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2958 ns |
2916 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2708 ns |
2625 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
158087.5 ns |
164236.5 ns |
0.96 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
321982541.5 ns |
286096229 ns |
1.13 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
339702500 ns |
339570541 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
314391875 ns |
321242167 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
275821875 ns |
271493208 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7045338.5 ns |
7111512 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1051877520.5 ns |
987492667 ns |
1.07 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
940277375 ns |
939040416 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
851686187.5 ns |
868433209 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1169867167 ns |
1162204042 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34049288.5 ns |
34040446 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1358421562.5 ns |
1310851000.5 ns |
1.04 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1687563625 ns |
1685402625 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1641183208 ns |
1648347125 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1312669833.5 ns |
1310788750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1410229.5 ns |
1412625 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1412333.5 ns |
1412041.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1411500 ns |
1424625 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1413395.5 ns |
1408334 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
128095 ns |
128501 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5056000 ns |
5028875 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5011833.5 ns |
5030104 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5027604 ns |
5062042 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5018979.5 ns |
5014021 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
518546 ns |
597004.5 ns |
0.87 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
169177292 ns |
168008834 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
131651145.5 ns |
130299417 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
129556583 ns |
148283479 ns |
0.87 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
164279437.5 ns |
161948354 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4876595.5 ns |
5052268 ns |
0.97 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
683143667 ns |
662817209 ns |
1.03 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
646211666 ns |
492884417 ns |
1.31 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
511752458 ns |
507367709 ns |
1.01 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
843952333 ns |
678320708 ns |
1.24 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16179217 ns |
17294527 ns |
0.94 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
9035354 ns |
8884604 ns |
1.02 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8674584 ns |
8801959 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7869708.5 ns |
8221541.5 ns |
0.96 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
10183812.5 ns |
10127167 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1610987 ns |
1611762 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36781834 ns |
36027125 ns |
1.02 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
36671271 ns |
36933063 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33328542 ns |
34547750 ns |
0.96 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
38854917 ns |
38824854 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6453307 ns |
6452267 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47375 ns |
47375 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47292 ns |
47250 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47833 ns |
47542 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47417 ns |
47333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
19398 ns |
19020 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50250 ns |
50312.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50437.5 ns |
50500 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50916 ns |
50958.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50291 ns |
50333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
162418.5 ns |
226580 ns |
0.72 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7083 ns |
6542 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7500 ns |
7187.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8187 ns |
9083 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7083.5 ns |
8625 ns |
0.82 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
74691 ns |
117383.5 ns |
0.64 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10083 ns |
9625 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10375 ns |
10208 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10375 ns |
10333.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10041 ns |
10209 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
439883 ns |
723908.5 ns |
0.61 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6459 ns |
6083 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8333 ns |
8250 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8417 ns |
9417 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5792 ns |
8375 ns |
0.69 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
81299 ns |
157024.5 ns |
0.52 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12958 ns |
13292 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13417 ns |
13792 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13542 ns |
13708 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12958 ns |
12834 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
396122 ns |
618769 ns |
0.64 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
959 ns |
1042 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
32528 ns |
32863 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7834 ns |
7875 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8166 ns |
8000 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8208 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8250 ns |
8250 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
193157 ns |
246953.5 ns |
0.78 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23041 ns |
25062.5 ns |
0.92 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23250 ns |
23291.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23500 ns |
23542 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23459 ns |
23250 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18862 ns |
18661 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52583 ns |
52625 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52666 ns |
52833 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52959 ns |
52875 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52125 ns |
52333 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
224379 ns |
364018 ns |
0.62 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1397834 ns |
1403750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1406042 ns |
1451354 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1408354.5 ns |
1407542 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1398208.5 ns |
1406458 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196210 ns |
196760 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5042250 ns |
5023250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5015250 ns |
5018687.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5019958 ns |
5042125 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5014375 ns |
5001750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
538689.5 ns |
766930 ns |
0.70 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3017625 ns |
3048708 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2071416 ns |
2082646 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2284167 ns |
2300125 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4852104.5 ns |
4855000 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
579661 ns |
583278 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24727104 ns |
24263250 ns |
1.02 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18832875.5 ns |
18905459 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18936771 ns |
19193375 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36502000 ns |
36575416 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3184962 ns |
3216229 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34414084 ns |
34013563 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28403834 ns |
28342229 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28024292 ns |
28436750 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41705396 ns |
43339875 ns |
0.96 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
143769208 ns |
144288959 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
141570125 ns |
142279583 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
124428458.5 ns |
126469000.5 ns |
0.98 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
174488250 ns |
168866000 ns |
1.03 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22552202 ns |
22582893 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
955894625 ns |
1275599313 ns |
0.75 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1172862062.5 ns |
1058487228.5 ns |
1.11 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1204750000 ns |
712851209 ns |
1.69 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
668847750 ns |
668538250 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
116933733 ns |
119108875 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
75000 ns |
83125 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
87584 ns |
76208 ns |
1.15 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
77875 ns |
78125 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
72875 ns |
72729 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
192074 ns |
365097 ns |
0.53 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
274854.5 ns |
189959 ns |
1.45 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
255958 ns |
287792 ns |
0.89 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
286208 ns |
268875 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
251645.5 ns |
189583.5 ns |
1.33 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1094455.5 ns |
1559670.5 ns |
0.70 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
36252875 ns |
35476167 ns |
1.02 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
35435458 ns |
35447729.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32241479 ns |
32304459 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40996833 ns |
40935146 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5845359.5 ns |
5843273 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
150751458 ns |
147875542 ns |
1.02 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
152701437.5 ns |
152751312.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
135398042 ns |
139824437 ns |
0.97 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
287922042 ns |
287719375 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34877403 ns |
34882914 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
156698834 ns |
120880395.5 ns |
1.30 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173722000 ns |
174358791 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148128521 ns |
155429791 ns |
0.95 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106406291 ns |
106966959 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5462755 ns |
5456342 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
521618645.5 ns |
470623375 ns |
1.11 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
465924792 ns |
466918000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
436476979.5 ns |
456589562.5 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
749280333 ns |
742113834 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32257322.5 ns |
32255425 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
691115542 ns |
706243291.5 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
655769083 ns |
652697541.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
570838750 ns |
591007625 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
850474417 ns |
851805375 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1161250 ns |
1320583.5 ns |
0.88 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
968958 ns |
965875 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
978792 ns |
736687.5 ns |
1.33 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2056959 ns |
1944666.5 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
569103 ns |
564187.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2792459 ns |
2971708.5 ns |
0.94 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2614042 ns |
2620334 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2624792 ns |
2535604 ns |
1.04 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3703458.5 ns |
3604083.5 ns |
1.03 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1689865 ns |
1878347.5 ns |
0.90 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
6754020.5 ns |
6649958 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
6516208.5 ns |
6493042 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
6508229.5 ns |
6437479.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
4440583 ns |
4435750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7375 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
6208 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6166 ns |
5375 ns |
1.15 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
9916 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25827 ns |
25400 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213020.5 ns |
213645.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220875 ns |
221833 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221500 ns |
221250 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213792 ns |
205875 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
258574 ns |
293719.5 ns |
0.88 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
313286250 ns |
301604437.5 ns |
1.04 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
221788541 ns |
221356625 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
189977792 ns |
223278083.5 ns |
0.85 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
312995792 ns |
312163250 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7678890.5 ns |
7672763 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1089809375 ns |
1078062604.5 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
908046396 ns |
896268771 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
816079833 ns |
880668729 ns |
0.93 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1175001292 ns |
1161143188 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26534604 ns |
26517571 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5834 ns |
5500 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7542 ns |
5750 ns |
1.31 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7291.5 ns |
9437.5 ns |
0.77 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5583 ns |
5875 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
152068 ns |
201555 ns |
0.75 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6958 ns |
7500 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7458 ns |
7458 ns |
1 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7667 ns |
7750 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6875 ns |
7041.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
628715 ns |
699933.5 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
458 ns |
500 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
24235 ns |
23724.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9417 ns |
9208 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9542 ns |
9625 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
11458 ns |
9604.5 ns |
1.19 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9167 ns |
9042 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
223814.5 ns |
234828.5 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351875 ns |
351500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351250 ns |
350896 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
351770.5 ns |
354624.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
353687.5 ns |
351708 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21385 ns |
20984 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
804000.5 ns |
775417 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
814312.5 ns |
824916 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
807812.5 ns |
830958 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
817708 ns |
823958 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
280926 ns |
306663 ns |
0.92 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
312542 ns |
338083 ns |
0.92 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
336646 ns |
341500 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
449478.5 ns |
443667 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
330645.5 ns |
325667 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18418 ns |
17821 ns |
1.03 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
689771 ns |
696042 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
738208 ns |
739416.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1025667 ns |
1042874.5 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
695584 ns |
692645.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
265152 ns |
273141.5 ns |
0.97 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
328792 ns |
358458.5 ns |
0.92 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
347104 ns |
349125 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
425250 ns |
431291.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
369625 ns |
370875 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22943 ns |
22357.5 ns |
1.03 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
747208.5 ns |
756625 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
744458 ns |
744208.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1064646 ns |
1073250 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
816917 ns |
818125.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
224782.5 ns |
221398.5 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3542 ns |
3459 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3458 ns |
3541 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3792 ns |
3792 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3500 ns |
3291 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
18400 ns |
17956 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4167 ns |
4208 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4250 ns |
4208 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4416 ns |
4416 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4167 ns |
4125 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
281571 ns |
275839.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3583 ns |
3792 ns |
0.94 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4125 ns |
3375 ns |
1.22 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5292 ns |
6750 ns |
0.78 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3875 ns |
6625 ns |
0.58 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
209071.5 ns |
205448.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8500 ns |
8334 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8417 ns |
8459 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8750 ns |
8500 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8541 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1186240 ns |
1183984 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204500 ns |
202625 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211125 ns |
210416 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210208 ns |
209292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
200375 ns |
200000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
35365 ns |
34588 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
599417 ns |
603792 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
671292 ns |
670625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
623500 ns |
630958 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
630291.5 ns |
631187.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
324289.5 ns |
352652 ns |
0.92 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
977208 ns |
967521 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
939375.5 ns |
927063 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
956812 ns |
964437.5 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
1308000 ns |
1281853.5 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
208543 ns |
207244 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4683396 ns |
4451771 ns |
1.05 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4481249.5 ns |
4482750 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4296708 ns |
4474208 ns |
0.96 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
6279833 ns |
6201166 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
939154.5 ns |
945549 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3375 ns |
3604.5 ns |
0.94 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3792 ns |
3167 ns |
1.20 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4854 ns |
6792 ns |
0.71 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3500 ns |
3167 ns |
1.11 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
203673 ns |
233201 ns |
0.87 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7125 ns |
7500 ns |
0.95 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
7375 ns |
1 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7750 ns |
7291 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7291 ns |
7083 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
979866.5 ns |
1014881 ns |
0.97 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1607125 ns |
1602833.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1183708 ns |
1187916 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1369958 ns |
1364062 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2332917 ns |
2343729.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
216036 ns |
212955.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12340417 ns |
12334792 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9538979 ns |
9602042 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9272374.5 ns |
9404958 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
17927917 ns |
17966833 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1958008 ns |
1949853 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17412145.5 ns |
17347084 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14312666.5 ns |
14365000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14342249.5 ns |
14512666 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21050666.5 ns |
21005479.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
89417 ns |
89791 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
91250 ns |
91729.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
94062 ns |
94291 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
90292 ns |
117416.5 ns |
0.77 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125974 ns |
126285 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2048916.5 ns |
2023917 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1760437.5 ns |
2013416.5 ns |
0.87 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2028562.5 ns |
2058875 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2017708 ns |
2027875 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
976278.5 ns |
1031286 ns |
0.95 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
329084 ns |
346791.5 ns |
0.95 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
338542 ns |
343583.5 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
395833.5 ns |
412250 ns |
0.96 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
312417 ns |
306166 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15658 ns |
16010 ns |
0.98 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
700875 ns |
702291 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
723333 ns |
728979.5 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
1020750 ns |
1025458 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
647250 ns |
639875 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
186416 ns |
193209 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7125 ns |
7292 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
3833 ns |
6083 ns |
0.63 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
5334 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10000 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33540 ns |
33620 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
249125 ns |
220479.5 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221583 ns |
231958 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221958 ns |
232041 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205958 ns |
220500 ns |
0.93 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
289999 ns |
311751 ns |
0.93 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3709 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22116 ns |
22440 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14083 ns |
14500 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14458 ns |
14417 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14459 ns |
14167 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14458 ns |
14291 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
451385.5 ns |
468658 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
94583.5 ns |
95166 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
96500 ns |
138021 ns |
0.70 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
100166 ns |
99167 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
99000 ns |
122458 ns |
0.81 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125281.5 ns |
125691 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1944062.5 ns |
1931875 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1652500 ns |
1954979 ns |
0.85 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1923000 ns |
1946854 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1910333 ns |
1923729.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
915747 ns |
940251.5 ns |
0.97 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
858770.5 ns |
880500 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
809334 ns |
815125 ns |
0.99 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1217666 ns |
1172292 ns |
1.04 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
959250 ns |
960167 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
270078.5 ns |
270704 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2748875 ns |
2803000 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2464583 ns |
2526833 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3353417 ns |
3361333 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3392521 ns |
3405875 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1543382 ns |
1569154 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
14917 ns |
15146 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16021 ns |
18000 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18625 ns |
21666 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
14875 ns |
18125 ns |
0.82 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
140956 ns |
141811.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
254667 ns |
217083 ns |
1.17 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
258562.5 ns |
229375 ns |
1.13 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219645.5 ns |
257396 ns |
0.85 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
255000 ns |
215833 ns |
1.18 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
640174 ns |
635765.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
221208 ns |
219750 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
222958 ns |
221500 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
223000 ns |
226021 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
219750 ns |
223937.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
270240.5 ns |
270450 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
508375.5 ns |
509917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
560375 ns |
557729 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
546708.5 ns |
549792 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
500104.5 ns |
555791 ns |
0.90 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1421050.5 ns |
1308245 ns |
1.09 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
310458 ns |
333479 ns |
0.93 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
335750 ns |
335541.5 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
417104 ns |
437333 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
319541 ns |
319417 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16359 ns |
16583 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
709833.5 ns |
715333 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
723000 ns |
730292 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
1018208 ns |
1025458.5 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
662917 ns |
655792 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
196421.5 ns |
193313 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17167 ns |
17625 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18375 ns |
17625 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20250 ns |
20437.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17417 ns |
18000 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
144230 ns |
144711.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219312.5 ns |
216667 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
216750 ns |
224083 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
213584 ns |
226625 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
212250 ns |
223417 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
950742 ns |
903796 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4541 ns |
4625 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6770.5 ns |
6750 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6542 ns |
7438 ns |
0.88 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4770.5 ns |
6625 ns |
0.72 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
248657 ns |
174159.5 ns |
1.43 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10500 ns |
10437.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9895.5 ns |
10750 ns |
0.92 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10979.5 ns |
10770.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10583 ns |
10833 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1098275 ns |
1024421 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3583 ns |
3646 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3625 ns |
3334 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5458.5 ns |
5625 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3666.5 ns |
3500 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
246374.5 ns |
231660 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7667 ns |
7708 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7917 ns |
7792 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7625 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7458 ns |
7167 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1103324 ns |
1037611.5 ns |
1.06 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
24115395.5 ns |
23838833 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34596167 ns |
33990646 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37656499.5 ns |
41585708 ns |
0.91 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34879000 ns |
34896229 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1854064 ns |
1839186 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
187364104.5 ns |
184662833 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
159394792 ns |
159634000 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
147080250 ns |
151746084 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
416232000 ns |
415075875 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16513835 ns |
16506413 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
436918167 ns |
427351833 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
253570271 ns |
251624521 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
232441063 ns |
233926312.5 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
487065917 ns |
484091542 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
184125 ns |
181666 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
185125 ns |
183416.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
185333 ns |
186125 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
183667 ns |
183834 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
228604.5 ns |
173529.5 ns |
1.32 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
609187.5 ns |
587541 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
589521 ns |
600458 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
615687.5 ns |
632375 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
588604 ns |
631354 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1082743 ns |
1005977 ns |
1.08 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3843541 ns |
3816041.5 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3628458.5 ns |
3637833 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3481208.5 ns |
3539646 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
5352750 ns |
5351396 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
550108 ns |
554127 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17957875 ns |
17372333 ns |
1.03 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17252250 ns |
17218458.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16584291 ns |
16979478.5 ns |
0.98 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
22130646 ns |
22177625 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2634462 ns |
2616933 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
458 ns |
583 ns |
0.79 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
459 ns |
459 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32228 ns |
32036 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9479.5 ns |
9667 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9709 ns |
9750 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
10125 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9333 ns |
9291 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
263745 ns |
260858 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
582782250 ns |
506491042 ns |
1.15 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
428813333 ns |
428949104 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
434641416 ns |
474815000 ns |
0.92 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
674397896 ns |
671461979 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12481497 ns |
12484614.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
2082933771 ns |
2043435104.5 ns |
1.02 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1628241833 ns |
1631358667 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1498637604 ns |
1546812271 ns |
0.97 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2215333416.5 ns |
2216473375.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49018510 ns |
49204869.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1609020.5 ns |
1642542 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1170854.5 ns |
1194625 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1390229 ns |
1380791 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2481499.5 ns |
2487084 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215279 ns |
215546 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12779125 ns |
12711687.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9941250 ns |
9927625 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9668020.5 ns |
9788604.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18419750.5 ns |
18464437.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2038920 ns |
1995889.5 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17725542 ns |
17669166.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14672812.5 ns |
14709437.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14586209 ns |
14807645.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21417895.5 ns |
21465708 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26209 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26291 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26834 ns |
26291 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24018 ns |
23873 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66750 ns |
66917 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66958 ns |
67333 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
68250 ns |
67083 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67084 ns |
66833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
410142.5 ns |
382426 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203375 ns |
203834 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210291 ns |
209542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
211042 ns |
209584 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199875 ns |
199584 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26776 ns |
26132 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
613708 ns |
613833.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
669042 ns |
636667 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
636312.5 ns |
671166.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
585917 ns |
628229.5 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
354860.5 ns |
308600 ns |
1.15 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
653500.5 ns |
671687.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
635459 ns |
645937.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
647417 ns |
644791.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
593312.5 ns |
676334 ns |
0.88 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131876.5 ns |
131667 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2293458 ns |
2241875 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1909916.5 ns |
2192250 ns |
0.87 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2243166.5 ns |
2297042 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2246479.5 ns |
2246249.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1186439 ns |
1114838 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17000 ns |
16791 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17834 ns |
17500 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21666.5 ns |
20958 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17750 ns |
16770.5 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
146117 ns |
143001 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
227250 ns |
230375 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
230916 ns |
231791.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
259958 ns |
266208 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219458 ns |
260728.5 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1054898 ns |
959584 ns |
1.10 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23885.5 ns |
23163 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9791 ns |
9604.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10583 ns |
10292 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10000 ns |
10625 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9979.5 ns |
9584 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
262185 ns |
255611 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5625 ns |
5416.5 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7812.5 ns |
5750 ns |
1.36 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7229.5 ns |
9458 ns |
0.76 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5458.5 ns |
5708 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
233872.5 ns |
219432 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7209 ns |
7833 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7667 ns |
7750 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7791 ns |
7709 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6917 ns |
7000 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
810262 ns |
764584 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2209 ns |
1959 ns |
1.13 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2208 ns |
2083 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2292 ns |
2417 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2292 ns |
2208 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18153 ns |
17893 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6375 ns |
6875 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6833 ns |
6542 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6792 ns |
6583 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6333.5 ns |
6291 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
336821.5 ns |
320459 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
749354.5 ns |
747709 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
748792 ns |
749833 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
749500 ns |
754999.5 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
749209 ns |
749375 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21981 ns |
21357 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
802958 ns |
774854 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
796458 ns |
792687.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
791875 ns |
817042 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
792875 ns |
811166 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
299725 ns |
295013.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
7334 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
6000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
5208.5 ns |
1.16 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10042 ns |
10166 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33633 ns |
33519 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
228896 ns |
219666 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
239875 ns |
268125 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
267396 ns |
252000.5 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
229000 ns |
213562 ns |
1.07 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
364682.5 ns |
354278 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10938 ns |
10875 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
13000 ns |
11833 ns |
1.10 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12521 ns |
12770.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10666.5 ns |
12000 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
250655.5 ns |
238132.5 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25167 ns |
24708 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25000 ns |
24584 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25041.5 ns |
25292 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24541 ns |
24500 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1138466 ns |
1094067.5 ns |
1.04 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
107427417 ns |
106709834 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
116984750 ns |
116906583.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
121269458 ns |
127036729 ns |
0.95 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117446750 ns |
117807000 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2641904 ns |
2657653 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
395550709 ns |
392558792 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
362937167 ns |
365774917 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
425153020.5 ns |
431860937.5 ns |
0.98 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
489784167 ns |
483379250 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15232680 ns |
15196086 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
768172250.5 ns |
758564875.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
753476708 ns |
761412666 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
745095458.5 ns |
748747542 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
764672166.5 ns |
765232583 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7375 ns |
6625 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8125 ns |
7334 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8000 ns |
9041.5 ns |
0.88 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7208 ns |
8250 ns |
0.87 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
240084.5 ns |
231038.5 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14292 ns |
14625 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14833 ns |
14750 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14958 ns |
14292 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14583 ns |
14542 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1094068 ns |
1043294.5 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
7833 ns |
5875 ns |
1.33 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8208 ns |
7959 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8375 ns |
9167 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6208 ns |
6333 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
238555 ns |
228571 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12562.5 ns |
12791 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12833 ns |
13167 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13125 ns |
13375 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12833 ns |
12333 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
800161.5 ns |
779066.5 ns |
1.03 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
328895.5 ns |
347625 ns |
0.95 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
340188 ns |
342625 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
399291.5 ns |
416812 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
311167 ns |
307083 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
17007 ns |
17023 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
704292 ns |
710208.5 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
729000 ns |
732125 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
1021958 ns |
1032542 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
661417 ns |
653979.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
202290 ns |
200196.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
334 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23607 ns |
23569 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6333 ns |
6375 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6625 ns |
6584 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6667 ns |
6834 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6708 ns |
6042 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
244508 ns |
241926 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5791 ns |
5708 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5833 ns |
5834 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5792 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5792 ns |
5708 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24898 ns |
24556.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21875 ns |
21562.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21917 ns |
22000 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21541.5 ns |
21709 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21500 ns |
21167 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
268220 ns |
265433.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144708 ns |
144917 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146625 ns |
191292 ns |
0.77 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
151812.5 ns |
149333 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
146000 ns |
149250 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167349 ns |
167659 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1353208 ns |
1319292 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1315875 ns |
1331416 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1324083 ns |
1362958 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1317708 ns |
1326125 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1352621 ns |
1343729.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22938 ns |
22250 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24916 ns |
23791 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27167 ns |
25875 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22167 ns |
23666.5 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
290263.5 ns |
286115 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
125979 ns |
146125 ns |
0.86 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
121875 ns |
118500 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
136291 ns |
129833 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
163666.5 ns |
175792 ns |
0.93 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1479598 ns |
1461317 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23481 ns |
23352 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6416 ns |
6334 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6709 ns |
6459 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6792 ns |
6709 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6417 ns |
6125 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
260281 ns |
258095.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5354.5 ns |
4625 ns |
1.16 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5104.5 ns |
4125 ns |
1.24 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6833 ns |
7625 ns |
0.90 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4791 ns |
4895.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
257811.5 ns |
256357.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10104.5 ns |
9959 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10291 ns |
10125 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10333 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10291 ns |
10333 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1365445.5 ns |
1358318.5 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1542 ns |
1625 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1584 ns |
1584 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1584 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1583 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23204 ns |
23389 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5666 ns |
5667 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5625 ns |
5875 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5917 ns |
6000 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5666 ns |
5625 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
278434.5 ns |
275350.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6794083.5 ns |
6780125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6437416.5 ns |
6371125 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6537625 ns |
6531396 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7672770.5 ns |
7625875 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215934 ns |
214804 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24128750 ns |
24015354 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21236687 ns |
21285667 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21047792 ns |
21085125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29739708 ns |
29769250 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2108232.5 ns |
2112477.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
37600792 ns |
37264541.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45471354 ns |
45538167 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45790709 ns |
45665125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
37882499.5 ns |
38235958 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6312.5 ns |
6208 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7250 ns |
5958.5 ns |
1.22 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7208 ns |
8750 ns |
0.82 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6334 ns |
7500 ns |
0.84 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
239307 ns |
236550 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8562.5 ns |
8750 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8292 ns |
8375 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8500 ns |
8500 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8250 ns |
8958 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1072829 ns |
1063848.5 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1518604.5 ns |
1554084 ns |
0.98 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1254999.5 ns |
1262375 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1638249.5 ns |
1631958.5 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2161791.5 ns |
2152375 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
282408 ns |
277465 ns |
1.02 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7954709 ns |
7881667 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6616166.5 ns |
6612667 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7188250 ns |
7276167 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10050292 ns |
10468062.5 ns |
0.96 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1885536.5 ns |
1876576 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
322292 ns |
346375 ns |
0.93 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
349229 ns |
348937.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
424833 ns |
423416.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
342791 ns |
336687 ns |
1.02 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
42363 ns |
46390 ns |
0.91 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
738167 ns |
735208 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
775562.5 ns |
782458 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1073833 ns |
1081666.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
771542 ns |
758458.5 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
308119.5 ns |
311011.5 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396833 ns |
397375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288375 ns |
288250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288042 ns |
212583 ns |
1.35 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
750895.5 ns |
754104.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44623 ns |
44494 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
644479.5 ns |
675959 ns |
0.95 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
531167 ns |
532333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
529666 ns |
474000 ns |
1.12 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
972833 ns |
973417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
191874.5 ns |
189847 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
642958.5 ns |
599375 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
644291.5 ns |
650333 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
648063 ns |
660375 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
659792 ns |
655833.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132083 ns |
132321 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2529834 ns |
2469395.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2408458 ns |
2363959 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2454875 ns |
2519875.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2469208 ns |
2465916 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1298049.5 ns |
1345989 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
332375 ns |
345583 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
341187.5 ns |
342834 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
399834 ns |
416375 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
309083 ns |
306979.5 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15953 ns |
16330 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
702167 ns |
703104 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
725709 ns |
729708 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
1019542 ns |
1026937.5 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
651145.5 ns |
645959 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
200850 ns |
199885.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1461167 ns |
1460542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1498584 ns |
1500583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1503292 ns |
1491791 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1438792 ns |
1441917 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
41257 ns |
41671 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5149000 ns |
5133500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4947833 ns |
5293250 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5288437.5 ns |
5309521 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4981458.5 ns |
4977042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
200973.5 ns |
197710 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3667 ns |
3666 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33162 ns |
33362 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14958 ns |
15125 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15250 ns |
15500 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15417 ns |
15125 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15167 ns |
15083 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
382563 ns |
381216.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71500 ns |
71375 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71250 ns |
71208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
70958 ns |
71583 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71209 ns |
71208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113882 ns |
113946.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
329166 ns |
319833 ns |
1.03 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
318166 ns |
319208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
318541 ns |
327125 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318917 ns |
318375 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
197593 ns |
195156 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
958 ns |
959 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
24063 ns |
23764 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8270.5 ns |
8084 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8542 ns |
8542 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8459 ns |
8416 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8166 ns |
7833.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
265383 ns |
263039 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
454542 ns |
472416 ns |
0.96 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
477021 ns |
468125 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
549979 ns |
549250 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
490708 ns |
550333 ns |
0.89 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129528 ns |
128804.5 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1405458 ns |
1375292 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1371583 ns |
1372208 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1604208 ns |
1633459 ns |
0.98 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
1364083.5 ns |
1580500 ns |
0.86 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
276117.5 ns |
274739 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
416 ns |
0.80 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
417 ns |
416 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32407 ns |
31574 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6458 ns |
6458 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6542 ns |
6875 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6500 ns |
6708 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6250 ns |
6000 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
267504 ns |
261869 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1723229 ns |
1727625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1725500.5 ns |
1783958 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1723708 ns |
1730916 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1733208 ns |
1729333 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169027 ns |
168455 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4392958 ns |
4352625 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4360166 ns |
4372937.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4374125 ns |
4412458 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4352958.5 ns |
4358042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1185931 ns |
1234725 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6833 ns |
6709 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6500 ns |
6584 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7167 ns |
7417 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
8874.5 ns |
6542 ns |
1.36 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
21039.5 ns |
19619.5 ns |
1.07 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
32958 ns |
51083 ns |
0.65 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
33166 ns |
35625 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
72708 ns |
49875 ns |
1.46 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
50854 ns |
70208 ns |
0.72 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
212280.5 ns |
211156 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
337125 ns |
354291 ns |
0.95 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
348917 ns |
347584 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
426625 ns |
432708 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
321166.5 ns |
319521.5 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18442 ns |
18053 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
715708 ns |
719104 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
725000 ns |
735979 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
1032812.5 ns |
1039063 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
675667 ns |
672750 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
346766 ns |
343671.5 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75416 ns |
75417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
73958 ns |
75333 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75083 ns |
75708 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75375 ns |
74709 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47399 ns |
46983 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
335875 ns |
324417 ns |
1.04 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
324875 ns |
327000 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
325250 ns |
334917 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
325417 ns |
324083 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
213441 ns |
207721.5 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1478500 ns |
1486334 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1526541 ns |
1527500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1529042 ns |
1519000 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1464375 ns |
1466541 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52565 ns |
51914 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5122208.5 ns |
5119333.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5254437 ns |
5300396 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5278208 ns |
5303708 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4984104 ns |
4989375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
205098 ns |
201413 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28125 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28209 ns |
28166 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28250 ns |
28333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28208 ns |
28208 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
23917.5 ns |
24393 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66333 ns |
66542 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66667 ns |
66292 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66500 ns |
66542 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66541 ns |
66584 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
523038 ns |
530998 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1379812.5 ns |
1493250 ns |
0.92 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1132708 ns |
1120167 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1142000 ns |
947625 ns |
1.21 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2125625 ns |
2256500 ns |
0.94 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
571358 ns |
570331 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
2998291 ns |
3075542 ns |
0.97 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2137354 ns |
2732479 ns |
0.78 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2744167 ns |
2643125 ns |
1.04 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3802166 ns |
3814770.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2054250 ns |
2010818 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
8949667 ns |
8738917 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
8798333 ns |
8777854.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
8779583 ns |
8781417 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
6359625 ns |
6360687.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80521 ns |
81146 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81645.5 ns |
81708.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
85979 ns |
83708 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82458 ns |
87687.5 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
191955.5 ns |
192383.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2047562.5 ns |
2016791.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1749333 ns |
2012708 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2018416.5 ns |
2041312 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2004021 ns |
2015208 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
799143 ns |
798885.5 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
6537129
to
24dc9ec
Compare
so as of last night 1.11 support is essentially in, can we test and sees if it resolves? |
Ah nice, I will trigger it |
No description provided.