-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: LV
/Octavian
moved to optional deps
#986
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 4a70e43 | Previous: 3d1ff6c | Ratio |
---|---|---|---|
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) |
71583.5 ns |
412333 ns |
0.17 |
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) |
72166.5 ns |
322708 ns |
0.22 |
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) |
395708.5 ns |
322354.5 ns |
1.23 |
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) |
71833 ns |
739667 ns |
0.09711532351720437 |
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA |
43652 ns |
43934 ns |
0.99 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) |
323625 ns |
605084 ns |
0.53 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) |
314208 ns |
511813 ns |
0.61 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) |
1211937.5 ns |
476187.5 ns |
2.55 |
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) |
283208 ns |
2280042 ns |
0.12 |
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA |
190656 ns |
191965 ns |
0.99 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) |
423458 ns |
720583.5 ns |
0.59 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) |
364666.5 ns |
629375 ns |
0.58 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) |
1383729.5 ns |
593479 ns |
2.33 |
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) |
332792 ns |
2247208 ns |
0.15 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1537208 ns |
1518562 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1190312.5 ns |
1187166.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1525458.5 ns |
1387229 ns |
1.10 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2464874.5 ns |
2947959 ns |
0.84 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
210700.5 ns |
211504 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12304521 ns |
12301292 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9594500 ns |
9560875.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9248624.5 ns |
9311271.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18023645.5 ns |
18616125 ns |
0.97 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1925425 ns |
1926828 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17287250 ns |
17354500 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14383374.5 ns |
14318812.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14546125 ns |
14334708 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21091917 ns |
21859292 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120906083.5 ns |
121057729 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174242292 ns |
174314729 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
116885562.5 ns |
147379166 ns |
0.79 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106585896 ns |
447559833 ns |
0.24 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5501639 ns |
5496733 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
582240708 ns |
595612958.5 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
535892917 ns |
542499292 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
827437521 ns |
446168125 ns |
1.85 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
627842708 ns |
1630779417 ns |
0.38 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
35142999.5 ns |
35003993.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
698660021 ns |
655003333.5 ns |
1.07 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
670647875 ns |
677053333 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1332769646 ns |
584185208 ns |
2.28 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
750190854 ns |
1732551062.5 ns |
0.43 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
883125 ns |
880584 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
821770.5 ns |
822625 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
3800041 ns |
1226625 ns |
3.10 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
959750 ns |
782750 ns |
1.23 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
278308.5 ns |
270475 ns |
1.03 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2682687.5 ns |
2740959 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2405250 ns |
2494167 ns |
0.96 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
12591458 ns |
3327979 ns |
3.78 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3249958 ns |
3134292 ns |
1.04 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1085615.5 ns |
1067170 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6696708 ns |
2264271 ns |
2.96 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6369250 ns |
1552417 ns |
4.10 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6525708 ns |
1753479 ns |
3.72 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7582250 ns |
4348083 ns |
1.74 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
212828 ns |
214769 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24130125 ns |
20483167 ns |
1.18 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21843875 ns |
17691916 ns |
1.23 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21547229.5 ns |
17963833 ns |
1.20 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29745541 ns |
26775375 ns |
1.11 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1991939 ns |
1991226 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
37796250 ns |
45016687.5 ns |
0.84 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45818104.5 ns |
42002229.5 ns |
1.09 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45523667 ns |
41336854.5 ns |
1.10 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49327792 ns |
47744959 ns |
1.03 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13398896 ns |
4319020.5 ns |
3.10 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12466375 ns |
2876959 ns |
4.33 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12614041 ns |
3010167 ns |
4.19 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15229500 ns |
8658375 ns |
1.76 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
512375.5 ns |
514332 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47580146 ns |
40234750 ns |
1.18 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41771000 ns |
34767583 ns |
1.20 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41202146 ns |
33924250 ns |
1.21 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58477750.5 ns |
53719958 ns |
1.09 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3012482 ns |
2979961.5 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
75212708 ns |
89992458 ns |
0.84 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
91894021 ns |
84426916.5 ns |
1.09 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
264774875 ns |
82809646 ns |
3.20 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
99021084 ns |
96502584 ns |
1.03 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
287455917 ns |
142457125 ns |
2.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
340245250 ns |
186377999.5 ns |
1.83 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
285087458 ns |
160522958 ns |
1.78 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
273053500 ns |
489638250 ns |
0.56 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7064855 ns |
7101620 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
976563875 ns |
877579500 ns |
1.11 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
877101209 ns |
810323667 ns |
1.08 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
1209457042 ns |
714880166.5 ns |
1.69 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1118578125 ns |
2042862020.5 ns |
0.55 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34000634 ns |
34011046 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1377756645.5 ns |
1671563875 ns |
0.82 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1698830042 ns |
1561654708 ns |
1.09 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
2330234937.5 ns |
1478668938 ns |
1.58 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1670119292 ns |
2558813125 ns |
0.65 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1521208.5 ns |
1545417 ns |
0.98 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1266270.5 ns |
1269125 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
5996750 ns |
1641375 ns |
3.65 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2141167 ns |
2465854.5 ns |
0.87 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
268811.5 ns |
268247 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7507875 ns |
7879417 ns |
0.95 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6589083 ns |
6568854.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
25009416 ns |
7162916 ns |
3.49 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10432021 ns |
11708979 ns |
0.89 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1101260 ns |
1072114.5 ns |
1.03 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
191129208 ns |
186008021 ns |
1.03 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
144736396 ns |
145478792 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
273492000 ns |
128424688 ns |
2.13 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
179461208 ns |
452715625 ns |
0.40 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4846488 ns |
4848333.5 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
629305584 ns |
641184291 ns |
0.98 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
509975583 ns |
524088958 ns |
0.97 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
1027968333 ns |
537727000 ns |
1.91 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
806872541 ns |
1403114875 ns |
0.58 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
17033903 ns |
18681328 ns |
0.91 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1088375 ns |
1095875 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
971208 ns |
967604 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
4519584 ns |
1353791 ns |
3.34 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1304542 ns |
1322541 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
269583 ns |
270349 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
4486458 ns |
6033416.5 ns |
0.74 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
3825437.5 ns |
4668271 ns |
0.82 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
17153291 ns |
4931041 ns |
3.48 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5684812.5 ns |
6027000 ns |
0.94 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1116515 ns |
1111457 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23823916.5 ns |
23769000 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33872292 ns |
34212937.5 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
39748625 ns |
37101833 ns |
1.07 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35386792 ns |
132610375 ns |
0.27 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1834153 ns |
1831215 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
185233521 ns |
184677875 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
158529229 ns |
159062583 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
294665437.5 ns |
144384604 ns |
2.04 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
385181292 ns |
534807250 ns |
0.72 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16520075 ns |
16477829 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
294759792 ns |
297162354.5 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
243985334 ns |
243897583 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
685013666 ns |
298830750.5 ns |
2.29 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
435921041 ns |
713110625 ns |
0.61 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
763398791 ns |
658030208 ns |
1.16 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
493032167 ns |
432501687.5 ns |
1.14 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
626385583 ns |
400225625 ns |
1.57 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
868315166 ns |
1771604646 ns |
0.49 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12483852 ns |
12483387 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1867913687.5 ns |
1887633520.5 ns |
0.99 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1570444917 ns |
1637268417 ns |
0.96 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
2756005292 ns |
1504383479 ns |
1.83 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2065837270.5 ns |
5051328625 ns |
0.41 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49723405.5 ns |
49779380 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3072750 ns |
3063562.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2100791.5 ns |
2098125.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2525375 ns |
2292583 ns |
1.10 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4785375 ns |
6036250 ns |
0.79 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
581725.5 ns |
581533.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
25396541.5 ns |
25421500 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
19829062.5 ns |
20387000 ns |
0.97 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18858041 ns |
19188625 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36960792 ns |
39410312.5 ns |
0.94 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2983801 ns |
2998929 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
35295146 ns |
35068167 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
30170458 ns |
28412292 ns |
1.06 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
171213125 ns |
30184750 ns |
5.67 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
42761479 ns |
45702312 ns |
0.94 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1657000 ns |
1657041.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1203792 ns |
1194958 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1571083 ns |
1386854 ns |
1.13 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2524687.5 ns |
3047792 ns |
0.83 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
217203 ns |
216764 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12721792 ns |
12728917 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9957625 ns |
9975625 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9616250 ns |
9685688 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18462187 ns |
19011166.5 ns |
0.97 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1943561 ns |
1948689 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17680916.5 ns |
17689854 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14740396 ns |
14742625 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14541458 ns |
14638458.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21459125 ns |
22191709 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23932750 ns |
23557167 ns |
1.02 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34347416.5 ns |
34461708 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
39627542 ns |
37530375 ns |
1.06 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35624917 ns |
132666521 ns |
0.27 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1852557 ns |
1832030 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
302609458 ns |
189557666.5 ns |
1.60 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
228874000 ns |
237037020.5 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
301221333 ns |
197049458.5 ns |
1.53 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
392048250 ns |
727023166 ns |
0.54 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13878343.5 ns |
13923369 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
299398562 ns |
302073604 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
251611583 ns |
250383541.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
712654166.5 ns |
306286250 ns |
2.33 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
440939000 ns |
717294500 ns |
0.61 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
2420583 ns |
1917792 ns |
1.26 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
2423229 ns |
1580771 ns |
1.53 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
6731646 ns |
1574458 ns |
4.28 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
2424167 ns |
2652249.5 ns |
0.91 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
571316.5 ns |
575627 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
6543187.5 ns |
6156375 ns |
1.06 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
6525875 ns |
5936542 ns |
1.10 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
17755209 ns |
5926084 ns |
3.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
6538145.5 ns |
9429521 ns |
0.69 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1380926 ns |
1376631 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17530271 ns |
18780958 ns |
0.93 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17532125 ns |
19122375 ns |
0.92 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
36016542 ns |
19117250 ns |
1.88 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14110479.5 ns |
18883917 ns |
0.75 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) |
71625 ns |
70021 ns |
1.02 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) |
69625 ns |
68417 ns |
1.02 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) |
1066333 ns |
70500 ns |
15.13 |
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) |
69646.5 ns |
727312.5 ns |
0.09575870069605569 |
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA |
47610 ns |
47914 ns |
0.99 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) |
322416.5 ns |
355562.5 ns |
0.91 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) |
296458 ns |
325250 ns |
0.91 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) |
1857542 ns |
326687.5 ns |
5.69 |
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) |
281208.5 ns |
2205479 ns |
0.13 |
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA |
211302 ns |
213681 ns |
0.99 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) |
443542 ns |
393084 ns |
1.13 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) |
448479.5 ns |
450084 ns |
1.00 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) |
1860041 ns |
444728.5 ns |
4.18 |
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) |
377895.5 ns |
2229333 ns |
0.17 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3054625.5 ns |
3034917 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2082708 ns |
2091583.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2524770.5 ns |
2284000 ns |
1.11 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4829771.5 ns |
6013125 ns |
0.80 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
578168 ns |
576987 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23553833.5 ns |
23582313 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18052083 ns |
18082917 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18386562.5 ns |
16983417 ns |
1.08 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36118708.5 ns |
37559208 ns |
0.96 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2892912.5 ns |
3055304 ns |
0.95 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34047500 ns |
33242479.5 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27628208 ns |
27632833.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
170842542 ns |
27457834 ns |
6.22 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
42140958 ns |
44681646 ns |
0.94 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
117615167 ns |
120219875 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174096667 ns |
174640416.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
116126813 ns |
147464708 ns |
0.79 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
105766417 ns |
447824083.5 ns |
0.24 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5459554 ns |
5463169 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
471545875 ns |
471068209 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
535019375 ns |
466756437 ns |
1.15 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
823344375 ns |
436859104.5 ns |
1.88 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
728723333 ns |
1751773916 ns |
0.42 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32279472.5 ns |
32302579.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
640547750 ns |
637667500 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
660946479 ns |
664219750.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1328513500 ns |
586420938 ns |
2.27 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
728557417 ns |
1733809521 ns |
0.42 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
335208 ns |
1232125 ns |
0.27 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
444875 ns |
975229 ns |
0.46 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
3143521 ns |
903250 ns |
3.48 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
400958 ns |
1950333 ns |
0.21 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
568639 ns |
568402.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2026916.5 ns |
2957958 ns |
0.69 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2017208 ns |
2629625 ns |
0.77 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
9021604 ns |
2593375 ns |
3.48 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
2028083 ns |
7086750 ns |
0.29 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1315517 ns |
1321605.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5792750.5 ns |
6642209 ns |
0.87 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5781104 ns |
6552584 ns |
0.88 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
15508125 ns |
6490000 ns |
2.39 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2891709 ns |
7617500 ns |
0.38 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) |
102645.5 ns |
39583 ns |
2.59 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) |
104250 ns |
31291 ns |
3.33 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) |
1193563 ns |
35041 ns |
34.06 |
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) |
104917 ns |
91458 ns |
1.15 |
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA |
27774 ns |
27908 ns |
1.00 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) |
209250 ns |
175479.5 ns |
1.19 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) |
209125 ns |
175625 ns |
1.19 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) |
604750 ns |
175667 ns |
3.44 |
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) |
209333 ns |
273166 ns |
0.77 |
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA |
217576 ns |
218444 ns |
1.00 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) |
706750 ns |
442021 ns |
1.60 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) |
711250 ns |
442417 ns |
1.61 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) |
1430333 ns |
442062.5 ns |
3.24 |
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) |
686875 ns |
510625 ns |
1.35 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) |
13500 ns |
13375 ns |
1.01 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) |
13125 ns |
12833 ns |
1.02 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) |
773270.5 ns |
14187.5 ns |
54.50 |
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) |
13104 ns |
54458 ns |
0.24 |
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA |
27910 ns |
27839 ns |
1.00 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) |
25708 ns |
25708 ns |
1 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) |
25708 ns |
25708 ns |
1 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) |
756084 ns |
25792 ns |
29.31 |
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) |
25875 ns |
151770.5 ns |
0.17 |
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA |
207052 ns |
208205 ns |
0.99 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) |
52250 ns |
46083 ns |
1.13 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) |
47000 ns |
45458 ns |
1.03 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) |
708750 ns |
45875 ns |
15.45 |
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) |
28167 ns |
151145.5 ns |
0.19 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
319453167 ns |
318485791 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
239155042 ns |
236155250 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
407766458 ns |
204947249.5 ns |
1.99 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
321870270.5 ns |
870093062.5 ns |
0.37 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7780650 ns |
7672854 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1236823417 ns |
1103355583.5 ns |
1.12 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
1007853145.5 ns |
951012041.5 ns |
1.06 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
1508969979.5 ns |
915597916 ns |
1.65 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1586783833 ns |
2647669125 ns |
0.60 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
27019293 ns |
27249547 ns |
0.99 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) |
416000 ns |
193938 ns |
2.15 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) |
415250 ns |
167312.5 ns |
2.48 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) |
1687042 ns |
167834 ns |
10.05 |
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) |
413459 ns |
873291.5 ns |
0.47 |
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA |
47441 ns |
47232 ns |
1.00 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1082395.5 ns |
1215770.5 ns |
0.89 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1079854 ns |
1097562.5 ns |
0.98 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2216437.5 ns |
1097292 ns |
2.02 |
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1092687.5 ns |
2767479.5 ns |
0.39 |
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA |
223288 ns |
222773 ns |
1.00 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) |
3093042 ns |
2290771 ns |
1.35 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) |
3109500 ns |
2230896 ns |
1.39 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) |
5883791.5 ns |
2223666 ns |
2.65 |
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) |
3023625 ns |
3710292 ns |
0.81 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
578667 ns |
1586500 ns |
0.36 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
572562.5 ns |
1236041.5 ns |
0.46 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
2245229 ns |
1234250 ns |
1.82 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
527584 ns |
2225875 ns |
0.24 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
576205 ns |
574042.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
2140625.5 ns |
3206334 ns |
0.67 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2131541.5 ns |
2859000 ns |
0.75 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
8372625 ns |
2838875 ns |
2.95 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
2138166 ns |
7347042 ns |
0.29 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
1373440 ns |
1353212 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7934875 ns |
8838958 ns |
0.90 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7934500 ns |
8778959 ns |
0.90 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
19530500 ns |
8989687 ns |
2.17 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4862417 ns |
9543937 ns |
0.51 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) |
6896 ns |
2500 ns |
2.76 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) |
6541.5 ns |
2229 ns |
2.93 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) |
6583 ns |
2500 ns |
2.63 |
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) |
6833 ns |
2709 ns |
2.52 |
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA |
25070 ns |
25189 ns |
1.00 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) |
7625 ns |
7417 ns |
1.03 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) |
7084 ns |
7083 ns |
1.00 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) |
9625 ns |
7209 ns |
1.34 |
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) |
7625 ns |
7250 ns |
1.05 |
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA |
192229.5 ns |
191810 ns |
1.00 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) |
9042 ns |
8667 ns |
1.04 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) |
8917 ns |
8666.5 ns |
1.03 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) |
9750 ns |
8500 ns |
1.15 |
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) |
5750 ns |
5916 ns |
0.97 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) |
20229.5 ns |
10395.5 ns |
1.95 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) |
19542 ns |
17416.5 ns |
1.12 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) |
17791.5 ns |
10625 ns |
1.67 |
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) |
19708 ns |
7458 ns |
2.64 |
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA |
25377 ns |
25409 ns |
1.00 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) |
33708.5 ns |
21666 ns |
1.56 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) |
33000 ns |
21500 ns |
1.53 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) |
30792 ns |
21875 ns |
1.41 |
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) |
33500 ns |
21792 ns |
1.54 |
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA |
201508 ns |
200873 ns |
1.00 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) |
95042 ns |
56750 ns |
1.67 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) |
93917 ns |
56750 ns |
1.65 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) |
91250 ns |
56750 ns |
1.61 |
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) |
92625 ns |
51333 ns |
1.80 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) |
12312 ns |
28375 ns |
0.43 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) |
11687.5 ns |
29125 ns |
0.40 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) |
620375 ns |
28916 ns |
21.45 |
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) |
13750 ns |
45875 ns |
0.30 |
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA |
26329 ns |
26566 ns |
0.99 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) |
23834 ns |
44041 ns |
0.54 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) |
23625 ns |
44375 ns |
0.53 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) |
728479.5 ns |
44125 ns |
16.51 |
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) |
23334 ns |
145041 ns |
0.16 |
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA |
171675.5 ns |
172032.5 ns |
1.00 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) |
57083 ns |
68500 ns |
0.83 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) |
57000 ns |
68687.5 ns |
0.83 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) |
793667 ns |
68333 ns |
11.61 |
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) |
34542 ns |
145708 ns |
0.24 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) |
7499.5 ns |
2000 ns |
3.75 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) |
5896 ns |
1916.5 ns |
3.08 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) |
5500 ns |
2000 ns |
2.75 |
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) |
6458 ns |
1916 ns |
3.37 |
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA |
23518 ns |
23659 ns |
0.99 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) |
5395.5 ns |
5500 ns |
0.98 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) |
5250 ns |
5125 ns |
1.02 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) |
7166 ns |
5375 ns |
1.33 |
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) |
5417 ns |
5375 ns |
1.01 |
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA |
176441.5 ns |
175247.5 ns |
1.01 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) |
9041 ns |
8208 ns |
1.10 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) |
8917 ns |
8375 ns |
1.06 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) |
8833 ns |
8292 ns |
1.07 |
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) |
6041 ns |
5291 ns |
1.14 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
107063875 ns |
34072083 ns |
3.14 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
116730666 ns |
40136625 ns |
2.91 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
123561417 ns |
43496250 ns |
2.84 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
118714416.5 ns |
153686583 ns |
0.77 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2641140 ns |
2640798 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
400534771 ns |
511580042 ns |
0.78 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
372080833.5 ns |
316303416.5 ns |
1.18 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
484853104 ns |
305395375.5 ns |
1.59 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
641061166 ns |
699851166 ns |
0.92 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15244009 ns |
15146264 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
630176000 ns |
745204500 ns |
0.85 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
762625959 ns |
693254333.5 ns |
1.10 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
1248272125 ns |
743284104.5 ns |
1.68 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
909667083 ns |
1174961458.5 ns |
0.77 |
This comment was automatically generated by workflow using github-action-benchmark.
docs fixed by LuxDL/MLDataDevices.jl#87 |
qa tests should be fixed once LuxDL/LuxLib.jl#175 is merged |
xref LuxDL/LuxLib.jl#175