Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: LV/Octavian moved to optional deps #986

Merged
merged 6 commits into from
Oct 18, 2024
Merged

fix: LV/Octavian moved to optional deps #986

merged 6 commits into from
Oct 18, 2024

Conversation

avik-pal
Copy link
Member

Copy link
Contributor

github-actions bot commented Oct 18, 2024

Benchmark Results (ASV)

main 4a70e43... main/4a70e438bc7aae...
basics/overhead 0.123 ± 0.0013 μs 0.138 ± 0.0015 μs 0.892
time_to_load 1.13 ± 0.0077 s 1.02 ± 0.0067 s 1.11

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 4a70e43 Previous: 3d1ff6c Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 71583.5 ns 412333 ns 0.17
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 72166.5 ns 322708 ns 0.22
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 395708.5 ns 322354.5 ns 1.23
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 71833 ns 739667 ns 0.09711532351720437
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43652 ns 43934 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 323625 ns 605084 ns 0.53
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 314208 ns 511813 ns 0.61
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 1211937.5 ns 476187.5 ns 2.55
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 283208 ns 2280042 ns 0.12
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 190656 ns 191965 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 423458 ns 720583.5 ns 0.59
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 364666.5 ns 629375 ns 0.58
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 1383729.5 ns 593479 ns 2.33
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 332792 ns 2247208 ns 0.15
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1537208 ns 1518562 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1190312.5 ns 1187166.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1525458.5 ns 1387229 ns 1.10
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2464874.5 ns 2947959 ns 0.84
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 210700.5 ns 211504 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12304521 ns 12301292 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9594500 ns 9560875.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9248624.5 ns 9311271.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18023645.5 ns 18616125 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1925425 ns 1926828 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17287250 ns 17354500 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14383374.5 ns 14318812.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14546125 ns 14334708 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21091917 ns 21859292 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120906083.5 ns 121057729 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174242292 ns 174314729 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116885562.5 ns 147379166 ns 0.79
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106585896 ns 447559833 ns 0.24
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5501639 ns 5496733 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 582240708 ns 595612958.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 535892917 ns 542499292 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 827437521 ns 446168125 ns 1.85
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 627842708 ns 1630779417 ns 0.38
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35142999.5 ns 35003993.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 698660021 ns 655003333.5 ns 1.07
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 670647875 ns 677053333 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1332769646 ns 584185208 ns 2.28
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 750190854 ns 1732551062.5 ns 0.43
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 883125 ns 880584 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 821770.5 ns 822625 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3800041 ns 1226625 ns 3.10
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 959750 ns 782750 ns 1.23
lenet(28, 28, 1, 32)/forward/GPU/CUDA 278308.5 ns 270475 ns 1.03
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2682687.5 ns 2740959 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2405250 ns 2494167 ns 0.96
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 12591458 ns 3327979 ns 3.78
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3249958 ns 3134292 ns 1.04
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1085615.5 ns 1067170 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6696708 ns 2264271 ns 2.96
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6369250 ns 1552417 ns 4.10
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6525708 ns 1753479 ns 3.72
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7582250 ns 4348083 ns 1.74
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212828 ns 214769 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24130125 ns 20483167 ns 1.18
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21843875 ns 17691916 ns 1.23
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21547229.5 ns 17963833 ns 1.20
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29745541 ns 26775375 ns 1.11
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1991939 ns 1991226 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37796250 ns 45016687.5 ns 0.84
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45818104.5 ns 42002229.5 ns 1.09
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45523667 ns 41336854.5 ns 1.10
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49327792 ns 47744959 ns 1.03
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13398896 ns 4319020.5 ns 3.10
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12466375 ns 2876959 ns 4.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12614041 ns 3010167 ns 4.19
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15229500 ns 8658375 ns 1.76
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 512375.5 ns 514332 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47580146 ns 40234750 ns 1.18
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41771000 ns 34767583 ns 1.20
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41202146 ns 33924250 ns 1.21
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58477750.5 ns 53719958 ns 1.09
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3012482 ns 2979961.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 75212708 ns 89992458 ns 0.84
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91894021 ns 84426916.5 ns 1.09
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 264774875 ns 82809646 ns 3.20
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 99021084 ns 96502584 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 287455917 ns 142457125 ns 2.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 340245250 ns 186377999.5 ns 1.83
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 285087458 ns 160522958 ns 1.78
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 273053500 ns 489638250 ns 0.56
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7064855 ns 7101620 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 976563875 ns 877579500 ns 1.11
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 877101209 ns 810323667 ns 1.08
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1209457042 ns 714880166.5 ns 1.69
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1118578125 ns 2042862020.5 ns 0.55
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34000634 ns 34011046 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1377756645.5 ns 1671563875 ns 0.82
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1698830042 ns 1561654708 ns 1.09
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 2330234937.5 ns 1478668938 ns 1.58
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1670119292 ns 2558813125 ns 0.65
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1521208.5 ns 1545417 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1266270.5 ns 1269125 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 5996750 ns 1641375 ns 3.65
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2141167 ns 2465854.5 ns 0.87
lenet(28, 28, 1, 128)/forward/GPU/CUDA 268811.5 ns 268247 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7507875 ns 7879417 ns 0.95
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6589083 ns 6568854.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 25009416 ns 7162916 ns 3.49
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10432021 ns 11708979 ns 0.89
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1101260 ns 1072114.5 ns 1.03
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 191129208 ns 186008021 ns 1.03
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 144736396 ns 145478792 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 273492000 ns 128424688 ns 2.13
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 179461208 ns 452715625 ns 0.40
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4846488 ns 4848333.5 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 629305584 ns 641184291 ns 0.98
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 509975583 ns 524088958 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 1027968333 ns 537727000 ns 1.91
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 806872541 ns 1403114875 ns 0.58
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17033903 ns 18681328 ns 0.91
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1088375 ns 1095875 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 971208 ns 967604 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 4519584 ns 1353791 ns 3.34
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1304542 ns 1322541 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 269583 ns 270349 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 4486458 ns 6033416.5 ns 0.74
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 3825437.5 ns 4668271 ns 0.82
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 17153291 ns 4931041 ns 3.48
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5684812.5 ns 6027000 ns 0.94
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1116515 ns 1111457 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23823916.5 ns 23769000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33872292 ns 34212937.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39748625 ns 37101833 ns 1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35386792 ns 132610375 ns 0.27
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1834153 ns 1831215 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 185233521 ns 184677875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 158529229 ns 159062583 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 294665437.5 ns 144384604 ns 2.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 385181292 ns 534807250 ns 0.72
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16520075 ns 16477829 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 294759792 ns 297162354.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 243985334 ns 243897583 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 685013666 ns 298830750.5 ns 2.29
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 435921041 ns 713110625 ns 0.61
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 763398791 ns 658030208 ns 1.16
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 493032167 ns 432501687.5 ns 1.14
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 626385583 ns 400225625 ns 1.57
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 868315166 ns 1771604646 ns 0.49
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12483852 ns 12483387 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1867913687.5 ns 1887633520.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1570444917 ns 1637268417 ns 0.96
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2756005292 ns 1504383479 ns 1.83
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2065837270.5 ns 5051328625 ns 0.41
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49723405.5 ns 49779380 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3072750 ns 3063562.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2100791.5 ns 2098125.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2525375 ns 2292583 ns 1.10
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4785375 ns 6036250 ns 0.79
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 581725.5 ns 581533.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25396541.5 ns 25421500 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19829062.5 ns 20387000 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18858041 ns 19188625 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36960792 ns 39410312.5 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2983801 ns 2998929 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 35295146 ns 35068167 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 30170458 ns 28412292 ns 1.06
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 171213125 ns 30184750 ns 5.67
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42761479 ns 45702312 ns 0.94
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1657000 ns 1657041.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1203792 ns 1194958 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1571083 ns 1386854 ns 1.13
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2524687.5 ns 3047792 ns 0.83
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217203 ns 216764 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12721792 ns 12728917 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9957625 ns 9975625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9616250 ns 9685688 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18462187 ns 19011166.5 ns 0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1943561 ns 1948689 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17680916.5 ns 17689854 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14740396 ns 14742625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14541458 ns 14638458.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21459125 ns 22191709 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23932750 ns 23557167 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34347416.5 ns 34461708 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39627542 ns 37530375 ns 1.06
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35624917 ns 132666521 ns 0.27
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1852557 ns 1832030 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 302609458 ns 189557666.5 ns 1.60
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 228874000 ns 237037020.5 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 301221333 ns 197049458.5 ns 1.53
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 392048250 ns 727023166 ns 0.54
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13878343.5 ns 13923369 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 299398562 ns 302073604 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 251611583 ns 250383541.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 712654166.5 ns 306286250 ns 2.33
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 440939000 ns 717294500 ns 0.61
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 2420583 ns 1917792 ns 1.26
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2423229 ns 1580771 ns 1.53
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 6731646 ns 1574458 ns 4.28
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2424167 ns 2652249.5 ns 0.91
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 571316.5 ns 575627 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 6543187.5 ns 6156375 ns 1.06
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 6525875 ns 5936542 ns 1.10
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 17755209 ns 5926084 ns 3.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 6538145.5 ns 9429521 ns 0.69
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1380926 ns 1376631 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17530271 ns 18780958 ns 0.93
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17532125 ns 19122375 ns 0.92
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 36016542 ns 19117250 ns 1.88
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14110479.5 ns 18883917 ns 0.75
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 71625 ns 70021 ns 1.02
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 69625 ns 68417 ns 1.02
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1066333 ns 70500 ns 15.13
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 69646.5 ns 727312.5 ns 0.09575870069605569
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47610 ns 47914 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 322416.5 ns 355562.5 ns 0.91
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 296458 ns 325250 ns 0.91
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1857542 ns 326687.5 ns 5.69
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 281208.5 ns 2205479 ns 0.13
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 211302 ns 213681 ns 0.99
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 443542 ns 393084 ns 1.13
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 448479.5 ns 450084 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 1860041 ns 444728.5 ns 4.18
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 377895.5 ns 2229333 ns 0.17
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3054625.5 ns 3034917 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2082708 ns 2091583.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2524770.5 ns 2284000 ns 1.11
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4829771.5 ns 6013125 ns 0.80
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 578168 ns 576987 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23553833.5 ns 23582313 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18052083 ns 18082917 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18386562.5 ns 16983417 ns 1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36118708.5 ns 37559208 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2892912.5 ns 3055304 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34047500 ns 33242479.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27628208 ns 27632833.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 170842542 ns 27457834 ns 6.22
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42140958 ns 44681646 ns 0.94
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 117615167 ns 120219875 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174096667 ns 174640416.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116126813 ns 147464708 ns 0.79
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105766417 ns 447824083.5 ns 0.24
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5459554 ns 5463169 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 471545875 ns 471068209 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 535019375 ns 466756437 ns 1.15
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 823344375 ns 436859104.5 ns 1.88
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 728723333 ns 1751773916 ns 0.42
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32279472.5 ns 32302579.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 640547750 ns 637667500 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 660946479 ns 664219750.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1328513500 ns 586420938 ns 2.27
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 728557417 ns 1733809521 ns 0.42
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 335208 ns 1232125 ns 0.27
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 444875 ns 975229 ns 0.46
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 3143521 ns 903250 ns 3.48
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 400958 ns 1950333 ns 0.21
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 568639 ns 568402.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2026916.5 ns 2957958 ns 0.69
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2017208 ns 2629625 ns 0.77
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 9021604 ns 2593375 ns 3.48
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 2028083 ns 7086750 ns 0.29
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1315517 ns 1321605.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5792750.5 ns 6642209 ns 0.87
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5781104 ns 6552584 ns 0.88
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 15508125 ns 6490000 ns 2.39
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2891709 ns 7617500 ns 0.38
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 102645.5 ns 39583 ns 2.59
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 104250 ns 31291 ns 3.33
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 1193563 ns 35041 ns 34.06
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 104917 ns 91458 ns 1.15
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27774 ns 27908 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 209250 ns 175479.5 ns 1.19
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 209125 ns 175625 ns 1.19
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 604750 ns 175667 ns 3.44
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 209333 ns 273166 ns 0.77
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 217576 ns 218444 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 706750 ns 442021 ns 1.60
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 711250 ns 442417 ns 1.61
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 1430333 ns 442062.5 ns 3.24
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 686875 ns 510625 ns 1.35
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 13500 ns 13375 ns 1.01
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 13125 ns 12833 ns 1.02
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 773270.5 ns 14187.5 ns 54.50
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 13104 ns 54458 ns 0.24
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 27910 ns 27839 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25708 ns 25708 ns 1
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 25708 ns 25708 ns 1
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 756084 ns 25792 ns 29.31
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 25875 ns 151770.5 ns 0.17
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 207052 ns 208205 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 52250 ns 46083 ns 1.13
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 47000 ns 45458 ns 1.03
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 708750 ns 45875 ns 15.45
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 28167 ns 151145.5 ns 0.19
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 319453167 ns 318485791 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 239155042 ns 236155250 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 407766458 ns 204947249.5 ns 1.99
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 321870270.5 ns 870093062.5 ns 0.37
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7780650 ns 7672854 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1236823417 ns 1103355583.5 ns 1.12
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1007853145.5 ns 951012041.5 ns 1.06
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1508969979.5 ns 915597916 ns 1.65
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1586783833 ns 2647669125 ns 0.60
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27019293 ns 27249547 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 416000 ns 193938 ns 2.15
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 415250 ns 167312.5 ns 2.48
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 1687042 ns 167834 ns 10.05
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 413459 ns 873291.5 ns 0.47
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47441 ns 47232 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1082395.5 ns 1215770.5 ns 0.89
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1079854 ns 1097562.5 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 2216437.5 ns 1097292 ns 2.02
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 1092687.5 ns 2767479.5 ns 0.39
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 223288 ns 222773 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 3093042 ns 2290771 ns 1.35
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 3109500 ns 2230896 ns 1.39
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 5883791.5 ns 2223666 ns 2.65
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3023625 ns 3710292 ns 0.81
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 578667 ns 1586500 ns 0.36
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 572562.5 ns 1236041.5 ns 0.46
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 2245229 ns 1234250 ns 1.82
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 527584 ns 2225875 ns 0.24
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 576205 ns 574042.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 2140625.5 ns 3206334 ns 0.67
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2131541.5 ns 2859000 ns 0.75
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 8372625 ns 2838875 ns 2.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 2138166 ns 7347042 ns 0.29
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1373440 ns 1353212 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7934875 ns 8838958 ns 0.90
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7934500 ns 8778959 ns 0.90
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 19530500 ns 8989687 ns 2.17
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4862417 ns 9543937 ns 0.51
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 6896 ns 2500 ns 2.76
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 6541.5 ns 2229 ns 2.93
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 6583 ns 2500 ns 2.63
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 6833 ns 2709 ns 2.52
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 25070 ns 25189 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7625 ns 7417 ns 1.03
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7084 ns 7083 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 9625 ns 7209 ns 1.34
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7625 ns 7250 ns 1.05
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 192229.5 ns 191810 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 9042 ns 8667 ns 1.04
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8917 ns 8666.5 ns 1.03
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 9750 ns 8500 ns 1.15
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5750 ns 5916 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 20229.5 ns 10395.5 ns 1.95
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 19542 ns 17416.5 ns 1.12
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 17791.5 ns 10625 ns 1.67
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 19708 ns 7458 ns 2.64
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 25377 ns 25409 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 33708.5 ns 21666 ns 1.56
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 33000 ns 21500 ns 1.53
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 30792 ns 21875 ns 1.41
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 33500 ns 21792 ns 1.54
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 201508 ns 200873 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 95042 ns 56750 ns 1.67
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 93917 ns 56750 ns 1.65
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 91250 ns 56750 ns 1.61
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 92625 ns 51333 ns 1.80
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 12312 ns 28375 ns 0.43
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 11687.5 ns 29125 ns 0.40
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 620375 ns 28916 ns 21.45
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 13750 ns 45875 ns 0.30
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 26329 ns 26566 ns 0.99
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 23834 ns 44041 ns 0.54
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 23625 ns 44375 ns 0.53
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 728479.5 ns 44125 ns 16.51
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 23334 ns 145041 ns 0.16
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 171675.5 ns 172032.5 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 57083 ns 68500 ns 0.83
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 57000 ns 68687.5 ns 0.83
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 793667 ns 68333 ns 11.61
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 34542 ns 145708 ns 0.24
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 7499.5 ns 2000 ns 3.75
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 5896 ns 1916.5 ns 3.08
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 5500 ns 2000 ns 2.75
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 6458 ns 1916 ns 3.37
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23518 ns 23659 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5395.5 ns 5500 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5250 ns 5125 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 7166 ns 5375 ns 1.33
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5417 ns 5375 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 176441.5 ns 175247.5 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 9041 ns 8208 ns 1.10
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 8917 ns 8375 ns 1.06
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 8833 ns 8292 ns 1.07
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 6041 ns 5291 ns 1.14
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 107063875 ns 34072083 ns 3.14
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 116730666 ns 40136625 ns 2.91
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 123561417 ns 43496250 ns 2.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 118714416.5 ns 153686583 ns 0.77
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2641140 ns 2640798 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 400534771 ns 511580042 ns 0.78
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 372080833.5 ns 316303416.5 ns 1.18
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 484853104 ns 305395375.5 ns 1.59
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 641061166 ns 699851166 ns 0.92
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15244009 ns 15146264 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 630176000 ns 745204500 ns 0.85
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 762625959 ns 693254333.5 ns 1.10
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 1248272125 ns 743284104.5 ns 1.68
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 909667083 ns 1174961458.5 ns 0.77

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal
Copy link
Member Author

docs fixed by LuxDL/MLDataDevices.jl#87

@avik-pal
Copy link
Member Author

qa tests should be fixed once LuxDL/LuxLib.jl#175 is merged

@avik-pal avik-pal merged commit 1a701d2 into main Oct 18, 2024
42 of 54 checks passed
@avik-pal avik-pal deleted the ap/lv_deps branch October 18, 2024 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant