Skip to content

Commit

Permalink
chore: bump compat for GPUArraysCore to 0.2 for package docs, (keep e…
Browse files Browse the repository at this point in the history
…xisting compat) (#985)

Co-authored-by: CompatHelper Julia <[email protected]>
  • Loading branch information
github-actions[bot] and CompatHelper Julia authored Oct 18, 2024
1 parent b24fe07 commit 3d1ff6c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Documenter = "1.4"
DocumenterVitepress = "0.1.3"
FiniteDiff = "2.23.1"
ForwardDiff = "0.10.36"
GPUArraysCore = "0.1"
GPUArraysCore = "0.1, 0.2"
KernelAbstractions = "0.9"
LinearAlgebra = "1.10"
Literate = "2.18.0"
Expand Down

1 comment on commit 3d1ff6c

@github-actions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 3d1ff6c Previous: 33e5432 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 412333 ns 411833 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 322708 ns 322270.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322354.5 ns 322687.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739667 ns 739792 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43934 ns 43717 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 605084 ns 592458 ns 1.02
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 511813 ns 485750 ns 1.05
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 476187.5 ns 472146 ns 1.01
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2280042 ns 916416 ns 2.49
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 191965 ns 193389 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 720583.5 ns 732083 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 629375 ns 630020.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 593479 ns 590250 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2247208 ns 1008000 ns 2.23
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1518562 ns 1531625.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1187166.5 ns 1199500 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1387229 ns 1370166 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2947959 ns 2432729.5 ns 1.21
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 211504 ns 211497 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12301292 ns 12247917 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9560875.5 ns 9551854.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9311271.5 ns 9290625 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18616125 ns 17955583 ns 1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1926828 ns 1916393.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17354500 ns 17351270.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14318812.5 ns 14353042 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14334708 ns 14309667 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21859292 ns 21080250 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121057729 ns 121821646 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174314729 ns 174069521 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147379166 ns 148056167 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447559833 ns 106139667 ns 4.22
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5496733 ns 5478633 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 595612958.5 ns 596837750 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 542499292 ns 543667792 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 446168125 ns 445085375 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1630779417 ns 626736625 ns 2.60
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35003993.5 ns 38176542 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 655003333.5 ns 652965479.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 677053333 ns 674093584 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 584185208 ns 632863021 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1732551062.5 ns 743445292 ns 2.33
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 880584 ns 849625 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 822625 ns 832854.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1226625 ns 1217000 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 782750 ns 966042 ns 0.81
lenet(28, 28, 1, 32)/forward/GPU/CUDA 270475 ns 266296.5 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2740959 ns 2721500 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2494167 ns 2466917 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3327979 ns 3314395.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3134292 ns 3364958.5 ns 0.93
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1067170 ns 1061958 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2264271 ns 2259875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1552417 ns 1580250 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1753479 ns 1752416.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4348083 ns 3779541 ns 1.15
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214769 ns 212874 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 20483167 ns 20464770.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 17691916 ns 17681833 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17963833 ns 17968916 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 26775375 ns 26220958.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1991226 ns 1983562 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 45016687.5 ns 44361875 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 42002229.5 ns 42037625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 41336854.5 ns 41240937.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 47744959 ns 47003375 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4319020.5 ns 4301083.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2876959 ns 2876167 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 3010167 ns 2986437.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8658375 ns 7412625 ns 1.17
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 514332 ns 515223 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 40234750 ns 40138542 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 34767583 ns 34883937.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 33924250 ns 33862542 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 53719958 ns 51421084 ns 1.04
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2979961.5 ns 2979770 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 89992458 ns 88409354.5 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 84426916.5 ns 84462416 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 82809646 ns 83166916.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 96502584 ns 93812228.5 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 142457125 ns 143119041 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 186377999.5 ns 186909958.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 160522958 ns 160607000 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 489638250 ns 149056313 ns 3.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7101620 ns 7091795 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 877579500 ns 876576041.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 810323667 ns 819011417 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 714880166.5 ns 713621416.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2042862020.5 ns 1026954750.5 ns 1.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34011046 ns 33962668 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1671563875 ns 1654338292 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1561654708 ns 1556399750 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1478668938 ns 1456365229 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2558813125 ns 1581565875 ns 1.62
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1545417 ns 1500042 ns 1.03
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1269125 ns 1281708 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1641375 ns 1629875 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2465854.5 ns 2163395.5 ns 1.14
lenet(28, 28, 1, 128)/forward/GPU/CUDA 268247 ns 262650.5 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7879417 ns 7601959 ns 1.04
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6568854.5 ns 6596916 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7162916 ns 7128375 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11708979 ns 10476396 ns 1.12
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1072114.5 ns 1087771 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 186008021 ns 185964437.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 145478792 ns 146352312.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 128424688 ns 130050146 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 452715625 ns 179543416.5 ns 2.52
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4848333.5 ns 4845696 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 641184291 ns 643688917 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 524088958 ns 604191917 ns 0.87
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 537727000 ns 537019041 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1403114875 ns 663244750 ns 2.12
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 18681328 ns 16664478 ns 1.12
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1095875 ns 1073937.5 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 967604 ns 979688 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1353791 ns 1338583 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1322541 ns 1380812 ns 0.96
lenet(28, 28, 1, 64)/forward/GPU/CUDA 270349 ns 265966 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6033416.5 ns 6009021 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4668271 ns 4658625 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4931041 ns 4922187.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6027000 ns 5723978.5 ns 1.05
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1111457 ns 1137942.5 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23769000 ns 23733624.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34212937.5 ns 35284771.5 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37101833 ns 37100750.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132610375 ns 35260167 ns 3.76
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1831215 ns 1834016 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184677875 ns 184898625 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159062583 ns 160642834 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 144384604 ns 144248000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534807250 ns 271530583 ns 1.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16477829 ns 16393096 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 297162354.5 ns 296257000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 243897583 ns 245304833 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 298830750.5 ns 301408687 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 713110625 ns 446273791 ns 1.60
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 658030208 ns 656873875 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 432501687.5 ns 433591937.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 400225625 ns 402349417 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1771604646 ns 677798728.5 ns 2.61
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12483387 ns 12482697 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1887633520.5 ns 1891955437.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1637268417 ns 1637549708 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1504383479 ns 1514000729 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5051328625 ns 2113439354.5 ns 2.39
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49779380 ns 49760182 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3063562.5 ns 3046500 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2098125.5 ns 2098166 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2292583 ns 2287292 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6036250 ns 4866125 ns 1.24
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 581533.5 ns 582507.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25421500 ns 25579833 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 20387000 ns 20277104 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19188625 ns 19545458 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39410312.5 ns 36687292 ns 1.07
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2998929 ns 2979368 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 35068167 ns 35578625 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28412292 ns 28390167 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 30184750 ns 30144895.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45702312 ns 42776229 ns 1.07
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1657041.5 ns 1650667 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1194958 ns 1204458 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1386854 ns 1396750 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3047792 ns 2509645.5 ns 1.21
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 216764 ns 218107 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12728917 ns 12697333 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9975625 ns 9973959 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9685688 ns 9758687 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 19011166.5 ns 18284458 ns 1.04
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1948689 ns 1944527.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17689854 ns 17688854 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14742625 ns 14754291 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14638458.5 ns 14674374.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22191709 ns 21468083.5 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23557167 ns 23681167 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34461708 ns 34404604 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37530375 ns 37545958 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132666521 ns 35268000 ns 3.76
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1832030 ns 1848561 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189557666.5 ns 190505958.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 237037020.5 ns 237366917 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 197049458.5 ns 194090667 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 727023166 ns 460122917 ns 1.58
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13923369 ns 13928578 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 302073604 ns 301146020.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 250383541.5 ns 250240417 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 306286250 ns 308748000 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 717294500 ns 395462625 ns 1.81
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1917792 ns 1916083.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1580771 ns 1556917 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1574458 ns 1579625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2652249.5 ns 2659291.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 575627 ns 570148 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 6156375 ns 6146812.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 5936542 ns 5943834 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 5926084 ns 5926041 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9429521 ns 6788041.5 ns 1.39
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1376631 ns 1353691.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18780958 ns 18785021 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19122375 ns 19131625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19117250 ns 19125833 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 18883917 ns 15678041 ns 1.20
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 70021 ns 68937 ns 1.02
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 68417 ns 68625 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 70500 ns 70792 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 727312.5 ns 69854 ns 10.41
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47914 ns 47405.5 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 355562.5 ns 287792 ns 1.24
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 325250 ns 312812.5 ns 1.04
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 326687.5 ns 280416 ns 1.17
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2205479 ns 281521 ns 7.83
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 213681 ns 211915 ns 1.01
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 393084 ns 444500 ns 0.88
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 450084 ns 448250 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 444728.5 ns 391667 ns 1.14
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2229333 ns 357041.5 ns 6.24
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3034917 ns 3044791 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2091583.5 ns 2094645.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2284000 ns 2278916.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6013125 ns 4567208 ns 1.32
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 576987 ns 585440 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23582313 ns 23578062.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18082917 ns 18085666 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16983417 ns 16978625 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37559208 ns 34976833 ns 1.07
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3055304 ns 2912837 ns 1.05
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33242479.5 ns 33419374.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27632833.5 ns 27788708 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27457834 ns 27373667 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44681646 ns 42059688 ns 1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120219875 ns 118607334 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174640416.5 ns 173693458.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147464708 ns 147902833 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447824083.5 ns 108303292 ns 4.13
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5463169 ns 5451158 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 471068209 ns 470478958 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466756437 ns 467481645.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 436859104.5 ns 434223083.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1751773916 ns 737222479.5 ns 2.38
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32302579.5 ns 35181339 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 637667500 ns 635200500 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 664219750.5 ns 665043396 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 586420938 ns 582947041.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1733809521 ns 731724375 ns 2.37
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1232125 ns 1304833 ns 0.94
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 975229 ns 937167 ns 1.04
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 903250 ns 903709 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1950333 ns 2036958 ns 0.96
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 568402.5 ns 564089 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2957958 ns 2960625 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2629625 ns 2635667 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2593375 ns 2619417 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7086750 ns 3698292 ns 1.92
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1321605.5 ns 1319613 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6642209 ns 6561416 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6552584 ns 6499959 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6490000 ns 6497875 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7617500 ns 4438375 ns 1.72
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 39583 ns 39271 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 31291 ns 32458.5 ns 0.96
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 35041 ns 32062.5 ns 1.09
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 91458 ns 54437.5 ns 1.68
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27908 ns 27919 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 175479.5 ns 179042 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 175625 ns 175541 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 175667 ns 175167 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 273166 ns 190708.5 ns 1.43
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 218444 ns 219938 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 442021 ns 442334 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 442417 ns 463458.5 ns 0.95
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 442062.5 ns 442417 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 510625 ns 429500 ns 1.19
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 13375 ns 13562.5 ns 0.99
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 12833 ns 13437.5 ns 0.96
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 14187.5 ns 14416 ns 0.98
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54458 ns 14375 ns 3.79
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 27839 ns 28121 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25708 ns 25917 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 25708 ns 25667 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 25792 ns 25625 ns 1.01
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151770.5 ns 26250 ns 5.78
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 208205 ns 209865 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 46083 ns 45437.5 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 45458 ns 46479.5 ns 0.98
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 45875 ns 46041 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 151145.5 ns 28209 ns 5.36
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 318485791 ns 318266167 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 236155250 ns 238108104 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 204947249.5 ns 203733333 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 870093062.5 ns 322939875 ns 2.69
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7672854 ns 7668589 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1103355583.5 ns 1098692854.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 951012041.5 ns 952627249.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 915597916 ns 856876291 ns 1.07
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2647669125 ns 1173710250 ns 2.26
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27249547 ns 27280510.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 193938 ns 193124.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 167312.5 ns 168542 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 167834 ns 168187.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 873291.5 ns 218458.5 ns 4.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47232 ns 47292 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1215770.5 ns 1214729 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1097562.5 ns 1095750 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 1097292 ns 1014896 ns 1.08
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2767479.5 ns 1504666 ns 1.84
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 222773 ns 222578.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 2290771 ns 2298292 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 2230896 ns 2283250 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 2223666 ns 2158334 ns 1.03
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3710292 ns 2476833 ns 1.50
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1586500 ns 1582437.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1236041.5 ns 1264833 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1234250 ns 1174562.5 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2225875 ns 2357375 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 574042.5 ns 571094.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3206334 ns 3197541 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2859000 ns 2843042 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2838875 ns 2853458 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7347042 ns 3931104 ns 1.87
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1353212 ns 1330355 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8838958 ns 8842250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8778959 ns 8776708 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8989687 ns 8804292 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9543937 ns 6342000 ns 1.50
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2500 ns 4625 ns 0.54
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2229 ns 2458 ns 0.91
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 2500 ns 2542 ns 0.98
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2709 ns 2416 ns 1.12
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 25189 ns 24562 ns 1.03
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7417 ns 7125 ns 1.04
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7083 ns 7125 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7209 ns 7417 ns 0.97
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7250 ns 7292 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 191810 ns 186417 ns 1.03
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8667 ns 8541 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8666.5 ns 8500 ns 1.02
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8500 ns 8709 ns 0.98
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5916 ns 6125 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10395.5 ns 10625 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 17416.5 ns 14792 ns 1.18
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10625 ns 12000 ns 0.89
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7458 ns 7500 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 25409 ns 24702.5 ns 1.03
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 21666 ns 21458 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 21500 ns 21583 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 21875 ns 22042 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 21792 ns 21792 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 200873 ns 196629 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 56750 ns 56833 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 56750 ns 59166 ns 0.96
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 56750 ns 57208 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 51333 ns 54542 ns 0.94
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28375 ns 28687.5 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 29125 ns 28709 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28916 ns 28792 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 45875 ns 46041 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 26566 ns 25795 ns 1.03
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 44041 ns 44250 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 44375 ns 47667 ns 0.93
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 44125 ns 44000 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 145041 ns 63916 ns 2.27
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 172032.5 ns 167633.5 ns 1.03
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 68500 ns 68417 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 68687.5 ns 68292 ns 1.01
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 68333 ns 68083 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145708 ns 68125 ns 2.14
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 2000 ns 2500 ns 0.80
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1916.5 ns 1750 ns 1.10
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2000 ns 1792 ns 1.12
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1916 ns 1708 ns 1.12
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23659 ns 23041 ns 1.03
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5500 ns 5375 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5125 ns 5083 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5375 ns 5416 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5375 ns 5125 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 175247.5 ns 171497 ns 1.02
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 8208 ns 8375 ns 0.98
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 8375 ns 8167 ns 1.03
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 8292 ns 8208 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5291 ns 5708 ns 0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 34072083 ns 34068625 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 40136625 ns 40361624.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 43496250 ns 43432603.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 153686583 ns 56216958.5 ns 2.73
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2640798 ns 2631639 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 511580042 ns 453239687.5 ns 1.13
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 316303416.5 ns 319327021 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 305395375.5 ns 307674396 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 699851166 ns 506119959 ns 1.38
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15146264 ns 15174112 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 745204500 ns 735455458 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 693254333.5 ns 706582229 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 743284104.5 ns 743368604 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 1174961458.5 ns 910398833 ns 1.29

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.