Skip to content

Commit

Permalink
chore: bump version for release
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Aug 27, 2024
1 parent b3d21e8 commit 6683721
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Lux"
uuid = "b2108857-7c20-44ae-9111-449ecde12c47"
authors = ["Avik Pal <[email protected]> and contributors"]
version = "0.5.66-DEV"
version = "0.5.66"

[deps]
ADTypes = "47edcb42-4c32-4615-8424-f2b9edc5f35b"
Expand Down

3 comments on commit 6683721

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/113958

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.5.66 -m "<description of version>" 66837215ab889346c5031a03ea657224ee9beefc
git push origin v0.5.66

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 6683721 Previous: e6dea49 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 412000 ns 411750 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 324166 ns 243959 ns 1.33
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322604.5 ns 323604 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 740459 ns 741209 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 44259 ns 44008 ns 1.01
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1321500 ns 1392458 ns 0.95
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 2457333 ns 1249333 ns 1.97
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 14240875 ns 14034875 ns 1.01
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2200812.5 ns 2247000 ns 0.98
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 208145 ns 206485 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1417146 ns 1411375 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 938625 ns 949209 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 1523833 ns 1539667 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2268625 ns 2262146 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1759166.5 ns 1751333.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1089459 ns 1096875 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1525291.5 ns 1541583 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2994187.5 ns 3026749.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 209673 ns 209127 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12094750 ns 12111771 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8806875 ns 8833083 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9239291 ns 9198584 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18581458 ns 18601167 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1488235.5 ns 1480357.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17307520.5 ns 17231270.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13967750 ns 13987541.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14528333 ns 14519729 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21842125 ns 21836292 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250015541.5 ns 250395646 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148333104 ns 148855375 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116106000 ns 115834062 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447467333 ns 446839208 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5463022 ns 5444163 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1184048458 ns 1176608458 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 987050292 ns 976012000 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 853459771 ns 837397979.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1759966833 ns 1759902458 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 31128789 ns 31490812.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1134468583 ns 1129305209 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1007793771 ns 991324229.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1296264208 ns 1295080375.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1736536375 ns 1730828646 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1099667 ns 1075249.5 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1640166 ns 1662353.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3652875 ns 3521959 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 787583 ns 782750 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 266751 ns 268581 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 3019833 ns 3020312.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4176833 ns 4174708 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 10973896 ns 11483792 ns 0.96
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3301792 ns 3174584 ns 1.04
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1182347 ns 1187325 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2286833.5 ns 2334458.5 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1435167 ns 1326625 ns 1.08
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1676229 ns 1671667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4190875 ns 4228083 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 209336.5 ns 208877 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 19397292 ns 19371042 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16083708.5 ns 16106687.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17390250 ns 17334333 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 25860291 ns 25864812.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1591006 ns 1587675 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 34232625 ns 33974334 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 30842208 ns 30652312 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 31266750 ns 30965958 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 36658416 ns 36591917 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4529667 ns 4502750 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2779874.5 ns 2520667 ns 1.10
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2911875 ns 2914750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8371875 ns 8397770.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 421528 ns 422071 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 38980750.5 ns 38880875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 32266104 ns 32118083 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 32323083 ns 32210354 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 51956541 ns 51886833.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2622867 ns 2617174 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 89073083 ns 88740458.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 114658792 ns 114655499.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 228331667 ns 222624292 ns 1.03
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 74438750 ns 74153520.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 268170542 ns 267012709 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 159370333 ns 156293291 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 126596000 ns 126425563 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 489032083 ns 484968208 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7010105 ns 7022844 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1476972354.5 ns 1472853458 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1173299958 ns 1171430875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1063552375.5 ns 1066813500 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2013745249.5 ns 2007065229.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34689426.5 ns 34464520 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1691606583 ns 1687201334 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1531432625 ns 1531380729 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1826034958 ns 1779981833 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2222547417 ns 2205561250 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 2015500 ns 2055417 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3021583 ns 3039333 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 8014708.5 ns 6418334 ns 1.25
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2428062.5 ns 2491084 ns 0.97
lenet(28, 28, 1, 128)/forward/GPU/CUDA 270747.5 ns 270182.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9364083 ns 9710917 ns 0.96
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 12044396.5 ns 12102375 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 24967708 ns 24324021 ns 1.03
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11780916.5 ns 11813792 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1257862 ns 1260525.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 383008958 ns 379862541 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 292145604.5 ns 310947959 ns 0.94
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 236038125 ns 239644500 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 452557520.5 ns 453270542 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4850720 ns 4854774.5 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1285445416 ns 1326926500 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 1002736292 ns 962218875 ns 1.04
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 859907625 ns 954450208 ns 0.90
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1579569083 ns 1593232541 ns 0.99
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 18356160 ns 19082921 ns 0.96
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1411708 ns 1392292 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 2105292 ns 1700416 ns 1.24
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 4691958 ns 5764584 ns 0.81
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1379542 ns 1353979 ns 1.02
lenet(28, 28, 1, 64)/forward/GPU/CUDA 271009 ns 270953.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6609291.5 ns 6765209 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 12489687.5 ns 13257604.5 ns 0.94
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 20500208 ns 19997334 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6137500 ns 6085271 ns 1.01
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1323036 ns 1315018 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70468999.5 ns 70450771.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43540000 ns 43794458 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39560750 ns 39565125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132524459 ns 132519812 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1883621 ns 1877581 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 385927667 ns 383421896 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 301384020.5 ns 297391833.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 280325709 ns 282075208 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 535114146 ns 534360167 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 12284164 ns 12294712.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 415345667 ns 407452917 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 391116104 ns 368882167 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 700495875.5 ns 664901875 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 710476750 ns 711106916 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1197297083 ns 1188807000 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 699386042 ns 829881458 ns 0.84
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 633884500 ns 629069625 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1862351209 ns 1864484709 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12303587 ns 12531429 ns 0.98
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3586148500 ns 3583219562.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2755457042 ns 2743701542 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2695342250 ns 2801027834 ns 0.96
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 4936198167 ns 5095250291 ns 0.97
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49530988 ns 49598783 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3410084 ns 3396375 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2070063 ns 2056770.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2514250 ns 2516292 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6005542 ns 6032625 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 290096.5 ns 288270 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25455000 ns 25431417 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18660291 ns 18519687.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18838271 ns 18816834 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 38883542 ns 38902042 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2476777 ns 2461496 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 54080542 ns 53959750 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 79047875 ns 80411625 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 172454667 ns 170419166.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45387917 ns 45563604 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1659041 ns 1774500 ns 0.93
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1103229.5 ns 1086541.5 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1539041.5 ns 1585833 ns 0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3029625 ns 3036958.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212328 ns 210199.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12461896 ns 12515229.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9201125 ns 9203541 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9593125 ns 9648916 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18956624.5 ns 18975125 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1542747 ns 1537145.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17493333 ns 17611666 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14286958 ns 14341209 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14556291 ns 14587625.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22159583 ns 22164499.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70478895.5 ns 70184479 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43472958 ns 43685354 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39536104 ns 39447354 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132668666.5 ns 132435229.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1893825.5 ns 1876651 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 359326208 ns 363077416 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 304163583 ns 286830229.5 ns 1.06
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 286141562.5 ns 287419458 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 623006583 ns 619680250 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13366172 ns 13399859 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 422001604 ns 417202479 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 419915667 ns 427236167 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 690603396 ns 702264812 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 715494834 ns 716642625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1482979 ns 1597458 ns 0.93
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1220187.5 ns 1041875 ns 1.17
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1164583 ns 1238521 ns 0.94
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2192333 ns 2311000 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 584838 ns 591624 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 8835750 ns 8828125 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 12790625 ns 13456667 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 30251979 ns 30478124.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9820166 ns 9827250 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1434577.5 ns 1454764 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 16799396 ns 17855792 ns 0.94
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17260041.5 ns 17325084 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 29571958 ns 28978667 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 13433125.5 ns 14301167 ns 0.94
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 796958.5 ns 785166.5 ns 1.02
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 518145.5 ns 635083 ns 0.82
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1037104 ns 1023416 ns 1.01
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 726083 ns 724437.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47492.5 ns 48101 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1538395.5 ns 1546042 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1012813 ns 1039334 ns 0.97
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1870375 ns 1418437.5 ns 1.32
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2191167 ns 2186167 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 232459.5 ns 237446 ns 0.98
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1704104 ns 1701854 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1240521 ns 1239604.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 2293542 ns 1785437.5 ns 1.28
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2258583 ns 2312500 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3312229 ns 3387542 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2051417 ns 2038042 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2523041 ns 2513500 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6001521 ns 6020041 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 282441 ns 285597 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23984333 ns 24084791 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17181209 ns 17173834 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17275083.5 ns 17124375 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37526208 ns 37508333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2394472 ns 2411179 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 52515208.5 ns 52430334 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 78819125 ns 80022625 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 170453417 ns 168792792 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44401084 ns 44511146 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 249599771 ns 248838833 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148233249.5 ns 148459125 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 115724417 ns 115539229 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 453949687 ns 447599646 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5444539 ns 5455438 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1133035459 ns 1123889625 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 886883229 ns 882004229 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 803351208 ns 805342291 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1759720167 ns 1746632750 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 28907451 ns 29283460 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1060040999.5 ns 1005172333.5 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 980328375 ns 985652583 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1300157375 ns 1246518292 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1738857750 ns 1720077833.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1307791 ns 1224042 ns 1.07
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 904958 ns 780375 ns 1.16
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 905041.5 ns 903792 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1946625 ns 1941500 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 564216 ns 574544.5 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5887083 ns 5625979 ns 1.05
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 6679291.5 ns 8687417 ns 0.77
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 24152125 ns 23834542 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7076792 ns 7099270.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1369695 ns 1400097 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 11316000.5 ns 10299146 ns 1.10
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 9070125 ns 10509708.5 ns 0.86
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 17109084 ns 16674333 ns 1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 8181250 ns 8726562.5 ns 0.94
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 484209 ns 386792 ns 1.25
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 483542 ns 494500 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 2022333.5 ns 2152000 ns 0.94
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 89541 ns 88104.5 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27791 ns 27980 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 385084 ns 339145.5 ns 1.14
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 431542 ns 436062.5 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 4556458.5 ns 4118937.5 ns 1.11
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 262041.5 ns 261500 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 219615 ns 223966.5 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 707167 ns 643312.5 ns 1.10
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 710063 ns 709125 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 1060083.5 ns 884729 ns 1.20
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 444625 ns 445958 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 431166.5 ns 331291 ns 1.30
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 425375 ns 438416 ns 0.97
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 743313 ns 601249.5 ns 1.24
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54583 ns 53958 ns 1.01
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 27936 ns 28317 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 339062.5 ns 277271 ns 1.22
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 321875 ns 319417 ns 1.01
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 869459 ns 679875.5 ns 1.28
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 154000 ns 153292 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 204132 ns 208854.5 ns 0.98
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 403687.5 ns 344375 ns 1.17
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 385604.5 ns 389750 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 1000750 ns 870042 ns 1.15
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 174292 ns 174063 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 605442833 ns 602013959 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 433469166.5 ns 430551375 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 371264875 ns 375744687.5 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 873652000 ns 873334145.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7020347 ns 7025763 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 2058909438 ns 2078756625 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1575303375 ns 1607808875 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1603582812.5 ns 1638666770.5 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2764206375 ns 2782335083 ns 0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 25821907 ns 25908154 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 521958 ns 518875 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 433458 ns 395396 ns 1.10
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 1910708.5 ns 1924520.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 868375.5 ns 865667 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 46967 ns 46907 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1805874.5 ns 1851083 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 2512750.5 ns 1779896 ns 1.41
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 14720729.5 ns 14384125 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2772083 ns 2660187.5 ns 1.04
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 246589 ns 247071 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 2754708 ns 2699458.5 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 2247625 ns 2245375 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 3885729 ns 3691833 ns 1.05
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3383500 ns 3398958 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1492500 ns 1486833.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1177000 ns 933395.5 ns 1.26
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1171916 ns 1185562.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2316833 ns 2210083 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 585700 ns 550416 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5795062 ns 5783000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 4639479 ns 7999604 ns 0.58
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 24514833 ns 23905084 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7281062.5 ns 7315812.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1345001 ns 1359665.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 12960708 ns 12501603.5 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 10744333 ns 12176000 ns 0.88
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 20566896 ns 20858687.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 10662416.5 ns 10743417 ns 0.99
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2937.5 ns 3916.5 ns 0.75
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 3083 ns 2875 ns 1.07
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 3792 ns 5104.5 ns 0.74
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2666 ns 2645.5 ns 1.01
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24904 ns 22876 ns 1.09
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 8583 ns 8541 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 9354 ns 8500 ns 1.10
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 9000 ns 8583.5 ns 1.05
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 8708 ns 8625 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 208868.5 ns 209808.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 16770.5 ns 16667 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 16667 ns 16875 ns 0.99
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 17000 ns 16708 ns 1.02
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 10958 ns 10792 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10270.5 ns 10166.5 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 13333 ns 15625 ns 0.85
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10917 ns 11375 ns 0.96
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7458 ns 7625 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24820 ns 24722 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 22417 ns 22270.5 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 22333 ns 22333 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 22667 ns 22667 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 22500 ns 22500 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 229875 ns 230109 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 52375 ns 52292 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 52334 ns 52458 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 52812.5 ns 52250 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 44146 ns 43916 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 29354.5 ns 28708 ns 1.02
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 29292 ns 29083 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 29562.5 ns 29708 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46270.5 ns 46250 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 25924 ns 25756.5 ns 1.01
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 210354 ns 207687.5 ns 1.01
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 259833.5 ns 259000 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 4170916 ns 4070042 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 147875 ns 147583 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 216626 ns 223667.5 ns 0.97
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 309041 ns 309125 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 284542 ns 289833 ns 0.98
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 751146 ns 766104 ns 0.98
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 161083 ns 161834 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 2125 ns 2000 ns 1.06
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 4125 ns 2292 ns 1.80
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2583 ns 4416 ns 0.58
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1791.5 ns 2083 ns 0.86
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23179 ns 22925 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 7375 ns 7604.5 ns 0.97
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 7792 ns 7208 ns 1.08
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 8000 ns 7542 ns 1.06
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 7417 ns 7333 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 268810.5 ns 270317 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 11500 ns 11541 ns 1.00
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 11375 ns 11542 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 11834 ns 11542 ns 1.03
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 7125 ns 7125 ns 1
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 79982812 ns 79878271 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 49064104 ns 47895875 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 44980062 ns 44952396.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 151479750 ns 151396791 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2652652.5 ns 2712095.5 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 605980917 ns 498390209 ns 1.22
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 414050750 ns 410182750 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 397397813 ns 398143833 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 685097750 ns 683908500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 14638421 ns 14599220 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 686689124.5 ns 686490667 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 677754209 ns 660533166 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 1006185875 ns 1012950542 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 999323084 ns 997113125 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.