Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make enzyme testing opt-in for now #1041

Merged
merged 2 commits into from
Nov 5, 2024
Merged

fix: make enzyme testing opt-in for now #1041

merged 2 commits into from
Nov 5, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Nov 5, 2024

No description provided.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: ba4dc25 Previous: 900c21c Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4292 ns 4270.5 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4250 ns 4000 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5875 ns 5875 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4500 ns 4895.5 ns 0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59809 ns 59833 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10375 ns 10375 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10250 ns 9958 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10833 ns 10792 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10208 ns 10125 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 422458 ns 422438 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1042 ns 1083 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1083 ns 1000 ns 1.08
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3000 ns 1417 ns 2.12
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1145.5 ns 1125 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18107 ns 18109 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4125 ns 4166 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4084 ns 4125 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4292 ns 4187.5 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4083 ns 4042 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 109577.5 ns 109209 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56083 ns 57645.5 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46583 ns 47000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46333 ns 38125 ns 1.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80250 ns 82084 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37793 ns 37455 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2041292 ns 1973687 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2085750 ns 2089416 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2063854 ns 2085625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1996187.5 ns 1985813 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 199313 ns 195917 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148334 ns 146416.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146833 ns 147020.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147750 ns 145667 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144437.5 ns 145604.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166164 ns 166391 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1151500 ns 1129209 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1119479.5 ns 1126375 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1113896.5 ns 1147667 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1116708 ns 1104209 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 534203.5 ns 521058.5 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3979 ns 3416.5 ns 1.16
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3375 ns 3333 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5250 ns 6333 ns 0.83
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3625 ns 3250 ns 1.12
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 68253.5 ns 66594 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9792 ns 8792 ns 1.11
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8916 ns 9291 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9417 ns 9250 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8958 ns 9292 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 503513.5 ns 493812 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15125 ns 14750 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15084 ns 15458 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18187 ns 19167 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15125 ns 16437.5 ns 0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55485 ns 53833 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221124.5 ns 215416.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213541 ns 213208.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215104 ns 214271 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212854.5 ns 227104 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 277108.5 ns 271460 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 792 ns 792 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 583 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17961 ns 17470 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1583 ns 1750 ns 0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1666 ns 1417 ns 1.18
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1709 ns 1.10
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1542 ns 1645.5 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 104946 ns 101826.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 6792 ns 7250 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 5916 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 5292 ns 1.13
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9792 ns 10000 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23774 ns 23857.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228625 ns 226895.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229041.5 ns 230375 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230250 ns 231584 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255250 ns 258625 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 172526.5 ns 167659 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3834 ns 3833 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23855 ns 23468 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16625 ns 16750 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16666 ns 17042 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16916 ns 17000 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16666 ns 16625 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 165483.5 ns 160597 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 571416 ns 572166 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 575792 ns 575000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 603292 ns 587458 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 574167 ns 578334 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113501.5 ns 113397 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1437833 ns 1421708 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1421291 ns 1420125 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1456333 ns 1430083 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1422000 ns 1413292 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 215195 ns 209669.5 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1058416 ns 1074458 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 960521 ns 958250.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1353667 ns 1334396 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1293917 ns 1310875 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 278078 ns 269120.5 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5810729.5 ns 5769437 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4595146 ns 4470625 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4941375 ns 4941021 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5508438 ns 5552042 ns 0.99
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1097372.5 ns 1066489 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 24077 ns 23585 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2250 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 174630 ns 169900 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4334 ns 4084 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 6250 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6833 ns 7209 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4167 ns 6125 ns 0.68
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66416 ns 64199 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11459 ns 11083 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11542 ns 11625 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12000 ns 12000 ns 1
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11250 ns 10917 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 455229 ns 446167.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7208 ns 6042 ns 1.19
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7375 ns 7042 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7500 ns 8833 ns 0.85
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 7250 ns 0.81
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53888 ns 51074.5 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17250 ns 17292 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17542 ns 18334 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18042 ns 18083 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16417 ns 17229.5 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 308501.5 ns 299895.5 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33132 ns 32630 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8833 ns 8458 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8875 ns 9041 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9187.5 ns 9166 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9083 ns 8459 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 162463.5 ns 158907 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64875 ns 64625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64667 ns 64250 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64083 ns 65000 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64583 ns 64667 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112830.5 ns 111460 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 287583.5 ns 289667 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 281083 ns 279750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 286167 ns 289625 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 278000 ns 281250 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 188746 ns 184453.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3222417 ns 3347125 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3053833 ns 3015520.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3024458 ns 2792979 ns 1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4052750 ns 4064520.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 595462.5 ns 588037 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7598375 ns 7500166 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7440500 ns 7470229.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7201375 ns 7393937.5 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8195250 ns 8209000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1370693 ns 1331630 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 19217250 ns 19529541 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19124458 ns 19142959 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19130959 ns 19022708 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15708959 ns 15703750 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23901937.5 ns 23617083 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33837333 ns 33598208 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37088812.5 ns 41100666 ns 0.90
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34986334 ns 35022333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1855403 ns 1855178.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 191718583.5 ns 189352250 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164425125 ns 163568208 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152408416 ns 158452896 ns 0.96
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 441427250 ns 438607167 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13919509 ns 13925600.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 292378833.5 ns 287704167 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 338895729.5 ns 337952937.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 298620583 ns 291466708 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 395610896 ns 395696000 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23042 ns 21334 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24625 ns 24375 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25375 ns 25771 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24333 ns 23584 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 97657 ns 95861 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104125 ns 103625 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 115458 ns 103708 ns 1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104958 ns 104625 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 102916 ns 103479.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 512696.5 ns 510517.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6750 ns 5750 ns 1.17
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7500 ns 7208 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7667 ns 7666.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 7166 ns 0.80
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 69092 ns 68604 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15541 ns 14708 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15792 ns 15916 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16541 ns 16666 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14834 ns 14667 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 482365 ns 483804.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 2981291.5 ns 2876500 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2052208 ns 2063833 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2263979 ns 2288208 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4755625 ns 4870416 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 583663.5 ns 587700 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23786792 ns 23421375 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18053854.5 ns 17990750 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17489458.5 ns 18312792 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34946750 ns 35646292 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3162391.5 ns 3104605 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33775833 ns 33240625 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27501917 ns 27662417 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27490000 ns 27837459 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41729750 ns 41788833 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74437.5 ns 72083 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76083 ns 78729 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76541 ns 75729.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74292 ns 72459 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100952 ns 100762.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 207625 ns 204458 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 207167 ns 219041 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 293750 ns 320458 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225125 ns 205312.5 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 544412.5 ns 541454.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12125 ns 11333 ns 1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12875 ns 12416 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13625 ns 13834 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11750 ns 13125 ns 0.90
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 69795 ns 69856.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26667 ns 26520.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27333 ns 27458 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28208 ns 28291 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26625 ns 26500 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 470671.5 ns 473341 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12708 ns 11833 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13375 ns 12750 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13375 ns 14333 ns 0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12584 ns 13375 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 53177 ns 51587 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26167 ns 26375 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26792 ns 26583 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26000 ns 26666 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27000 ns 26417 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 301560.5 ns 302777.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180479 ns 178666.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182041 ns 180292 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 182125 ns 184416.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179687.5 ns 179709 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56093 ns 55677 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 586125 ns 591146.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 587625 ns 588583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 594291 ns 593062 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582063 ns 582708.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 282917.5 ns 285027 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6417 ns 5667 ns 1.13
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7084 ns 7167 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7750 ns 7895.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 7291 ns 0.84
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 69623 ns 69657.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14458 ns 14167 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14958 ns 14958 ns 1
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15792 ns 15854.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14625 ns 14583 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 458284 ns 460443 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1163875 ns 1194208.5 ns 0.97
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1217875 ns 1216792 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1274792 ns 1262604 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1326000 ns 1318166.5 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302247 ns 301559 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4269354 ns 4098416 ns 1.04
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4347708 ns 4352937.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4646208 ns 4631875 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4438396 ns 4436562.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1034297 ns 1042661.5 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1792 ns 1750 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 22933 ns 23523 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4792 ns 4792 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4958 ns 4875 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4916 ns 4916 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 186651 ns 187370 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6208 ns 5500 ns 1.13
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6583 ns 6334 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6958 ns 8604 ns 0.81
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6291.5 ns 7292 ns 0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 55052.5 ns 54466 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10917 ns 10958 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11833 ns 11792 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11792 ns 11708.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11167 ns 11166 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 327658.5 ns 330839 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 334 ns 333 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22442 ns 22873.5 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2708 ns 2708 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2708 ns 2959 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 3042 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2791 ns 2750 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 156880 ns 157537.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13291.5 ns 10750 ns 1.24
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 13667 ns 13708 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13937.5 ns 14958 ns 0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11625 ns 14583 ns 0.80
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 55944 ns 55574.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25250 ns 25209 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25042 ns 25250 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25417 ns 25375 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25250 ns 24979.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 287796.5 ns 292656 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4208 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24570 ns 24774 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 15917 ns 16333 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16208 ns 16125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16375 ns 16125 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16167 ns 16084 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 191969 ns 195031.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5792 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 5750 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5709 ns 5750 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5709 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33007 ns 33326 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20666 ns 21125 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20916 ns 20875 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21625 ns 21583 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21125 ns 21500 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 174227.5 ns 175195.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 379416 ns 415708 ns 0.91
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 377958 ns 376667 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 488875 ns 471499.5 ns 1.04
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 520084 ns 523500 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66431 ns 66680.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 918792 ns 924750.5 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 848084 ns 849291 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1232167 ns 1217521 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1323208 ns 1302292 ns 1.02
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 189370.5 ns 189339 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 81291.5 ns 79792 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82625 ns 82667 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81292 ns 84208 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83000 ns 82833 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192966.5 ns 193132 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1932167 ns 1917625.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1699958.5 ns 1915292 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1916125 ns 1940917 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1917083 ns 1896541 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 391191 ns 395963 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21486 ns 21798 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 164447 ns 167505 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6750 ns 5834 ns 1.16
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6875 ns 7500 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7875 ns 9958 ns 0.79
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6667 ns 6875 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 55875 ns 58244.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9166 ns 9375 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9292 ns 9333 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9500 ns 9354.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9541 ns 9625 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 292456.5 ns 302935 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 156951208.5 ns 119443416.5 ns 1.31
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174291584 ns 173896250 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148195729.5 ns 155811625 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104188250 ns 108054541 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5474066 ns 5469386 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 674441875 ns 616746166.5 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 556229250 ns 555745625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 454132562.5 ns 468855125 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 761515396 ns 760571396 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35100772 ns 34956216 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 700454083 ns 648663875 ns 1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 666098604.5 ns 664591146 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 581200271 ns 601178041.5 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 737997250 ns 746069334 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56334 ns 59458 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47459 ns 47083 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47792 ns 39166 ns 1.22
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83542 ns 83208 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36978 ns 37582 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1938916.5 ns 1926708 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1976604 ns 1983042 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1978541.5 ns 1986937.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1893667 ns 1850250 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 173150 ns 173017.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 268750.5 ns 265187.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 274583 ns 267959 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 292333.5 ns 276771 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 265458 ns 266917 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 117998.5 ns 128834.5 ns 0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 694354 ns 604083 ns 1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 683917 ns 692833.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 693250 ns 705709 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 656583.5 ns 590291.5 ns 1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 652056.5 ns 683429 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2193104.5 ns 2195333 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2181208 ns 2225625 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2216521 ns 2230583 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2184458 ns 2183333 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132517 ns 133325.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5563750 ns 5480833 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5510667 ns 5508958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5513000 ns 5585895.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5481750.5 ns 5490125 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 695298 ns 766206 ns 0.91
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 638500 ns 646750 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 639292 ns 660250 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 634541 ns 642917 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 646979.5 ns 647375 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46553 ns 47306 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1793958 ns 1828875 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1725834 ns 1721042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1749625 ns 1665209 ns 1.05
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2104250 ns 2097000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 220666.5 ns 223896.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56916 ns 58667 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47125 ns 47750 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46125 ns 38958 ns 1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83625 ns 82750 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28292 ns 29191 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2045604.5 ns 2029083.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2090729.5 ns 2091166 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2082062.5 ns 2107249.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1996375 ns 1994854.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 188293 ns 190986 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13451791 ns 13371291 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12408750 ns 12436583.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12579833.5 ns 12675625 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15153250 ns 15146959 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 515861.5 ns 517535.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47609667 ns 47259416 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41845375 ns 41746209 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41044270.5 ns 41384750 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58540625 ns 58440500 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3200949 ns 3203835 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74073166.5 ns 73984667 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 90759458 ns 91223791.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90813833 ns 90609938 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 75984750 ns 77234000 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57000 ns 59000 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47292 ns 47417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47541 ns 38917 ns 1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79042 ns 81125 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47481 ns 47741 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1933833.5 ns 1911646 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1966833 ns 1970541 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1973625 ns 1976417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1890208 ns 1882083 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194959.5 ns 195868.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 334 ns 333 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32490 ns 32615 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6458.5 ns 6500 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6459 ns 6375 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6750 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6375 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 170804 ns 176818 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32160 ns 32102 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2625 ns 2625 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 2875 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2958 ns 2916 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2708 ns 2625 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 158087.5 ns 164236.5 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 321982541.5 ns 286096229 ns 1.13
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339702500 ns 339570541 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314391875 ns 321242167 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 275821875 ns 271493208 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7045338.5 ns 7111512 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1051877520.5 ns 987492667 ns 1.07
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 940277375 ns 939040416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 851686187.5 ns 868433209 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1169867167 ns 1162204042 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34049288.5 ns 34040446 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1358421562.5 ns 1310851000.5 ns 1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1687563625 ns 1685402625 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1641183208 ns 1648347125 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1312669833.5 ns 1310788750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1410229.5 ns 1412625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1412333.5 ns 1412041.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1411500 ns 1424625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1413395.5 ns 1408334 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128095 ns 128501 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5056000 ns 5028875 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5011833.5 ns 5030104 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5027604 ns 5062042 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5018979.5 ns 5014021 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 518546 ns 597004.5 ns 0.87
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 169177292 ns 168008834 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 131651145.5 ns 130299417 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 129556583 ns 148283479 ns 0.87
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 164279437.5 ns 161948354 ns 1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4876595.5 ns 5052268 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 683143667 ns 662817209 ns 1.03
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 646211666 ns 492884417 ns 1.31
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 511752458 ns 507367709 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 843952333 ns 678320708 ns 1.24
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16179217 ns 17294527 ns 0.94
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 9035354 ns 8884604 ns 1.02
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8674584 ns 8801959 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7869708.5 ns 8221541.5 ns 0.96
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10183812.5 ns 10127167 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1610987 ns 1611762 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36781834 ns 36027125 ns 1.02
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36671271 ns 36933063 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33328542 ns 34547750 ns 0.96
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38854917 ns 38824854 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6453307 ns 6452267 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47375 ns 47375 ns 1
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47292 ns 47250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47833 ns 47542 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47417 ns 47333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19398 ns 19020 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50250 ns 50312.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50437.5 ns 50500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50916 ns 50958.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50291 ns 50333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 162418.5 ns 226580 ns 0.72
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7083 ns 6542 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7500 ns 7187.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8187 ns 9083 ns 0.90
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7083.5 ns 8625 ns 0.82
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 74691 ns 117383.5 ns 0.64
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10083 ns 9625 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10375 ns 10208 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 10333.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10041 ns 10209 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 439883 ns 723908.5 ns 0.61
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6459 ns 6083 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8333 ns 8250 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8417 ns 9417 ns 0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 8375 ns 0.69
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 81299 ns 157024.5 ns 0.52
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12958 ns 13292 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13417 ns 13792 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13542 ns 13708 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12958 ns 12834 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 396122 ns 618769 ns 0.64
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 959 ns 1042 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32528 ns 32863 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7834 ns 7875 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8166 ns 8000 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8208 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8250 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 193157 ns 246953.5 ns 0.78
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23041 ns 25062.5 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23250 ns 23291.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23500 ns 23542 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23459 ns 23250 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18862 ns 18661 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52583 ns 52625 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52666 ns 52833 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52959 ns 52875 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52125 ns 52333 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 224379 ns 364018 ns 0.62
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1397834 ns 1403750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1406042 ns 1451354 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1408354.5 ns 1407542 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1398208.5 ns 1406458 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196210 ns 196760 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5042250 ns 5023250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5015250 ns 5018687.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5019958 ns 5042125 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5014375 ns 5001750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 538689.5 ns 766930 ns 0.70
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3017625 ns 3048708 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2071416 ns 2082646 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2284167 ns 2300125 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4852104.5 ns 4855000 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 579661 ns 583278 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24727104 ns 24263250 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18832875.5 ns 18905459 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18936771 ns 19193375 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36502000 ns 36575416 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3184962 ns 3216229 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34414084 ns 34013563 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28403834 ns 28342229 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28024292 ns 28436750 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41705396 ns 43339875 ns 0.96
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 143769208 ns 144288959 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 141570125 ns 142279583 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 124428458.5 ns 126469000.5 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 174488250 ns 168866000 ns 1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22552202 ns 22582893 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 955894625 ns 1275599313 ns 0.75
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1172862062.5 ns 1058487228.5 ns 1.11
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1204750000 ns 712851209 ns 1.69
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 668847750 ns 668538250 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 116933733 ns 119108875 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75000 ns 83125 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 87584 ns 76208 ns 1.15
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77875 ns 78125 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72875 ns 72729 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 192074 ns 365097 ns 0.53
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 274854.5 ns 189959 ns 1.45
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 255958 ns 287792 ns 0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 286208 ns 268875 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 251645.5 ns 189583.5 ns 1.33
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1094455.5 ns 1559670.5 ns 0.70
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 36252875 ns 35476167 ns 1.02
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35435458 ns 35447729.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32241479 ns 32304459 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40996833 ns 40935146 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5845359.5 ns 5843273 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 150751458 ns 147875542 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 152701437.5 ns 152751312.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 135398042 ns 139824437 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287922042 ns 287719375 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34877403 ns 34882914 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 156698834 ns 120880395.5 ns 1.30
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173722000 ns 174358791 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148128521 ns 155429791 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106406291 ns 106966959 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5462755 ns 5456342 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 521618645.5 ns 470623375 ns 1.11
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 465924792 ns 466918000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 436476979.5 ns 456589562.5 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 749280333 ns 742113834 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32257322.5 ns 32255425 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 691115542 ns 706243291.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 655769083 ns 652697541.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 570838750 ns 591007625 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 850474417 ns 851805375 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1161250 ns 1320583.5 ns 0.88
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 968958 ns 965875 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 978792 ns 736687.5 ns 1.33
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2056959 ns 1944666.5 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 569103 ns 564187.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2792459 ns 2971708.5 ns 0.94
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2614042 ns 2620334 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2624792 ns 2535604 ns 1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3703458.5 ns 3604083.5 ns 1.03
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1689865 ns 1878347.5 ns 0.90
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6754020.5 ns 6649958 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6516208.5 ns 6493042 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6508229.5 ns 6437479.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4440583 ns 4435750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7375 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 6208 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6166 ns 5375 ns 1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 9916 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25827 ns 25400 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213020.5 ns 213645.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220875 ns 221833 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221500 ns 221250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213792 ns 205875 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 258574 ns 293719.5 ns 0.88
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 313286250 ns 301604437.5 ns 1.04
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 221788541 ns 221356625 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 189977792 ns 223278083.5 ns 0.85
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 312995792 ns 312163250 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7678890.5 ns 7672763 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1089809375 ns 1078062604.5 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 908046396 ns 896268771 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 816079833 ns 880668729 ns 0.93
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1175001292 ns 1161143188 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26534604 ns 26517571 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5834 ns 5500 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7542 ns 5750 ns 1.31
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7291.5 ns 9437.5 ns 0.77
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5583 ns 5875 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 152068 ns 201555 ns 0.75
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6958 ns 7500 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7458 ns 1
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7750 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 7041.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 628715 ns 699933.5 ns 0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 458 ns 500 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24235 ns 23724.5 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9417 ns 9208 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9542 ns 9625 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 11458 ns 9604.5 ns 1.19
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9167 ns 9042 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 223814.5 ns 234828.5 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351875 ns 351500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351250 ns 350896 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 351770.5 ns 354624.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 353687.5 ns 351708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21385 ns 20984 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 804000.5 ns 775417 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 814312.5 ns 824916 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 807812.5 ns 830958 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 817708 ns 823958 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 280926 ns 306663 ns 0.92
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 312542 ns 338083 ns 0.92
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 336646 ns 341500 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 449478.5 ns 443667 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 330645.5 ns 325667 ns 1.02
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18418 ns 17821 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 689771 ns 696042 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 738208 ns 739416.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1025667 ns 1042874.5 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 695584 ns 692645.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 265152 ns 273141.5 ns 0.97
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 328792 ns 358458.5 ns 0.92
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 347104 ns 349125 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 425250 ns 431291.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 369625 ns 370875 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22943 ns 22357.5 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 747208.5 ns 756625 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 744458 ns 744208.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1064646 ns 1073250 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 816917 ns 818125.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 224782.5 ns 221398.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3542 ns 3459 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3458 ns 3541 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3792 ns 3792 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3500 ns 3291 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18400 ns 17956 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4167 ns 4208 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4250 ns 4208 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4416 ns 4416 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4167 ns 4125 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 281571 ns 275839.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3583 ns 3792 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4125 ns 3375 ns 1.22
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5292 ns 6750 ns 0.78
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3875 ns 6625 ns 0.58
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 209071.5 ns 205448.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 8334 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 8459 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8750 ns 8500 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8541 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1186240 ns 1183984 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204500 ns 202625 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211125 ns 210416 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210208 ns 209292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200375 ns 200000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35365 ns 34588 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 599417 ns 603792 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 671292 ns 670625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 623500 ns 630958 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630291.5 ns 631187.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 324289.5 ns 352652 ns 0.92
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 977208 ns 967521 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 939375.5 ns 927063 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 956812 ns 964437.5 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1308000 ns 1281853.5 ns 1.02
batchedmm(128, Bsize=128)/forward/GPU/CUDA 208543 ns 207244 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4683396 ns 4451771 ns 1.05
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4481249.5 ns 4482750 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4296708 ns 4474208 ns 0.96
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6279833 ns 6201166 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 939154.5 ns 945549 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3604.5 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3792 ns 3167 ns 1.20
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4854 ns 6792 ns 0.71
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3500 ns 3167 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 203673 ns 233201 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 7500 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 7375 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7291 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7291 ns 7083 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 979866.5 ns 1014881 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1607125 ns 1602833.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1183708 ns 1187916 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1369958 ns 1364062 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2332917 ns 2343729.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 216036 ns 212955.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12340417 ns 12334792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9538979 ns 9602042 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9272374.5 ns 9404958 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17927917 ns 17966833 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1958008 ns 1949853 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17412145.5 ns 17347084 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14312666.5 ns 14365000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14342249.5 ns 14512666 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21050666.5 ns 21005479.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 89417 ns 89791 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91250 ns 91729.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 94062 ns 94291 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90292 ns 117416.5 ns 0.77
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125974 ns 126285 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2048916.5 ns 2023917 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1760437.5 ns 2013416.5 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2028562.5 ns 2058875 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2017708 ns 2027875 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 976278.5 ns 1031286 ns 0.95
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 329084 ns 346791.5 ns 0.95
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 338542 ns 343583.5 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 395833.5 ns 412250 ns 0.96
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 312417 ns 306166 ns 1.02
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15658 ns 16010 ns 0.98
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 700875 ns 702291 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 723333 ns 728979.5 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1020750 ns 1025458 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 647250 ns 639875 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 186416 ns 193209 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7125 ns 7292 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 3833 ns 6083 ns 0.63
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 5334 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10000 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33540 ns 33620 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 249125 ns 220479.5 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221583 ns 231958 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221958 ns 232041 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205958 ns 220500 ns 0.93
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 289999 ns 311751 ns 0.93
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22116 ns 22440 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14083 ns 14500 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14458 ns 14417 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14459 ns 14167 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14458 ns 14291 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 451385.5 ns 468658 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 94583.5 ns 95166 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 96500 ns 138021 ns 0.70
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 100166 ns 99167 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 99000 ns 122458 ns 0.81
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125281.5 ns 125691 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1944062.5 ns 1931875 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1652500 ns 1954979 ns 0.85
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1923000 ns 1946854 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1910333 ns 1923729.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 915747 ns 940251.5 ns 0.97
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 858770.5 ns 880500 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 809334 ns 815125 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1217666 ns 1172292 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 959250 ns 960167 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 270078.5 ns 270704 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2748875 ns 2803000 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2464583 ns 2526833 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3353417 ns 3361333 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3392521 ns 3405875 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1543382 ns 1569154 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14917 ns 15146 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16021 ns 18000 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18625 ns 21666 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14875 ns 18125 ns 0.82
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 140956 ns 141811.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 254667 ns 217083 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 258562.5 ns 229375 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219645.5 ns 257396 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255000 ns 215833 ns 1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 640174 ns 635765.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 221208 ns 219750 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222958 ns 221500 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 223000 ns 226021 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 219750 ns 223937.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 270240.5 ns 270450 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 508375.5 ns 509917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 560375 ns 557729 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 546708.5 ns 549792 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 500104.5 ns 555791 ns 0.90
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1421050.5 ns 1308245 ns 1.09
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 310458 ns 333479 ns 0.93
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 335750 ns 335541.5 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 417104 ns 437333 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 319541 ns 319417 ns 1.00
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16359 ns 16583 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 709833.5 ns 715333 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 723000 ns 730292 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1018208 ns 1025458.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 662917 ns 655792 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 196421.5 ns 193313 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17167 ns 17625 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18375 ns 17625 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20250 ns 20437.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17417 ns 18000 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144230 ns 144711.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219312.5 ns 216667 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216750 ns 224083 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213584 ns 226625 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212250 ns 223417 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 950742 ns 903796 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4541 ns 4625 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6770.5 ns 6750 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6542 ns 7438 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4770.5 ns 6625 ns 0.72
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 248657 ns 174159.5 ns 1.43
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10500 ns 10437.5 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9895.5 ns 10750 ns 0.92
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10979.5 ns 10770.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10583 ns 10833 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1098275 ns 1024421 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3583 ns 3646 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3625 ns 3334 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5458.5 ns 5625 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3666.5 ns 3500 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 246374.5 ns 231660 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 7708 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7917 ns 7792 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7625 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7458 ns 7167 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1103324 ns 1037611.5 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24115395.5 ns 23838833 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34596167 ns 33990646 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37656499.5 ns 41585708 ns 0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34879000 ns 34896229 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1854064 ns 1839186 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 187364104.5 ns 184662833 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159394792 ns 159634000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 147080250 ns 151746084 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 416232000 ns 415075875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16513835 ns 16506413 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 436918167 ns 427351833 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253570271 ns 251624521 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 232441063 ns 233926312.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 487065917 ns 484091542 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 184125 ns 181666 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 185125 ns 183416.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185333 ns 186125 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183667 ns 183834 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 228604.5 ns 173529.5 ns 1.32
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 609187.5 ns 587541 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 589521 ns 600458 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 615687.5 ns 632375 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 588604 ns 631354 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1082743 ns 1005977 ns 1.08
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3843541 ns 3816041.5 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3628458.5 ns 3637833 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3481208.5 ns 3539646 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5352750 ns 5351396 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 550108 ns 554127 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17957875 ns 17372333 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17252250 ns 17218458.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16584291 ns 16979478.5 ns 0.98
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22130646 ns 22177625 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2634462 ns 2616933 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 458 ns 583 ns 0.79
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 459 ns 459 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32228 ns 32036 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9479.5 ns 9667 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9709 ns 9750 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 10125 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9333 ns 9291 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 263745 ns 260858 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 582782250 ns 506491042 ns 1.15
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 428813333 ns 428949104 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 434641416 ns 474815000 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 674397896 ns 671461979 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12481497 ns 12484614.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2082933771 ns 2043435104.5 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1628241833 ns 1631358667 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1498637604 ns 1546812271 ns 0.97
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2215333416.5 ns 2216473375.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49018510 ns 49204869.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1609020.5 ns 1642542 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1170854.5 ns 1194625 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1390229 ns 1380791 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2481499.5 ns 2487084 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215279 ns 215546 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12779125 ns 12711687.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9941250 ns 9927625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9668020.5 ns 9788604.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18419750.5 ns 18464437.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2038920 ns 1995889.5 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17725542 ns 17669166.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14672812.5 ns 14709437.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14586209 ns 14807645.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21417895.5 ns 21465708 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26209 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26291 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26834 ns 26291 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24018 ns 23873 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66750 ns 66917 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66958 ns 67333 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68250 ns 67083 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67084 ns 66833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 410142.5 ns 382426 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203375 ns 203834 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210291 ns 209542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211042 ns 209584 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199875 ns 199584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26776 ns 26132 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 613708 ns 613833.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 669042 ns 636667 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 636312.5 ns 671166.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 585917 ns 628229.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 354860.5 ns 308600 ns 1.15
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 653500.5 ns 671687.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 635459 ns 645937.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 647417 ns 644791.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 593312.5 ns 676334 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131876.5 ns 131667 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2293458 ns 2241875 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1909916.5 ns 2192250 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2243166.5 ns 2297042 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2246479.5 ns 2246249.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1186439 ns 1114838 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17000 ns 16791 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17834 ns 17500 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21666.5 ns 20958 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17750 ns 16770.5 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146117 ns 143001 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227250 ns 230375 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230916 ns 231791.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 259958 ns 266208 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219458 ns 260728.5 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1054898 ns 959584 ns 1.10
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23885.5 ns 23163 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9791 ns 9604.5 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10583 ns 10292 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10000 ns 10625 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9979.5 ns 9584 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 262185 ns 255611 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5625 ns 5416.5 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7812.5 ns 5750 ns 1.36
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7229.5 ns 9458 ns 0.76
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5458.5 ns 5708 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 233872.5 ns 219432 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7209 ns 7833 ns 0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 7750 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7791 ns 7709 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6917 ns 7000 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 810262 ns 764584 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2209 ns 1959 ns 1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2208 ns 2083 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2292 ns 2417 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2292 ns 2208 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18153 ns 17893 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6375 ns 6875 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6833 ns 6542 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6792 ns 6583 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6333.5 ns 6291 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 336821.5 ns 320459 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749354.5 ns 747709 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 748792 ns 749833 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 749500 ns 754999.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749209 ns 749375 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21981 ns 21357 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 802958 ns 774854 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 796458 ns 792687.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 791875 ns 817042 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 792875 ns 811166 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 299725 ns 295013.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7334 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 6000 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 5208.5 ns 1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 10166 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33633 ns 33519 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228896 ns 219666 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 239875 ns 268125 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 267396 ns 252000.5 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 229000 ns 213562 ns 1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 364682.5 ns 354278 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10938 ns 10875 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 13000 ns 11833 ns 1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12521 ns 12770.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10666.5 ns 12000 ns 0.89
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 250655.5 ns 238132.5 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25167 ns 24708 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25000 ns 24584 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25041.5 ns 25292 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24541 ns 24500 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1138466 ns 1094067.5 ns 1.04
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 107427417 ns 106709834 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 116984750 ns 116906583.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 121269458 ns 127036729 ns 0.95
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117446750 ns 117807000 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2641904 ns 2657653 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 395550709 ns 392558792 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 362937167 ns 365774917 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 425153020.5 ns 431860937.5 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 489784167 ns 483379250 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15232680 ns 15196086 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 768172250.5 ns 758564875.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 753476708 ns 761412666 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 745095458.5 ns 748747542 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 764672166.5 ns 765232583 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7375 ns 6625 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8125 ns 7334 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8000 ns 9041.5 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7208 ns 8250 ns 0.87
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 240084.5 ns 231038.5 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14292 ns 14625 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14833 ns 14750 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14958 ns 14292 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14583 ns 14542 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1094068 ns 1043294.5 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7833 ns 5875 ns 1.33
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8208 ns 7959 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8375 ns 9167 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6208 ns 6333 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 238555 ns 228571 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12562.5 ns 12791 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12833 ns 13167 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13125 ns 13375 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12833 ns 12333 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 800161.5 ns 779066.5 ns 1.03
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 328895.5 ns 347625 ns 0.95
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 340188 ns 342625 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 399291.5 ns 416812 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 311167 ns 307083 ns 1.01
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17007 ns 17023 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 704292 ns 710208.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 729000 ns 732125 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1021958 ns 1032542 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 661417 ns 653979.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 202290 ns 200196.5 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 334 ns 0.87
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23607 ns 23569 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6333 ns 6375 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6584 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6667 ns 6834 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6708 ns 6042 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 244508 ns 241926 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5791 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5833 ns 5834 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5792 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24898 ns 24556.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21875 ns 21562.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21917 ns 22000 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21541.5 ns 21709 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21500 ns 21167 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 268220 ns 265433.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144708 ns 144917 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146625 ns 191292 ns 0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 151812.5 ns 149333 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146000 ns 149250 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167349 ns 167659 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1353208 ns 1319292 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1315875 ns 1331416 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1324083 ns 1362958 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1317708 ns 1326125 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1352621 ns 1343729.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22938 ns 22250 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24916 ns 23791 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27167 ns 25875 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22167 ns 23666.5 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 290263.5 ns 286115 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 125979 ns 146125 ns 0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 121875 ns 118500 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 136291 ns 129833 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 163666.5 ns 175792 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1479598 ns 1461317 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23481 ns 23352 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6416 ns 6334 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6709 ns 6459 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6792 ns 6709 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6417 ns 6125 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 260281 ns 258095.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5354.5 ns 4625 ns 1.16
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5104.5 ns 4125 ns 1.24
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6833 ns 7625 ns 0.90
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4791 ns 4895.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 257811.5 ns 256357.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10104.5 ns 9959 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10291 ns 10125 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10333 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10291 ns 10333 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1365445.5 ns 1358318.5 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1542 ns 1625 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1584 ns 1584 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23204 ns 23389 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5666 ns 5667 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5625 ns 5875 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5917 ns 6000 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5666 ns 5625 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 278434.5 ns 275350.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6794083.5 ns 6780125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6437416.5 ns 6371125 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6537625 ns 6531396 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7672770.5 ns 7625875 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215934 ns 214804 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24128750 ns 24015354 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21236687 ns 21285667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21047792 ns 21085125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29739708 ns 29769250 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2108232.5 ns 2112477.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37600792 ns 37264541.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45471354 ns 45538167 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45790709 ns 45665125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 37882499.5 ns 38235958 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6312.5 ns 6208 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7250 ns 5958.5 ns 1.22
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7208 ns 8750 ns 0.82
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6334 ns 7500 ns 0.84
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 239307 ns 236550 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8562.5 ns 8750 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8292 ns 8375 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8500 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8958 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1072829 ns 1063848.5 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1518604.5 ns 1554084 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1254999.5 ns 1262375 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1638249.5 ns 1631958.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2161791.5 ns 2152375 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 282408 ns 277465 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7954709 ns 7881667 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6616166.5 ns 6612667 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7188250 ns 7276167 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10050292 ns 10468062.5 ns 0.96
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1885536.5 ns 1876576 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 322292 ns 346375 ns 0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 349229 ns 348937.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 424833 ns 423416.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 342791 ns 336687 ns 1.02
batchedmm(128, Bsize=4)/forward/GPU/CUDA 42363 ns 46390 ns 0.91
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 738167 ns 735208 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 775562.5 ns 782458 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1073833 ns 1081666.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 771542 ns 758458.5 ns 1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 308119.5 ns 311011.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396833 ns 397375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288375 ns 288250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288042 ns 212583 ns 1.35
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750895.5 ns 754104.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44623 ns 44494 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 644479.5 ns 675959 ns 0.95
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 531167 ns 532333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 529666 ns 474000 ns 1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 972833 ns 973417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 191874.5 ns 189847 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 642958.5 ns 599375 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 644291.5 ns 650333 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 648063 ns 660375 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 659792 ns 655833.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132083 ns 132321 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2529834 ns 2469395.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2408458 ns 2363959 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2454875 ns 2519875.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2469208 ns 2465916 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1298049.5 ns 1345989 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 332375 ns 345583 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 341187.5 ns 342834 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 399834 ns 416375 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 309083 ns 306979.5 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15953 ns 16330 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 702167 ns 703104 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 725709 ns 729708 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1019542 ns 1026937.5 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 651145.5 ns 645959 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 200850 ns 199885.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1461167 ns 1460542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1498584 ns 1500583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1503292 ns 1491791 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1438792 ns 1441917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41257 ns 41671 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5149000 ns 5133500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4947833 ns 5293250 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5288437.5 ns 5309521 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4981458.5 ns 4977042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 200973.5 ns 197710 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3666 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33162 ns 33362 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14958 ns 15125 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15250 ns 15500 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15417 ns 15125 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15167 ns 15083 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 382563 ns 381216.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71500 ns 71375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71250 ns 71208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 70958 ns 71583 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71209 ns 71208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113882 ns 113946.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 329166 ns 319833 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 318166 ns 319208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318541 ns 327125 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318917 ns 318375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 197593 ns 195156 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 958 ns 959 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1000 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24063 ns 23764 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8270.5 ns 8084 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8542 ns 8542 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8459 ns 8416 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8166 ns 7833.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 265383 ns 263039 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 454542 ns 472416 ns 0.96
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 477021 ns 468125 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 549979 ns 549250 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 490708 ns 550333 ns 0.89
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129528 ns 128804.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1405458 ns 1375292 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1371583 ns 1372208 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1604208 ns 1633459 ns 0.98
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1364083.5 ns 1580500 ns 0.86
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 276117.5 ns 274739 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 416 ns 0.80
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 416 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32407 ns 31574 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6458 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6542 ns 6875 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6708 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6000 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 267504 ns 261869 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1723229 ns 1727625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1725500.5 ns 1783958 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1723708 ns 1730916 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1733208 ns 1729333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169027 ns 168455 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4392958 ns 4352625 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4360166 ns 4372937.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4374125 ns 4412458 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4352958.5 ns 4358042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1185931 ns 1234725 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6833 ns 6709 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6500 ns 6584 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7167 ns 7417 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 8874.5 ns 6542 ns 1.36
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 21039.5 ns 19619.5 ns 1.07
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32958 ns 51083 ns 0.65
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 33166 ns 35625 ns 0.93
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 72708 ns 49875 ns 1.46
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 50854 ns 70208 ns 0.72
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 212280.5 ns 211156 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 337125 ns 354291 ns 0.95
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 348917 ns 347584 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 426625 ns 432708 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 321166.5 ns 319521.5 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18442 ns 18053 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 715708 ns 719104 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 725000 ns 735979 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1032812.5 ns 1039063 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 675667 ns 672750 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 346766 ns 343671.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75416 ns 75417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 73958 ns 75333 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75083 ns 75708 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75375 ns 74709 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47399 ns 46983 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 335875 ns 324417 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 324875 ns 327000 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 325250 ns 334917 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 325417 ns 324083 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213441 ns 207721.5 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1478500 ns 1486334 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1526541 ns 1527500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1529042 ns 1519000 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1464375 ns 1466541 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52565 ns 51914 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5122208.5 ns 5119333.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5254437 ns 5300396 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5278208 ns 5303708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4984104 ns 4989375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 205098 ns 201413 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28125 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28209 ns 28166 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28250 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28208 ns 28208 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23917.5 ns 24393 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66333 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66667 ns 66292 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66500 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66541 ns 66584 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 523038 ns 530998 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1379812.5 ns 1493250 ns 0.92
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1132708 ns 1120167 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1142000 ns 947625 ns 1.21
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2125625 ns 2256500 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 571358 ns 570331 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 2998291 ns 3075542 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2137354 ns 2732479 ns 0.78
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2744167 ns 2643125 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3802166 ns 3814770.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2054250 ns 2010818 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8949667 ns 8738917 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8798333 ns 8777854.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8779583 ns 8781417 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6359625 ns 6360687.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80521 ns 81146 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81645.5 ns 81708.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85979 ns 83708 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82458 ns 87687.5 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 191955.5 ns 192383.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2047562.5 ns 2016791.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1749333 ns 2012708 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2018416.5 ns 2041312 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2004021 ns 2015208 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 799143 ns 798885.5 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

github-actions bot commented Nov 5, 2024

Benchmark Results (ASV)

main 24dc9ec... main/24dc9ec419939c...
basics/overhead 0.121 ± 0.001 μs 0.124 ± 0.0012 μs 0.974
time_to_load 1.19 ± 0.0073 s 1.2 ± 0.0069 s 0.992

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal avik-pal merged commit 4079372 into main Nov 5, 2024
4 of 5 checks passed
@avik-pal avik-pal deleted the ap/enz_opt_in branch November 5, 2024 20:16
@wsmoses
Copy link
Contributor

wsmoses commented Nov 6, 2024

so as of last night 1.11 support is essentially in, can we test and sees if it resolves?

@avik-pal
Copy link
Member Author

avik-pal commented Nov 6, 2024

Ah nice, I will trigger it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants