Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: try re-enabling enzyme testing on 0.13.16 #1042

Merged
merged 21 commits into from
Nov 21, 2024
Merged

test: try re-enabling enzyme testing on 0.13.16 #1042

merged 21 commits into from
Nov 21, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Nov 6, 2024

No description provided.

Copy link
Contributor

github-actions bot commented Nov 6, 2024

Benchmark Results (ASV)

main a08903d... main/a08903d84907c1...
basics/overhead 0.121 ± 0.0011 μs 0.124 ± 0.0011 μs 0.977
time_to_load 1.21 ± 0.014 s 1.2 ± 0.0051 s 1.01

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal
Copy link
Member Author

avik-pal commented Nov 6, 2024

Need to also reenable some of the tests manually in LuxLib

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: a08903d Previous: cb0900f Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4000 ns 3875 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4416 ns 4375 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5208 ns 5083 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4250 ns 4208 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60702.5 ns 60144 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10334 ns 10625 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10417 ns 10666 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11583 ns 11375 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10333 ns 10334 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 429548.5 ns 421452 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1208 ns 1250 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1209 ns 1292 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1334 ns 1250 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1125 ns 1167 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18619 ns 18149 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4041 ns 4167 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4208 ns 4042 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4250 ns 4292 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4083 ns 3625 ns 1.13
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 110443.5 ns 109548 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57250 ns 56166 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38333 ns 46709 ns 0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47167 ns 46334 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82042 ns 82291 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37058 ns 37127 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2029124.5 ns 2031334 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2097750 ns 2096166.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2114167 ns 2086458 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1991603.5 ns 1997167 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195005 ns 197158.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143875 ns 143042 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 153625 ns 145583.5 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146250 ns 146709 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144042 ns 149500 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167064 ns 166231 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1115083.5 ns 1138708.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1145062.5 ns 1128583 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1143000 ns 1062083.5 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1111770.5 ns 1115041.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 519176.5 ns 530934 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3875 ns 3125 ns 1.24
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3792 ns 3458 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4396 ns 4292 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3500 ns 3375 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70922 ns 70464 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8542 ns 9208 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9375 ns 8917 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9833 ns 9125 ns 1.08
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9041 ns 9166 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 493056 ns 483194.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16666 ns 15333 ns 1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15625 ns 15458 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18708 ns 17333 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14958 ns 17062.5 ns 0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 53797 ns 53962 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212125 ns 214583.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219770.5 ns 212667 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215792 ns 214625 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212625 ns 225250 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 271057 ns 273370 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 458 ns 1.36
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 666 ns 0.88
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 708 ns 750 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 500 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17646 ns 17502.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1375 ns 1542 ns 0.89
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1625 ns 1667 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1792 ns 1834 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1625 ns 1375 ns 1.18
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 103036.5 ns 101667.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7083 ns 7125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5167 ns 5917 ns 0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6041 ns 5792 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 9917 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23852 ns 23886 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220542 ns 221417 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 231917 ns 228125 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229333 ns 228666 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214208 ns 220500 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 169922.5 ns 169891 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23947.5 ns 23537 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16792 ns 16750 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16500 ns 17042 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17000 ns 16875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16708 ns 16750 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 163864 ns 159725 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 570750 ns 570333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 578959 ns 574000 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 571125 ns 579125 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 578875 ns 571125 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113126.5 ns 113492 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1421042 ns 1428041 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1426625 ns 1422333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1423792 ns 1423708 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1424687.5 ns 1423458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 212529 ns 208571.5 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1084083 ns 1051187.5 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 945500 ns 971896 ns 0.97
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1343979.5 ns 1346062.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1294145.5 ns 1306416 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 269871.5 ns 272301 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5730416.5 ns 5990916 ns 0.96
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4638750 ns 4519875 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4949333 ns 4948416.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5515291 ns 5523125 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1069949 ns 1070952 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23814 ns 23553 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2167 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2083 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 171935.5 ns 168963.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3833 ns 3875 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4458 ns 4167 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5292 ns 5250 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3792 ns 3666 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66284.5 ns 65091 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10958 ns 11416 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11750 ns 11292 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12208 ns 12333.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11000 ns 11209 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 453228 ns 446962.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6792 ns 6458.5 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6916.5 ns 6792 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8584 ns 7833.5 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6334 ns 6250 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 51674 ns 52555 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16709 ns 16584 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16645.5 ns 17791 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18125 ns 17375 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16875 ns 17125 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 300225 ns 308634 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 666 ns 0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 708 ns 583 ns 1.21
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32510 ns 32320 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8458 ns 8541 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8542 ns 9167 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9250 ns 9500 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8292 ns 9479.5 ns 0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 157936.5 ns 159616 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64542 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64833 ns 64625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64709 ns 64292 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64542 ns 64542 ns 1
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112820 ns 111041.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 279459 ns 292000 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 296687.5 ns 292084 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 274459 ns 275666 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 278291.5 ns 275708 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 186750 ns 183441 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3282834 ns 3191791 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2900062.5 ns 3043437.5 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3017792 ns 3020437.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3941042 ns 4089708 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 576338 ns 601857 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7618104 ns 7582625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7355500 ns 7473208.5 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7445792 ns 7437833 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8201396.5 ns 8187292 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1331422 ns 1317154 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17547125 ns 18957000 ns 0.93
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17608979 ns 19047250 ns 0.92
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17586250 ns 19104542 ns 0.92
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14133271.5 ns 15686625 ns 0.90
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23678375 ns 23902625 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 42806708 ns 34420458 ns 1.24
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 36969687.5 ns 37002333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35186500 ns 34848770.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1841334 ns 1857006 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 188420708 ns 191696375.5 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 249825542 ns 164341792 ns 1.52
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 193973479.5 ns 152698167 ns 1.27
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 433364292 ns 439655916 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13933807 ns 13895377 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 288423500 ns 292126520.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 353166937.5 ns 340023312 ns 1.04
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 295955209 ns 298857875 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 394124521 ns 335240875 ns 1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21583.5 ns 22250 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22437.5 ns 23083 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23542 ns 23959 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22083 ns 23417 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95875 ns 96101 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 102916 ns 103542 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104125 ns 103541 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104875 ns 104791 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104084 ns 113250 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 499046 ns 512131 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 5834 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 6375 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7000 ns 7000 ns 1
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6083.5 ns 6125 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68801.5 ns 68297.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14916 ns 15208 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16333 ns 15750 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16042 ns 16583 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14916 ns 15062.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 478630.5 ns 474148.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3017666.5 ns 3053958 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2067937 ns 2089500 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2280833 ns 2270042 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4862417 ns 4804875 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582819 ns 582756 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23500708 ns 23872458.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18333729.5 ns 18056937.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18038062.5 ns 17766021 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35525604 ns 35515208 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3102887 ns 3103295.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33341708 ns 33801000 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28040562.5 ns 27630916.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28530791.5 ns 27435750 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41211375 ns 41597458 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 71250 ns 74917 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 81791 ns 72541 ns 1.13
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 88458.5 ns 76416 ns 1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74750 ns 74375 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101733.5 ns 103583 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 206250 ns 221146 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232250 ns 219166 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220167 ns 208875 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 204708 ns 206542 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 545309 ns 560403 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11917 ns 12166 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12333 ns 12208.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12917 ns 13167 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11792 ns 12042 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 73274 ns 71403 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26500 ns 26979.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26375 ns 27167 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27625 ns 27958.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26667 ns 26459 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 485939.5 ns 472464 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12458 ns 12437.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12083 ns 12979 ns 0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14167 ns 14167 ns 1
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12042 ns 12125 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 53767.5 ns 53400 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 28417 ns 25625 ns 1.11
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25500 ns 26292 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26000 ns 26416 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26208 ns 26167 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 308847.5 ns 306626.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179313 ns 180729 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 181771 ns 182709 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184083 ns 183875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181625 ns 180833 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57081 ns 56252.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 590625 ns 593541.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 585833 ns 593916 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 595375 ns 584021 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583958 ns 582917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 289906 ns 289288.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 6500 ns 0.88
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6250 ns 6125 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7292 ns 7792 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7354.5 ns 6145.5 ns 1.20
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 72512 ns 70132.5 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14333 ns 14271 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14750 ns 14916 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15292 ns 15500 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14292 ns 14000 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 475946 ns 460852.5 ns 1.03
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1194000 ns 1175354 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1255041.5 ns 1353000 ns 0.93
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1282167 ns 1269979 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1009000 ns 1317500 ns 0.77
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301898 ns 302455 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4099458 ns 4288500 ns 0.96
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4622875 ns 4366958 ns 1.06
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4583479 ns 4543917 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3719875 ns 4469000 ns 0.83
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1037320.5 ns 1030148 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1875 ns 1792 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1792 ns 1875 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 24423 ns 23497 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns 4834 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4833 ns 5041 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5000 ns 4875 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 195083.5 ns 185923.5 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6292 ns 5500 ns 1.14
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6167 ns 6167 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7291.5 ns 6459 ns 1.13
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5708 ns 5583 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 56884 ns 55454.5 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10542 ns 10667 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11291 ns 11750 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11666 ns 11458 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10667 ns 10667 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 338689 ns 337381 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23835 ns 22737 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2708 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2750 ns 3000 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3000 ns 3000 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2709 ns 2750 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 165106 ns 157057 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11708 ns 11625 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12042 ns 12250 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12916 ns 12708 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11208 ns 11417 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 58245 ns 56422 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25166 ns 24250 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24375 ns 25208 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25000 ns 25000 ns 1
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24833 ns 25437.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 299833.5 ns 294376.5 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4208 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4167 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25332.5 ns 24716 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16208 ns 16042 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 15958 ns 16417 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16334 ns 16250 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16167 ns 16167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 203337.5 ns 193381 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5750 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 6083 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5917 ns 5750 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34137 ns 33569 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20875 ns 20479.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20542 ns 21000 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21208 ns 21208 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21041 ns 21104.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 181462.5 ns 174365.5 ns 1.04
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 423396 ns 375416.5 ns 1.13
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 368374.5 ns 374666.5 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 485375.5 ns 488312.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 102875 ns 524187.5 ns 0.20
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67695.5 ns 66372.5 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 873958 ns 931978.5 ns 0.94
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 975084 ns 880291.5 ns 1.11
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1174250 ns 1223791.5 ns 0.96
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 327583.5 ns 1351833.5 ns 0.24
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 192402.5 ns 192149.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80270.5 ns 81312.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82875 ns 80750 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83292 ns 80792 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80375 ns 80937 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194536.5 ns 192807 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1909166.5 ns 1932917 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1937625 ns 1916542 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1926500 ns 1926479 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1919542 ns 1921042 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 401599 ns 394461 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 333 ns 291 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22634 ns 22118 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1750 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1916 ns 1834 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 174445 ns 166019.5 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6333 ns 6250 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6666 ns 7208 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7584 ns 8166 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6375 ns 6312.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59506.5 ns 57360.5 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9250 ns 8917 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8792 ns 9167 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9334 ns 9208 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9209 ns 9250 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 313371.5 ns 301535 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120626875 ns 156508063 ns 0.77
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181866646 ns 173937500 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147965312.5 ns 148141208 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 109172333 ns 106478500 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5482154.5 ns 5474150 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 613775374.5 ns 673237875 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 579490625 ns 556883000 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 454979000 ns 453960458.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 757824166.5 ns 759297583 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34957490 ns 38204722 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 651362542 ns 701496583 ns 0.93
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 688274854 ns 667076166 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 584097270.5 ns 586800771 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 744092250 ns 744632000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59042 ns 56833 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39166 ns 48042 ns 0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 48458 ns 47125 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83583 ns 84541 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37588 ns 37576 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1895062.5 ns 1935541 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1981167 ns 1985208 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1996959 ns 1979834 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1888458 ns 1893771 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 174180 ns 174934 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 264875 ns 267875 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 273521 ns 288042 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 275542 ns 270229.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 265312 ns 267250 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 130044 ns 128767 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 696000 ns 665041 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 694979 ns 668958 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 587145.5 ns 589167 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 584562.5 ns 596209 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 714810.5 ns 703647.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2224458 ns 2205417 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2224833 ns 2188541 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2203791.5 ns 2100166.5 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2155375 ns 2225499.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133466 ns 133307.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5495291 ns 5538625 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5583291.5 ns 5527958 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5523041.5 ns 5503250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5498583 ns 5491271 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 772462 ns 759584.5 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 644875 ns 638667 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 646292 ns 640458 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 638000 ns 648875 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 638875 ns 636167 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46942.5 ns 47137 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1827375 ns 1796937.5 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1670062.5 ns 1724292 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1724167 ns 1720542 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2105958 ns 2104520.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 221811 ns 218174.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58291 ns 57000 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38667 ns 46833 ns 0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47959 ns 47083 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84292 ns 84542 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28649.5 ns 28335 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2032709 ns 2047750 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2095416.5 ns 2077083 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2115084 ns 2092083 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1998167 ns 1939979 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190814 ns 191381.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13379583 ns 13410020.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12465083 ns 12472750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12510000 ns 12570979 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15365208 ns 15234500 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 512956 ns 512740.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47297125 ns 47584458 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 42036749.5 ns 41911083 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40906541 ns 41152979.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58402937.5 ns 58152541 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3259192 ns 3249099 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74125833 ns 74313208.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91091084 ns 91931958.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90945458 ns 91156000 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98786375 ns 76595709 ns 1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58791 ns 57334 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38875 ns 47417 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47916 ns 47250 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81750 ns 84375 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47173 ns 48075 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1916875.5 ns 1930959 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1975416.5 ns 1977562.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1998896 ns 1977250 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1886479.5 ns 1816292 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194727.5 ns 196217.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 417 ns 0.80
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 334 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32121 ns 32756 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 5979.5 ns 6125 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6583 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6542 ns 6542 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5959 ns 6208 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 177657.5 ns 178147.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31406 ns 31948 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns 2625 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 2875 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2834 ns 2834 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2625 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 168384.5 ns 164100 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 284625167 ns 323244146 ns 0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 346874125 ns 340740458 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314223937 ns 314512041.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 271286833 ns 271130916 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7105292.5 ns 7115553 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 997536042 ns 1053603541.5 ns 0.95
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 962898625 ns 941056333 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 836523750 ns 854610104 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1157418250 ns 1162236250 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33940243.5 ns 33945165 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1317043896 ns 1364084083.5 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1704351583 ns 1705661833 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1639291667 ns 1621953875 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1664114458 ns 1313183229.5 ns 1.27
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1463208 ns 1410000 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1420208 ns 1408291.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1416146 ns 1453645.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1413542 ns 1407209 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127746 ns 127861 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5020063 ns 5051959 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5058875 ns 5013583.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5062958.5 ns 5028416.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5016917 ns 5027271 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 621557.5 ns 604299 ns 1.03
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 172836708.5 ns 161226250 ns 1.07
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 167838917 ns 131446875 ns 1.28
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 129038271 ns 127042083 ns 1.02
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 167391875 ns 155626750.5 ns 1.08
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4853519.5 ns 4974919.5 ns 0.98
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 626867250 ns 850481958 ns 0.74
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 577625083 ns 644255791 ns 0.90
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 432711750 ns 496077667 ns 0.87
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 647126958 ns 685984875 ns 0.94
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 15994577 ns 15948822 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8946000 ns 9064833.5 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9055083 ns 8770396 ns 1.03
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7872791 ns 7878104.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9737396 ns 10163000 ns 0.96
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1615300 ns 1608837.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36179042 ns 37348729 ns 0.97
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 38214417 ns 36970124.5 ns 1.03
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33412417 ns 33623167 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37675812 ns 38875729.5 ns 0.97
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6517064 ns 6455570 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47459 ns 47375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47375 ns 47750 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47625 ns 47583 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47500 ns 47625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19122 ns 18855 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50875 ns 50250 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50416.5 ns 50750 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50541 ns 50416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50333 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 228481 ns 202264 ns 1.13
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6209 ns 6375 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6542 ns 7187.5 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7666.5 ns 8417 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7208 ns 6708 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 111051 ns 108599.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10208 ns 9604.5 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9625 ns 10209 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 10292 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns 10583 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 703962.5 ns 610519 ns 1.15
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5709 ns 5958 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5958 ns 6375 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7125 ns 7583 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6334 ns 5542 ns 1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 160862.5 ns 131186.5 ns 1.23
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13125 ns 12875 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12792 ns 13208 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13250 ns 13583 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13209 ns 12875 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 610923 ns 530393 ns 1.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1000 ns 1167 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 31878 ns 32479.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 7833.5 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7916 ns 8042 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8208 ns 8083 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7958 ns 7916 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 234181 ns 216406.5 ns 1.08
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23000 ns 23042 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23500 ns 23542 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23375 ns 23333 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23291.5 ns 23375 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18958 ns 19066 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52333 ns 52291.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52542 ns 52500 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52833 ns 53166.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52750 ns 52125 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 347156 ns 309714.5 ns 1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1398291 ns 1413917 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1400542 ns 1401104 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1451646 ns 1457583.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1403458 ns 1402271 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196101 ns 196285 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5010104 ns 5045083 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5042416.5 ns 4724458 ns 1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5035667 ns 5023021 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5006875.5 ns 4706104.5 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 698384 ns 644560.5 ns 1.08
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3037083 ns 3086125.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2097125 ns 2087104.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2310166 ns 2281125 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4587708 ns 4848375 ns 0.95
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582576 ns 580262 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24451708.5 ns 24765000.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19116646 ns 18889791.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18907687.5 ns 19005084 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36689646.5 ns 36681292 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3202509 ns 3253871.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34075708 ns 34537875 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28707020.5 ns 28314500 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28058375 ns 27967000 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41736416.5 ns 41702500 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 143730708 ns 144041208 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 147488750 ns 143168583 ns 1.03
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126224750 ns 124247521 ns 1.02
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173454520.5 ns 173506729 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22575677 ns 22768605 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1620032250 ns 957619479 ns 1.69
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 863135021.5 ns 1175957479.5 ns 0.73
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1509034062.5 ns 739734292 ns 2.04
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 665993458 ns 672317125 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 117915974 ns 118020449 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72166 ns 73979 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73250 ns 75750 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76770.5 ns 75416 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73833.5 ns 72854.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 284295.5 ns 300521.5 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 287666 ns 287875 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 192041.5 ns 285333 ns 0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 200187.5 ns 204208 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 282000 ns 287375 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1416407.5 ns 1342742 ns 1.05
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35556583 ns 36185500 ns 0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36666375 ns 35466000.5 ns 1.03
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32500083 ns 32336688 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40359917 ns 40972250 ns 0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5850508 ns 5837876 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 146447437.5 ns 151179834 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 159338208.5 ns 151456979 ns 1.05
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 138154562.5 ns 136606104 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 283678542 ns 287372208 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34916771 ns 34877857 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121153146.5 ns 155986916 ns 0.78
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181784520.5 ns 174507459 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148303833 ns 148111416.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106006229 ns 102908562.5 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5475461.5 ns 5463707 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 469832812.5 ns 520380250 ns 0.90
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 483820458.5 ns 465489750 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 439800958 ns 439138000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 741809084 ns 742252417 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32272334 ns 35175845 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 707200771 ns 698201250 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 674004500.5 ns 654820792 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 574385562.5 ns 571273229.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 731320834 ns 850215250 ns 0.86
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1295833 ns 1101520.5 ns 1.18
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 666625 ns 970208.5 ns 0.69
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 976125 ns 920500 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1942583 ns 1945375.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 578916 ns 580245.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2968500 ns 2907896 ns 1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2508187.5 ns 2595708 ns 0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2653229 ns 2606333 ns 1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3698583 ns 3655000 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1888981 ns 1734207 ns 1.09
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5791354.5 ns 6744875 ns 0.86
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5897834 ns 6498208 ns 0.91
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5827584 ns 6503854.5 ns 0.90
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2886541 ns 4423604.5 ns 0.65
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7416 ns 7208 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 6083 ns 0.87
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6209 ns 5958.5 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 9959 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25562 ns 25201 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212562.5 ns 212291 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221625 ns 220750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221083 ns 220125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207125 ns 206792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 299268 ns 262467.5 ns 1.14
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 307907958 ns 316552750 ns 0.97
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 280027438 ns 221682708 ns 1.26
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 198301979.5 ns 187257688 ns 1.06
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 308612916 ns 311596375 ns 0.99
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7676186 ns 7676203 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1090327479 ns 1093022833.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1069654042 ns 911616145.5 ns 1.17
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 805438000 ns 815656375 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1153864583 ns 1161401125 ns 0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26354888 ns 26547253 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5312.5 ns 5292 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5584 ns 5667 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7041 ns 6625 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5500 ns 5125 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 186016 ns 167889.5 ns 1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7416 ns 7083 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7125 ns 7375 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7834 ns 7459 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7291.5 ns 7437.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 709229 ns 650263 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 709 ns 0.82
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 667 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23945 ns 23809 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 9041.5 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9209 ns 9791 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 9208.5 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8833 ns 9042 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 234895 ns 233459 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 353895.5 ns 351417 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352500 ns 352250 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352333.5 ns 353063 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 352375 ns 353333 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21675 ns 21613 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 826000 ns 791250 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 835458.5 ns 808979 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 776312.5 ns 773625 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 824520.5 ns 824084 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 308824 ns 305844 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 338833 ns 314958 ns 1.08
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 323020.5 ns 333625 ns 0.97
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 453208 ns 448667 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10770.5 ns 331833 ns 0.032457591619881085
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17775 ns 17811 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 712979 ns 682125 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 727625 ns 746791.5 ns 0.97
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1002292 ns 1029167 ns 0.97
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 27291.5 ns 700937.5 ns 0.038935711101203745
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 295524 ns 273907.5 ns 1.08
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 379625 ns 328083 ns 1.16
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 329041 ns 348979 ns 0.94
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 440521.5 ns 424375 ns 1.04
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 30208.5 ns 370666 ns 0.08149789837751507
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22403 ns 22237 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 734125 ns 743604 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 779520.5 ns 750229 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1028562.5 ns 1076375 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 105500 ns 822541 ns 0.13
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 258280.5 ns 220485.5 ns 1.17
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3459 ns 3334 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3791 ns 3792 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3792 ns 3625 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3625 ns 3583 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17955 ns 18068 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4209 ns 4166 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4333 ns 4542 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4292 ns 4250 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4250 ns 4334 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 290816 ns 278097 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3625 ns 3292 ns 1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3958 ns 3645.5 ns 1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4500 ns 4708 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3500 ns 4042 ns 0.87
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 238246.5 ns 212235.5 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8417 ns 8042 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8083 ns 8417 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 8792 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8167 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1225625 ns 1255478 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203166 ns 204000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211625 ns 211375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211292 ns 211042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199292 ns 200541 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34921 ns 34367 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 648729.5 ns 605708.5 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 675041 ns 625021 ns 1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622125 ns 620792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 627292 ns 582583 ns 1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 346070.5 ns 361289.5 ns 0.96
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 992125 ns 973333 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1031125.5 ns 950209 ns 1.09
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 953834 ns 955541 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 861124.5 ns 1286000.5 ns 0.67
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206641 ns 207830 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4533291 ns 4594084 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4839770.5 ns 4500750.5 ns 1.08
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4423375 ns 4304583 ns 1.03
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 5168938 ns 6304625 ns 0.82
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 926931 ns 925479 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3209 ns 3333 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3854.5 ns 3583 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4458 ns 4250 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2834 ns 3541 ns 0.80
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 229027 ns 240989.5 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns 6875 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7291 ns 7542 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7250 ns 7375 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 7042 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1010836.5 ns 1039649.5 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1640708.5 ns 1636792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1182958 ns 1175749.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1363875 ns 1347167 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2437104 ns 2463271 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215070 ns 213096 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12367625 ns 12388416 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9635542 ns 9551437.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9254375 ns 9305937.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18041020.5 ns 18088000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1951821.5 ns 1951605 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17387209 ns 17398084 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14500084 ns 14348854.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14366416.5 ns 14347271 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21053249.5 ns 21112104 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 124458.5 ns 94729.5 ns 1.31
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93333 ns 90667 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92667 ns 92375 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 87625 ns 114395.5 ns 0.77
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126571 ns 125574 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2029833 ns 2039792 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2043291 ns 1808208.5 ns 1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2043542 ns 2033666.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024104 ns 2022500 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1034383 ns 1052869 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 2917 ns 326041.5 ns 0.008946713838575765
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2041 ns 344833 ns 0.005918807074728927
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 2833 ns 396416 ns 0.007146532935098483
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 3292 ns 314708 ns 0.010460490359317209
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15778 ns 15677 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2750 ns 701042 ns 0.0039227321615538015
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2459 ns 733209 ns 0.003353750431323129
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2833 ns 1020500 ns 0.00277609015188633
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2792 ns 656250 ns 0.0042544761904761905
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 192972 ns 196145.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7084 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5416 ns 5541 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6084 ns 6084 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 10000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34424 ns 34060 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221625 ns 221166.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221250 ns 220916.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220834 ns 220167 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219333.5 ns 217124.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 316465.5 ns 344547 ns 0.92
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22906 ns 22568 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14458 ns 14167 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14250 ns 14375 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14417 ns 14458 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14333 ns 14416 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 476234 ns 487124.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 117770.5 ns 97500 ns 1.21
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 98104 ns 93417 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96500 ns 96687.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91584 ns 91875 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126019 ns 124929 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1930000 ns 1940875 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1937167 ns 1919916.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1921021 ns 1931229.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923125 ns 1917271.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 939452.5 ns 955641 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 867604 ns 854084 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 807896 ns 826333 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1207166.5 ns 1211000 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 951167 ns 955354.5 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 276975 ns 272141 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2825833 ns 2801124.5 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2537062.5 ns 2515333 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3318041 ns 3309625 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3363250 ns 3416625 ns 0.98
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1571603 ns 1612126.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16041.5 ns 17062.5 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16083 ns 16708.5 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17625 ns 18937 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14770.5 ns 15167 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142076.5 ns 142123.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227020.5 ns 223437.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 262125 ns 215958 ns 1.21
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216500 ns 216125 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 228417 ns 255708.5 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 633913.5 ns 644779 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 220958 ns 222292 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 223166 ns 221750 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 221583.5 ns 222542 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 219000 ns 220917 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 268138.5 ns 271274.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 555875 ns 509083 ns 1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 563417 ns 501292 ns 1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 499167 ns 496750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 504708 ns 550583 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1310913 ns 1401190 ns 0.94
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 3667 ns 304437.5 ns 0.0120451652638062
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4541 ns 331687.5 ns 0.013690597324288675
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4875 ns 376292 ns 0.01295536445101145
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3625 ns 321812.5 ns 0.011264323169547485
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16636 ns 16554 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7125 ns 708875 ns 0.010051137365543996
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7083 ns 736875 ns 0.009612213740458016
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7375 ns 1020209 ns 0.007228910938837042
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7333 ns 668458 ns 0.010970023546729936
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 193281.5 ns 196065 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18708 ns 17854 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19625 ns 18520.5 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19542 ns 19667 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16520.5 ns 16209 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144540.5 ns 146750.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223396 ns 247604 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221292 ns 212500 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213375 ns 212917 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222666 ns 211750.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 896239 ns 1011803 ns 0.89
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4000 ns 4125 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4458 ns 4125 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5395.5 ns 5187.5 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4250 ns 4084 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 180460 ns 201325 ns 0.90
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10959 ns 10667 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10416 ns 10875 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10500 ns 10500 ns 1
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10417 ns 10375 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1021118.5 ns 1050725 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3042 ns 3375 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3750 ns 3625 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4541 ns 4167 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3292 ns 3291 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 219358 ns 242454 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns 7542 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7666 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7750 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7334 ns 7333 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1035983 ns 1067571 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23583541 ns 24057353.5 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43838187.5 ns 34753459 ns 1.26
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37729834 ns 37792125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35203208 ns 34828583.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1832361.5 ns 1854184 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184224084 ns 187222542 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 173933854 ns 160010375 ns 1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146952229.5 ns 146721854.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 410469417 ns 412776417 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16498571 ns 16508303 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 424640333 ns 437495583 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 261807375 ns 253838438 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 295219708.5 ns 232343979.5 ns 1.27
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 479699708 ns 483540875 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182437.5 ns 183854 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184959 ns 183625 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185625 ns 185334 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182958 ns 184167 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 176456.5 ns 220968 ns 0.80
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 629208 ns 594000 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 611645.5 ns 632437.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 588334 ns 586084 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 596292 ns 628500 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1007095.5 ns 1061303.5 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3908958 ns 3892042 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 4164979 ns 3642708 ns 1.14
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3539625 ns 3572042 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4551958.5 ns 5353250 ns 0.85
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532362 ns 549368 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17307166 ns 17901624.5 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 18330542 ns 17281292 ns 1.06
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16469187.5 ns 16574875 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20270062.5 ns 22050250 ns 0.92
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2616105 ns 2630980 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 541 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 708 ns 584 ns 1.21
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31875 ns 31762 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9437.5 ns 9145.5 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9459 ns 9208 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9291.5 ns 9417 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9125 ns 9208 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 261192 ns 262912.5 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 498957667 ns 505346750 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 506947291 ns 429818666.5 ns 1.18
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 420975459 ns 433256333.5 ns 0.97
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 674716520.5 ns 677373875 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12478081 ns 12487373 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1869574750 ns 2066713500 ns 0.90
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1653016000 ns 1635890000 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1493871875 ns 1494391792 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2197444312 ns 2208031208.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49046932 ns 49163495.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1642250 ns 1632500.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1190416 ns 1173583 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1384604 ns 1383958 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2488917 ns 2483292 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 216008 ns 214736 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12763416 ns 12776042 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9992083 ns 9939062.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9689979.5 ns 9686917 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18446958 ns 18349375 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2006858 ns 2056758 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17699416 ns 17758729.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14794062.5 ns 14689958 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14579833 ns 14551125 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21407792 ns 21399666 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26291 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26292 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26667 ns 26333 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23694 ns 24146 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66834 ns 66791 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66917 ns 67292 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67250 ns 68417 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66917 ns 66709 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 381806 ns 391053.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 202917 ns 204333 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209208 ns 210125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210166 ns 209458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199750 ns 198792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26205 ns 26289 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 611521 ns 642083 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 672625 ns 624354.5 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 625292 ns 621729.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 634083 ns 627000.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 311868 ns 357106 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 604020.5 ns 645625 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 659396 ns 636292 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 550042 ns 602667 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 646458 ns 672375 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131537 ns 132245.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2232396 ns 2294979 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2281167 ns 2157208 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2260354 ns 2246208 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2241125 ns 2249458 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1133773 ns 1236985 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19291.5 ns 17937.5 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20063 ns 18416.5 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19875 ns 20083 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17166 ns 18895.5 ns 0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 143835 ns 145580 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218416 ns 259583 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 267625 ns 261791 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220729.5 ns 219084 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258250 ns 257520.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 938699.5 ns 1034996 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 666 ns 542 ns 1.23
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 667 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 708 ns 625 ns 1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22958 ns 23604 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9708 ns 9750 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10084 ns 10292 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10041 ns 10250 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9291 ns 9333 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 253769.5 ns 260113.5 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5687.5 ns 5083.5 ns 1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5729.5 ns 5792 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6458 ns 6833 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5291.5 ns 5375 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 185951.5 ns 229273.5 ns 0.81
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 6709 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7667 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7541.5 ns 7583 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 6937.5 ns 1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 731372.5 ns 777061.5 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2125 ns 1917 ns 1.11
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2333 ns 2500 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2333 ns 2208 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2167 ns 2250 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18317 ns 18340 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6542 ns 6542 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6542 ns 6667 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6959 ns 6666 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6667 ns 6584 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 307304.5 ns 320616.5 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749042 ns 750542 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 758479 ns 746792 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 749583 ns 746916 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 752500.5 ns 750584 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21294 ns 21795 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 775167 ns 805145.5 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 792834 ns 791604 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775375 ns 772584 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 792083.5 ns 810645.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 295315 ns 302046.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 6959 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5917 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 6000 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10167 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32899.5 ns 32896 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219167 ns 228770.5 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 270000 ns 227709 ns 1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228875 ns 228084 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225958 ns 225625.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 320960.5 ns 359979 ns 0.89
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9833 ns 10250 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10959 ns 10208 ns 1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11209 ns 11042 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10458 ns 9958 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 211522.5 ns 245976 ns 0.86
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23833.5 ns 24896 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24875 ns 24000 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25458 ns 25416.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24458.5 ns 24625 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1049023 ns 1114734 ns 0.94
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106078646 ns 106794687 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 126381834 ns 118367979 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 121177729 ns 120992291 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117537563 ns 118045833 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2652799 ns 2655666 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 391974416 ns 397097667 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 380681375 ns 368138875 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 356763209 ns 357737125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 480738708 ns 483722209 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15212996 ns 15195689 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 753414729 ns 769405854 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 774115708 ns 762934333 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 748053687.5 ns 748099729.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 944047458.5 ns 772112770.5 ns 1.22
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6916.5 ns 6417 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7333 ns 7375 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8542 ns 8187 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7104.5 ns 8708.5 ns 0.82
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 230215 ns 243458.5 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14458 ns 13625 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14417 ns 14834 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14292 ns 14834 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14083 ns 14000 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1044880 ns 1081512.5 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6041 ns 5500 ns 1.10
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6542 ns 6083.5 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7042 ns 7500 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6021 ns 5625 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 227729 ns 236881 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12750 ns 12583 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12625 ns 12750 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12750 ns 13000 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12750 ns 12542 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 749467.5 ns 792100 ns 0.95
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5292 ns 328937.5 ns 0.016088162644879347
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 6041 ns 345250 ns 0.017497465604634322
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6458 ns 398625 ns 0.01620068987143305
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5542 ns 315687.5 ns 0.017555335577113442
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16776 ns 17026 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15791 ns 701750 ns 0.022502315639472747
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15375 ns 734417 ns 0.02093497291048546
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15583 ns 1025666 ns 0.015193055049109554
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15666 ns 663750 ns 0.02360225988700565
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 196800 ns 202330 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23311 ns 23795 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6250 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6750 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6500 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6458 ns 6104.5 ns 1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 237753.5 ns 242897.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5916 ns 6042 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6000 ns 5917 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24206 ns 24778 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21292 ns 21834 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 20834 ns 21542 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21500 ns 21750 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 20708.5 ns 21417 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 259866 ns 265364.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143791 ns 184375 ns 0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 158270.5 ns 185000 ns 0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147479 ns 149541 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 149250 ns 190750 ns 0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166782 ns 168165 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1333750 ns 1361667 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1370333.5 ns 1306875.5 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1335416 ns 1318541.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1320333.5 ns 1332084 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1283264 ns 1372553 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23708.5 ns 24458 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23792 ns 22729 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24458 ns 25000 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24354.5 ns 22374.5 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 285298 ns 355948 ns 0.80
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 126459 ns 176958 ns 0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 148916 ns 131167 ns 1.14
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 119292 ns 126166.5 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 173833 ns 177542 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1391431.5 ns 1491511 ns 0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22764 ns 23138 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6520.5 ns 6125 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6917 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6770.5 ns 6667 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6417 ns 6250 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 253763.5 ns 259300 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4583.5 ns 4458 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 4875 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5583 ns 5708.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4584 ns 4833 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 243103.5 ns 258768.5 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9812.5 ns 9709 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9750 ns 10083 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10417 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10041.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1309296.5 ns 1358754 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1666 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1584 ns 1583 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22984 ns 23306 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5667 ns 5625 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5709 ns 6125 ns 0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5958 ns 6041 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns 5625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 272375 ns 275587 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6807375 ns 6813916.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6369645.5 ns 6428416 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6576041.5 ns 6554167 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7693187.5 ns 7571104.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214270 ns 213811 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24061771 ns 24163500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21342083.5 ns 21359167 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21083084 ns 21066083 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29748249.5 ns 29670209 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2098936 ns 2101483 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37348771 ns 37462416 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45817791 ns 45862833.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 46004333 ns 45876667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49463688 ns 38235959 ns 1.29
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5750 ns 5459 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6334 ns 6250 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6958 ns 6958 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5895.5 ns 5292 ns 1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 229337.5 ns 238588.5 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 7959 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8959 ns 8334 ns 1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8417 ns 8250 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7959 ns 8250 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1021096 ns 1068264.5 ns 0.96
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1557541 ns 1529292 ns 1.02
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1250583 ns 1266666.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1626375 ns 1623709 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2091875 ns 2163750 ns 0.97
lenet(28, 28, 1, 128)/forward/GPU/CUDA 272400.5 ns 279544 ns 0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7906292 ns 7968292 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6655979 ns 6533250 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7142375 ns 7125792 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10434875 ns 10479375 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1820689 ns 1874497 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 368291 ns 320667 ns 1.15
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 352000 ns 346291 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 460542 ns 428584 ns 1.07
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 24791 ns 345375 ns 0.07177994933043794
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45978 ns 46619.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 733708.5 ns 745958.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 814750 ns 791666.5 ns 1.03
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1065167 ns 1073208.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 94833 ns 776479 ns 0.12
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 277791 ns 311670 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397750 ns 396708.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 213375 ns 287917 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287917 ns 288250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755083 ns 753417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44009 ns 44556 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 673708 ns 645167 ns 1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 474958 ns 527667 ns 0.90
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 533292 ns 532000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 972958 ns 974292 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 189948 ns 190424 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 651833 ns 668958 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 663375.5 ns 629749.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 610708 ns 544375 ns 1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 646208 ns 643396 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131710 ns 132592.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2462750 ns 2485646 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2486896 ns 2448562.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2493833 ns 2450292 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2462666 ns 2461146 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1171820 ns 1408688 ns 0.83
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 3770.5 ns 324000.5 ns 0.011637327720173271
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2250 ns 344459 ns 0.006531982035597851
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4125 ns 396583 ns 0.010401353562810307
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 3354 ns 314083.5 ns 0.010678688947365907
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16239 ns 16193 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5583 ns 700875 ns 0.007965757089352595
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5292 ns 734292 ns 0.007206942197381968
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5542 ns 1020625 ns 0.005430006123698714
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5500 ns 656584 ns 0.00837668904511837
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 194908.5 ns 201017 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1457334 ns 1461042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1493416 ns 1503750 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499125 ns 1504625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1437458 ns 1442917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40269 ns 40991 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5127875.5 ns 5155750 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5301459 ns 5279833.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5323375 ns 5308333.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4984791.5 ns 4987604 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196759 ns 200839 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3750 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33034 ns 33187 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15084 ns 14958 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15208 ns 15395.5 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15250 ns 15375 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15208 ns 15083 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 366758 ns 379072.5 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71250 ns 71541 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71375 ns 71542 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71375 ns 71270.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71291 ns 71083 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113846 ns 112914 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317292 ns 325333 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 326625 ns 320729.5 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318459 ns 318792 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318042 ns 317333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 194171.5 ns 193733 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1041 ns 1000 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1125 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23364.5 ns 23845 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8209 ns 7750 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8042 ns 8583 ns 0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8500 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 7750 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 257659 ns 262768.5 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 508125 ns 456417 ns 1.11
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 478917 ns 472584 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 566520.5 ns 554479 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 219792 ns 550167 ns 0.40
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129544.5 ns 128330 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1385208 ns 1408750 ns 0.98
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1467875 ns 1380958 ns 1.06
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1768791.5 ns 1632666.5 ns 1.08
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 871958 ns 1597604 ns 0.55
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 273249 ns 274089 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31676 ns 31588 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6334 ns 6083 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6333 ns 6750 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6416 ns 6458 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6125 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 261677.5 ns 263587.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1721292 ns 1767792 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1726209 ns 1726375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1723750 ns 1725708 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1729333 ns 1773250 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168454 ns 168887 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4357145.5 ns 4406958 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4398937.5 ns 4358916 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4385500 ns 4369792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4371333 ns 4367125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1121300 ns 1241756.5 ns 0.90
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6666 ns 6750 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6917 ns 7000 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6709 ns 6792 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6917 ns 6750 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20771 ns 19512 ns 1.06
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51750 ns 51584 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 52271 ns 48771 ns 1.07
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33125 ns 33250 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51166 ns 52958 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 211174 ns 210086 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17209 ns 328750 ns 0.052346768060836504
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 18167 ns 344958 ns 0.05266438233060257
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 19667 ns 408250 ns 0.04817391304347826
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17875 ns 323500 ns 0.05525502318392581
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18120 ns 18058 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53583 ns 719583.5 ns 0.07446390863603737
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53083 ns 735666.5 ns 0.07215633714461649
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53208 ns 1034250 ns 0.0514459753444525
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53333 ns 684646 ns 0.07789865127379697
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 332596 ns 345041 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75292 ns 75459 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75542 ns 75292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75625 ns 75167 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75292 ns 75333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46820.5 ns 46969 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 327334 ns 332833 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 334083 ns 325833 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 325708 ns 324583 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324458 ns 323834 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 210121 ns 207979 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1484208 ns 1487708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1520208 ns 1530375 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1525917 ns 1530750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1462875 ns 1466417 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51807 ns 51505.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5121958 ns 5146312.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5309458 ns 5151604.5 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5315875 ns 5003270.5 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4991875 ns 4984709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202097 ns 205494.5 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28334 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28250 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28209 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24786 ns 24407 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66250 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66417 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66709 ns 67458 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66375 ns 66417 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 508762 ns 525547 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1466375 ns 1383749.5 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 930645.5 ns 1059771 ns 0.88
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1076708 ns 1061458 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2109333.5 ns 2248687.5 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 584770.5 ns 581876.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3071000 ns 3035479 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2631084 ns 2745250 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2776875 ns 2740958 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3811833.5 ns 3811500 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2004790.5 ns 2064611 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7904896 ns 8921042 ns 0.89
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8008541 ns 8776625 ns 0.91
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7950354.5 ns 8768729.5 ns 0.91
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4825709 ns 6359583 ns 0.76
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 119062.5 ns 82083.5 ns 1.45
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 137041 ns 81562.5 ns 1.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82375 ns 83125 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80979 ns 80583 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194253.5 ns 192403.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1969750.5 ns 2040625 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2032792 ns 1935354.5 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1755791.5 ns 2023083 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2016250.5 ns 2003562.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 780655 ns 805958 ns 0.97

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal
Copy link
Member Author

avik-pal commented Nov 18, 2024

@wsmoses any idea what the following means: (THis is preceded by a 19K llvm dump)

initfn=define private void @jlplt_ijl_set_task_tid_448994({} addrspace(10)* %0, i32 %1) #9 {
top:
  %2 = load atomic void ()*, void ()** null unordered, align 8
  %3 = icmp ne void ()* %2, null
  br i1 %3, label %ccall, label %dlsym

dlsym:                                            ; preds = %top
  %4 = call void ()* @ijl_load_and_lookup(i8* inttoptr (i64 3 to i8*), i8* getelementptr inbounds ([17 x i8], [17 x i8]* @_j_str_ijl_set_task_tid_21, i32 0, i32 0), i8** @jl_libjulia_internal_handle)
  store atomic void ()* %4, void ()** null release, align 8
  br label %ccall

ccall:                                            ; preds = %dlsym, %top
  %5 = phi void ()* [ %2, %top ], [ %4, %dlsym ]
  %6 = bitcast void ()* %5 to void ({} addrspace(10)*, i32)*
  %7 = bitcast void ({} addrspace(10)*, i32)* %6 to void ()*
  store atomic void ()* %7, void ()** @jlplt_ijl_set_task_tid_448994_got release, align 8
  musttail call void %6({} addrspace(10)* %0, i32 %1)
  ret void
}

loadfn=  %2 = load atomic void ()*, void ()** null unordered, align 8
opv=void ()** null

It seems to show up once I trigger parallel tests which is very strange. Before I run parallel tests, the same code just works...

@avik-pal avik-pal force-pushed the ap/enz branch 13 times, most recently from f047259 to 8f246a3 Compare November 18, 2024 17:29
@wsmoses
Copy link
Contributor

wsmoses commented Nov 18, 2024

@avik-pal can you open an issue with the full dump?

@wsmoses
Copy link
Contributor

wsmoses commented Nov 18, 2024

and can you see if using https://github.com/EnzymeAD/Enzyme.jl/pull/2068/files fixes it?

@avik-pal avik-pal force-pushed the ap/enz branch 6 times, most recently from 92f1f7a to d586e10 Compare November 19, 2024 20:42
@avik-pal avik-pal force-pushed the ap/enz branch 2 times, most recently from 32e9e57 to af3cfb9 Compare November 21, 2024 15:54
@avik-pal avik-pal changed the title test: try re-enabling enzyme testing on 0.13.14 test: try re-enabling enzyme testing on 0.13.16 Nov 21, 2024
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@avik-pal
Copy link
Member Author

Tests are now (mostly) happy. Need to look into downgrade eventually but I will call this a victory and merge

@avik-pal avik-pal merged commit 132619c into main Nov 21, 2024
52 of 59 checks passed
@avik-pal avik-pal deleted the ap/enz branch November 21, 2024 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants