Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support passing in device and client to XLA #1020

Merged
merged 4 commits into from
Nov 5, 2024
Merged

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Nov 3, 2024

No description provided.

@avik-pal avik-pal added the xla label Nov 3, 2024
@avik-pal avik-pal force-pushed the ap/xla_args branch 3 times, most recently from 57de396 to 2e5ff52 Compare November 3, 2024 21:22
@avik-pal
Copy link
Member Author

avik-pal commented Nov 3, 2024

I want to wait for EnzymeAD/Reactant.jl#222 which exposes device so that we don't have to make direct XLA calls

@avik-pal avik-pal marked this pull request as draft November 4, 2024 01:36
Copy link
Contributor

github-actions bot commented Nov 4, 2024

Benchmark Results (ASV)

main 2e5ff52... main/2e5ff52866c81e...
basics/overhead 0.121 ± 0.00089 μs 0.125 ± 0.0013 μs 0.969
time_to_load 0.967 ± 0.0074 s 0.971 ± 0.022 s 0.996

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 1514e7d Previous: 8bfa628 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4375 ns 4625 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4375 ns 4084 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5938 ns 5791 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4125 ns 4292 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60068 ns 60959 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 10125 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10291 ns 9959 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10042 ns 10375 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10458 ns 10666 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 424087.5 ns 427044 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 959 ns 1167 ns 0.82
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1250 ns 1250 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1542 ns 1458 ns 1.06
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3042 ns 3542 ns 0.86
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18235 ns 18260 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4084 ns 4125 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4208 ns 3833 ns 1.10
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4209 ns 4125 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4041 ns 4000 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 111236.5 ns 111381 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57000 ns 57709 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46417 ns 47250 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38292 ns 38250 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81667 ns 80333 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37054 ns 37655 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2042458 ns 2026167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2082375 ns 2092708.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2085396 ns 2059625.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1988541 ns 1993416 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196865 ns 197377 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144042 ns 152958 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146124.5 ns 148250 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145770.5 ns 146417 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 148000 ns 150375 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165635 ns 167595 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1112500 ns 1098542 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1113875 ns 1124250 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1147125 ns 1116146 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1116187.5 ns 1107229.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 526497 ns 523151 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4250 ns 3584 ns 1.19
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3542 ns 3625 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4916 ns 5708.5 ns 0.86
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3333 ns 3417 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 71425.5 ns 70157 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8834 ns 8834 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9500 ns 8667 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns 9291 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9083 ns 9042 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 480134 ns 492826.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16875 ns 17000 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17083 ns 16375 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17667 ns 18667 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16958 ns 17083 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55287.5 ns 54850 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225708 ns 213146 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221375 ns 216104 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221125 ns 214167 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213667 ns 225333 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 276702.5 ns 272672.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 709 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 459 ns 583 ns 0.79
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17631 ns 17542 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1459 ns 1708 ns 0.85
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1541 ns 1458 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583.5 ns 1625 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1708 ns 1750 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 104160 ns 104205 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7250 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5833 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5208 ns 5209 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 4000 ns 2.48
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24094 ns 23961 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 274729 ns 228750.5 ns 1.20
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 271146 ns 228333 ns 1.19
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 238083.5 ns 228500 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 227646.5 ns 226334 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 170934.5 ns 170956 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3834 ns 3875 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3916 ns 3916 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3834 ns 3834 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23993 ns 23832 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16875 ns 16833 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16750 ns 16708 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 19500 ns 16708 ns 1.17
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16750 ns 16958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 164614 ns 165501.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 575542 ns 579042 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 584000 ns 574375 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 605667 ns 575083 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 572083 ns 576292 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113682 ns 113664 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1428709 ns 1417708 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1417020.5 ns 1429333 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1455458 ns 1425729.5 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1422708 ns 1422208 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 214581 ns 214791 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1083708 ns 1082104 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 966124.5 ns 959958.5 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1334708 ns 1341792 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1295562 ns 1294792 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 279300 ns 281583.5 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5908000 ns 5777875 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4591709 ns 4456083 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4976209 ns 4934792 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5721000 ns 5627500 ns 1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1097188 ns 1106964 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 24143 ns 23988 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2084 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2083 ns 2083 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 178433 ns 179026 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6375 ns 6084 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6167 ns 6167 ns 1
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6791 ns 7041 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 6375 ns 0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65777 ns 66163.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11125 ns 11291 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11042 ns 10791 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12167 ns 12125 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10875 ns 11354.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 452424 ns 456626.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7500 ns 7000 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7020.5 ns 7042 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8167 ns 8375 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6417 ns 7042 ns 0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52916.5 ns 52652 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17542 ns 17375 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17084 ns 17167 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18125 ns 17770.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16625 ns 18708 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 306123.5 ns 306093.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32503 ns 33004 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8625 ns 8583 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8666 ns 8208 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9333 ns 9583 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8875 ns 9042 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 163266.5 ns 162492.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64583 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64459 ns 64417 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64542 ns 64625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64708 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112855 ns 112347.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 286958 ns 277542 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 284333 ns 281625 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 283417 ns 288750 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 282208 ns 275500 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 189716 ns 189809 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3364542 ns 3285583 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3028687.5 ns 3022333.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 2790833 ns 2780375 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4100917 ns 4038625 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 572405 ns 573967 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7623458 ns 7586208.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7440791 ns 7415437 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7366542 ns 7333375 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8197688 ns 8220958 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1349771 ns 1351752.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18788459 ns 18835167 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19166542 ns 19044834 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19272542 ns 19135125 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15690750 ns 15633417 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23515500 ns 23661916.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33757666.5 ns 33965500 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 40869250 ns 41107417 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35082500 ns 34858709 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1844978 ns 1862815 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189627750 ns 189289541 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164852917 ns 164224708 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 157414459 ns 157847979 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 439775958 ns 438904833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13911288.5 ns 13913764 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289819458.5 ns 289733584 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 338120021.5 ns 338173667 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 307472667 ns 307489541.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 394527375 ns 393585937.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23771 ns 21708.5 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23812.5 ns 24458 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25667 ns 25937 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22208.5 ns 24229 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95804 ns 96907 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103916.5 ns 103750 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103500 ns 105292 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104042 ns 104208 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103500 ns 151250 ns 0.68
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 506012.5 ns 504189 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6958 ns 6583 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7166 ns 7292 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7917 ns 7959 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5583 ns 6958 ns 0.80
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68880 ns 68581 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14792 ns 14916.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15062.5 ns 14709 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14834 ns 16666 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16020.5 ns 14292 ns 1.12
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 479169 ns 483895 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 2993625 ns 3017937 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2067437.5 ns 2022458 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2318395.5 ns 2307959 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4596333.5 ns 4846645.5 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 586471 ns 585796 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23496917 ns 23617917 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18039875 ns 17975417 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18250417 ns 18323812.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34954792 ns 35597209 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3109927.5 ns 3109235 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33379041 ns 33405687.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27647875 ns 27693604 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27702417 ns 27860958 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41697625 ns 42002937.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73583.5 ns 72375 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75396 ns 84624.5 ns 0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 84103.5 ns 83250 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74875 ns 73750 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100532.5 ns 102852 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 323042 ns 218167 ns 1.48
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219458.5 ns 309979 ns 0.71
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 320541.5 ns 317479 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218083.5 ns 288875 ns 0.75
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 540970.5 ns 550996 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12667 ns 12041 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12625 ns 12729.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13333 ns 13833 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12792 ns 11666.5 ns 1.10
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 70619 ns 71604 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26833 ns 26625 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26667 ns 26959 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27792 ns 28292 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26959 ns 26458 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 470712 ns 484486.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13417 ns 12417 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12708.5 ns 12542 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14500 ns 14584 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12625 ns 13041.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52550 ns 53694 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26500 ns 26312.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25750 ns 26270.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26937.5 ns 26667 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26083 ns 26333 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 305657 ns 309291.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 181666.5 ns 178770.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182979 ns 182334 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183666.5 ns 184895.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 180500 ns 179750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56554 ns 57908 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 587292 ns 587125 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 583125 ns 596500 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 590791 ns 593770.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583167 ns 583166 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 287118 ns 290369.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7084 ns 7354.5 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6958 ns 7167 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7333 ns 7875 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6750 ns 6833 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70418 ns 70829 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14291 ns 14375 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13917 ns 14708 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15375 ns 15625 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14542 ns 14083 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 458663 ns 471312.5 ns 0.97
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1170729 ns 1235042 ns 0.95
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1237583 ns 1283583 ns 0.96
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1266125 ns 1282875 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1309250 ns 1325208 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 348527 ns 301270 ns 1.16
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4115834 ns 4111125 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4378458 ns 4361625 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4798917 ns 4786395.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4449458 ns 4453229.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1045715 ns 1047552 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1791 ns 1750 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23953.5 ns 23328 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4916 ns 4833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4792 ns 4792 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4917 ns 4917 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4916 ns 4917 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 193278 ns 186698 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7250 ns 7208.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6354 ns 5584 ns 1.14
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8687.5 ns 8667 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7375 ns 7312.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 57033 ns 54539 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11542 ns 10833 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10875 ns 10834 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11708 ns 12375 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11292 ns 11916 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 339762.5 ns 329099 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 334 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23315 ns 22753 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2708 ns 2708 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 2667 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3000 ns 2959 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2708 ns 3000 ns 0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 163052.5 ns 157496 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13667 ns 13167 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 13375 ns 13166 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15083.5 ns 15000 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 13520.5 ns 13792 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57725.5 ns 55218 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25458 ns 24833 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24541 ns 24542 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25333 ns 25375 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24917 ns 24709 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 300059 ns 289966 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4166 ns 4083 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4166 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25178 ns 24660 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16250 ns 15958 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16084 ns 16417 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16042 ns 16042 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16083 ns 16125 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 199817 ns 194045.5 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5667 ns 5667 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5584 ns 5625 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5708 ns 5750 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5708 ns 5791 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33846.5 ns 32989 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21604.5 ns 21125 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20667 ns 20459 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21416 ns 21542 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20958 ns 20875 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 178601.5 ns 174273 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 404395.5 ns 403209 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 373854 ns 371125 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 469333 ns 474292 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 527166 ns 539604.5 ns 0.98
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67224 ns 66734 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 994917 ns 1011917 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 871792 ns 884896 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1218458 ns 1220125 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1362646 ns 1400208 ns 0.97
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 192087.5 ns 190566.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80437.5 ns 82917 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82959 ns 82791 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83958 ns 88958.5 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82916.5 ns 83187.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194575 ns 192556.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1691687.5 ns 1921500 ns 0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1913270.5 ns 1696166 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1935000 ns 1938083 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1693333 ns 1915875 ns 0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 398092 ns 393732 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22211 ns 21580 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 173510.5 ns 165924 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8167 ns 6708 ns 1.22
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6167 ns 6250 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10042 ns 9750 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7438 ns 8125 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60835.5 ns 56950.5 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9208 ns 8916.5 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8958 ns 8958 ns 1
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 9625 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9208 ns 9542 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 314546 ns 299584.5 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121181896 ns 120035854.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174356250 ns 174382959 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 155331667 ns 154831333 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105591312 ns 103109500 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5490856 ns 5474606 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 616313021 ns 617124000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555471625 ns 555612167 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 466804583 ns 468382792 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 757846375 ns 756087750 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34940752 ns 38213656 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 649235042 ns 651747459 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 665954333.5 ns 666674583.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 596507375 ns 602170708.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 741194667 ns 734251875 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57709 ns 57208 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47417 ns 48167 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39167 ns 39167 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83833 ns 83958 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38045 ns 37250 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1920958 ns 1929792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1970917 ns 1973292 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1819395.5 ns 1984249.5 ns 0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1889084 ns 1881417 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 176927 ns 171491 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 281874.5 ns 273354 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268333.5 ns 267959 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 269333 ns 270687.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 266916 ns 268834 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 128624.5 ns 124192.5 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 674792 ns 658333 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 677166 ns 674854.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621541 ns 665333 ns 0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 681103.5 ns 670500 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 753295 ns 664813 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2194792 ns 2190167 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2200750 ns 2214354.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2188604 ns 2216958.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2196250 ns 2099979 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 134058.5 ns 133238 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5495625 ns 5505354.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5489000 ns 5504750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5583959 ns 5565292 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5501708 ns 5499708 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 794649 ns 740235 ns 1.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 650458 ns 650417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 648250 ns 649020.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 647729.5 ns 640625 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 637958 ns 648292 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47562 ns 47265 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1820292 ns 1821708 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1718291 ns 1720959 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1692833 ns 1675729.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2104334 ns 2108500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 226480 ns 224014 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 58583 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46667 ns 46645.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38917 ns 38750 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83958 ns 83834 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28972 ns 28947 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2032895.5 ns 2024916 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2080792 ns 2086188 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1870500 ns 2100521 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1987874.5 ns 1993416.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192457.5 ns 191815.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13326479 ns 13473875 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12445229 ns 12547041.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12578437.5 ns 12559604 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 14851000 ns 15213416.5 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 517471.5 ns 517805 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47244833 ns 47353458 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41818500 ns 41833334 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41261562 ns 41118750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58290958 ns 58300041 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3201520 ns 3203904 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74154459 ns 74077042 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68317750 ns 68022250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90991833 ns 90906749.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76224833 ns 99115937.5 ns 0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 58958 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47209 ns 47375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38750 ns 38729.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83937.5 ns 83500 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 46606 ns 47777 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1920104.5 ns 1923375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1958583 ns 1961541 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1752354 ns 1980229 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1897541 ns 1890354 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190968 ns 194350.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 291 ns 291 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 31329 ns 32617.5 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6083 ns 6208.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6083 ns 5958 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6583 ns 6708 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6000 ns 6437.5 ns 0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 175665 ns 173722.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31064 ns 32110 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2584 ns 2583 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2666 ns 2542 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2833 ns 2833 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2584 ns 2833 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 162308 ns 161891 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 287512021 ns 286335145.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 341664333 ns 339870250 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 320946166.5 ns 320445937.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 272895792 ns 272825875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7028736 ns 7113314 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 998982709 ns 990386709 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 938834833 ns 938484666 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 867538167 ns 868613416.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1101126958.5 ns 1158749666 ns 0.95
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33951770 ns 33903874 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1316927292 ns 1310266104.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1337023792 ns 1325766333.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1620632917 ns 1623996500 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1310255479 ns 1663239334 ns 0.79
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1407417 ns 1461479 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1417625 ns 1415750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1413625 ns 1429167 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1406687.5 ns 1414437.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127437 ns 128213 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5023604 ns 5019792 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5018750 ns 5022458 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4760396 ns 5050000 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5017562.5 ns 5006541.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 594578 ns 557532 ns 1.07
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 174929541 ns 175263520.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 128751271 ns 129816208.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 146934041.5 ns 145953208.5 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 163487375.5 ns 164619104.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4859464 ns 4883992 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 664127542 ns 831528333 ns 0.80
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 641259166 ns 497840084 ns 1.29
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 567946500 ns 556789916 ns 1.02
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 850158541 ns 679969833 ns 1.25
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16210159 ns 16195623 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8956208 ns 8914083 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8755875 ns 8769917 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 8188500 ns 8216313 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10176729.5 ns 10158000 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1606153 ns 1595526 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36297042 ns 35894250 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36814229 ns 36843625 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34349792 ns 34476562 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38895333 ns 38802729 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 8910559.5 ns 6454567.5 ns 1.38
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47209 ns 47396 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 49333 ns 49334 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47542 ns 47542 ns 1
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 49166 ns 47417 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18896 ns 19457 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50083 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50583 ns 50520.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 51042 ns 50584 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50250 ns 50250 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 197236.5 ns 189575 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7584 ns 8104 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8083 ns 6791 ns 1.19
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9208 ns 9125 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8333 ns 7333 ns 1.14
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 95601 ns 86829.5 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9958 ns 9875 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9458 ns 9583 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 10375 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10208 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 565793 ns 537525 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7875 ns 8208 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7917 ns 8250 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9500 ns 9812.5 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8708 ns 6375 ns 1.37
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 129044 ns 113788.5 ns 1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12375 ns 13333.5 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12542 ns 12625 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13584 ns 13584 ns 1
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12583 ns 13208 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 512785 ns 479705.5 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1042 ns 958 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 958 ns 958 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 31575 ns 32580 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7958 ns 7750 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7625 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8250 ns 8542 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 8208 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 203613.5 ns 201701.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23167 ns 23250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23209 ns 23042 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23666 ns 23500 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23209 ns 23167 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18630 ns 18765.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 51875 ns 52875 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52292 ns 52292 ns 1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53083 ns 52792 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52042 ns 52459 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 280711 ns 260844.5 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1404833 ns 1400229 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1406625 ns 1398666.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1405229.5 ns 1400708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1454625 ns 1398917 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195023.5 ns 196521.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5017834 ns 5018604 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5014083.5 ns 5004729.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5053208 ns 5044229.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5002958 ns 5001271 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 619245 ns 595122 ns 1.04
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3030666.5 ns 3043083 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2072750 ns 2094042 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2270479.5 ns 2287146 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4951458.5 ns 4530875 ns 1.09
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582238 ns 582703 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24376833.5 ns 24366625 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18837750 ns 18829583 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19058604.5 ns 19120291 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36607062.5 ns 36653000 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3184045.5 ns 3189516.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34131312.5 ns 33943229 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28332646 ns 28373417 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28385728.5 ns 28357208 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41613645.5 ns 41659750 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144867000 ns 144299750 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 143371708 ns 142248375 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 125843979.5 ns 126632146 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 174338583.5 ns 173840291.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22787685 ns 22781482 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 960337666.5 ns 1307941437.5 ns 0.73
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 884236209 ns 1133574500.5 ns 0.78
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 718061541 ns 711240125 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 671977458 ns 670828250 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 119167280 ns 118499942 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74375 ns 74542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73667 ns 73917 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78583 ns 83125 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74271 ns 72916.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 236816 ns 225032.5 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 190333 ns 202979.5 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 203833 ns 282792 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 193125 ns 253479.5 ns 0.76
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 275667 ns 244146 ns 1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1230276 ns 1201754 ns 1.02
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35529791.5 ns 35408938 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35399125 ns 35449645.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32566334 ns 32512083 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 41050959 ns 41003541.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5849679 ns 5848198 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 145971667 ns 146608875 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 151770062.5 ns 151542938 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 140117333 ns 138849083 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287497375 ns 287439584 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34904065 ns 34913824 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121596229 ns 121086291.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174621209 ns 174190000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 155248875 ns 155717667 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105916125 ns 106488666.5 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5467486 ns 5478422 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 477537959 ns 611208666 ns 0.78
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 465834541 ns 466441167 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 454243958 ns 453562937.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 742668146 ns 741621625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32286932 ns 35157227 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 641994375 ns 648662584 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 654482979 ns 657411208 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 588547854.5 ns 585962375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 850465708 ns 845072208 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1310375 ns 1304708 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 956750 ns 965666 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 662041.5 ns 744354 ns 0.89
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2098541.5 ns 1944604 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 559887.5 ns 572387 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2974896 ns 2974271 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2550167 ns 2531646 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2481312.5 ns 2512854 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3687958 ns 3691334 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1744044.5 ns 1817474 ns 0.96
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6644917 ns 6642416 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6496417 ns 6630792 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6453208 ns 6466375 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4452625 ns 4443145.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7334 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6083 ns 6208 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5333 ns 5458 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10167 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25006 ns 25916 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212125 ns 212104 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222333 ns 219562.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221209 ns 220667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205875 ns 206291 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 256601.5 ns 257490 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 300709625 ns 301772791.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 221256000 ns 222879750 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 218293312.5 ns 222700312.5 ns 0.98
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 306523896 ns 311773125 ns 0.98
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7678868 ns 7676597.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1086946812.5 ns 1082870459 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 893668062.5 ns 892532250 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 821712646 ns 883941208.5 ns 0.93
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1164252583.5 ns 1154293562 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26517827 ns 26959026 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5833 ns 6459 ns 0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5166 ns 5209 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8917 ns 10000 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5625 ns 5708.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 164437.5 ns 168546.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 7458 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6834 ns 6792 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7459 ns 7542 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6917 ns 7792 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 668536 ns 639812.5 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 458 ns 542 ns 0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23464 ns 24361 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9083 ns 9000 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 8917 ns 9000 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 9583 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8958 ns 9708 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 228625 ns 234125.5 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351875 ns 351500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351375 ns 351500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352083.5 ns 351916 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 355166.5 ns 356625 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21438 ns 21502 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 779459 ns 811270.5 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 775625 ns 774958.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 776333 ns 776584 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 827729 ns 821875 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 306099 ns 315795.5 ns 0.97
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 333375 ns 335896 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 338187.5 ns 338208.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 447083 ns 441167 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 327541 ns 331375 ns 0.99
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17916 ns 18761.5 ns 0.95
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 688187.5 ns 695166 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 740833 ns 738208 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1043750.5 ns 1036458 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 687437.5 ns 692396 ns 0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 259575.5 ns 292461.5 ns 0.89
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 352250 ns 354166.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 345500 ns 346771 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 430417 ns 433791 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 366833 ns 370250 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22452 ns 23121 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 749437.5 ns 757417 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 749667 ns 749625 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1068125 ns 1070562.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 819375 ns 828458 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 228679 ns 257074.5 ns 0.89
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3417 ns 3292 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3375 ns 3458 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3709 ns 3750 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3375 ns 3417 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17943 ns 18586 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4209 ns 4167 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4208 ns 4375 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4333 ns 4417 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4333 ns 4250 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 247167 ns 296700.5 ns 0.83
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6708 ns 3625 ns 1.85
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3541 ns 3750 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6541 ns 6541 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 6354.5 ns 0.62
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 187075 ns 232189.5 ns 0.81
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 8187.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8167 ns 8000 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8416 ns 8458 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8584 ns 8500 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1170319.5 ns 1227082 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204792 ns 203417 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210709 ns 209541.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210500 ns 208250 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 205709 ns 198709 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34545 ns 35300 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 607041 ns 612417 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 620875 ns 623292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 623083 ns 623250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630937.5 ns 630166 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 345777 ns 347973 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 972333 ns 977646 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 936145.5 ns 935437.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 971895.5 ns 970083 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1287875 ns 1286374.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207543.5 ns 209031 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4486834 ns 4514333 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4467750 ns 4466146 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4468375 ns 4452875 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6363084 ns 6260416.5 ns 1.02
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 989981 ns 947144.5 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4000 ns 3542 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3667 ns 3417 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6354.5 ns 5896 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3834 ns 6667 ns 0.58
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 232582.5 ns 219336.5 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7459 ns 6917 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7459 ns 6958 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7270.5 ns 7708 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7458 ns 7291 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1031292.5 ns 1020167.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1623666.5 ns 1635042 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1158083 ns 1200395.5 ns 0.96
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1365542 ns 1363584 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2483375 ns 2345187.5 ns 1.06
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212766.5 ns 215784.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12292250 ns 12316854.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9562291 ns 9564000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9349083.5 ns 9378437.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18086958 ns 17989542 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1951499.5 ns 1948181 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17292271 ns 17368125 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14359292 ns 14382958 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14481708 ns 14502250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21083791.5 ns 21085917 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 91250 ns 90917 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90417 ns 89500 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 93375 ns 91833 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 92271 ns 113437.5 ns 0.81
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126008 ns 126891 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027791 ns 2009625 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1698917 ns 2030000 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1986791 ns 2039270.5 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2022292 ns 1871125 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1059363.5 ns 1032563 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 343041.5 ns 342166.5 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 343334 ns 343375 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 408459 ns 406458 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 307729.5 ns 311729 ns 0.99
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15788 ns 16465.5 ns 0.96
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 698500 ns 706208 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 728604 ns 728542 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1025520.5 ns 1018584 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 643645.5 ns 650375 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 194667.5 ns 195366.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7375 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 5875 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5333 ns 5416 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10000 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33151 ns 34591 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223333.5 ns 243791 ns 0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220209 ns 220125 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221084 ns 221083 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221375 ns 239167 ns 0.93
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347875 ns 327793 ns 1.06
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22573 ns 22616 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14334 ns 14292 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14333 ns 14416 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14042 ns 14208 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14292 ns 14417 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 494233.5 ns 480334.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 97708 ns 94458 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 96979 ns 92625 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 97042 ns 96875 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 101458 ns 96229.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125532 ns 126007 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1923854 ns 1714792 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1897479.5 ns 1926792 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1925958 ns 1913291.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1920000 ns 1711417 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1048446.5 ns 1034230 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 863709 ns 876916.5 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 822479.5 ns 817791 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1185917 ns 1169438 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 954084 ns 966187.5 ns 0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA 268705.5 ns 275657.5 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2763958.5 ns 2828583 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2467666.5 ns 2474833 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3312375 ns 3335750 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3401645.5 ns 3304292 ns 1.03
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1642790 ns 1618381.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16416 ns 16709 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17375 ns 15625 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19020.5 ns 18667 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15958 ns 15583 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 143820 ns 142594 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 241750 ns 228750 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215125 ns 215750 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216500 ns 217625 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258750 ns 255500 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 647705 ns 641543.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222166 ns 222458 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222937.5 ns 221500 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 223500 ns 223458.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221625 ns 222604.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 270664.5 ns 269850.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 530500 ns 537583 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 504708 ns 497334 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 498354 ns 499583 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 509042 ns 526833 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1440100 ns 1430878.5 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 332166.5 ns 330125 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 334583 ns 332834 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 436584 ns 435458.5 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 316291.5 ns 315917 ns 1.00
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16549 ns 16581 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 714916 ns 717084 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 731749.5 ns 728166.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1018041 ns 1021104 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 659916 ns 662729.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 197924.5 ns 195479.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17500 ns 17875 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17208 ns 17167 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19500 ns 20250 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17625 ns 17208 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146247.5 ns 145639 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213167 ns 223750 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212354 ns 212417 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214229.5 ns 214041 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224292 ns 221917 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 963761 ns 1035551.5 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7187.5 ns 6708 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6458 ns 6333 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7542 ns 7208 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 6625 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 247788.5 ns 240542 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10666 ns 10584 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10292 ns 9917 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10500 ns 11166.5 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10625 ns 10917 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1101991 ns 1097401.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7209 ns 3500 ns 2.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3167 ns 3208 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5334 ns 6333.5 ns 0.84
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3125 ns 6750 ns 0.46
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 249742.5 ns 250006 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7333 ns 7625 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7125 ns 7084 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 8125 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333.5 ns 7500 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1104576 ns 1102649 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23676542 ns 23315625 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35315625 ns 34529125 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 41021500 ns 41513333.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34975146 ns 34929834 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1841641 ns 1838602 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183575291 ns 184421875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159863125 ns 159459792 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 150390416 ns 151225083 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 414546000 ns 413223958 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16498361 ns 16387494 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 426892000 ns 428743125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 255220666.5 ns 252439020.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 233789999.5 ns 233017396 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 488318583 ns 484197291 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182625 ns 183584 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183416 ns 182750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186375 ns 186625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183562.5 ns 183146 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 230078 ns 228677.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 591125 ns 596083 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 586583.5 ns 586292 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 588875 ns 589770.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 633896 ns 631958 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1092628.5 ns 1119701 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3870667 ns 3838833 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3639041 ns 3643375.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3556854.5 ns 3563521 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5348541.5 ns 5359750 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 539345 ns 537722 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17399709 ns 17412417 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17238083 ns 17190667 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 17087396 ns 17100375 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22258375 ns 22144083 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2638865 ns 2612799 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32200 ns 32035 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9916 ns 9208 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8666 ns 8542 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9917 ns 10208 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9167 ns 9459 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 269524 ns 264327.5 ns 1.02
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 498856625 ns 504274209 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 429009812.5 ns 430218396 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 471079125 ns 471374500 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 677305167 ns 672994208.5 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12482570.5 ns 12486595 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2039988125 ns 2049529562.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1629390541 ns 1632649709 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1544585104 ns 1536417708 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2216826791.5 ns 2205666041.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49313694 ns 49389302 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1648875 ns 1657645.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1178188 ns 1189208.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1335999.5 ns 1382000 ns 0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2451250 ns 2334125 ns 1.05
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 216670 ns 214982 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12769708 ns 12688500 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9922083 ns 9942000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9700833 ns 9748312.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18419000.5 ns 18407312 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2018117 ns 2050613 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17711521 ns 17691583.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14689667 ns 14746041.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14712292 ns 14804417 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21443959 ns 21386084 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26208 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26209 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26375 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26167 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23929 ns 24125 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67042 ns 66875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66709 ns 66917 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66958 ns 67083 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66667 ns 67209 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 401514 ns 398847.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203834 ns 202667 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209500 ns 209000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209833 ns 209167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 201916 ns 199583 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26172 ns 26392 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 650145.5 ns 612416.5 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 666666.5 ns 627416.5 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 623083.5 ns 667979 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 632333 ns 631250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 349666.5 ns 353043.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 655125 ns 645542 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 647625 ns 643375 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 546958 ns 664187.5 ns 0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 681500 ns 540834 ns 1.26
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131296 ns 132126 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2258229 ns 2247375 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2255500 ns 2239958 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2297083.5 ns 2302917 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2241417 ns 2219000 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1166127 ns 1328726 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18666.5 ns 17667 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17729 ns 16979.5 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21000 ns 20792 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17958 ns 18500 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144841.5 ns 146392.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 262417 ns 229708 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219042 ns 225333 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221084 ns 229292 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257833 ns 259083 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1023581 ns 1081671 ns 0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 459 ns 542 ns 0.85
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22851 ns 23645 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10083 ns 9833.5 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9750 ns 9542 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10458 ns 10708 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9500 ns 9916 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 257737.5 ns 262941 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7145.5 ns 7291 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5270.5 ns 5833 ns 0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8084 ns 9625 ns 0.84
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7354.5 ns 7250 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 234765.5 ns 234003 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7583 ns 7333 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7041 ns 7000 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7833 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6917 ns 7250 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 799451.5 ns 810029.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2125 ns 2042 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1917 ns 2000 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2583 ns 2375 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2250 ns 2208 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17933 ns 18218 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6583 ns 6542 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6458 ns 6500 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6750 ns 6708 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6500 ns 6750 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 329954.5 ns 335368 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 762666 ns 750166 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 748292 ns 746604.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 748937.5 ns 751041 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 752125 ns 761417 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21677 ns 21856 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 810604 ns 775334 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 772458 ns 775042 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775020.5 ns 804792 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 811250.5 ns 791625 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 297528.5 ns 299022 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333.5 ns 7375 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5875 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5042 ns 5208 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32915 ns 32492 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 261083 ns 233188 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227854.5 ns 227750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228167 ns 254458 ns 0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255667 ns 255583 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 362215.5 ns 359227 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12729.5 ns 11042 ns 1.15
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12542 ns 12458 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13208 ns 12959 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10125 ns 12000 ns 0.84
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 247873.5 ns 245075.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24541 ns 24875 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24625 ns 24458 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25042 ns 25458 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24916.5 ns 24583.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1130164.5 ns 1120608 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106546042 ns 106980458 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117887062.5 ns 118006979.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 123861917 ns 123940208 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117586083 ns 118407959 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2640571 ns 2661574 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 392989041 ns 394378313 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 367697333 ns 368164500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 360473458 ns 358657167 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 483729375 ns 482282708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15223969 ns 15138278 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 755583520.5 ns 759267583 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 579266333 ns 577881125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 748066646 ns 749378833 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 768472896 ns 945671312.5 ns 0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7750 ns 7458 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7042 ns 7958 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8834 ns 8750 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8250 ns 7333 ns 1.13
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 240391.5 ns 235620 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14875 ns 14500 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13708 ns 13333 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14292 ns 15041 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13167 ns 14292 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1084587 ns 1078273.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8770.5 ns 8542 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7958 ns 7792 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9375 ns 9187.5 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6458 ns 7833.5 ns 0.82
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 235585.5 ns 235827.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12916.5 ns 13167 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12375 ns 12084 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12666 ns 13084 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12333 ns 12833 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 794027 ns 787391.5 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 345125 ns 347250 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 345458 ns 344875 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 418042 ns 409896 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 310875 ns 310562 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16738 ns 16566 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 706459 ns 713833.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 731167 ns 727291 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1022500 ns 1023416 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 650625 ns 654959 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 200821 ns 197250.5 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23278 ns 23066 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6333 ns 6250 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6167 ns 6334 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6750 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6791 ns 0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 241472.5 ns 238420 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5708 ns 5750 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5667 ns 5750 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5792 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5667 ns 5834 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24186 ns 23863 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21187.5 ns 21750 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21333 ns 21000 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21833 ns 21958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21084 ns 21708 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 264949.5 ns 261085 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 145958 ns 152146 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145667 ns 145250 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150937.5 ns 149541 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 183459 ns 145937 ns 1.26
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167512 ns 166536.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1325645.5 ns 1328792 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1283209 ns 1319083.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1321500 ns 1350812.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1321750 ns 1317084 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1344605.5 ns 1336276 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24708 ns 24917 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23750 ns 24208 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25791.5 ns 25708 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23875 ns 24208.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 286945.5 ns 351114.5 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 178875 ns 131125 ns 1.36
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 121125 ns 117791 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118416 ns 172917 ns 0.68
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 177750 ns 177334 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1461558 ns 1465398.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 333 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22927.5 ns 22926 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6417 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6125 ns 6458 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6917 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6584 ns 6542 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 258576.5 ns 254551 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6625 ns 7625 ns 0.87
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6812 ns 4167 ns 1.63
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6834 ns 7708.5 ns 0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4042 ns 7375 ns 0.55
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 256701 ns 250274.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10583 ns 10042 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 9708 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10041.5 ns 10333 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10208 ns 10250 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1359307.5 ns 1345295 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23350 ns 22897 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5625 ns 5625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5709 ns 5584 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5916 ns 5959 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5584 ns 5958 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 275462 ns 271438.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6828396 ns 6886125 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6363250 ns 6378229 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6516562.5 ns 6526875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7621750 ns 7602250 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215412 ns 213111 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24080041.5 ns 24073062 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21280041 ns 21283625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 20997625 ns 21045584 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29759000.5 ns 29677875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2123616 ns 2108165 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37323458 ns 37353145.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34474916.5 ns 34386667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45793312.5 ns 45930020.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 37921042 ns 49322334 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7333.5 ns 7708.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5187.5 ns 5875 ns 0.88
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7937.5 ns 8333 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6916.5 ns 7062.5 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 239076 ns 238522.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8458 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7708 ns 8042 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8459 ns 8583 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 8292 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1072155.5 ns 1070850 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1562125 ns 1544374.5 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1257937.5 ns 1259666.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1631625 ns 1632771 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2159229 ns 2150667 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 280483 ns 278945 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7910479.5 ns 7908937.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6565334 ns 6609937 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7148750 ns 7237750.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10451187.5 ns 10434334 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1898276.5 ns 1889956 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 336958 ns 340979 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 340916.5 ns 345792 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 417542 ns 417125 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 342750 ns 345833 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46061 ns 42448 ns 1.09
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 748770.5 ns 746500.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 793209 ns 784542 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1057812.5 ns 1073250 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 768708 ns 761062.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 312846.5 ns 303720.5 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397333 ns 397500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288166 ns 288250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 212083 ns 212666 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755667 ns 756084 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44394 ns 43887 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 666250 ns 671083 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 532250 ns 530083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 471125 ns 470667 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973625 ns 974750 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 191807 ns 188388.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 642749.5 ns 679250 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 657917 ns 645333.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 600542 ns 642458 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 682375 ns 638562.5 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132357 ns 131530 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2459583 ns 2409292 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2421437.5 ns 2456416.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2501750 ns 2514583 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2470417 ns 2456292 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1375665.5 ns 1277300 ns 1.08
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 341833 ns 345146 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 342917 ns 343583 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 406292 ns 403708.5 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 308208 ns 312208 ns 0.99
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16247 ns 16009 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 701208 ns 709667 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 725603.5 ns 724500 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1027750 ns 1022687.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 644500 ns 650417 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 201443.5 ns 195917 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1461354.5 ns 1460417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1500917 ns 1500812.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1493542 ns 1496375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1440084 ns 1438708 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41707 ns 40600 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5118646 ns 5128791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5284042 ns 5302375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5161771 ns 5313000 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4984145.5 ns 4970208.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198485.5 ns 196206.5 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3667 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33901 ns 32895 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15041 ns 15167 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15084 ns 15083 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15083 ns 15083 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15083 ns 15375 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 382601.5 ns 376729 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 70917 ns 71459 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71375 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 70958 ns 71375 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71375 ns 70708 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 114182 ns 113177.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317708 ns 317917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 320416 ns 320417 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 331250 ns 325333 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318541 ns 320916 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 198022 ns 193043 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 958 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 959 ns 958 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 959 ns 1042 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23593 ns 23363 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8042 ns 8083 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8041.5 ns 7792 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8583 ns 8750 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7916.5 ns 8750 ns 0.90
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 264877.5 ns 260535.5 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 473667 ns 475499.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 478708 ns 470520.5 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 549417 ns 557125 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 545834 ns 557959 ns 0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129526 ns 129404 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1394625 ns 1399270.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1382500 ns 1382375 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1612937.5 ns 1611125 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1580937.5 ns 1582104.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 279438 ns 274924 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 291 ns 250 ns 1.16
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31859 ns 31647 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6375 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6083 ns 6042 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6729.5 ns 6666 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6020.5 ns 6625 ns 0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 267633.5 ns 262541.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1724250 ns 1761833 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1721479 ns 1723396 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1727042 ns 1733812.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1730208 ns 1730625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168767.5 ns 169477.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4344292 ns 4358625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4323437.5 ns 4358708 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4399666 ns 4403062.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4355416 ns 4373875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1262148 ns 1208123 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6791 ns 7167 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6417 ns 6875 ns 0.93
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6625 ns 6916 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6709 ns 6750 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 21098 ns 20662 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 48084 ns 51625 ns 0.93
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32833 ns 32917 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33000 ns 48208.5 ns 0.68
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51416 ns 51417 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 212419 ns 292106.5 ns 0.73
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 353583 ns 354562.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 345083 ns 348666.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 435187.5 ns 433333 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 317875 ns 322041.5 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18605 ns 18353 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 723437.5 ns 724625 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 740646 ns 730583 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1044625 ns 1038687.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 669791.5 ns 675333 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 346908.5 ns 335730.5 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75167 ns 75458 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75208 ns 75333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75292 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 74625 ns 74584 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47711 ns 46864.5 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324709 ns 325166 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 338396 ns 324250 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 336208 ns 336875 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 325000 ns 325125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213486 ns 209059.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1486042 ns 1485709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1526750 ns 1526833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1518500 ns 1522792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1465542 ns 1462625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52087.5 ns 51397 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5125208 ns 5113395.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5101375 ns 5295292 ns 0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5010541.5 ns 5300812.5 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4979833 ns 5001042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 206862 ns 202971.5 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28291 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28208 ns 28208 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28250 ns 28209 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 25316 ns 24514.5 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66417 ns 66417 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66583 ns 66458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66542 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66417 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 540711 ns 505942 ns 1.07
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1499041 ns 1502084 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1143146 ns 1124250 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 814500 ns 944270.5 ns 0.86
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2202375.5 ns 2255250 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 572810.5 ns 566674 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3097583 ns 3090791 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2552125 ns 2751542 ns 0.93
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2429833.5 ns 2628896 ns 0.92
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3815666 ns 3819709 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2014809 ns 1979936 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8858833 ns 8847333 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8751542 ns 8768375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8741583 ns 8750250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6425062.5 ns 6340375 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 123833.5 ns 85125 ns 1.45
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82667 ns 83021 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81896 ns 85708.5 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 125792 ns 83562.5 ns 1.51
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194861 ns 192703 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2004479 ns 2012875 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1739604.5 ns 2024062.5 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1766291 ns 2038542 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2013646 ns 2008812 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 741224 ns 791664.5 ns 0.94

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal marked this pull request as ready for review November 4, 2024 23:32
@avik-pal avik-pal merged commit dc2885f into main Nov 5, 2024
51 of 113 checks passed
@avik-pal avik-pal deleted the ap/xla_args branch November 5, 2024 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants