Skip to content

Commit

Permalink
fix(MLDataDevices): remove stale import
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Nov 5, 2024
1 parent bf1f12b commit 8b87c2b
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion lib/MLDataDevices/ext/MLDataDevicesReactantExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ module MLDataDevicesReactantExt

using Adapt: Adapt
using MLDataDevices: MLDataDevices, Internal, ReactantDevice, CPUDevice, get_device_type
using Reactant: Reactant, XLA, RArray, ConcreteRArray, ConcreteRNumber, TracedRArray,
using Reactant: Reactant, XLA, ConcreteRArray, ConcreteRNumber, TracedRArray,
TracedRNumber

MLDataDevices.loaded(::Union{ReactantDevice, Type{<:ReactantDevice}}) = true
Expand Down

1 comment on commit 8b87c2b

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 8b87c2b Previous: 8bfa628 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4709 ns 4625 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4792 ns 4084 ns 1.17
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5166 ns 5791 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4416 ns 4292 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60862 ns 60959 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10416 ns 10125 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9875 ns 9959 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11417 ns 10375 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10542 ns 10666 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 426730.5 ns 427044 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1000 ns 1167 ns 0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1333 ns 1250 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1291 ns 1458 ns 0.89
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1395.5 ns 3542 ns 0.39
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18565 ns 18260 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4209 ns 4125 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 3833 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4167 ns 4125 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4000 ns 4000 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 111556 ns 111381 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56375 ns 57709 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46916 ns 47250 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46167 ns 38250 ns 1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80959 ns 80333 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37697 ns 37655 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2046500 ns 2026167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2089354 ns 2092708.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2048708.5 ns 2059625.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1993834 ns 1993416 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 199690 ns 197377 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 147104.5 ns 152958 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144104.5 ns 148250 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148584 ns 146417 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144583.5 ns 150375 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165605 ns 167595 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1131291 ns 1098542 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1119584 ns 1124250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1111791.5 ns 1116146 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1118209 ns 1107229.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 531488 ns 523151 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 3584 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3709 ns 3625 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5520.5 ns 5708.5 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3375 ns 3417 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 71213 ns 70157 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9084 ns 8834 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9625 ns 8667 ns 1.11
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 9291 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8584 ns 9042 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 497375 ns 492826.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15458.5 ns 17000 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15250 ns 16375 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19146 ns 18667 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14604 ns 17083 ns 0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55040 ns 54850 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213833 ns 213146 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213292 ns 216104 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215292 ns 214167 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217500 ns 225333 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 277020.5 ns 272672.5 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 459 ns 1.18
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 709 ns 709 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 583 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17919 ns 17542 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1542 ns 1708 ns 0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1458 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1916 ns 1625 ns 1.18
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1750 ns 0.79
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 104816 ns 104205 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7250 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5833 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5916 ns 5209 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9875 ns 4000 ns 2.47
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24078 ns 23961 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229750 ns 228750.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228583 ns 228333 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230292 ns 228500 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213917 ns 226334 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 172648 ns 170956 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3833 ns 3875 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3834 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23922 ns 23832 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16458 ns 16833 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16583 ns 16708 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16958 ns 16708 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16750 ns 16958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 166168.5 ns 165501.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 579542 ns 579042 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 576458 ns 574375 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 578750 ns 575083 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 574667 ns 576292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113828 ns 113664 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1424688 ns 1417708 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1421083 ns 1429333 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1423208.5 ns 1425729.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1419500 ns 1422208 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 215564 ns 214791 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1071229.5 ns 1082104 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 961417 ns 959958.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1343000 ns 1341792 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1300000.5 ns 1294792 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 277770.5 ns 281583.5 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5955916 ns 5777875 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4519500 ns 4456083 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4916354.5 ns 4934792 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5726333 ns 5627500 ns 1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1105672 ns 1106964 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 24042 ns 23988 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2084 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2125 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 173326.5 ns 179026 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4000 ns 6084 ns 0.66
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4584 ns 6167 ns 0.74
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7083 ns 7041 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4125 ns 6375 ns 0.65
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65959 ns 66163.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11084 ns 11291 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11000 ns 10791 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12292 ns 12125 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10791 ns 11354.5 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 456125.5 ns 456626.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7000 ns 7000 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6458 ns 7042 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8500 ns 8375 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6292 ns 7042 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 54186 ns 52652 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16708 ns 17375 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17875 ns 17167 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18750 ns 17770.5 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16875 ns 18708 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 308312 ns 306093.5 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33294 ns 33004 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8708 ns 8583 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9208 ns 8208 ns 1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9458 ns 9583 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8292 ns 9042 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 162415.5 ns 162492.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64625 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64667 ns 64417 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64666 ns 64625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64625 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112234 ns 112347.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 284395.5 ns 277542 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 286937.5 ns 281625 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 285291 ns 288750 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 277917 ns 275500 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 188885.5 ns 189809 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3237000 ns 3285583 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3046417 ns 3022333.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3014917 ns 2780375 ns 1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3953541.5 ns 4038625 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 577323 ns 573967 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7569937.5 ns 7586208.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7460791.5 ns 7415437 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7457666.5 ns 7333375 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8209666 ns 8220958 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1380365.5 ns 1351752.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18994750 ns 18835167 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19146458 ns 19044834 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19185583 ns 19135125 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15773833 ns 15633417 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24040875 ns 23661916.5 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33769833 ns 33965500 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37025062.5 ns 41107417 ns 0.90
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34849833 ns 34858709 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1855448 ns 1862815 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 192176500 ns 189289541 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 165400792 ns 164224708 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 153088459 ns 157847979 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 439540208 ns 438904833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13926820 ns 13913764 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 292222499.5 ns 289733584 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 338088333 ns 338173667 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 298393250 ns 307489541.5 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 394164437.5 ns 393585937.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23395.5 ns 21708.5 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23000 ns 24458 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 26479.5 ns 25937 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22271 ns 24229 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96215.5 ns 96907 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103541.5 ns 103750 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104375 ns 105292 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 105000 ns 104208 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 106291 ns 151250 ns 0.70
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 499410 ns 504189 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7125 ns 6583 ns 1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6542 ns 7292 ns 0.90
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7916 ns 7959 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 6958 ns 0.84
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 67753 ns 68581 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15250 ns 14916.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15500 ns 14709 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16666 ns 16666 ns 1
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14667 ns 14292 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 471687 ns 483895 ns 0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3030208.5 ns 3017937 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2057020.5 ns 2022458 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2271375 ns 2307959 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4518521 ns 4846645.5 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585712 ns 585796 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23780833 ns 23617917 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17907042 ns 17975417 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16907896 ns 18323812.5 ns 0.92
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34889792 ns 35597209 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3222471 ns 3109235 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33703875 ns 33405687.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27577959 ns 27693604 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27463958 ns 27860958 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41773187 ns 42002937.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73687.5 ns 72375 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73292 ns 84624.5 ns 0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 83417 ns 83250 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74667 ns 73750 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101830 ns 102852 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 318542 ns 218167 ns 1.46
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216770.5 ns 309979 ns 0.70
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219750 ns 317479 ns 0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 297396 ns 288875 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 550055 ns 550996 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11937.5 ns 12041 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11958 ns 12729.5 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13395.5 ns 13833 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11584 ns 11666.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71500 ns 71604 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26666 ns 26625 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26875 ns 26959 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27792 ns 28292 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26500 ns 26458 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 478647.5 ns 484486.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12458 ns 12417 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12750 ns 12542 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14042 ns 14584 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12042 ns 13041.5 ns 0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 54279 ns 53694 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25792 ns 26312.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25791 ns 26270.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26584 ns 26667 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 25833.5 ns 26333 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 307846.5 ns 309291.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180187.5 ns 178770.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 179750 ns 182334 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183375 ns 184895.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179041 ns 179750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57080 ns 57908 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 584708.5 ns 587125 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 587833 ns 596500 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 595750 ns 593770.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 587000 ns 583166 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 286439 ns 290369.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6541.5 ns 7354.5 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6708 ns 7167 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7500 ns 7875 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 6833 ns 0.84
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70275 ns 70829 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13937.5 ns 14375 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14708 ns 14708 ns 1
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15583 ns 15625 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13500 ns 14083 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 465284 ns 471312.5 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1198000 ns 1235042 ns 0.97
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1218958 ns 1283583 ns 0.95
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1268562.5 ns 1282875 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1315416 ns 1325208 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302635 ns 301270 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4311792 ns 4111125 ns 1.05
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4360354 ns 4361625 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4524583 ns 4786395.5 ns 0.95
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4481833 ns 4453229.5 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1039337 ns 1047552 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1834 ns 1750 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23819 ns 23328 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4834 ns 4833 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4875 ns 4792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5000 ns 4917 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 189325 ns 186698 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6250 ns 7208.5 ns 0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6084 ns 5584 ns 1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8291 ns 8667 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5750 ns 7312.5 ns 0.79
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 56699 ns 54539 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11125 ns 10833 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 12083 ns 10834 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11875 ns 12375 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11125 ns 11916 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 333470 ns 329099 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 334 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23140 ns 22753 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2834 ns 2708 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2709 ns 2667 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 2959 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 3000 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 160474 ns 157496 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11833 ns 13167 ns 0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12500 ns 13166 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15000 ns 15000 ns 1
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11667 ns 13792 ns 0.85
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57479 ns 55218 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24667 ns 24833 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25000 ns 24542 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25583 ns 25375 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25125 ns 24709 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 294701.5 ns 289966 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4083 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4208 ns 4166 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4167 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4125 ns 4125 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25243 ns 24660 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 15959 ns 15958 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16167 ns 16417 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16500 ns 16042 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16125 ns 16125 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 196657.5 ns 194045.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5667 ns 5667 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5708 ns 5625 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5709 ns 5750 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5708 ns 5791 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34103 ns 32989 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20375 ns 21125 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21166 ns 20459 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21500 ns 21542 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21083 ns 20875 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 178406.5 ns 174273 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 380541 ns 403209 ns 0.94
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 375333 ns 371125 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 487875 ns 474292 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 532687 ns 539604.5 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67192 ns 66734 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 993167 ns 1011917 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 884334 ns 884896 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1238562.5 ns 1220125 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1412624.5 ns 1400208 ns 1.01
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 189581 ns 190566.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 86875 ns 82917 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80583 ns 82791 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85875 ns 88958.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80791.5 ns 83187.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192886.5 ns 192556.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924208 ns 1921500 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1916917 ns 1696166 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920541 ns 1938083 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1907750 ns 1915875 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 398152 ns 393732 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22307 ns 21580 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1791 ns 1792 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 170162 ns 165924 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6792 ns 6708 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7458.5 ns 6250 ns 1.19
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9604.5 ns 9750 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6458.5 ns 8125 ns 0.79
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60140 ns 56950.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8875 ns 8916.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9208 ns 8958 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9250 ns 9625 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9208 ns 9542 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 308605.5 ns 299584.5 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 156095333.5 ns 120035854.5 ns 1.30
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174294250 ns 174382959 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147908167 ns 154831333 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105395375 ns 103109500 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5479498 ns 5474606 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 674867041 ns 617124000 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555334333 ns 555612167 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 454020333.5 ns 468382792 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 758003104 ns 756087750 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34951781 ns 38213656 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 701059834 ns 651747459 ns 1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 666716125.5 ns 666674583.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 580121499.5 ns 602170708.5 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 741952792 ns 734251875 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57708 ns 57208 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47333 ns 48167 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47250 ns 39167 ns 1.21
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83959 ns 83958 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37806 ns 37250 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1934958.5 ns 1929792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1972000 ns 1973292 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1976374.5 ns 1984249.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1886667 ns 1881417 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 174540 ns 171491 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 274833.5 ns 273354 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 267625 ns 267959 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 288750 ns 270687.5 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 275791.5 ns 268834 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 127747 ns 124192.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 588791.5 ns 658333 ns 0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 676334 ns 674854.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 669375.5 ns 665333 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 637708 ns 670500 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 705367 ns 664813 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2201812.5 ns 2190167 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2173417 ns 2214354.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2204166 ns 2216958.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2175854 ns 2099979 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133869 ns 133238 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5561000 ns 5505354.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5485083 ns 5504750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5500791 ns 5565292 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5486667 ns 5499708 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 758600 ns 740235 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 650375 ns 650417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 639375 ns 649020.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 639250 ns 640625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 645541 ns 648292 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46906 ns 47265 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1797375 ns 1821708 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1723000 ns 1720959 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1729417 ns 1675729.5 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2102375 ns 2108500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 224012.5 ns 224014 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57125 ns 58583 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46792 ns 46645.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46792 ns 38750 ns 1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83625 ns 83834 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28934 ns 28947 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2042125 ns 2024916 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2085750 ns 2086188 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2086104 ns 2100521 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1992187.5 ns 1993416.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192769 ns 191815.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13486000 ns 13473875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12454854 ns 12547041.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12584062 ns 12559604 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15166646 ns 15213416.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 516981.5 ns 517805 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47757417 ns 47353458 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41920875 ns 41833334 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41057895.5 ns 41118750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58660917 ns 58300041 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3200471 ns 3203904 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74173979 ns 74077042 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68296125 ns 68022250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90853250 ns 90906749.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76369500 ns 99115937.5 ns 0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57542 ns 58958 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47333 ns 47375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47208 ns 38729.5 ns 1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83542 ns 83500 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47283 ns 47777 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1917416.5 ns 1923375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1969750 ns 1961541 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1977666 ns 1980229 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1891062.5 ns 1890354 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191945 ns 194350.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 291 ns 1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32084 ns 32617.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6166 ns 6208.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6417 ns 5958 ns 1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6959 ns 6708 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6334 ns 6437.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 173427.5 ns 173722.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31620 ns 32110 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns 2583 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 2542 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2959 ns 2833 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2833 ns 0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 161588.5 ns 161891 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 322222750 ns 286335145.5 ns 1.13
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 341161875 ns 339870250 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313409520.5 ns 320445937.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 272857666 ns 272825875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7106282 ns 7113314 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1057275812.5 ns 990386709 ns 1.07
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 937359791 ns 938484666 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 852420750 ns 868613416.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1161160000 ns 1158749666 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34076180 ns 33903874 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1357441042 ns 1310266104.5 ns 1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1321006541.5 ns 1325766333.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1604272875 ns 1623996500 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1302899708.5 ns 1663239334 ns 0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1417312.5 ns 1461479 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1438625 ns 1415750 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1422375 ns 1429167 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1404187.5 ns 1414437.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127360 ns 128213 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5059667 ns 5019792 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5032458 ns 5022458 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5024750 ns 5050000 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5017709 ns 5006541.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 498493.5 ns 557532 ns 0.89
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 172134417 ns 175263520.5 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 132190854 ns 129816208.5 ns 1.02
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 125671875 ns 145953208.5 ns 0.86
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 162159562.5 ns 164619104.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4881912.5 ns 4883992 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 676531000 ns 831528333 ns 0.81
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 642244500 ns 497840084 ns 1.29
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 502997666 ns 556789916 ns 0.90
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 678617458 ns 679969833 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17408311 ns 16195623 ns 1.07
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 9098854 ns 8914083 ns 1.02
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8775166.5 ns 8769917 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7856833.5 ns 8216313 ns 0.96
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10166000 ns 10158000 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1591045 ns 1595526 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 37558563 ns 35894250 ns 1.05
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37073459 ns 36843625 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33526542 ns 34476562 ns 0.97
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38790125 ns 38802729 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6476971 ns 6454567.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47333 ns 47396 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47333 ns 49334 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47625 ns 47542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47125 ns 47417 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19085 ns 19457 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50333 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 52875 ns 50520.5 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 53083 ns 50584 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50250 ns 50250 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 184149.5 ns 189575 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7458 ns 8104 ns 0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7333 ns 6791 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8667 ns 9125 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6708 ns 7333 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 84192.5 ns 86829.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9917 ns 9875 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9917 ns 9583 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11041 ns 10375 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9917 ns 10208 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 493810 ns 537525 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7542 ns 8208 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7667 ns 8250 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9667 ns 9812.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5417 ns 6375 ns 0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 91440.5 ns 113788.5 ns 0.80
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12625 ns 13333.5 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13833 ns 12625 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14000 ns 13584 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12666 ns 13208 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 454481 ns 479705.5 ns 0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 958 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1083 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32617 ns 32580 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns 7750 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8145.5 ns 7625 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8542 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8125 ns 8208 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 196206.5 ns 201701.5 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23083 ns 23250 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23375 ns 23042 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23583 ns 23500 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23750 ns 23167 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18627 ns 18765.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52500 ns 52875 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52875 ns 52292 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53417 ns 52792 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52166 ns 52459 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 249106 ns 260844.5 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1448167 ns 1400229 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1405000 ns 1398666.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1405874.5 ns 1400708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1403917 ns 1398917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195637 ns 196521.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5038167 ns 5018604 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5020646 ns 5004729.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5017458 ns 5044229.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5008375 ns 5001271 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 558064 ns 595122 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3065354.5 ns 3043083 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2082084 ns 2094042 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2285291 ns 2287146 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4897375 ns 4530875 ns 1.08
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 583035 ns 582703 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24715854 ns 24366625 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18870292 ns 18829583 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18758208 ns 19120291 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36783917 ns 36653000 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3184571 ns 3189516.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34426125 ns 33943229 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28319896 ns 28373417 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28022958.5 ns 28357208 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41761166.5 ns 41659750 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144957333 ns 144299750 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 142855500 ns 142248375 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 124763354 ns 126632146 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173311167 ns 173840291.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22559600 ns 22781482 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 956543708 ns 1307941437.5 ns 0.73
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1622781604 ns 1133574500.5 ns 1.43
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1236835833 ns 711240125 ns 1.74
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 673901750 ns 670828250 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118606884 ns 118499942 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74208 ns 74542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74834 ns 73917 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 86875 ns 83125 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73041.5 ns 72916.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 204598.5 ns 225032.5 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 278208.5 ns 202979.5 ns 1.37
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 202666.5 ns 282792 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 288416 ns 253479.5 ns 1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 287917 ns 244146 ns 1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1117217.5 ns 1201754 ns 0.93
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 36148959 ns 35408938 ns 1.02
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35295854 ns 35449645.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32189834 ns 32512083 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40944021 ns 41003541.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5845476 ns 5848198 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 151293125 ns 146608875 ns 1.03
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 152622708.5 ns 151542938 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 134152417 ns 138849083 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287902584 ns 287439584 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34882228 ns 34913824 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 155688000 ns 121086291.5 ns 1.29
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174601250 ns 174190000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147696687.5 ns 155717667 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106151041.5 ns 106488666.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5471843 ns 5478422 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 518343938 ns 611208666 ns 0.85
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 467330167 ns 466441167 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 438511083.5 ns 453562937.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 738327500 ns 741621625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32271735 ns 35157227 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 689829417 ns 648662584 ns 1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 655962042 ns 657411208 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 572893458 ns 585962375 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 850499333 ns 845072208 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1204208 ns 1304708 ns 0.92
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 909228.5 ns 965666 ns 0.94
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 975604.5 ns 744354 ns 1.31
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2068166 ns 1944604 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 573967.5 ns 572387 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2921979 ns 2974271 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2595937 ns 2531646 ns 1.03
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2601958 ns 2512854 ns 1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3701291 ns 3691334 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1629819 ns 1817474 ns 0.90
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6735042 ns 6642416 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6496187.5 ns 6630792 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6432833.5 ns 6466375 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4458667 ns 4443145.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7334 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6084 ns 6208 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 5458 ns 1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10167 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25112 ns 25916 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214479.5 ns 212104 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219625 ns 219562.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221583 ns 220667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206125 ns 206291 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 247799 ns 257490 ns 0.96
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 312548750 ns 301772791.5 ns 1.04
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 223228250 ns 222879750 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 196993083 ns 222700312.5 ns 0.88
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 310829208 ns 311773125 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7675013 ns 7676597.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1097849625.5 ns 1082870459 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 906889750 ns 892532250 ns 1.02
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 868243875 ns 883941208.5 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1161595250 ns 1154293562 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26504585 ns 26959026 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5250 ns 6459 ns 0.81
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6520.5 ns 5209 ns 1.25
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7375 ns 10000 ns 0.74
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5125 ns 5708.5 ns 0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 155225.5 ns 168546.5 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6917 ns 7458 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7541 ns 6792 ns 1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7584 ns 7542 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7792 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 614403 ns 639812.5 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 458 ns 542 ns 0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24324 ns 24361 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9209 ns 9000 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9333 ns 9000 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9709 ns 9583 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9083 ns 9708 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 214987 ns 234125.5 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352000 ns 351500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351167 ns 351500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352000 ns 351916 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 351667 ns 356625 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21526 ns 21502 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 822667 ns 811270.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 803791 ns 774958.5 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 774000 ns 776584 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 819209 ns 821875 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 271931 ns 315795.5 ns 0.86
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 315625 ns 335896 ns 0.94
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 334062.5 ns 338208.5 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 448958 ns 441167 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 335542 ns 331375 ns 1.01
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18135.5 ns 18761.5 ns 0.97
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 693229 ns 695166 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 737125 ns 738208 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1034583 ns 1036458 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 697563 ns 692396 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 240714.5 ns 292461.5 ns 0.82
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 329166 ns 354166.5 ns 0.93
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 345354 ns 346771 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 424875 ns 433791 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 374166 ns 370250 ns 1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22796 ns 23121 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 753187.5 ns 757417 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 751083 ns 749625 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1069042 ns 1070562.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 824250 ns 828458 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 214489 ns 257074.5 ns 0.83
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3458 ns 3292 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3500 ns 3458 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3875 ns 3750 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3292 ns 3417 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18145 ns 18586 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4417 ns 4167 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4208 ns 4375 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4333 ns 4417 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4209 ns 4250 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 237972.5 ns 296700.5 ns 0.80
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6417 ns 3625 ns 1.77
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4042 ns 3750 ns 1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6542 ns 6541 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3375 ns 6354.5 ns 0.53
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 174590 ns 232189.5 ns 0.75
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8209 ns 8187.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8250 ns 8000 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8708 ns 8458 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8709 ns 8500 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1063088 ns 1227082 ns 0.87
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203375 ns 203417 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209625 ns 209541.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210958 ns 208250 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200833 ns 198709 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34926 ns 35300 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 601916 ns 612417 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 633750 ns 623292 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622208.5 ns 623250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 586000 ns 630166 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 307649.5 ns 347973 ns 0.88
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 966417 ns 977646 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 932833 ns 935437.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 945958.5 ns 970083 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1291166 ns 1286374.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 208387 ns 209031 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4606250 ns 4514333 ns 1.02
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4489917 ns 4466146 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4299708 ns 4452875 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6229250 ns 6260416.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 933347.5 ns 947144.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3875 ns 3542 ns 1.09
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3833 ns 3417 ns 1.12
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6167 ns 5896 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2917 ns 6667 ns 0.44
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 191984.5 ns 219336.5 ns 0.88
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7666 ns 6917 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7125 ns 6958 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7708 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7208 ns 7291 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 941897 ns 1020167.5 ns 0.92
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1602667 ns 1635042 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1171416 ns 1200395.5 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1364375 ns 1363584 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2512583 ns 2345187.5 ns 1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215456.5 ns 215784.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12345833 ns 12316854.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9563708.5 ns 9564000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9248333 ns 9378437.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18039541.5 ns 17989542 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1941766 ns 1948181 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17410875 ns 17368125 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14343875 ns 14382958 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14290187.5 ns 14502250 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21033375 ns 21085917 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 93146 ns 90917 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 89750 ns 89500 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92375 ns 91833 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 104667 ns 113437.5 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126306.5 ns 126891 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2057146 ns 2009625 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2030833 ns 2030000 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2027062.5 ns 2039270.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024458 ns 1871125 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 951168 ns 1032563 ns 0.92
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 327771 ns 342166.5 ns 0.96
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 344667 ns 343375 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 393729 ns 406458 ns 0.97
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 312667 ns 311729 ns 1.00
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16220 ns 16465.5 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 703375.5 ns 706208 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 721271 ns 728542 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1023666.5 ns 1018584 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 653917 ns 650375 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 187186 ns 195366.5 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7083 ns 7375 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 5875 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5833 ns 5416 ns 1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9916 ns 10000 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34409 ns 34591 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214083 ns 243791 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222333.5 ns 220125 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221187.5 ns 221083 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206125 ns 239167 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 301322.5 ns 327793 ns 0.92
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3625 ns 3708 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 23004 ns 22616 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14250 ns 14292 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14333 ns 14416 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14500 ns 14208 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14416 ns 14417 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 460312.5 ns 480334.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 92937.5 ns 94458 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 133375 ns 92625 ns 1.44
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96583.5 ns 96875 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 136958 ns 96229.5 ns 1.42
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125681 ns 126007 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1754208.5 ns 1714792 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1922334 ns 1926792 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1933417 ns 1913291.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1927416.5 ns 1711417 ns 1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 955943 ns 1034230 ns 0.92
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 857708 ns 876916.5 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 817583 ns 817791 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1222291.5 ns 1169438 ns 1.05
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 963416 ns 966187.5 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 275885 ns 275657.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2826354 ns 2828583 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2472708.5 ns 2474833 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3311750 ns 3335750 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3417042 ns 3304292 ns 1.03
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1599363 ns 1618381.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15667 ns 16709 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15541 ns 15625 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18791 ns 18667 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15042 ns 15583 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 143363 ns 142594 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221562 ns 228750 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 257625 ns 215750 ns 1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216167 ns 217625 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 253521 ns 255500 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 648580 ns 641543.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 221958 ns 222458 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222584 ns 221500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222875 ns 223458.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 219542 ns 222604.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 287448 ns 269850.5 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 560521 ns 537583 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 506729 ns 497334 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 497875 ns 499583 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 524917 ns 526833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1378195 ns 1430878.5 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 312208.5 ns 330125 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 334917 ns 332834 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 355354.5 ns 435458.5 ns 0.82
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 323229.5 ns 315917 ns 1.02
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16853 ns 16581 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 710916 ns 717084 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 725333.5 ns 728166.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1020291 ns 1021104 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 666458 ns 662729.5 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 196645 ns 195479.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18292 ns 17875 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17250 ns 17167 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20250 ns 20250 ns 1
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16687 ns 17208 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 147801.5 ns 145639 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219292 ns 223750 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219437.5 ns 212417 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213646 ns 214041 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222104.5 ns 221917 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1001312.5 ns 1035551.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6458 ns 6708 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4792 ns 6333 ns 0.76
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7250 ns 7208 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4458 ns 6625 ns 0.67
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 238642 ns 240542 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10792 ns 10584 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10375 ns 9917 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11375 ns 11166.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10333 ns 10917 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1064757 ns 1097401.5 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6042 ns 3500 ns 1.73
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3792 ns 3208 ns 1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4750 ns 6333.5 ns 0.75
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3209 ns 6750 ns 0.48
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 236410 ns 250006 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7250 ns 7625 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7084 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8042 ns 8125 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7584 ns 7500 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1074231 ns 1102649 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24130479 ns 23315625 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 38799500 ns 34529125 ns 1.12
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37733750 ns 41513333.5 ns 0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34918167 ns 34929834 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1843476 ns 1838602 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 186803646 ns 184421875 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159613166 ns 159459792 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146295625 ns 151225083 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 412659125 ns 413223958 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16523543 ns 16387494 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 436777542 ns 428743125 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253178667 ns 252439020.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 232826083.5 ns 233017396 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 484428667 ns 484197291 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183792 ns 183584 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182000 ns 182750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185584 ns 186625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182354.5 ns 183146 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 220958.5 ns 228677.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 593000 ns 596083 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 587187 ns 586292 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 588166 ns 589770.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 632000 ns 631958 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1068694.5 ns 1119701 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3862583.5 ns 3838833 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3623187 ns 3643375.5 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3513333 ns 3563521 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5351459 ns 5359750 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 534395 ns 537722 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17921270.5 ns 17412417 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17168125 ns 17190667 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16586271 ns 17100375 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22125084 ns 22144083 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2619299 ns 2612799 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 459 ns 583 ns 0.79
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32390 ns 32035 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9417 ns 9208 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8875 ns 8542 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10125 ns 10208 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9125 ns 9459 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 265134.5 ns 264327.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 505787208 ns 504274209 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 430827229 ns 430218396 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 432173291.5 ns 471374500 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 584857000 ns 672994208.5 ns 0.87
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12384263 ns 12486595 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2073799791.5 ns 2049529562.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1628408167 ns 1632649709 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1495535812 ns 1536417708 ns 0.97
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2213815333 ns 2205666041.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49261027.5 ns 49389302 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1644542 ns 1657645.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1184062.5 ns 1189208.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1367187.5 ns 1382000 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2468292 ns 2334125 ns 1.06
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217369 ns 214982 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12780979.5 ns 12688500 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9943666 ns 9942000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9649896 ns 9748312.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18379437 ns 18407312 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2035807.5 ns 2050613 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17754833 ns 17691583.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14655042 ns 14746041.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14543333 ns 14804417 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21358459 ns 21386084 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26250 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26583 ns 26291 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26208 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23360 ns 24125 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66834 ns 66875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67542 ns 66917 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67083 ns 67083 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66875 ns 67209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 392635.5 ns 398847.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203542 ns 202667 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209584 ns 209000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209708 ns 209167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199875 ns 199583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25945.5 ns 26392 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 608625 ns 612416.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 632958.5 ns 627416.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622333 ns 667979 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 584541.5 ns 631250 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 349189 ns 353043.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 653500 ns 645542 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 670875 ns 643375 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 547042 ns 664187.5 ns 0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 675666.5 ns 540834 ns 1.25
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131441 ns 132126 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2289416 ns 2247375 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2233958 ns 2239958 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2245708 ns 2302917 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2234188 ns 2219000 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1153968 ns 1328726 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17583 ns 17667 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17000 ns 16979.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21083.5 ns 20792 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17479 ns 18500 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142918 ns 146392.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226645.5 ns 229708 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230417 ns 225333 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220688 ns 229292 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218917 ns 259083 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 981199 ns 1081671 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 458 ns 542 ns 0.85
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23217 ns 23645 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10041.5 ns 9833.5 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10125 ns 9542 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10417 ns 10708 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9250 ns 9916 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 255034 ns 262941 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5916 ns 7291 ns 0.81
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6229.5 ns 5833 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8563 ns 9625 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5500 ns 7250 ns 0.76
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 222902 ns 234003 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7250 ns 7333 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7709 ns 7000 ns 1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7833 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6958.5 ns 7250 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 767625.5 ns 810029.5 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2291 ns 2042 ns 1.12
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2250 ns 2000 ns 1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2333 ns 2375 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2333 ns 2208 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17725 ns 18218 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6542 ns 6542 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6667 ns 6500 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6958 ns 6708 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6583 ns 6750 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 317996.5 ns 335368 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 748750 ns 750166 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 747083 ns 746604.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 747042 ns 751041 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749125 ns 761417 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21402 ns 21856 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 790729 ns 775334 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 790333.5 ns 775042 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 773125 ns 804792 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 775458.5 ns 791625 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 291072 ns 299022 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7209 ns 7375 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 5875 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 5208 ns 1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10125 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32814 ns 32492 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220166 ns 233188 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 240583 ns 227750 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228583 ns 254458 ns 0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255708 ns 255583 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 355564.5 ns 359227 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12541 ns 11042 ns 1.14
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10500 ns 12458 ns 0.84
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13167 ns 12959 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10125 ns 12000 ns 0.84
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 239405.5 ns 245075.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24791.5 ns 24875 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24375 ns 24458 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25541 ns 25458 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24812.5 ns 24583.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1085912 ns 1120608 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 108107292 ns 106980458 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117455666.5 ns 118006979.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120529584 ns 123940208 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117307042 ns 118407959 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2652543 ns 2661574 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 395929750 ns 394378313 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 367066041 ns 368164500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 354756333 ns 358657167 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 484413208 ns 482282708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15198392 ns 15138278 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 767591687.5 ns 759267583 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 579795958 ns 577881125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 743372729 ns 749378833 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 765609167 ns 945671312.5 ns 0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7458.5 ns 7458 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7479.5 ns 7958 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8916 ns 8750 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6708 ns 7333 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 232243 ns 235620 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13917 ns 14500 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14125 ns 13333 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15166 ns 15041 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14458 ns 14292 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1035695.5 ns 1078273.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9042 ns 8542 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6833 ns 7792 ns 0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9750 ns 9187.5 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5500 ns 7833.5 ns 0.70
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 227355.5 ns 235827.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12625 ns 13167 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12959 ns 12084 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12917 ns 13084 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12292 ns 12833 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 753887 ns 787391.5 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 327750 ns 347250 ns 0.94
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 342666.5 ns 344875 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 398083 ns 409896 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 317437.5 ns 310562 ns 1.02
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16593 ns 16566 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 702854.5 ns 713833.5 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 720833 ns 727291 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1025771 ns 1023416 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 661750 ns 654959 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 196204.5 ns 197250.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23062 ns 23066 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6333 ns 6250 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6584 ns 6334 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6792 ns 6750 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6791 ns 0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 236488 ns 238420 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5792 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5667 ns 5834 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24282 ns 23863 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21687 ns 21750 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21584 ns 21000 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21750 ns 21958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21063 ns 21708 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 260349.5 ns 261085 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 172458 ns 152146 ns 1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 185292 ns 145250 ns 1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148917 ns 149541 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 186625 ns 145937 ns 1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166632 ns 166536.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1351354.5 ns 1328792 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1310042 ns 1319083.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1312208 ns 1350812.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1317292 ns 1317084 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1279433 ns 1336276 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24291 ns 24917 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22125 ns 24208 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25958 ns 25708 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21916 ns 24208.5 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 277859 ns 351114.5 ns 0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 127896 ns 131125 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 174583 ns 117791 ns 1.48
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118667 ns 172917 ns 0.69
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 135125 ns 177334 ns 0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1390180 ns 1465398.5 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 333 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22950 ns 22926 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6416.5 ns 6417 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6458 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6834 ns 6917 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6542 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 253555 ns 254551 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6000 ns 7625 ns 0.79
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4167 ns 4167 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7375 ns 7708.5 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4666 ns 7375 ns 0.63
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 241371.5 ns 250274.5 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10166 ns 10042 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10042 ns 9708 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10625 ns 10333 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10333 ns 10250 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1304285.5 ns 1345295 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22830 ns 22897 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5708 ns 5625 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5667 ns 5584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6042 ns 5959 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5583 ns 5958 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 270940 ns 271438.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6820479 ns 6886125 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6334041.5 ns 6378229 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6486416.5 ns 6526875 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7665459 ns 7602250 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213607.5 ns 213111 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24142500 ns 24073062 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21253833 ns 21283625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 20999479 ns 21045584 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29726209 ns 29677875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2083084.5 ns 2108165 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37375166.5 ns 37353145.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 33959583 ns 34386667 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45667583 ns 45930020.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 37873562.5 ns 49322334 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6979.5 ns 7708.5 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6667 ns 5875 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8104.5 ns 8333 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5479.5 ns 7062.5 ns 0.78
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 228629.5 ns 238522.5 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8458 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8375 ns 8042 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8584 ns 8583 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8125 ns 8292 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1060872.5 ns 1070850 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1527229 ns 1544374.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1259812.5 ns 1259666.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1616208 ns 1632771 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2147979 ns 2150667 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 271439 ns 278945 ns 0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7973083.5 ns 7908937.5 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6586020.5 ns 6609937 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7034625 ns 7237750.5 ns 0.97
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10461334 ns 10434334 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1861989 ns 1889956 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 318167 ns 340979 ns 0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 341959 ns 345792 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 408000 ns 417125 ns 0.98
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 345291 ns 345833 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46596 ns 42448 ns 1.10
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 734812.5 ns 746500.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 781000 ns 784542 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1068667 ns 1073250 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 746084 ns 761062.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 299516.5 ns 303720.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397708 ns 397500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288000 ns 288250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288125 ns 212666 ns 1.35
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 752083 ns 756084 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44143 ns 43887 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 633750 ns 671083 ns 0.94
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 531000 ns 530083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 530834 ns 470667 ns 1.13
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973250 ns 974750 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 188258.5 ns 188388.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 667374.5 ns 679250 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 643458.5 ns 645333.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 545833 ns 642458 ns 0.85
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 678833.5 ns 638562.5 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131695 ns 131530 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2403188 ns 2409292 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2439250 ns 2456416.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2454541 ns 2514583 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2454542 ns 2456292 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1200754 ns 1277300 ns 0.94
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 325000 ns 345146 ns 0.94
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 340500 ns 343583 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 394250 ns 403708.5 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 314000 ns 312208 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15982 ns 16009 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 702813 ns 709667 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 719125 ns 724500 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1024146 ns 1022687.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 651667 ns 650417 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 196545 ns 195917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458417 ns 1460417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1503167 ns 1500812.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499542 ns 1496375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1439209 ns 1438708 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40255 ns 40600 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5142459 ns 5128791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5295000.5 ns 5302375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5017687.5 ns 5313000 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4991625 ns 4970208.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 197920.5 ns 196206.5 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33701 ns 32895 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14917 ns 15167 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15333 ns 15083 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15375 ns 15083 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15125 ns 15375 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 380032 ns 376729 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71375 ns 71459 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71292 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71250 ns 71375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71250 ns 70708 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113118 ns 113177.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 322292 ns 317917 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 321459 ns 320417 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 327292 ns 325333 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318334 ns 320916 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 196182.5 ns 193043 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 958 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 958 ns 1042 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23902 ns 23363 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7959 ns 8083 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8083 ns 7792 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8541 ns 8750 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8125 ns 8750 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 263222.5 ns 260535.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 451021 ns 475499.5 ns 0.95
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 470667 ns 470520.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 556978.5 ns 557125 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 567333 ns 557959 ns 1.02
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129930 ns 129404 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1413124.5 ns 1399270.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1374375 ns 1382375 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1599125 ns 1611125 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1589500 ns 1582104.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 275820 ns 274924 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31985 ns 31647 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6042 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 6666 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6291 ns 6625 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 265480 ns 262541.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1723041.5 ns 1761833 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1770375 ns 1723396 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1726791 ns 1733812.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1769792 ns 1730625 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169107.5 ns 169477.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4370833 ns 4358625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4358458 ns 4358708 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4355958 ns 4403062.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4350000 ns 4373875 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1170977 ns 1208123 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6625 ns 7167 ns 0.92
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6750 ns 6875 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7041 ns 6916 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 9000 ns 6750 ns 1.33
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 21354 ns 20662 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 33104.5 ns 51625 ns 0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 51458 ns 32917 ns 1.56
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33083 ns 48208.5 ns 0.69
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51042 ns 51417 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 211403.5 ns 292106.5 ns 0.72
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 332479 ns 354562.5 ns 0.94
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 345500 ns 348666.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 420625 ns 433333 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 326208 ns 322041.5 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18610.5 ns 18353 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 719166 ns 724625 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 732604 ns 730583 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1029625 ns 1038687.5 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 679354 ns 675333 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 345590 ns 335730.5 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75167 ns 75458 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75125 ns 75333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75292 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 74875 ns 74584 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47792 ns 46864.5 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 334542 ns 325166 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 340667 ns 324250 ns 1.05
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 326000 ns 336875 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 326708 ns 325125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213631.5 ns 209059.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1484750 ns 1485709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1530208 ns 1526833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1526875 ns 1522792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1463833 ns 1462625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52711 ns 51397 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5145375.5 ns 5113395.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5286834 ns 5295292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4997792 ns 5300812.5 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4998437.5 ns 5001042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 207150 ns 202971.5 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28209 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28292 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28209 ns 28209 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24880 ns 24514.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66375 ns 66417 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66584 ns 66458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66542 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66541 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 537867.5 ns 505942 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1339125 ns 1502084 ns 0.89
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1143854 ns 1124250 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1056979.5 ns 944270.5 ns 1.12
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2227833 ns 2255250 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 577124.5 ns 566674 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3019562 ns 3090791 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2730250 ns 2751542 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2578250 ns 2628896 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3815792 ns 3819709 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2002712 ns 1979936 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8920709 ns 8847333 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8781875 ns 8768375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8792854 ns 8750250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6367541.5 ns 6340375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84000 ns 85125 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82083 ns 83021 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84583 ns 85708.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80791.5 ns 83562.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192031 ns 192703 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2015625 ns 2012875 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2019458.5 ns 2024062.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1745917 ns 2038542 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2013895.5 ns 2008812 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 797860.5 ns 791664.5 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.