Skip to content

Commit

Permalink
non_differentiable gpu_device and cpu_device (#1089)
Browse files Browse the repository at this point in the history
* non_differentiable gpu_device and cpu_device

* Update lib/MLDataDevices/ext/MLDataDevicesChainRulesCoreExt.jl

* fix: missing imports

* chore: bump version for release

---------

Co-authored-by: Avik Pal <[email protected]>
  • Loading branch information
CarloLucibello and avik-pal authored Nov 17, 2024
1 parent f1e0ad8 commit 38f1a73
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 4 deletions.
2 changes: 1 addition & 1 deletion lib/MLDataDevices/Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "MLDataDevices"
uuid = "7e8f7934-dd98-4c1a-8fe8-92b47a384d40"
authors = ["Avik Pal <[email protected]> and contributors"]
version = "1.6.1"
version = "1.6.2"

[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
Expand Down
10 changes: 7 additions & 3 deletions lib/MLDataDevices/ext/MLDataDevicesChainRulesCoreExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,14 @@ module MLDataDevicesChainRulesCoreExt
using Adapt: Adapt
using ChainRulesCore: ChainRulesCore, NoTangent, ProjectTo, @non_differentiable

using MLDataDevices: AbstractDevice, UnknownDevice, get_device, get_device_type
using MLDataDevices: AbstractDevice, UnknownDevice, get_device, get_device_type,
reactant_device, cpu_device, gpu_device

@non_differentiable get_device(::Any)
@non_differentiable get_device_type(::Any)
@non_differentiable get_device(::Any...)
@non_differentiable get_device_type(::Any...)
@non_differentiable gpu_device(::Any...)
@non_differentiable cpu_device(::Any...)
@non_differentiable reactant_device(::Any...)

function ChainRulesCore.rrule(::typeof(Adapt.adapt), to::AbstractDevice, x::AbstractArray)
dev = get_device(x)
Expand Down

3 comments on commit 38f1a73

@avik-pal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register subdir=lib/MLDataDevices

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/119646

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a MLDataDevices-v1.6.2 -m "<description of version>" 38f1a738326298f6fdb5899be0c8c2f6c0075f48
git push origin MLDataDevices-v1.6.2

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 38f1a73 Previous: 2331c99 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4208 ns 4125 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3895.5 ns 4292 ns 0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4750 ns 4875 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3875 ns 4188 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 62917.5 ns 61773 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 11250 ns 10375 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10708 ns 10250 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11208 ns 10709 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10333 ns 10584 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 438545 ns 433806 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1125 ns 1209 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1167 ns 1208 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1541 ns 1334 ns 1.16
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1250 ns 1333 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18677 ns 18632 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4166 ns 3958 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4041 ns 3770.5 ns 1.07
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4125 ns 4250 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4042 ns 3750 ns 1.08
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 112338 ns 111653 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 57167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38375 ns 46708 ns 0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46291 ns 47042 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82125 ns 85000 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37665 ns 37778 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2025875 ns 2021166.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2090208 ns 2091833 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087125 ns 2090417 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2003042 ns 2037250 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 199276 ns 197839 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 145125 ns 144125 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 143166 ns 143687.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145875 ns 145875 ns 1
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 184521 ns 144542 ns 1.28
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166939 ns 166264.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1108375 ns 815917 ns 1.36
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1112042 ns 1110583 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1118979 ns 1128458 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1120229 ns 1161791.5 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 539594 ns 531966.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3958 ns 3834 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3333 ns 3667 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4334 ns 4208 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3375 ns 3875 ns 0.87
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 72226 ns 72027 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9666 ns 9666 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9875 ns 9208 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9959 ns 9667 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9708.5 ns 8791 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 496688 ns 495388.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18334 ns 17250 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15041 ns 15292 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17583 ns 17750 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14666 ns 14875 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55623 ns 54800 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215167 ns 213334 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214250 ns 213667 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214750 ns 215625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214687.5 ns 213125 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 280563 ns 273384.5 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 708 ns 625 ns 1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 584 ns 500 ns 1.17
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 834 ns 0.80
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 18098 ns 17538 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1708 ns 1459 ns 1.17
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1709 ns 1625 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1958 ns 1541 ns 1.27
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1584 ns 0.87
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 104646 ns 101749 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 6625 ns 1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5167 ns 5833 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5875 ns 6000 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9875 ns 10541 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24104 ns 23308 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221209 ns 230042 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 231959 ns 228000 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229958 ns 229917 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226541 ns 215459 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 172625 ns 167869.5 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3958 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 24227 ns 23769 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17166 ns 16625 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16583 ns 16645.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16875 ns 16916 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16750 ns 16542 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 165474.5 ns 160993.5 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 599125 ns 583542 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 582292 ns 582166 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 578209 ns 573083 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 571084 ns 578334 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 114507 ns 112908 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1442792 ns 1416417 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1428646 ns 1413563 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1428895.5 ns 1420000 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1420104.5 ns 1427041.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 215306 ns 209512.5 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1075333.5 ns 1074937.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 935375 ns 961625 ns 0.97
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1340937.5 ns 1349604 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1309500 ns 1275750 ns 1.03
lenet(28, 28, 1, 64)/forward/GPU/CUDA 279907.5 ns 272786 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5997834 ns 5988250 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4511396 ns 4453229 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4924000 ns 4954875 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5545938 ns 5751250 ns 0.96
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1109413 ns 1067705 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 541 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 24308 ns 23552 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2250 ns 2125 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 175730.5 ns 171901 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4375 ns 4208.5 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 3625 ns 4417 ns 0.82
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5250 ns 5042 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3917 ns 4166 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 68155 ns 65093 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11708 ns 11292 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11709 ns 11292 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12042 ns 11875 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10917 ns 11417 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 463688.5 ns 448429 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7145.5 ns 7020.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8500 ns 7041 ns 1.21
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8041 ns 7625 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6167 ns 6500 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53097.5 ns 52253 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17459 ns 16979.5 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16666 ns 17833 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18458 ns 18875 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16917 ns 16875 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 305688 ns 301549.5 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 667 ns 584 ns 1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33408 ns 32680 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 8750 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8375 ns 8834 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9084 ns 9625 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8792 ns 8667 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 162736 ns 156693 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64584 ns 64125 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64542 ns 64291 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64500 ns 64458 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64583 ns 64584 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112246 ns 111163 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 273833 ns 280625 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 284750 ns 274250 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 273833 ns 278083 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 280125 ns 289292 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 188461.5 ns 184761.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3352917 ns 3374250 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2859834 ns 3022020.5 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3018375 ns 3033167 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4081521 ns 4059271.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 581514 ns 577014 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7365834 ns 7622583.5 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7362708 ns 7400875 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7444334 ns 7463083 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8211042 ns 8222208 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1352556 ns 1350413 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18787750 ns 18744750 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19099041 ns 19149375 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19169000 ns 19037709 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15691292 ns 15854917 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23530604 ns 23424208 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 42588125 ns 33648791 ns 1.27
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37157833 ns 37255625 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34914792 ns 35462146 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1854127 ns 1854361 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 187564000 ns 189507459 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 177265645.5 ns 163150563 ns 1.09
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152184437.5 ns 151759708 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 437472417 ns 449307375 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13907657 ns 13915090 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289863667 ns 290474792 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 355940583 ns 338390437.5 ns 1.05
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 300149020.5 ns 298728666 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 332483208 ns 400176437.5 ns 0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24354.5 ns 24666 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22625 ns 23062.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25333 ns 25125 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22042 ns 21833 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100207 ns 95619.5 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104583 ns 103041 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104625 ns 103750 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 105145.5 ns 104584 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103000 ns 104146 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 519245.5 ns 500114.5 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6167 ns 6042 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6417 ns 6500 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7792 ns 6667 ns 1.17
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5958 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 70633 ns 68217 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15250 ns 14833 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16250 ns 16208 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15958 ns 16542 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14792 ns 15541.5 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 491137.5 ns 474515 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 2890458 ns 3028583 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2080375 ns 2072250 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2260667 ns 2258958 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4898959 ns 4727250 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582542 ns 581996.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23630250 ns 23485750 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18270209 ns 18074583 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16933041.5 ns 17953667 ns 0.94
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35714083 ns 36188354.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2894491 ns 3102669 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33294083 ns 33313750 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27973458 ns 27588229.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27440417 ns 27385167 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41585958 ns 42266896 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74708.5 ns 72125 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75729.5 ns 75625 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76750 ns 75209 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72041 ns 72313 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 103323 ns 102770.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 289146 ns 217709 ns 1.33
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 313625 ns 264292 ns 1.19
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212416 ns 208812 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217500 ns 216750 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 555124 ns 548643 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12250 ns 11834 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12042 ns 13750 ns 0.88
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13125 ns 12208 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11416.5 ns 11791.5 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 72087 ns 71431.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27208 ns 26500 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27750 ns 27375 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27854.5 ns 28000 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26375 ns 27167 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 476976 ns 474755 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13000 ns 12292 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12750 ns 13250 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14667 ns 13625 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13000 ns 12625 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52635 ns 53420 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26208 ns 25708 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26125 ns 26084 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26208 ns 26375 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26208 ns 26209 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 305320.5 ns 305780 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180500 ns 181833 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180729.5 ns 182750 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 181833 ns 182000 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179750 ns 179750 ns 1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56100.5 ns 56584 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 583667 ns 582667 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 586125 ns 589020.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 583750 ns 585562.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583792 ns 582875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 289385.5 ns 286509.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6395.5 ns 5958 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5916 ns 7000 ns 0.85
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8542 ns 6917 ns 1.23
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5459 ns 6167 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 71686 ns 71314 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14750 ns 14041.5 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15208 ns 15042 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15208 ns 15334 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14416 ns 15042 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 468551 ns 465404.5 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1187959 ns 1163666 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1888958 ns 1608417 ns 1.17
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1305792 ns 1245958 ns 1.05
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1334770.5 ns 1315062.5 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302497.5 ns 301860.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4137167 ns 4119833.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4514604.5 ns 4367812.5 ns 1.03
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4613834 ns 4633625 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4486208 ns 4681521 ns 0.96
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1040017 ns 1040008 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1834 ns 1916 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23961 ns 23628.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4916 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5000 ns 4917 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 189260 ns 188198 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6125 ns 5959 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5417 ns 6333 ns 0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6750 ns 6584 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6000 ns 5625 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 56193.5 ns 55698 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11667 ns 10958 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11083 ns 11875 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12041 ns 11667 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10333 ns 11041.5 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 335012.5 ns 330993.5 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 334 ns 334 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23036 ns 23016 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3042 ns 2791 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2750 ns 2750 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3041 ns 3083 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2791 ns 2792 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 159234.5 ns 158081 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11667 ns 12000 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11291 ns 12292 ns 0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13333 ns 12979 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11541 ns 11500 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 58294.5 ns 56764.5 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25500 ns 25250 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24541 ns 25292 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25250 ns 25542 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24500 ns 25125 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 299590 ns 293131 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4209 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4250 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24930 ns 24851 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16208 ns 16084 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 15958 ns 16084 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16291 ns 16250 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16291 ns 16125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 200508.5 ns 193865.5 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5833 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5792 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5791 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33184 ns 33648.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20895.5 ns 20937.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20583 ns 20875 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 20875 ns 21375 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21208 ns 20833 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 177158 ns 175295.5 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 398333.5 ns 405354.5 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 350916 ns 383146 ns 0.92
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 490333 ns 487375 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 531125 ns 505333 ns 1.05
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66865 ns 67095 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 932125 ns 921500 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 886416 ns 879833.5 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1236834 ns 1239500 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1389000 ns 1413875 ns 0.98
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 189939 ns 190914 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82541.5 ns 80792 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80354 ns 80625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82437.5 ns 82416.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 97083 ns 82208.5 ns 1.18
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192428 ns 193084 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1914917 ns 1921166 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1939042 ns 1923375 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920479.5 ns 1702792 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1905042 ns 1942625 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 410076 ns 397267 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22365 ns 22298 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1875 ns 1792 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 175349 ns 171128.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7375 ns 6750 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6458 ns 7125 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7333 ns 7750 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6584 ns 6583 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60938.5 ns 60207.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9584 ns 9334 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9208 ns 9458 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9167 ns 9458 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9416 ns 9500 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 319338.5 ns 309332.5 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120592520.5 ns 118908083 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181751312 ns 173905459 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148080208 ns 148147000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 102236292 ns 104063562 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5477430 ns 5483006 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 616714854.5 ns 615077271 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 577694542 ns 556251208 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 455123729 ns 456191166.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 752842396 ns 775264354 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38217675 ns 38217009 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 652219667 ns 651954834 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 684789500 ns 668816521 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 585586750 ns 584471208 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 741987542 ns 743364500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59375 ns 59041 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38959 ns 47167 ns 0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46791 ns 48042 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83959 ns 85604.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37739.5 ns 38577 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1916959 ns 1921792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1979666 ns 1983375 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1976083 ns 1974021 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1894000 ns 1888041.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 175218.5 ns 177270 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 269208 ns 267667 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 275479 ns 269500 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 271041 ns 269000 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 266333.5 ns 265375 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 136845.5 ns 129439 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 680750 ns 602875 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 692354 ns 667625 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 599084 ns 589104 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 595292 ns 696166.5 ns 0.86
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 721806 ns 698695 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2222875 ns 2214416 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2206562.5 ns 2132916.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2190792 ns 2099687.5 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2219812.5 ns 2218542 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132948 ns 135139.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5499000 ns 5496500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5581125 ns 5493084 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5523834 ns 5512750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5495417 ns 5608375 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 731329.5 ns 786813 ns 0.93
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 638708 ns 645084 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 653417 ns 646042 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 641250 ns 643042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 636417 ns 645042 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47525 ns 47537 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1847375 ns 1818666 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1675167 ns 1720625 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1727334 ns 1727375 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2104041 ns 2097625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 219920 ns 225809.5 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58417 ns 58458 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38834 ns 46958 ns 0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46958 ns 47500 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84208 ns 85709 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28615 ns 29149.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2026062.5 ns 2024312 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2097250 ns 2089792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2092063 ns 2079417 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1994667 ns 2030812.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 189265 ns 192873 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13388125 ns 13367875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12478208.5 ns 12448375 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12574250 ns 12498688 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15224000 ns 15196500 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 513523 ns 515450 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47281583 ns 47301125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 42012708 ns 41737208 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41057937.5 ns 41031917 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58763708 ns 59054000 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3027655.5 ns 3246636.5 ns 0.93
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 96334145.5 ns 73864187.5 ns 1.30
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91884667 ns 90734875 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 91286333 ns 90710083 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76278542 ns 99247604 ns 0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58875 ns 58667 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38916.5 ns 47292 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47125 ns 47625 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82417 ns 85416.5 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47552.5 ns 47961 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924541.5 ns 1915542 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1974562.5 ns 1967250 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1774812.5 ns 1778666.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1885958 ns 1904791 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196220 ns 195659 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 333 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32997 ns 32740 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6812.5 ns 6167 ns 1.10
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6062.5 ns 6000 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6667 ns 6625 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6000 ns 6042 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 174932.5 ns 176130 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32308 ns 31946 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2625 ns 1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2709 ns 2792 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2834 ns 2916 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2583 ns 2625 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 161677 ns 164970 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 284931833.5 ns 286577604 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 346561021 ns 339468333 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314560020.5 ns 314095271 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 270608125 ns 270924375 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7114118 ns 7117527 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1002689042 ns 1001221667 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 958558000 ns 939877583 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 855141500 ns 851361917 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1156706167 ns 1176703208 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34166632.5 ns 33887966 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1685313333 ns 1311845770.5 ns 1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1711077458 ns 1679371125 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1615744166 ns 1604290334 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1303020395.5 ns 1668435000 ns 0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1421208 ns 1415333.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1428958 ns 1417520.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1408625 ns 1416104 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1404708 ns 1420146 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128545 ns 128175 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5018624.5 ns 5010542 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5051333.5 ns 5020291.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4733250 ns 5037500 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5021583.5 ns 5047042 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 496665 ns 595594 ns 0.83
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 175572229 ns 175229188 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 180410646 ns 123461167 ns 1.46
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 128544479 ns 127594250 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 157146354.5 ns 154552916.5 ns 1.02
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4884679 ns 4884050 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 670240791 ns 667971584 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 607052917 ns 641402625 ns 0.95
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 535894542 ns 501342541 ns 1.07
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 644547708 ns 657859875 ns 0.98
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17659435 ns 15872908 ns 1.11
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8871750 ns 8987479.5 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8823292 ns 8781270.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7877542 ns 7857729 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10108021 ns 10412374.5 ns 0.97
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1604141 ns 1592095 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36437625 ns 36150584 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37737666 ns 36797500 ns 1.03
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33406667 ns 33192666.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38634291.5 ns 40244625 ns 0.96
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6473557 ns 6455577 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47500 ns 47417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47416 ns 47584 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47834 ns 47583 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47541 ns 47333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18281 ns 18534 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50395.5 ns 52833.5 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50583 ns 50375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50583 ns 50666 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50458.5 ns 50250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 170078 ns 202850 ns 0.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7500 ns 7459 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6917 ns 7417 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7521 ns 7312.5 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6625 ns 7458.5 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 80332.5 ns 98661 ns 0.81
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10458 ns 9792 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9750 ns 10125 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10292 ns 10542 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10000 ns 10250 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 461160.5 ns 555252.5 ns 0.83
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6167 ns 6750 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 6042 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7333 ns 7208.5 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5333 ns 6542 ns 0.82
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 87519.5 ns 104446.5 ns 0.84
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13479.5 ns 13125 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13291 ns 12917 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13250 ns 13292 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12667 ns 13083 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 422223 ns 478181 ns 0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1125 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32913 ns 32701 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8166 ns 8375 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8042 ns 8125 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7958 ns 8125 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8041 ns 8083 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 195210 ns 206369.5 ns 0.95
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23417 ns 23417 ns 1
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23167 ns 23500 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23750 ns 23416 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23459 ns 23333 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 19164 ns 18592 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52458 ns 52750 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52625 ns 54709 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52625 ns 52917 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52500 ns 52917 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 226904.5 ns 283991 ns 0.80
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1404979 ns 1399417 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1404271 ns 1396395.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1401667 ns 1396833 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1400583 ns 1449874.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196430 ns 196187 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5006854 ns 5003208 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5043375.5 ns 5005375 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5019271 ns 5023834 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5013583.5 ns 5050167 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 541904 ns 585941 ns 0.92
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3050625 ns 3039563 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2111958 ns 2072875 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2267479 ns 2275208 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4552979 ns 4856479 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 583345 ns 583070 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24377459 ns 24354562.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19098958 ns 18867354 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18712353.5 ns 18817521 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36880667 ns 37413770.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2992730 ns 3176919 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34016250 ns 33990500 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28716709 ns 28382208.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27913333 ns 28070021 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41657708 ns 42353875 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 142625542 ns 144782125 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 142415500 ns 142800542 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 124505750 ns 123809687.5 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 174391229.5 ns 168891563 ns 1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22776842 ns 22773536 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 920278833.5 ns 1277305063 ns 0.72
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 871702708.5 ns 1180173271 ns 0.74
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 713641791.5 ns 757990666 ns 0.94
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 671743250 ns 688381500 ns 0.98
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 116134147 ns 118470004 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75208.5 ns 75042 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74417 ns 73625 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75917 ns 77166 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72229 ns 74708 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 198593 ns 220284.5 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 287583 ns 285750 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 205125 ns 191208 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 268979 ns 192209 ns 1.40
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255333 ns 286417 ns 0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1021640 ns 1195118 ns 0.85
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35433625 ns 35568917 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36009208 ns 35278833 ns 1.02
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32265896 ns 32149729 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40598479 ns 41733750 ns 0.97
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5840629 ns 5841675.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148147125 ns 148531084 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 155796771 ns 153045542 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 134729167 ns 136231750 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 286194250 ns 228329854.5 ns 1.25
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34880901 ns 34864707.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120489250 ns 119094187.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181727625 ns 174236667 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148246958.5 ns 147985917 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 107474479 ns 107449375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5474097 ns 5482351 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 469354333 ns 467600417 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 484572458 ns 465577292 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 439936978.5 ns 438034750 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 739328625 ns 759816229.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35154753 ns 35154520.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 707340521 ns 709358854.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 673922146 ns 655624271 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 572041396 ns 571617791 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 849558125 ns 869387791 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1298708.5 ns 1327250.5 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 721687 ns 905875 ns 0.80
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 919229 ns 907750 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2090500 ns 2079042 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 581149 ns 578714.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2964833 ns 2967333.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2526166.5 ns 2631479.5 ns 0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2616228.5 ns 2620896 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3691270.5 ns 3771729 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1611589.5 ns 1755565 ns 0.92
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6637958 ns 6610917 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6471417 ns 6496875 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6516083 ns 6497437.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4027166.5 ns 4521833 ns 0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7208 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5375 ns 6125 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6084 ns 6084 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10542 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25734 ns 25575 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213292 ns 212875 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223084 ns 229500 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221166 ns 221187.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206500 ns 246625 ns 0.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 222772.5 ns 261769.5 ns 0.85
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 300948375 ns 313730896 ns 0.96
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 281067792 ns 222537125 ns 1.26
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 190339437.5 ns 194707917 ns 0.98
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 311293458 ns 313279354 ns 0.99
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7676637.5 ns 7673155 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1078603979 ns 1080950395.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 988164875 ns 899873458 ns 1.10
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 871100958 ns 834690333 ns 1.04
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1152323979.5 ns 1180116917 ns 0.98
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26676107 ns 26459206.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6125 ns 5875 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5542 ns 5417 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6625 ns 6250 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 6084 ns 0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 111098 ns 162725 ns 0.68
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns 7375 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7084 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 7750 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6958 ns 7625 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 498819 ns 624677.5 ns 0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 666 ns 666 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24257 ns 23758 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 9542 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 8667 ns 9291 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9375 ns 9584 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9208 ns 9209 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 184515.5 ns 225738 ns 0.82
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 354291.5 ns 352000 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352292 ns 352042 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352229.5 ns 354604.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 355750 ns 353833 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21862 ns 21344 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 824562.5 ns 822291 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 775208 ns 812479 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 774792 ns 824250 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 821812 ns 831958 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 232768 ns 304872 ns 0.76
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 337458 ns 337167 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 315020.5 ns 343334 ns 0.92
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 448104 ns 446875 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 338562.5 ns 316354.5 ns 1.07
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17974 ns 18389 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 688937.5 ns 695521 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 736417 ns 750792 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1019625 ns 1026833 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 700125 ns 688999.5 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 224901 ns 282579.5 ns 0.80
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 353708 ns 356667 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 327187.5 ns 354500 ns 0.92
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 421250 ns 421500 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 381333 ns 347042 ns 1.10
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22774 ns 22715 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 751625 ns 754229 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 747792 ns 753792 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1067270.5 ns 1072417 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 832521 ns 823125 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 209326.5 ns 256204.5 ns 0.82
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3542 ns 3583 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3542 ns 3500 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3959 ns 3708 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3583 ns 3667 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18314 ns 17612 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4208 ns 4208 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4208 ns 4500 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4375 ns 4292 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4291 ns 4417 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 228582.5 ns 280326.5 ns 0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4333 ns 4209 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4042 ns 4333 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4708 ns 4291 ns 1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4041 ns 4125 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 170151.5 ns 232867.5 ns 0.73
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8750 ns 8291 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8542 ns 8500 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8541 ns 8562.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8333 ns 8667 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1071112 ns 1214158.5 ns 0.88
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204666 ns 203792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 208416 ns 211375 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210500 ns 209083 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199458 ns 202541 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34973 ns 34629 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 644875 ns 645167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 623042 ns 623895.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622916 ns 630084 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 629125 ns 633833 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 316013 ns 349768 ns 0.90
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 965625.5 ns 972062.5 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 931292 ns 937916.5 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 949791 ns 960125 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1287375 ns 1319708 ns 0.98
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207697.5 ns 208475 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4502250 ns 4500166 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4587333.5 ns 4475687.5 ns 1.02
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4293249.5 ns 4308250 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6257042 ns 6508250 ns 0.96
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 935844 ns 944786.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4042 ns 4084 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3458 ns 3750 ns 0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4333 ns 4083 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3417 ns 3542 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 197475.5 ns 226002.5 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7875 ns 7542 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 7625 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7958 ns 7625 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7083 ns 7334 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 992361 ns 1008436 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1639334 ns 1647479.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1153625 ns 1203104.5 ns 0.96
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1347709 ns 1378125 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2351833.5 ns 2472896 ns 0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214480 ns 213582 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12352166.5 ns 12309291 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9610125 ns 9565666 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9257687.5 ns 9280334 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17946167 ns 18216500 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1947706 ns 1940596 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17372666 ns 17356917 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14382041.5 ns 14358625 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14310583 ns 14329312.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21072020.5 ns 21175541 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 90333 ns 133834 ns 0.67
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90833 ns 90000 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91625 ns 93687 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 134166 ns 90750 ns 1.48
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126661 ns 125997 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020563 ns 2019458 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2038375 ns 2029375 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1989208 ns 2029667 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026937.5 ns 2049458 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1036489 ns 1042357 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 344459 ns 347333 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 323354 ns 349250 ns 0.93
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 395291.5 ns 394583 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 315666.5 ns 293978.5 ns 1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15917 ns 16455.5 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 703000 ns 709041 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 719667 ns 741583.5 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1017062.5 ns 1022875 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 658709 ns 644791 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 197602.5 ns 197069.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7250 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5042 ns 5875 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6041 ns 6041 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 10583 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34203 ns 34401 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222334 ns 224416.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223750 ns 220375 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221667 ns 231250 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 211938 ns 236834 ns 0.89
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 315474 ns 318034 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 23085 ns 23219 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14416 ns 14375 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14167 ns 14375 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14416 ns 14417 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14334 ns 14167 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 475619.5 ns 484400.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 102583 ns 97417 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 94625 ns 94042 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 95792 ns 97959 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 92375 ns 95500 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126060 ns 125837 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1914833 ns 1920250 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1935917 ns 1649417 ns 1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1723083 ns 1923437 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923271 ns 1953916 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 937375.5 ns 974936 ns 0.96
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 875000 ns 879729.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 798167 ns 832708 ns 0.96
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1216563 ns 1229562.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 975458 ns 939583 ns 1.04
lenet(28, 28, 1, 32)/forward/GPU/CUDA 282281 ns 281248 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2835584 ns 2831145.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2458187.5 ns 2527396 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3311542 ns 3353354.5 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3418209 ns 3411104.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1620696.5 ns 1661947.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18729 ns 14854.5 ns 1.26
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15625 ns 15583 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17167 ns 18792 ns 0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15563 ns 16000 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144619.5 ns 144462 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 255875 ns 255958 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216500 ns 215583.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215625 ns 257583 ns 0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 254834 ns 262500 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 653404.5 ns 650445 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222291.5 ns 221375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220167 ns 220792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222937.5 ns 223083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 223417 ns 220646 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 276300.5 ns 273454.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 510541.5 ns 559542 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 501250 ns 510542 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 498084 ns 507813 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 556000 ns 535208.5 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1413449 ns 1396532 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 336938 ns 328770.5 ns 1.02
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 312000 ns 336937 ns 0.93
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 378334 ns 370500 ns 1.02
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 330583.5 ns 299625 ns 1.10
batchedmm(16, Bsize=4)/forward/GPU/CUDA 17443.5 ns 17616 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 713625 ns 711834 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 728875 ns 732166.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1013479.5 ns 1024479.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 669791 ns 657917 ns 1.02
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 198819.5 ns 200486.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20187 ns 18520.5 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19166 ns 19083 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18750 ns 19625 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18167 ns 18396 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 149298 ns 147224 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224104 ns 213625 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215021 ns 221875 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213250 ns 221812.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221417 ns 237333 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1054396 ns 951211 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4500 ns 4583 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4604.5 ns 4417 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5291 ns 4917 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4250 ns 4437.5 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 250417 ns 239868.5 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10666 ns 10625 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11167 ns 10500 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10916 ns 10958 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10667 ns 10625 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1105680.5 ns 1112681.5 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3709 ns 3791 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3667 ns 3541 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4542 ns 4229.5 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3250 ns 3833 ns 0.85
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 255235.5 ns 252769 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8000 ns 7334 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7583 ns 7792 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8291 ns 7916 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7208 ns 7437.5 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1113468.5 ns 1116124.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23595542 ns 23341875 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43052124.5 ns 34053354.5 ns 1.26
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 38080375.5 ns 37482854.5 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34894124.5 ns 35456625 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1838652.5 ns 1845777.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183963083 ns 184378291 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 183192958 ns 158584667 ns 1.16
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146668854 ns 146193479 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 412535000 ns 422496166.5 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16493440 ns 16510255 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 428118750 ns 426674167 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 258337959 ns 253893875 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 232950667 ns 232875895.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 483252042 ns 494805750 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183708 ns 184500 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184708.5 ns 183458 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185084 ns 185583 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 184708.5 ns 183416.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 229170.5 ns 231684 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 635625 ns 599042 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 608958 ns 586312.5 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 586750 ns 636833 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 598000 ns 641125 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1136450 ns 1087543.5 ns 1.04
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3837375 ns 3842645.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3726667 ns 3643229 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3480459 ns 3509333 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5388895.5 ns 5524187.5 ns 0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA 537067 ns 534809 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17354583 ns 17462833 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17694167 ns 17328500.5 ns 1.02
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16528146 ns 16632083 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22060667 ns 23474479.5 ns 0.94
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2634503 ns 2613903 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 584 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 666 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31760.5 ns 32551 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9417 ns 9375 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9375 ns 8625 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9875 ns 9792 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9041 ns 9354.5 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 264747 ns 264963 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 499432375 ns 500529917 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 408185125 ns 429131021 ns 0.95
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 433810812.5 ns 390085458 ns 1.11
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 592107916 ns 680776812.5 ns 0.87
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12429503 ns 12474289.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2040048312.5 ns 2050021916.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1663687458 ns 1635602292 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1492613875 ns 1501725478.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2211016250 ns 2237822875 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49189263 ns 49165291 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1645792 ns 1648791.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1175479 ns 1195792 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1372437.5 ns 1379625 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2492875.5 ns 2436187.5 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 216494.5 ns 215012 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12712041.5 ns 12725833.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9982875 ns 9944041.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9669792 ns 9667395.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18338875 ns 18594104.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2038701 ns 2038696 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17668854 ns 17722000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14759833 ns 14694125 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14522791.5 ns 14557833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21750333 ns 21533833 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26417 ns 26250 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26291 ns 26750 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24090 ns 23955 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67416 ns 66833 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66750 ns 66625 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68000 ns 66958 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67125 ns 66833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 407713 ns 403690.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203750 ns 202791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 208250 ns 209375 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209750 ns 209667 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199542 ns 200708 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26545 ns 26177 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 614458.5 ns 612146 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 626479.5 ns 622334 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622000 ns 680520.5 ns 0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 628041 ns 634750 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 354394 ns 350618 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 634000 ns 650500 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 662125 ns 542145.5 ns 1.22
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 546833 ns 634666 ns 0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 678084 ns 679459 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131699 ns 131917 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2219958 ns 2229542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2290584 ns 2231250 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2209458 ns 2251687.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2242166.5 ns 2330333 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1187469.5 ns 1238942 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19375 ns 16854 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17917 ns 19500 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18770.5 ns 19791.5 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18396 ns 17750 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144324 ns 144506 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 232750 ns 230625 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221167 ns 260583 ns 0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219917 ns 261125 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258396 ns 265583.5 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1055691 ns 1064679 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23295 ns 23448 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10209 ns 10125 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10000 ns 9792 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10250 ns 10000 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9666 ns 9979 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 258972 ns 257505.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6000 ns 6125 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5708 ns 5625 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7083 ns 6666 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 6084 ns 0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 231630 ns 233944.5 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7416 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 7334 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7334 ns 7834 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6792 ns 7417 ns 0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 802749 ns 800597 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2208 ns 2209 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2208 ns 2292 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2500 ns 2208 ns 1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2208 ns 2250 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18522 ns 17989 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6792 ns 6541.5 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6687.5 ns 6542 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6625 ns 7125 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6458 ns 6750 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 332179 ns 330052 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 755667 ns 751958.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 749145.5 ns 746604.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 749542 ns 749167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 746916 ns 748959 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21431 ns 21090 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 790645.5 ns 791292 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 775270.5 ns 792333 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775020.5 ns 773291 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 810729 ns 792291.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 298142.5 ns 299003.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7416 ns 7291 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5208 ns 5917 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5917 ns 6083 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10166 ns 10791 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32730.5 ns 33088.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 232208 ns 233333 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229250 ns 229479 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229083 ns 269542 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255375 ns 220958 ns 1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 363419.5 ns 359587 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10333 ns 10625 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10166 ns 10375 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11250 ns 10958 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9917 ns 10959 ns 0.90
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 255351 ns 249563.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25167 ns 25042 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25125 ns 24625 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25667 ns 25375 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24792 ns 25250 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1125690 ns 1114585 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106221875 ns 106488708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 125394375 ns 117008645.5 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 121390333 ns 120350584 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117359542 ns 118085396 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2631223 ns 2661446 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 392505292 ns 393399750 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 378545917 ns 368428125 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 355444792 ns 359138458 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 480457375 ns 486814000 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15266181 ns 15211152 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 937472500 ns 759103375 ns 1.23
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 779964959 ns 755373708 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 749323541.5 ns 744752604 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 763325375 ns 959286729.5 ns 0.80
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7500 ns 6896 ns 1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8541.5 ns 7791 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9125 ns 8250 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7042 ns 7583 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 238515 ns 240721 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14208 ns 14458 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14834 ns 14208.5 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14708 ns 14750 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14166 ns 14312.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1101022.5 ns 1072384 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6125 ns 6292 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6042 ns 6125 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7062.5 ns 7291 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5542 ns 6458 ns 0.86
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 237732.5 ns 234548 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12625 ns 12584 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12709 ns 12625 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12667 ns 12959 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12125 ns 12583 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 796510.5 ns 784420 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 346208 ns 347708 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 320958.5 ns 386916.5 ns 0.83
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 397021 ns 398834 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 317417 ns 292375 ns 1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17118 ns 16947 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 709854 ns 708249.5 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 726459 ns 746000 ns 0.97
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1020999.5 ns 1025229 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 663583.5 ns 652416.5 ns 1.02
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 200899.5 ns 199954 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 416 ns 416 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23684.5 ns 23200 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6458 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6333 ns 6417 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6541 ns 6750 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6000 ns 6542 ns 0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 241466.5 ns 238715 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5958 ns 5917 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5917 ns 6000 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 5917 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24709 ns 24219 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21666 ns 21250 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21500 ns 20875 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21958.5 ns 21417 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 20750 ns 21896 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 265505.5 ns 261648 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144292 ns 145458 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144583.5 ns 147521 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147479 ns 147770.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 190521 ns 147458.5 ns 1.29
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168147.5 ns 167051 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1320958.5 ns 1322500 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1335437 ns 1320041 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1300333 ns 1325833 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1320104.5 ns 1391458 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1368672 ns 1346456 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22625 ns 22250 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22125 ns 24750 ns 0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24208 ns 23416 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23084 ns 22396 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 356959.5 ns 353387 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 179167 ns 178709 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 118875 ns 118687.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 119167 ns 127459 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 129709 ns 134041.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1505917 ns 1464281 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23148 ns 22942 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6459 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6333 ns 6500 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6458 ns 6666 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6229.5 ns 6750 ns 0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 259448 ns 255510 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 5000 ns 1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4625 ns 4729.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5334 ns 5292 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4333 ns 4709 ns 0.92
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 259168.5 ns 256450 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10541.5 ns 10042 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 10167 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10417 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns 10125 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1348181 ns 1348843.5 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1667 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23350 ns 22876 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6000 ns 5625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5625 ns 5625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6041 ns 6041 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns 5708 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 276521 ns 272214.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6816750 ns 6888875 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6363479 ns 6384792 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6523125 ns 6514708.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7623625 ns 7555583 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214095 ns 214320 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24024229 ns 24087271 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21333333 ns 21278062.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21044500 ns 21040583 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29721312.5 ns 29921333 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2157342.5 ns 2106395 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48604000 ns 37396292 ns 1.30
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45855104 ns 45619104.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45946750 ns 45717854 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38018271 ns 49514208 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6208 ns 6208 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5750 ns 6333 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6584 ns 6459 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5750 ns 6083 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 238393 ns 236136 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9041 ns 9125 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8000 ns 8666 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8291 ns 8416 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8416 ns 8583 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1067250.5 ns 1059780 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1551208 ns 1497208 ns 1.04
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1242333 ns 1271146 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1627229 ns 1623333 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2145291.5 ns 2143312.5 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 279384.5 ns 273613.5 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7908792 ns 7900125 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6541625 ns 6605479 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7076917 ns 7156416.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10471270.5 ns 10528062.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1889848 ns 1850752 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 342084 ns 343000 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 325667 ns 349166.5 ns 0.93
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 364520.5 ns 383250 ns 0.95
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 346875 ns 325438 ns 1.07
batchedmm(128, Bsize=4)/forward/GPU/CUDA 43276.5 ns 46572 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 751333.5 ns 746124.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 784042 ns 795499.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1057167 ns 1076208.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 760375 ns 753291.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 312654 ns 309766 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397687.5 ns 397375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 213000 ns 287916 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287750 ns 288000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750292 ns 749125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44710 ns 44192 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 673645.5 ns 666145.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 472000 ns 531062.5 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 531792 ns 529625 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974417 ns 975062.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 191361 ns 188202 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 646583 ns 646708 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 543750 ns 543166.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 597208 ns 654229 ns 0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 677459 ns 659479 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131924.5 ns 132313.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2453750 ns 2450208 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2497021 ns 2447833 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2150458 ns 2404020.5 ns 0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2458500 ns 2562667 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1369206 ns 1598744 ns 0.86
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 342625 ns 347208 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 320374.5 ns 347542 ns 0.92
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 395375 ns 400125 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 319042 ns 291604 ns 1.09
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16400 ns 16522 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 707145.5 ns 706875 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 724417 ns 734333 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1012958 ns 1028542 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 657083.5 ns 647750 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 199317.5 ns 199294.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458375 ns 1458584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1493042 ns 1498042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499834 ns 1499666 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1437375 ns 1444167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40669 ns 40454 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5122624.5 ns 5120438 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5305041 ns 5292292 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5297979 ns 5286000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4988562.5 ns 5017937.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 199526 ns 195965.5 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33617 ns 32802 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15375 ns 15125 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14917 ns 15292 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15416 ns 15417 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15083 ns 14875 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 378853 ns 372915.5 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71125 ns 70917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71083 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71167 ns 70916 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 69833 ns 71375 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113699 ns 112608 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318167 ns 317750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 324084 ns 318417 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 320084 ns 318375 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318708 ns 327667 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 196046 ns 192232 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1125 ns 1000 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1083 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24007 ns 23208 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8250 ns 8000 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7958 ns 8042 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns 8250 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8042 ns 8250 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 262710.5 ns 259321 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 468625 ns 468417 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 454583 ns 479458 ns 0.95
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 548646 ns 555416 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 553541.5 ns 544792 ns 1.02
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129403 ns 128776.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1392000 ns 1386166.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1394229 ns 1391187.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1613875 ns 1623687.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1592062.5 ns 1644333.5 ns 0.97
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274426 ns 275740 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31844 ns 31924 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6666 ns 5958 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6167 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6542 ns 6459 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6083 ns 6166 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 265642 ns 262594 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1721792 ns 1733625 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1722771 ns 1722729.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1726625 ns 1729958 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1721375 ns 1727000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169529.5 ns 168805 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4366500 ns 4353667 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4385854.5 ns 4366916.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4393625.5 ns 4362042 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4358458 ns 4429395.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1321267 ns 1264129.5 ns 1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6708 ns 6959 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7042 ns 6708 ns 1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6958 ns 7000 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6667 ns 6833 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19868.5 ns 20795 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51416 ns 51500 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32792 ns 38042 ns 0.86
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 32666 ns 47209 ns 0.69
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 70521 ns 48666.5 ns 1.45
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 297123.5 ns 295172.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 353542 ns 355084 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 325167 ns 350583 ns 0.93
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 408125 ns 423208.5 ns 0.96
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 325916 ns 295000 ns 1.10
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18546 ns 18329 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 718854 ns 718562.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 731000 ns 744125 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1025333.5 ns 1031500 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 687084 ns 672625 ns 1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 335478 ns 347666.5 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75292 ns 75042 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75083 ns 75250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75250 ns 75333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 74604.5 ns 75584 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47512 ns 46603 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324750 ns 324708 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 334708 ns 327334 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 326791.5 ns 324375 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324667 ns 334062.5 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 212265 ns 207370 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1484917 ns 1485208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1518458 ns 1526250 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1526750 ns 1526625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1463500 ns 1467250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52587 ns 51906 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5132416.5 ns 5116396.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5271708 ns 5284312.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5296709 ns 5277167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4984250 ns 5025562.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 207065 ns 203896.5 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28292 ns 28625 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28250 ns 28334 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 25113 ns 24422 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66542 ns 66250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66125 ns 66250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66750 ns 66250 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66292 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 537900 ns 519781.5 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1468167 ns 1501250 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 828250 ns 1125791 ns 0.74
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1059083.5 ns 1125104.5 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2230354.5 ns 2259459 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 586046.5 ns 571991 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3081687.5 ns 3070000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2620708.5 ns 2775000 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2737021 ns 2736500 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3814083 ns 3899292 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2011434 ns 2055229 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8838209 ns 8838896 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8744875 ns 8809083.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8797042 ns 8782709 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6371334 ns 6483958.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 83791 ns 80583 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 79333 ns 81334 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81854.5 ns 83645.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 129083 ns 136500 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192348 ns 192157 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2017875 ns 2012958 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2022541 ns 2009583 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2018416.5 ns 2015916.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2016353.5 ns 2051000 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 798353 ns 803108 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.