Skip to content

Commit

Permalink
chore: bump crate-ci/typos from 1.27.0 to 1.27.3 (#1065)
Browse files Browse the repository at this point in the history
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.27.0 to 1.27.3.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.27.0...v1.27.3)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  • Loading branch information
dependabot[bot] authored Nov 11, 2024
1 parent 22cb59e commit 0be7504
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/QualityCheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ jobs:
- name: Checkout Actions Repository
uses: actions/checkout@v4
- name: Check spelling
uses: crate-ci/[email protected].0
uses: crate-ci/[email protected].3

1 comment on commit 0be7504

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 0be7504 Previous: 22cb59e Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4292 ns 4584 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4166 ns 4917 ns 0.85
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5000 ns 5666 ns 0.88
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4187.5 ns 4042 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60972 ns 60487 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10459 ns 10167 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9666 ns 11000 ns 0.88
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11583 ns 10542 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10542 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 426712 ns 424703 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1125 ns 1125 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3000 ns 1166 ns 2.57
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3041 ns 1292 ns 2.35
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1083 ns 1125 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18505 ns 18464 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4042 ns 4000 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 4000 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4333 ns 4208 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4125 ns 4083 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 112061 ns 109915.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 55667 ns 57375 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46208 ns 38250 ns 1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46792 ns 46375 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79959 ns 81584 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37292.5 ns 37506 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2048125 ns 2012792 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2094395.5 ns 2093417 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2098292 ns 2086646 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1986396 ns 2000208 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196342 ns 197705 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144771 ns 147000 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146979.5 ns 143145.5 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147708 ns 149666 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147167 ns 147229.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166937.5 ns 168379 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1137083 ns 1012208 ns 1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1121625 ns 1152209 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1134166 ns 1110709 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1113312.5 ns 1119500 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 528613 ns 522581.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 4834 ns 0.70
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3541 ns 3792 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4459 ns 4667 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3583 ns 3958 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70628 ns 65957 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8791 ns 9000 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8416 ns 9292 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9792 ns 9459 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8916 ns 8500 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 479018 ns 469308.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15145.5 ns 18167 ns 0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 14770.5 ns 15625 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18375 ns 18917 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15500 ns 16583 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55206 ns 52878 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215083.5 ns 252312.5 ns 0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214000 ns 215959 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214959 ns 214625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212416 ns 214583 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 274790 ns 267130 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 584 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 541 ns 583 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 708 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17833 ns 17462 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1625 ns 1500 ns 1.08
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1417 ns 1459 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1750 ns 1.07
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1417 ns 1417 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 103732 ns 100800 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 6917 ns 7167 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5125 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5834 ns 5875 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9834 ns 9833 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23612 ns 23225 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230000 ns 259792 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 233000 ns 232458.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229104.5 ns 229520.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212834 ns 221875 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 168410 ns 166055.5 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3833 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3833 ns 3875 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23718 ns 23597 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16834 ns 17125 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16750 ns 16541 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17250 ns 16833 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16667 ns 16667 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 164253.5 ns 160583 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 603959 ns 576583 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 576333 ns 581541 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 573083 ns 573687.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 577708 ns 575750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113138.5 ns 113170 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1445791.5 ns 1423250 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1433375 ns 1430791.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1418104 ns 1431000 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1417292 ns 1421792 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 213357 ns 207811 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1038750 ns 1075667 ns 0.97
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 969875 ns 948313 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1352625.5 ns 1346646 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1297479.5 ns 1310750 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 277707.5 ns 270367.5 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5979604.5 ns 5995500.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4659833 ns 4593750 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4957333.5 ns 4976208.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5723916.5 ns 5505395.5 ns 1.04
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1103929 ns 1090295.5 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23796.5 ns 23458 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2167 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2208 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2083 ns 2083 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 176570.5 ns 172926 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5062.5 ns 6458 ns 0.78
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6333 ns 5125 ns 1.24
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7083 ns 7208 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4979 ns 4333 ns 1.15
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66510 ns 64432 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11375 ns 11458 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11334 ns 11583 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12167 ns 11958 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11375 ns 10833 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 456940 ns 442914.5 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7167 ns 7792 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7000 ns 7041 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7666 ns 8458 ns 0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7187.5 ns 6375 ns 1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53184 ns 51253.5 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17625 ns 17833.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17334 ns 18000 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18291 ns 18542 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17083 ns 16875 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 308823 ns 298470 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 458 ns 583 ns 0.79
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33350 ns 32349 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9083 ns 8875 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9000 ns 9333 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9166 ns 9291 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9083 ns 8875 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 162050.5 ns 157321.5 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64916 ns 64792 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64625 ns 64667 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64625 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64458 ns 64375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112372 ns 111151 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 289833 ns 275916 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 279500 ns 293917 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 282167 ns 291666 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 278917 ns 274417 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 191380 ns 183162.5 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3267875 ns 3323375 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3100000 ns 2861812 ns 1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3087125 ns 3049625 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3985042 ns 3939000 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 582393 ns 580012.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7531208 ns 7623333 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7465041 ns 7263625 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7467854 ns 7327354 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8032146 ns 8196041 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1388075.5 ns 1311084.5 ns 1.06
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 19401417 ns 18847291 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19152875 ns 19137541 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19084458 ns 19205875 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15677125 ns 15425792 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24347125 ns 23654958 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34024291.5 ns 43401291.5 ns 0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37229500 ns 37089791.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34880042 ns 34880750 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1857760.5 ns 1841996 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 193645562 ns 188777125 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164151291.5 ns 178489062.5 ns 0.92
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 151654958 ns 152827958 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 439821375 ns 438354958 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13858604 ns 13884864 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 293543645.5 ns 289730542 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 336893125.5 ns 273653750 ns 1.23
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 299206416.5 ns 300146084 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 333915208 ns 363130458 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23584 ns 24959 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25271 ns 23166 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25750 ns 26250 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23834 ns 21541 ns 1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 97106.5 ns 93319 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103958.5 ns 104333 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104229.5 ns 104208 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 113750 ns 104041 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103166 ns 103292 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 509956.5 ns 494914.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6187.5 ns 7375 ns 0.84
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7084 ns 7062.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7833 ns 8083 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7000 ns 6959 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68645 ns 66496.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14687.5 ns 15333 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15500 ns 16334 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 15958 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14833 ns 14750 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 482599 ns 467266 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 2799667 ns 3009270.5 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2078479.5 ns 2083125 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2280312.5 ns 2291250 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4472000 ns 4920209 ns 0.91
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 587881 ns 585803 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24038167 ns 23529584 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17995729 ns 18299083 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17057395.5 ns 17952042 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35636667 ns 35984709 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3108956.5 ns 3109259 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33833209 ns 33275020.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27496229.5 ns 28041667 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27409833 ns 27515834 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41716709 ns 41779084 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72750 ns 75459 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75542 ns 81146 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76875 ns 76416.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74541.5 ns 72291 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101955.5 ns 100380 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 296812.5 ns 285041.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 302854.5 ns 311542 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 315187.5 ns 292833 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 204333.5 ns 315375 ns 0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 551842.5 ns 544347 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12042 ns 12667 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13500 ns 12771 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13250 ns 13750 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12958 ns 12083 ns 1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71982 ns 70337.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26833 ns 27042 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27333 ns 27625 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27979.5 ns 27708 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27062.5 ns 26875 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 484665 ns 473629 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12625 ns 13083 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12792 ns 13250 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13625 ns 14833 ns 0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13500 ns 13125 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 54014 ns 52795 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26209 ns 26250 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26000 ns 26750 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26250 ns 28792 ns 0.91
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26542 ns 26167 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 309883 ns 304928.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180709 ns 181791 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 181792 ns 181750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185459 ns 184875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181375 ns 181833 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 58480.5 ns 56540.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 585521 ns 615187.5 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 595417 ns 620771.5 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 592000 ns 583541 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582583 ns 595499.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 290043 ns 285956 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6292 ns 6958 ns 0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7333.5 ns 7083 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7458 ns 8041 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7291 ns 6375 ns 1.14
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70388.5 ns 70068.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14458 ns 14375 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14729.5 ns 15333 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15187.5 ns 15333 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14417 ns 14500 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 470302.5 ns 463652.5 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1190583 ns 1234312.5 ns 0.96
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1259542 ns 1279667 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1286958 ns 1269833.5 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1309792 ns 1312458 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301566 ns 301465 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4329292 ns 4127187.5 ns 1.05
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4329916 ns 4510874.5 ns 0.96
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4565562 ns 4533354 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4476104.5 ns 4443687.5 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1040707.5 ns 1047444 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 24056 ns 23871 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4959 ns 4917 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4959 ns 4959 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 191633.5 ns 190792.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6041.5 ns 7041.5 ns 0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7062.5 ns 6292 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8042 ns 9208 ns 0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7083 ns 7166 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 55968.5 ns 56472 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11750 ns 11750 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10354.5 ns 11584 ns 0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12125 ns 11812.5 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10917 ns 10792 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 334934 ns 335267 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23116 ns 23092 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2958 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2750 ns 2667 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3084 ns 2667 ns 1.16
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2709 ns 2708 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 162654.5 ns 161307 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12770.5 ns 14395.5 ns 0.89
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 13709 ns 12333 ns 1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15250 ns 14917 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 13875 ns 13145.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57764 ns 56807.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25417 ns 25375 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24583 ns 25333 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25167 ns 24958 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25041.5 ns 25333 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 295171 ns 292514 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4167 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25293 ns 25065 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16125 ns 16333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16250 ns 16000 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16917 ns 16250 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16084 ns 16167 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 200954 ns 198557.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5791 ns 5791 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5792 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5750 ns 5792 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5791 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33912 ns 33912.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20959 ns 21041 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21041 ns 21250 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21583 ns 21395.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20792 ns 21042 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 176959 ns 176321 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 386291 ns 408208 ns 0.95
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 372708 ns 363583.5 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 486562.5 ns 492667 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 519792 ns 523542 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67330 ns 67347 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 1011291 ns 978667 ns 1.03
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 873062.5 ns 891000.5 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1237521 ns 1242958 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1397209 ns 1420417 ns 0.98
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 191794.5 ns 190609 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80250 ns 82666 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81792 ns 82709 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86000 ns 85834 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83375 ns 133542 ns 0.62
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194543 ns 193457 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1932750 ns 1923750 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1920541.5 ns 1936250.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1918916 ns 1914520.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923021 ns 1920083 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 395736 ns 399634.5 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22304 ns 22639 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 170577 ns 174147.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6562.5 ns 8542 ns 0.77
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6584 ns 7292 ns 0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8687.5 ns 9083 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6917 ns 6541 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 58494 ns 60578 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9500 ns 9542 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9083 ns 9479.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9500 ns 9542 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9000 ns 9541 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 299864 ns 313158.5 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 158867583 ns 120031270.5 ns 1.32
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173898792 ns 181860604 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148221333 ns 147859583 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104030875 ns 107036271 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5486248 ns 5506155 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 681447729.5 ns 615708666.5 ns 1.11
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 554483292 ns 581207833 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 449606833 ns 450770312.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 759359917 ns 758274833.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38204857 ns 34927722 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 704631875 ns 650246750 ns 1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 667416541.5 ns 685688396 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 592203208.5 ns 577502729 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 746616584 ns 743657333 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57333 ns 59167 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46750 ns 39333 ns 1.19
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47167 ns 47625 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83083 ns 83542 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38237 ns 38483 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1937667 ns 1924917 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1966479.5 ns 1972334 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1981312 ns 1976458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1666938 ns 1895208 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 175800.5 ns 176241.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 268167 ns 270958 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268062.5 ns 269042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 270375 ns 270875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267833.5 ns 267958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 121785.5 ns 128472 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 676083.5 ns 682312.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 661375 ns 684021 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 682625 ns 678333 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 594229.5 ns 683083 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 667322.5 ns 712823 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2162666.5 ns 2110062.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2158625 ns 2217708.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2235541.5 ns 2221875 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2184396 ns 2230541 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 134022 ns 134372 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5614125 ns 5507000 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5487000 ns 5539625 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5503834 ns 5512958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5397958 ns 5509604 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 730476 ns 755964 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 649042 ns 638125 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 648000 ns 651667 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 645042 ns 638459 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 640167 ns 647208 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47671 ns 47881 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1819333 ns 1826416 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1719500 ns 1675750 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1722333 ns 1720875 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2100708 ns 2104000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 226229 ns 224321 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56375 ns 58208 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46042 ns 38792 ns 1.19
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45375 ns 46584 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81917 ns 83542 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28944 ns 29060 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2043125 ns 2031958 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2091959 ns 2100291.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087459 ns 2085291 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1994792 ns 2007250 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191139 ns 191693.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13390792 ns 13371646.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12557146 ns 12465792 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12640375 ns 12501042 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15104000 ns 15188916 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 518367 ns 510743.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47595458 ns 47270208 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41795625 ns 42049416.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41070000.5 ns 41051834 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58470417 ns 58110084 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3196312 ns 3204565.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97085500 ns 96634583 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68418416 ns 91624583 ns 0.75
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90577750 ns 90630541 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76363042 ns 98906458.5 ns 0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56625 ns 58500 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47250 ns 38709 ns 1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47459 ns 47125 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79458 ns 83541 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 46662 ns 47960 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1927541.5 ns 1920000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1977249.5 ns 1969792 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1980458.5 ns 1972500 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1885937.5 ns 1889834 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191777.5 ns 192720 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 416 ns 0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32632 ns 31940 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6833 ns 6750 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6625 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6562.5 ns 6583 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6542 ns 6250 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 170723.5 ns 171690.5 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31798 ns 31426 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 2833 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2792 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2917 ns 2834 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2584 ns 2625 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 160722.5 ns 160271 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 324739354.5 ns 287478708.5 ns 1.13
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339525125 ns 347117687.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313761354 ns 313742875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 270586625 ns 271337417 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7046631.5 ns 7120485.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1056579291.5 ns 999672583 ns 1.06
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 938968625 ns 962585125 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 854294979.5 ns 847863396 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1160220458 ns 1159606875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33965020 ns 34018012.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1708151417 ns 1668327625 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1335733187.5 ns 1694566583 ns 0.79
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1644800875 ns 1646047208 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1296542916.5 ns 1665789292 ns 0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1415104.5 ns 1415313 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1411979.5 ns 1417167 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1416812.5 ns 1417459 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1409584 ns 1412583 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128051 ns 128511 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5069625 ns 5021792 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5045062 ns 5044792 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5032541 ns 5021250 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5026125 ns 5024292 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 538087.5 ns 495850 ns 1.09
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 169153250 ns 169190166 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 121641542 ns 179239187.5 ns 0.68
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 130222229.5 ns 128995104.5 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 165507937.5 ns 162929271 ns 1.02
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4915832 ns 4883493 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 850025583 ns 671536958 ns 1.27
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 641237459 ns 604481292 ns 1.06
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 532476541 ns 531751292 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 683223459 ns 681136250 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 18105472 ns 16104554 ns 1.12
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 9040458 ns 8980854 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8724291 ns 8853334 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7847084 ns 7886771 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10130834 ns 10140625 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1594949 ns 1602269.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 37056958.5 ns 36048625 ns 1.03
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36670666.5 ns 37859417 ns 0.97
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33132917 ns 33187042 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 40008520.5 ns 39063937.5 ns 1.02
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6471272 ns 8827671 ns 0.73
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47395.5 ns 47666 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47417 ns 47667 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47666 ns 47625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47416 ns 47542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18474 ns 18332 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50292 ns 50416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50208 ns 50500 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50875 ns 50541 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50125 ns 53000 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 171116.5 ns 183394 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7708 ns 7833 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8250 ns 7500 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9917 ns 9375 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7500 ns 6979.5 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 81170 ns 85722.5 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10416.5 ns 10500 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10083 ns 10500 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10166 ns 10625 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10167 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 468151 ns 484512.5 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7937.5 ns 9250 ns 0.86
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8417 ns 6750 ns 1.25
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9334 ns 9417 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6500 ns 7792 ns 0.83
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 96340.5 ns 105586.5 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13750 ns 13083 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13083 ns 13250 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13708 ns 13458.5 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13542 ns 13417 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 430131.5 ns 467808.5 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32020 ns 31641 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8208 ns 8167 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8083 ns 8209 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8291 ns 8291 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8541 ns 8125 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 194284 ns 195119.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23250 ns 25167 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23166 ns 23250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23479.5 ns 23270.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23333 ns 23125 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18346 ns 18534 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52583 ns 53062 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52458.5 ns 52375 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52875 ns 52500 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52583 ns 52708 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 246069.5 ns 252220 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1414770.5 ns 1400708 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1448500 ns 1409834 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1408812.5 ns 1399229.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1400666 ns 1398458 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195901.5 ns 194493.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5041521 ns 5016000 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5009562.5 ns 5040334 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5020750 ns 4993708.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5011875 ns 4643770.5 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 577943 ns 597903.5 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3051687.5 ns 3046458 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2098000 ns 2118792 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2285625 ns 2287146 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4540833.5 ns 4859250 ns 0.93
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 580172 ns 581676 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24668666.5 ns 24338167 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18916375 ns 19105334 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18998375 ns 18916917 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36603708 ns 36315667 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3187885.5 ns 3195442 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34405834 ns 33985645.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28320666.5 ns 28693250 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28063958.5 ns 27979104.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41741750 ns 41435375 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 141994958 ns 144577667 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 140315020.5 ns 142667333 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 124968083.5 ns 124796041.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173994708.5 ns 174395646 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22772769 ns 22784954 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 945646312.5 ns 908417417 ns 1.04
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 821550291 ns 866595875 ns 0.95
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1293202000 ns 690147541 ns 1.87
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 687734375 ns 679371625 ns 1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118417230 ns 118837225 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 82875 ns 76312 ns 1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 87604 ns 76708.5 ns 1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78708 ns 78062.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73708.5 ns 74292 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 218863.5 ns 239745 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 201083.5 ns 279187.5 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 286750 ns 297958 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 288834 ns 283125 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 250916.5 ns 265791.5 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1188088 ns 1232585 ns 0.96
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 36439417 ns 35449875 ns 1.03
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35404270.5 ns 35824917 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32153583.5 ns 32070395.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40973604 ns 40877625 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5842252 ns 5847896 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 151588250 ns 147901291 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 153155979.5 ns 155872291 ns 0.98
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 137699666.5 ns 133368083 ns 1.03
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287172166 ns 286886250 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34872027 ns 34880972 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 158128125 ns 121105063 ns 1.31
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173816375 ns 181834292 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147404958 ns 147760625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 107977875 ns 101356500 ns 1.07
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5466563 ns 5478431 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 520328562 ns 473677042 ns 1.10
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 465862417 ns 485888583.5 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 438730375 ns 437646959 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 743725417 ns 740881667 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35165135 ns 32245879 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 693072125 ns 707376812.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 654827271 ns 667253771 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 573433167 ns 576063750 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 851476833 ns 852206792 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1174312.5 ns 1266333 ns 0.93
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 994625 ns 788917 ns 1.26
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 961895.5 ns 969500 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2072333.5 ns 2069208.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 579719 ns 586368.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2931333.5 ns 2969541 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2614999.5 ns 2523083 ns 1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2630875 ns 2620708 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3710749.5 ns 3700583 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1749092 ns 1794949 ns 0.97
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6772375 ns 6640437.5 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6502416 ns 6484958 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6513500 ns 6451083 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4449063 ns 4447979 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7500 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 5334 ns 1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6250 ns 6125 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 9917 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25514 ns 25270 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213125 ns 212167 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220000 ns 221000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220167 ns 221125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206583 ns 207458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 255014.5 ns 252957 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 312678917 ns 313894603.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 213764812.5 ns 280731020.5 ns 0.76
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 197457396 ns 185850791.5 ns 1.06
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 311892292 ns 312245084 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7676616 ns 7682659 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1092379645.5 ns 1079816500.5 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 904888770.5 ns 989067125 ns 0.91
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 812108375 ns 810903834 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1158438958 ns 1155211625 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26462420 ns 26590890 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5563 ns 7416.5 ns 0.75
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5666 ns 6209 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8000 ns 6917 ns 1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6125 ns 5729.5 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 162748.5 ns 151351 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7583 ns 7604.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7542 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 7541 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7916 ns 7542 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 614344.5 ns 598449 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 541 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 541 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 458 ns 459 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24176 ns 24254 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9709 ns 9416 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9333 ns 9292 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9791 ns 9458 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9291 ns 9250 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 206786 ns 214013.5 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 353875 ns 352917 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351166 ns 355083.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 351250 ns 350833 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 351229.5 ns 353583 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21232 ns 21515 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 821062.5 ns 828979 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 775209 ns 787167 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 774125 ns 774312.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 821375 ns 823875 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 296042.5 ns 271369.5 ns 1.09
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 315437.5 ns 338958.5 ns 0.93
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 340458 ns 320167 ns 1.06
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 450500 ns 453291 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 330333 ns 331895.5 ns 1.00
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18319 ns 18690 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 696292 ns 696291 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 743062.5 ns 744854.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1033542 ns 1036229 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 698416 ns 686042 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 250758.5 ns 234671 ns 1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 331042 ns 361375 ns 0.92
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 348542 ns 336417 ns 1.04
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 417875 ns 425792 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 368167 ns 377584 ns 0.98
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22961 ns 22985 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 756354 ns 760187 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 753124.5 ns 753000 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1077937.5 ns 1084125 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 824208 ns 812791.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 223744 ns 215024 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3542 ns 3625 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3500 ns 3708 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3625 ns 3625 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3458 ns 3541 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17947 ns 18002 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4500 ns 4291 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4291 ns 4583 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4334 ns 4500 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4292 ns 4541 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 250374.5 ns 239767 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4020.5 ns 5687.5 ns 0.71
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4167 ns 4125 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6229 ns 4959 ns 1.26
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4188 ns 3875 ns 1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 187203.5 ns 180564 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 8708 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8292 ns 8500 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8959 ns 8667 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8542 ns 8541 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1134224 ns 1101874 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204792 ns 208292 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209750 ns 209250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210583 ns 209166.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 198541 ns 200375 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35314 ns 34680 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 628542 ns 649916 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 621333 ns 632250 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621916.5 ns 621979 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 628417 ns 632208 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 311564.5 ns 306075 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 981958.5 ns 975416.5 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 939958.5 ns 936645.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 952792 ns 954895.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1288333 ns 1290104.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207725 ns 206706.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4717104 ns 4495416.5 ns 1.05
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4470167 ns 4624208 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4297542 ns 4293833.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6248125 ns 6306792 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 935192 ns 924556 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3625 ns 4667 ns 0.78
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3958 ns 4333 ns 0.91
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4916.5 ns 5208 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3667 ns 3125 ns 1.17
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 194832 ns 201570 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7625 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7375 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7333 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7375 ns 7125 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 992089 ns 964645.5 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1641437.5 ns 1660208.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1185979 ns 1158208 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1378709 ns 1364146 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2428250 ns 2354187 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214706 ns 213379 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12422083 ns 12376417 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9583083 ns 9587708.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9294291 ns 9262687 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18012291 ns 17957375 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1954819 ns 1953093.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17453916 ns 17363667 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14374292 ns 14466208 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14354958 ns 14361333 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21099271 ns 21148875 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 93875 ns 136479 ns 0.69
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 88667 ns 90541.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92166 ns 91959 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90437.5 ns 88917 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126600 ns 126286 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2066875 ns 2029396 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1965333.5 ns 2020021 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2034937.5 ns 2021541.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2029166.5 ns 2009791 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1084421 ns 970059 ns 1.12
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 328375 ns 348458 ns 0.94
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 349208 ns 336521 ns 1.04
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 397000 ns 399187.5 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 309875 ns 313500 ns 0.99
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15807 ns 15421 ns 1.03
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 707500 ns 709188 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 737354.5 ns 737750 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1022500 ns 1023375 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 654396 ns 643583 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 195018.5 ns 185776.5 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7334 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 5292 ns 1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 6042 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 9959 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33500 ns 33229 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214937.5 ns 223750 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230375 ns 228166.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220500 ns 220459 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 210791 ns 215250 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 348178.5 ns 289979.5 ns 1.20
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22629 ns 22473 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14375 ns 14375 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14334 ns 14250 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14375 ns 14417 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14334 ns 14500 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 496039.5 ns 454491 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 97166 ns 93937.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 94771 ns 96000 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98292 ns 95750 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 94833 ns 94229 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125320 ns 125724.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1954375 ns 1921562.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1925166 ns 1938875 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1932375 ns 1920667 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1915062.5 ns 1918854.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1014156 ns 949972 ns 1.07
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 852417 ns 886792 ns 0.96
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 820062.5 ns 812958 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1215333 ns 1228020.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 959271 ns 961021 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 271531 ns 266393 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2836167 ns 2837791.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2455958 ns 2523625 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3343125 ns 3323459 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3387042 ns 3391708 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1631802 ns 1589685.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15333 ns 17625 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15000 ns 15833 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19271 ns 18750 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16708 ns 15833 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142548.5 ns 140920 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228292 ns 216604.5 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215437.5 ns 223875 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216000 ns 216062.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 263417 ns 257042 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 647857.5 ns 635870.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222708.5 ns 227209 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221916.5 ns 220833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 223791 ns 223271 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221292 ns 219541 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 351782.5 ns 267876.5 ns 1.31
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 509750 ns 523334 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 494979.5 ns 557334 ns 0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 510333 ns 498187.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 542437.5 ns 540416 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1486215.5 ns 1349491 ns 1.10
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 316417 ns 334459 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 336437.5 ns 317417 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 354416 ns 364250 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 320125 ns 320791 ns 1.00
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16603 ns 16596.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 714958 ns 715750.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 732917 ns 735750 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1022875 ns 1025729.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 665729 ns 657937.5 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 197199 ns 193892 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17875 ns 17667 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17645.5 ns 17417 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19416 ns 20583.5 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17208 ns 16833 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 228185 ns 144720.5 ns 1.58
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213625 ns 212749.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223604.5 ns 212500 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213000 ns 213583 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 240834 ns 223229 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1059556 ns 930226.5 ns 1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5979.5 ns 7458 ns 0.80
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7083 ns 5000 ns 1.42
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7083 ns 7458 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6375 ns 6000 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 238595 ns 229973.5 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10583.5 ns 10750 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10292 ns 10604 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11416 ns 11000 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10708 ns 10458 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1093136 ns 1052874 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3770.5 ns 4167 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3375 ns 4145.5 ns 0.81
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4229 ns 5250 ns 0.81
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3292 ns 2834 ns 1.16
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 251757 ns 236091.5 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7458 ns 7625 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7167 ns 7875 ns 0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 7750 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333 ns 7209 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1103726 ns 1061672 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24419792 ns 23478292 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34715312.5 ns 43131583 ns 0.80
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37629125 ns 37763437.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34906500 ns 34891125.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1844116 ns 1856489 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 188432604.5 ns 184985667 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159504917 ns 171828500 ns 0.93
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146180437.5 ns 146459896 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 413707208 ns 412533125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16521419 ns 16498145 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 438323667 ns 426401458 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253688896 ns 257893209 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 231435417 ns 231907209 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 485078167 ns 482223334 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 185291.5 ns 183271 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183250 ns 183354.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184687.5 ns 186750 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183709 ns 182250 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 229419 ns 202451.5 ns 1.13
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 631500 ns 589375 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 585166.5 ns 596958.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 597708 ns 589000 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630583.5 ns 632167 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1108299 ns 1041439 ns 1.06
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3978979.5 ns 3849562 ns 1.03
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3639000 ns 3881896 ns 0.94
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3486041 ns 3464521 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5345292 ns 5356333 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 534520 ns 536569.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 18266999.5 ns 17412625 ns 1.05
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17298416.5 ns 17756875 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16542625 ns 16608479 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22103834 ns 22042750 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2634003.5 ns 2637828 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 584 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32947 ns 32430 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9666 ns 9875 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9459 ns 9500 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9833 ns 9750 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9875 ns 9145.5 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 267979.5 ns 267467.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 581657146 ns 504434042 ns 1.15
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 431111584 ns 458633542 ns 0.94
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 431743375 ns 381209021 ns 1.13
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 596638417 ns 671200875.5 ns 0.89
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12473530 ns 12484248 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2064050353.5 ns 2048273395.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1627636167 ns 1661422833 ns 0.98
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1488459749.5 ns 1499198563 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2201925958.5 ns 2207989770.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49182728 ns 49043755 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1645500 ns 1648062.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1168250 ns 1192292 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1385416 ns 1392792 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2450375 ns 2475542 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217064 ns 218335.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12851792 ns 12753208 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9920604 ns 9970145.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9698792 ns 9709187 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18357187.5 ns 18405562.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2017297 ns 2007331 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17787021 ns 17672750 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14725958 ns 14774167 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14633271 ns 14626875 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21461500 ns 21434167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26458 ns 26208 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26209 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26291 ns 26209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23903 ns 24803 ns 0.96
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67417 ns 66917 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66666 ns 66833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67292 ns 67875 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66792 ns 66750 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 406936 ns 397350.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203875 ns 204750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209625 ns 209167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210333 ns 209917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199291 ns 200042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26061.5 ns 26341 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 611041 ns 612792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 631708 ns 669042 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 669229 ns 665479.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 594291.5 ns 633646 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 352245 ns 340366 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 671500 ns 656958 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 642583.5 ns 628166 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 651416.5 ns 637292 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 635333 ns 658854 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131732.5 ns 131658 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2329958 ns 2236438 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2241167 ns 2302291.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2235209 ns 2233208.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2243562.5 ns 2244083.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1189796.5 ns 1141510 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18584 ns 17708.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17479.5 ns 17875 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22542 ns 22791.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21041 ns 17812.5 ns 1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 145492.5 ns 143266.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228333 ns 231271 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218792 ns 262583 ns 0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 262916 ns 262520.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258104 ns 262167 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1060721 ns 974956 ns 1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23751 ns 23116 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10042 ns 10167 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9917 ns 9666 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10250 ns 10125 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10041 ns 10084 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 261761.5 ns 255373.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6167 ns 7125 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5875 ns 6209 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7458 ns 7354.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6625 ns 5792 ns 1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 236484.5 ns 224318.5 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7708 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7084 ns 7375 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7541 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 7209 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 805005.5 ns 798172.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2208 ns 2208.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2250 ns 2291 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2209 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2125 ns 2167 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18044 ns 17921 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6625 ns 6875 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6500 ns 6500 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6667 ns 6750 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6750 ns 6708 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 332799.5 ns 329206 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 754583.5 ns 749437.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746792 ns 748917 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 746709 ns 749541 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 748833 ns 751833.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21377 ns 21135 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 793208 ns 795541 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 774958 ns 788459 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 791541.5 ns 792916 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 810083 ns 791791.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 300021.5 ns 292229.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7209 ns 7209 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5333 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 5958 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10166 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33125 ns 32459 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228292 ns 229666.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226459 ns 239729.5 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 269250 ns 264354.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212958 ns 255083.5 ns 0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 361902.5 ns 359407.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11334 ns 12770.5 ns 0.89
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12979.5 ns 11125 ns 1.17
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13083 ns 12792 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10958 ns 10541 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 253895 ns 243081 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25042 ns 25208 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23875 ns 24916 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25792 ns 25208 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25084 ns 24625 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1125992 ns 1117079 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106901208 ns 106480583 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117686604 ns 125655584 ns 0.94
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 121224709 ns 120834166 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 118251959 ns 117491666 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2654315 ns 2637704 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 396867792 ns 393188541 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366900875 ns 380341000 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 353248208 ns 357677834 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 486147750 ns 481091583 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15270647 ns 15233085 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 948542479 ns 937085875 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 579410750 ns 774220083 ns 0.75
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 743810562.5 ns 745186000 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 764589145.5 ns 945237625.5 ns 0.81
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7709 ns 8625 ns 0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7583 ns 7500 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8042 ns 8875 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7875 ns 7833 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 243350 ns 237576 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14334 ns 14250 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13875 ns 14375 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14459 ns 13916 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14458 ns 14083 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1071498 ns 1078858 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7917 ns 9125 ns 0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8083 ns 7041 ns 1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8916 ns 9083 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8583 ns 7042 ns 1.22
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 237135.5 ns 235440 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12667 ns 12916.5 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12083 ns 13208 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13000 ns 12792 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13083 ns 12708 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 787328.5 ns 787408.5 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 335541 ns 353104 ns 0.95
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 343750 ns 328604 ns 1.05
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 397542 ns 398083 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 308542 ns 314250 ns 0.98
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17004 ns 16719 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 710187.5 ns 711500 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 735542 ns 737000 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1025667 ns 1029562.5 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 660021 ns 649000 ns 1.02
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 200985 ns 196298 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23598 ns 23316 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6542 ns 6584 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6584 ns 6750 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6792 ns 6625 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6542 ns 6375 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 241099.5 ns 238133 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5916 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5959 ns 5875 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5792 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24612 ns 23849 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21542 ns 21583 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21250 ns 21542 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21750 ns 22395.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21625 ns 21250 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 264784.5 ns 259774.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 178145.5 ns 148459 ns 1.20
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 148750 ns 146500 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 152312.5 ns 151167 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146979 ns 149209 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167163.5 ns 168521.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1390521 ns 1306312.5 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1324584 ns 1335292 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1339167 ns 1326333 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1324166 ns 1329459 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1355480 ns 1332341.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23500 ns 25520.5 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24167 ns 22687.5 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25625 ns 25917 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24083 ns 23417 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 285493.5 ns 283013 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 183458 ns 176479.5 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 118833 ns 119334 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 180603.5 ns 131395.5 ns 1.37
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 131313 ns 178542 ns 0.74
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1475275 ns 1446515 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23098 ns 22447 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6792 ns 7000 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6666 ns 6792 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6833 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6584 ns 6604.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 259072 ns 254907.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4625 ns 5791.5 ns 0.80
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4708 ns 5041.5 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7562 ns 7375 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7167 ns 5583.5 ns 1.28
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 256394 ns 252117.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10250 ns 10250 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10167 ns 10292 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10583 ns 10208 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10292 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1355462 ns 1346292 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22982 ns 23009 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5917 ns 5958 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5667 ns 5625 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6000 ns 5667 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5666 ns 5791 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 276854.5 ns 270989.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6836667 ns 6824625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6379291.5 ns 6348145.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6516666.5 ns 6519020.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7591021.5 ns 7697209 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214663 ns 213576.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24219625 ns 24071458 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21283583 ns 21312916.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21080667 ns 21105208.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29731124.5 ns 29655708 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2118978 ns 2112366 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48989083.5 ns 48607583 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34134771 ns 45891875 ns 0.74
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45780708.5 ns 45733979.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38096479 ns 49303792 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6042 ns 7292 ns 0.83
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7291 ns 6916 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7875 ns 7667 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6583 ns 6812.5 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 237605.5 ns 236251.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8750 ns 8833 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8042 ns 9084 ns 0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9667 ns 9125 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9083 ns 8375 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1065271 ns 1057827.5 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1524750 ns 1557041.5 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1278354 ns 1245708 ns 1.03
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1620875.5 ns 1634792 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2109874.5 ns 2151354 ns 0.98
lenet(28, 28, 1, 128)/forward/GPU/CUDA 280243.5 ns 269564 ns 1.04
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7991625 ns 7905354 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6617270.5 ns 6660125 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7189875 ns 7215708 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10461771 ns 10061000 ns 1.04
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1897156.5 ns 1851007 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 321937.5 ns 347583.5 ns 0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 345500 ns 330250 ns 1.05
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 409750 ns 398666.5 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 339562.5 ns 347854.5 ns 0.98
batchedmm(128, Bsize=4)/forward/GPU/CUDA 42269 ns 46483.5 ns 0.91
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 746125 ns 750667 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 790750 ns 791375 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1075375 ns 1087833 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 734625 ns 760750 ns 0.97
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 239383 ns 231907 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 395917 ns 397542 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288291 ns 213292 ns 1.35
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288209 ns 288208 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751084 ns 750375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44476.5 ns 43637 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 646125 ns 666667 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 530875 ns 472875 ns 1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 530542 ns 532542 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 972958 ns 973709 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 191911 ns 187534.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 682833 ns 596583 ns 1.14
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 670833 ns 643625 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 659500 ns 658187.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 638708 ns 659375 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131884.5 ns 131892 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2563479.5 ns 2455000 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2432854.5 ns 2514542 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2458542 ns 2453792 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2383625 ns 2461334 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1195975.5 ns 1187757 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 331542 ns 352771 ns 0.94
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 342250 ns 330916.5 ns 1.03
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 400229 ns 399291 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 312958 ns 312854.5 ns 1.00
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16588 ns 15466 ns 1.07
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 703000 ns 710875 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 729354 ns 734791 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1024041 ns 1025771 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 655979.5 ns 642208 ns 1.02
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 199893 ns 195407.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1462917 ns 1465375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1500917 ns 1498459 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1500875 ns 1502875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1438833 ns 1442833 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40439.5 ns 40141 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5170625 ns 5101625 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5293687.5 ns 5303750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5300979 ns 5295812.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4994229.5 ns 4993584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195795.5 ns 196609 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3666 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33586 ns 33049 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15125 ns 15292 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15250 ns 15167 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15458 ns 15291 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15166 ns 15167 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 379759 ns 375124.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71875 ns 71334 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71334 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71334 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71167 ns 71041 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113830 ns 113867.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 330000 ns 318833 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 326500 ns 321959 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318792 ns 317750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317917 ns 317541 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 196755 ns 192238.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1084 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23617 ns 23138 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8416 ns 8583 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8250 ns 8417 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8584 ns 8459 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8291 ns 7958 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 261794 ns 258287 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 465562.5 ns 475687.5 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 478125.5 ns 463395.5 ns 1.03
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 555166 ns 562708 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 539000 ns 552729.5 ns 0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129234 ns 130132 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1418354 ns 1400250 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1388291.5 ns 1394771 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1637459 ns 1643270.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1604854 ns 1597458 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 273532 ns 277863 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31612 ns 31425 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6750 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6333 ns 6833 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6791 ns 6834 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6209 ns 6208 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 264028.5 ns 261831.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1758708 ns 1726416.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1729729 ns 1745625 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1739750.5 ns 1724625 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1726646 ns 1725854 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168260 ns 169678 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4420959 ns 4357021 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4347666 ns 3978291.5 ns 1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4379083 ns 4384375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4357583 ns 4359458.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1226950 ns 1215814 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6875 ns 6750 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6542 ns 6875 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7083.5 ns 7312.5 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6833 ns 6792 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20701 ns 20951 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51979 ns 48417 ns 1.07
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32583 ns 33583 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 52750 ns 73208.5 ns 0.72
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 50874.5 ns 70500 ns 0.72
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 210813.5 ns 288573 ns 0.73
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 327625.5 ns 360125 ns 0.91
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 347916.5 ns 330312.5 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 408917 ns 410854.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 320146 ns 324312.5 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18775 ns 18716 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 718958 ns 717250 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 737917 ns 741709 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1036166 ns 1036125.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 680500 ns 667292 ns 1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 345085 ns 340218.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75791 ns 75417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75584 ns 75208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75375 ns 75292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75375 ns 75333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47372 ns 46771 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 334958 ns 325792 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 328625 ns 333167 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 327541 ns 325417 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 326083 ns 324333 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213955 ns 208628 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1483750 ns 1487791 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1525917 ns 1523333 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1527208 ns 1526708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1462750 ns 1466375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52155 ns 51173 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5146333 ns 5109167 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5307500 ns 5274250 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5288187.5 ns 5289270.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4992312.5 ns 4981458.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202940.5 ns 201765 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28209 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28208 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28334 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 25072 ns 24387 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66750 ns 66625 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66500 ns 66250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66667 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66542 ns 66792 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 534798 ns 518482.5 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1392708 ns 1471916.5 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1134042 ns 936458 ns 1.21
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1145604 ns 1142000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2197625 ns 2245542 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 584489.5 ns 593805 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3059625 ns 3051000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2735708.5 ns 2625979.5 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2741625 ns 2744916 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3826667 ns 3827125 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2073138 ns 2034429 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8924354 ns 8759417 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8783666.5 ns 8720687.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8795334 ns 8789874.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6371000 ns 6417375 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84500 ns 83687.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 84709 ns 82438 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85125 ns 83416.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84208 ns 82771 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194258 ns 194015.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2051771 ns 2015417 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2021959 ns 2036291 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2024188 ns 2016500 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2022500 ns 2009667 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 803766 ns 802404 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.