Skip to content

Commit

Permalink
chore: bump crate-ci/typos from 1.28.1 to 1.28.2 (#1130)
Browse files Browse the repository at this point in the history
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.28.1 to 1.28.2.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.28.1...v1.28.2)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  • Loading branch information
dependabot[bot] authored Dec 9, 2024
1 parent 546798a commit 59c0c69
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/QualityCheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ jobs:
- name: Checkout Actions Repository
uses: actions/checkout@v4
- name: Check spelling
uses: crate-ci/[email protected].1
uses: crate-ci/[email protected].2

1 comment on commit 59c0c69

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 59c0c69 Previous: 1ea272a Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4041 ns 3958 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5209 ns 4791 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5333 ns 4792 ns 1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3937.5 ns 3958 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59316 ns 59494 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10250 ns 10750 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11083 ns 10959 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11375 ns 10125 ns 1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10542 ns 10562.5 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 416506 ns 417797.5 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1167 ns 1125 ns 1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1292 ns 1292 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1416 ns 1208 ns 1.17
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1167 ns 1083 ns 1.08
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18132 ns 18173 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4020.5 ns 4083 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4250 ns 3417 ns 1.24
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4000 ns 4250 ns 0.94
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4166 ns 3709 ns 1.12
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 107833.5 ns 107683 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 70208 ns 70750 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 58667 ns 64000 ns 0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 64125 ns 64250 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79750 ns 83042 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36663 ns 36561 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2033104 ns 2030500 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2103708 ns 2082541.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2094916 ns 2089104 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2002834 ns 2008667 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192404.5 ns 193196.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 184125 ns 140083 ns 1.31
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 189792 ns 181291 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 186063 ns 181167 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 185125 ns 185250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166512 ns 166362 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1118896 ns 1120708 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1163979 ns 1119000 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1120500 ns 1120041.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1129854 ns 1124104 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 516905 ns 525948 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3334 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3917 ns 4125 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5041 ns 3729.5 ns 1.35
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3333.5 ns 3542 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70027 ns 70915 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9166 ns 9125 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9125 ns 9542 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9042 ns 8708 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8875 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 443773 ns 475931.5 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19084 ns 15062.5 ns 1.27
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15375 ns 15250 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18375 ns 17437.5 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14625 ns 15375 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54384.5 ns 53231 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225917 ns 216458.5 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214542 ns 225042 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215125 ns 213541.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213000 ns 222375 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 268804 ns 270372.5 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 791 ns 750 ns 1.05
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 666 ns 1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 583 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17259 ns 17324 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1500 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1542 ns 1520.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1750 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1667 ns 1500 ns 1.11
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 100025 ns 100368.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8917 ns 8125 ns 1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6417 ns 8125 ns 0.79
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 8042 ns 7041 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10334 ns 10667 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23360 ns 22992 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233625 ns 234000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230375 ns 239937.5 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230166 ns 228833.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225083 ns 222271 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 165842 ns 167254 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3834 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4000 ns 3958 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23503 ns 23377 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17333 ns 16833 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17125 ns 16667 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 18416 ns 18375 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16542 ns 16583 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 160007 ns 160878 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 602459 ns 610542 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 612791 ns 613209 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 611250 ns 634042 ns 0.96
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 609583 ns 609000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113120 ns 113540.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1422458 ns 1430375 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1432875 ns 1420292 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1432708.5 ns 1446167 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1421250 ns 1425542 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 210148 ns 210405 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1073292 ns 1076083 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 969125 ns 968959 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1355229 ns 1348187.5 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1303542 ns 1290083 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 271326.5 ns 272167 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5773875 ns 5791000 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4524834 ns 4597104 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4956520.5 ns 4948917 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5616459 ns 5522395.5 ns 1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1077047.5 ns 1076534 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23520 ns 23590 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2208 ns 2166 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2208 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 169620 ns 173376 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4250 ns 3917 ns 1.09
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4125 ns 4208 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4708 ns 5125 ns 0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3833 ns 4083.5 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 64369 ns 65133.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11375 ns 10979.5 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11750 ns 11375 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11875 ns 11667 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11292 ns 11125 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 441357.5 ns 444460.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6292 ns 5959 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6750 ns 6416 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7166 ns 7209 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6333 ns 6333 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 51996.5 ns 51265 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18312.5 ns 17125 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18083 ns 17208 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 19833 ns 17709 ns 1.12
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18125 ns 17500 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 297374 ns 297640 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 541 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 625 ns 0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32515 ns 32574 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8959 ns 8312.5 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8917 ns 8395.5 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9083 ns 8834 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8542 ns 9125 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 156580.5 ns 156527 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 96500 ns 96458 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 96458 ns 96250 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 96666.5 ns 95958 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 96458 ns 97333 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111507.5 ns 111569 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 282542 ns 279917 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 294792 ns 272666 ns 1.08
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 278250 ns 276958 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 274042 ns 291791 ns 0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 185832.5 ns 184593 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3410792 ns 3390792 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2893584 ns 3045416 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3043771 ns 3031500 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3950938 ns 3960417 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 573403 ns 572942 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7640458 ns 7593625 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7363916.5 ns 7437042 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7444583 ns 7444584 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8213291 ns 8265979 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1371137.5 ns 1334670 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17504417 ns 12605208 ns 1.39
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17685667 ns 17554084 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17570042 ns 17556062 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14113396 ns 14272042 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23914500 ns 24062729 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43551541 ns 34415959 ns 1.27
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37461209 ns 37185584 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34611021 ns 34968250 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1841718.5 ns 1858779 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 313175916 ns 317027145.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 178521083 ns 233784625 ns 0.76
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 195096687.5 ns 195359167 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 279780167 ns 280568396 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13887340 ns 13916432 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 273572625 ns 273605875 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 278931729 ns 269293459 ns 1.04
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 256343958 ns 251015375 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 474930271 ns 332609042 ns 1.43
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21875 ns 21834 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22459 ns 21750 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23250 ns 25500 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21334 ns 22916 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95397 ns 95464 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 111375 ns 118125 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104624.5 ns 103417 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104666 ns 104833.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103604.5 ns 104125 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 506209 ns 509331.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5833.5 ns 5417 ns 1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6041 ns 6500 ns 0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6667 ns 6500 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5583.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68686 ns 67886 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14834 ns 14625 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15792 ns 15292 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16375 ns 15542 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14834 ns 14917 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 476339.5 ns 472243.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3078645.5 ns 3101833 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2149083 ns 2134333 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2304458.5 ns 2303021 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4677166 ns 5007292 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 589471.5 ns 586798 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23611208 ns 23546583 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18335958 ns 18840521 ns 0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17863458.5 ns 18012083 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35453375 ns 36120167 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2770157.5 ns 2918041 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33321333 ns 33910770.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27967958 ns 27527417 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27533500 ns 28620667 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41461333 ns 41842979 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72791 ns 72812 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73083 ns 74542 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 81187.5 ns 74187.5 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73875 ns 72666 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100825.5 ns 101631 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 316333.5 ns 292354.5 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 318437.5 ns 217084 ns 1.47
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 323125 ns 315166 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 308937.5 ns 292458 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 537489.5 ns 549955 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11625 ns 11541.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12083 ns 11791 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12125 ns 12250 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11959 ns 11417 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 69997 ns 70877.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26834 ns 26084 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26959 ns 26583 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27958 ns 28645.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26791.5 ns 27000 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 466299.5 ns 471342.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12625 ns 12083.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12604.5 ns 12250 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13958 ns 13292 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12208 ns 12417 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52579.5 ns 52255 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25958 ns 25416 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26333 ns 25750 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26583 ns 25791 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26541 ns 26459 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 298632 ns 302749 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179625 ns 178333 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 179458 ns 179875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183083 ns 180750 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 188958 ns 180334 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56675.5 ns 56120 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 595770.5 ns 581770.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 595666 ns 583250 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 584792 ns 583208.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582042 ns 589771 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 280995 ns 283667.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5959 ns 6187.5 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 6333 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7125 ns 6354.5 ns 1.12
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6042 ns 6250 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 69870 ns 70397 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14166 ns 13583 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14917 ns 14000 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15625 ns 14792 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14458 ns 14709 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 454188 ns 461030.5 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1239500 ns 1242791 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1321583 ns 1300208 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1360666.5 ns 1359354 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1089687 ns 1186229.5 ns 0.92
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302025.5 ns 301478 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4119041 ns 4116667 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4588250 ns 4395875 ns 1.04
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4571375 ns 4529125 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3710875 ns 3917271.5 ns 0.95
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1037839.5 ns 1038425.5 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23678 ns 23500 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4959 ns 4834 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4916 ns 4834 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 4958 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 190693.5 ns 188737.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5792 ns 5584 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6167 ns 6084 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7042 ns 7333 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5875 ns 5959 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 55855.5 ns 55083.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11792 ns 10750 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11125 ns 11209 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11250 ns 11542 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10584 ns 11292 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 329926.5 ns 335254 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 334 ns 334 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23017 ns 22752 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 2750 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3083 ns 2792 ns 1.10
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3041 ns 2709 ns 1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2750 ns 0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 161342 ns 159355.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11875 ns 11084 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11833 ns 11458 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13042 ns 12854.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11292 ns 12083 ns 0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57364 ns 57729 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24959 ns 24167 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24979.5 ns 24541 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 27250 ns 24916 ns 1.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24458 ns 25167 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 291490.5 ns 299680 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4250 ns 4125 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4291 ns 4125 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4208 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4166 ns 4209 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24932 ns 24651 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16500 ns 16166 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16166 ns 16083 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16541 ns 16292 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16250 ns 16042 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 196710 ns 199395 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5791 ns 5667 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 5709 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5834 ns 5791 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5791 ns 5791 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33248 ns 33617 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20792 ns 20020.5 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20959 ns 20583 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21459 ns 21083 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20542 ns 21042 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 175956 ns 175086.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 412875 ns 407729 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 375208 ns 380271 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 487209 ns 483500 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 146584 ns 105458.5 ns 1.39
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67216 ns 67085 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 916708.5 ns 926875 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 989792 ns 968750 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1196125 ns 1173375 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 476875 ns 378000 ns 1.26
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 190238 ns 188736 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 135084 ns 132583 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81542 ns 130188 ns 0.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 141833 ns 129458 ns 1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 135750 ns 137584 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193827 ns 192853 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1911291.5 ns 1920250.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1946333 ns 1918583 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1928333 ns 1924438 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1910834 ns 1920500 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 394453 ns 409280 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22330 ns 21945 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 168958 ns 171197.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6625 ns 6042 ns 1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6792 ns 6625 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7792 ns 7916.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6666 ns 7042 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 58760 ns 58992.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9667 ns 8791 ns 1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9291 ns 8792 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9333 ns 9291 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9417 ns 9292 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 298788.5 ns 311073 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 111820937.5 ns 110075500 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181915979 ns 174018250 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143480208 ns 143516291 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 92143250 ns 116009417 ns 0.79
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5482249 ns 5438117 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 614702333 ns 617670521 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 582318312.5 ns 555321542 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 456793479.5 ns 453019437.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 623509562.5 ns 637539146 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38239772 ns 34975009 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 796858958 ns 654977875 ns 1.22
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 687543333 ns 666181396 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 619636833 ns 629801020.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 745741417 ns 742545875 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 62834 ns 61500 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47791 ns 52500 ns 0.91
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 53250 ns 53125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83083 ns 85458 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37226 ns 37175.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1923354 ns 1912375 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1992584 ns 1971000 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1986708.5 ns 1984958.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1895062.5 ns 1907791.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 173492.5 ns 173650 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 266916.5 ns 285104 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 267354.5 ns 265292 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 268666 ns 267750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 264979 ns 266625 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 127720 ns 130504 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 664125 ns 686125 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 694604.5 ns 704333 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 650292 ns 683541.5 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 699958 ns 663104 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 703429.5 ns 717967 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2256583 ns 2234292 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2246021 ns 2244771 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2238750 ns 2244750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2261771 ns 2241333.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133355.5 ns 133396.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5510583 ns 5451812.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5590125 ns 5487812.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5513333 ns 5498042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5481479.5 ns 5562521 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 809354 ns 754203 ns 1.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 669750 ns 685959 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 680333 ns 670541 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 678166 ns 666167 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 674417 ns 680000 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47532 ns 46765 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1816770.5 ns 1817416 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1665417 ns 1716895.5 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1717645.5 ns 1744292 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2082542 ns 2082750 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 226328.5 ns 220971 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 70125 ns 70125 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 59875 ns 53125 ns 1.13
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 52958 ns 52708 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82666 ns 84625 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28600 ns 28234 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2037917 ns 2030854.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2108146 ns 2081770.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2092292 ns 2100958 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2001334 ns 2007416 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190920 ns 188927 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13460541.5 ns 13472458 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12543854 ns 12508625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12654167 ns 12582124.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15261812.5 ns 15073041.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 515830 ns 512756.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47280959 ns 47011770.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 42008521 ns 41636000 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40839333.5 ns 40969375 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58419750 ns 59058645.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2897965 ns 3033111.5 ns 0.96
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97048750 ns 73891958 ns 1.31
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91157167 ns 67845145.5 ns 1.34
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90856333.5 ns 92214500 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76444354 ns 99774291.5 ns 0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 72334 ns 71166.5 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47292 ns 64583 ns 0.73
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 65375 ns 65791 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82584 ns 84792 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 46194 ns 47424 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1929937 ns 1905937.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1984583.5 ns 1967666.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1983584 ns 1977375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1888750 ns 1898333.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 189040 ns 192864 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 416 ns 0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32091 ns 32583 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6541 ns 6041 ns 1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6125 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6583 ns 6459 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5958 ns 6542 ns 0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 172109 ns 172656.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 333 ns 0.87
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32297 ns 32498 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2708 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2834 ns 2709 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2875 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 162091 ns 162027.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 279890375 ns 278479062 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 347812250 ns 339860437.5 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 310658166.5 ns 309104833 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 261239625 ns 282371084 ns 0.93
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7100472 ns 7112114 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 994066791 ns 997282375 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 960267958 ns 939909542 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 837209229.5 ns 834322792 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1129871667 ns 1020744375 ns 1.11
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34010568 ns 34065304 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1752205958 ns 1416221791.5 ns 1.24
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1693119292 ns 1324822042 ns 1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1650193041 ns 1631228625 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1306363020.5 ns 1675762813 ns 0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458375 ns 1450812.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1463959 ns 1456521 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1465625 ns 1455333 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1459625 ns 1460167 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127763 ns 127677 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5012416 ns 5023459 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5066791 ns 5018833 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5033750 ns 5024791.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5030375 ns 5045271 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 508105 ns 588360 ns 0.86
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 158175666 ns 157992750 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 166759458.5 ns 148446708 ns 1.12
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 90721479 ns 164732625 ns 0.55
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 151859250 ns 153538583.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4851019 ns 4886668 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 669929250 ns 637312250 ns 1.05
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 560789291 ns 611560250 ns 0.92
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 487588708 ns 470585834 ns 1.04
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 651112083 ns 662978834 ns 0.98
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16518582 ns 16094164 ns 1.03
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8927708.5 ns 8954458 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9111000 ns 9014875 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7978437.5 ns 7941438 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10091416 ns 10320875 ns 0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1611554 ns 1593595 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36693146 ns 37088334 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 39523229 ns 37925916.5 ns 1.04
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34135874.5 ns 34179167 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 59280958 ns 39118729 ns 1.52
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6506722 ns 6471873.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47437.5 ns 47416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47500 ns 47292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47542 ns 47459 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47292 ns 47458 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18176 ns 18458 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50417 ns 50250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50458 ns 50291 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50708 ns 50834 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50333 ns 50458 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 194955.5 ns 188984 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7145.5 ns 6125 ns 1.17
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7292 ns 6708 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8084 ns 7875 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6667 ns 7042 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 93154.5 ns 89761 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10417 ns 9750 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10166 ns 10125 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10250 ns 1
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9833 ns 10541 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 540639.5 ns 516571.5 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6333 ns 5750 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 5958.5 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7479.5 ns 7417 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5166 ns 6417 ns 0.81
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 107337.5 ns 106479.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13709 ns 12750 ns 1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13000 ns 13042 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13542 ns 13291 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13416.5 ns 13270.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 470745 ns 479931 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1125 ns 958 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1041 ns 959 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1125 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32066 ns 32924 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 7542 ns 1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8084 ns 8000 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8042 ns 7958 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7875 ns 8250 ns 0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 196744 ns 200265 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23791.5 ns 22875 ns 1.04
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23209 ns 23041 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23291 ns 23917 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23041 ns 23208 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18317 ns 18525 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52583 ns 52208 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52625 ns 52583 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52833 ns 52625 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52417 ns 52542 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 274763.5 ns 267460 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458625 ns 1451417 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1464021 ns 1459084 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1466000 ns 1459500 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1454708 ns 1465416.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195560 ns 196174 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5020749.5 ns 5014166.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5048500 ns 5005062.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5032583 ns 5014250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5015271 ns 5037250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 577898 ns 579761 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3133854.5 ns 3149500 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2152167 ns 1975646 ns 1.09
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2319584 ns 2323562.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4994354 ns 4912270.5 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 580258 ns 583087.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24444667 ns 24421562.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19072896 ns 19801250.5 ns 0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19040875 ns 18967959 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36840083 ns 37230000 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2865056 ns 2963899 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34088208 ns 34154937.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28581417 ns 28340541 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28009625 ns 28271812.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41680458.5 ns 43122000 ns 0.97
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 141268000 ns 140810292 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 143350625 ns 143457875 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 120743271 ns 120969000 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 188129709 ns 190332292 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22550783 ns 22567410 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2324854792 ns 1439193417 ns 1.62
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 841095084 ns 1035778354.5 ns 0.81
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1147318167 ns 1029350563 ns 1.11
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 833862645.5 ns 847160583 ns 0.98
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 117903310 ns 118590973 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 84125 ns 72979 ns 1.15
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 78250 ns 72229.5 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76312 ns 75417 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 71667 ns 73416.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 218484.5 ns 210693.5 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 290458 ns 296396 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 292000 ns 283542 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 305208 ns 309000 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 288208.5 ns 282667 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1198254 ns 1113011 ns 1.08
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35368791 ns 35428583 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36524083.5 ns 35740146 ns 1.02
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 31361542 ns 31356458 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 38859354 ns 39882791 ns 0.97
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5837777.5 ns 5846172 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148171584 ns 148563000 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 157709333 ns 152825542 ns 1.03
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 137631188 ns 135772750.5 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 150161812.5 ns 153516333 ns 0.98
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34862440.5 ns 34902152 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 111509000 ns 112450083 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 181918104.5 ns 173734500 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143432542 ns 143024292 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 94189375.5 ns 97164708 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5460850 ns 5471199 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 497837834 ns 468949292 ns 1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 512628166 ns 523211021 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 440382167 ns 440488146 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 678623500 ns 623433833.5 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35174507 ns 32285967 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 644936208 ns 800549541 ns 0.81
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 676380021 ns 656663541.5 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 603539166.5 ns 567293062.5 ns 1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 727707084 ns 735113417 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1357667 ns 1357292 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 795375 ns 1006709 ns 0.79
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 995750 ns 993792 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2104875 ns 2076875 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 578115.5 ns 574648.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2829624.5 ns 2981104 ns 0.95
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2513417 ns 2614562.5 ns 0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2616854 ns 2632479 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3785792 ns 3749687.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1750025 ns 1705197 ns 1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5815812.5 ns 5826896 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5906250 ns 5792500 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5802125 ns 5792645.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2884250 ns 2968021 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8084 ns 8042 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6333 ns 7000 ns 0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7042 ns 7042 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10541 ns 10875 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24821 ns 24779 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213645.5 ns 212208 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 255312.5 ns 233625 ns 1.09
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220667 ns 220750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205667 ns 209750 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 248152.5 ns 246929 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 293659209 ns 452114625 ns 0.65
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 259757583 ns 205741771 ns 1.26
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 158085937.5 ns 181027291.5 ns 0.87
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 293331625 ns 462543917 ns 0.63
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7676538.5 ns 7673150.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1087845916.5 ns 1095771812.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 950749875 ns 925308125 ns 1.03
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 812442750 ns 875879750 ns 0.93
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1143172250 ns 1183196167 ns 0.97
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26848630 ns 26783812 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5542 ns 5125 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6084 ns 5312.5 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7020.5 ns 6375 ns 1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4958 ns 6083 ns 0.82
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 143609 ns 143484 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 6875 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7542 ns 7500 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7541 ns 7583.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7042 ns 7708 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 603161 ns 569216 ns 1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23552 ns 23876 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9167 ns 8584 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9250 ns 8917 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9416 ns 9500 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9084 ns 9292 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 220124 ns 202303 ns 1.09
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 380958 ns 352875 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352792 ns 382959 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352625 ns 352625 ns 1
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 350834 ns 351625 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21325 ns 21342 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 832062.5 ns 776270.5 ns 1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 827458 ns 810812.5 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 775062.5 ns 775187.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 823188 ns 827583.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 278436.5 ns 240060.5 ns 1.16
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 335250 ns 332770.5 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 327208 ns 332583 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 451729 ns 451459 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 12458 ns 9959 ns 1.25
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18024 ns 18163 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 711291 ns 714000 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 735541 ns 727125 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1004041 ns 999833 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26666 ns 26625 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 252208 ns 238711 ns 1.06
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 375354.5 ns 374437 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 336667 ns 347917 ns 0.97
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 439084 ns 440937.5 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 28875 ns 28792 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22779 ns 22488 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 720625 ns 733000 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 804333.5 ns 778479 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1027667 ns 1023541.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 104125 ns 89875 ns 1.16
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 204288.5 ns 205326 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3500 ns 3354.5 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3833 ns 3417 ns 1.12
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 3625 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3334 ns 3750 ns 0.89
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17589 ns 17749 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4208 ns 4125 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4375 ns 4292 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4583 ns 4250 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4375 ns 4375 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 247283.5 ns 235900.5 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3625 ns 3417 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3917 ns 4000 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4667 ns 4041 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3729 ns 4125 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 172481 ns 174157.5 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8479.5 ns 8042 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8750 ns 8500 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8584 ns 8125 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8625 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1132928 ns 1076434 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206959 ns 207542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 212916 ns 213916 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 214834 ns 212833 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200708 ns 202625 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34121 ns 34097 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 649583.5 ns 601333 ns 1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 623333 ns 633916.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622250 ns 621208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 613479 ns 582666.5 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 316951.5 ns 291620 ns 1.09
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1236916 ns 1245375 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1300167 ns 1251750 ns 1.04
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1184250 ns 1177937.5 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1155667 ns 1207083 ns 0.96
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206523 ns 207232 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4569500 ns 4566750 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4789500 ns 4712249.5 ns 1.02
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4471334 ns 4457500 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4277000 ns 4779979 ns 0.89
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 925498 ns 927700.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 2958 ns 1.18
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3708 ns 3917 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4500 ns 3896 ns 1.16
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3084 ns 3833 ns 0.80
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 207464.5 ns 167597.5 ns 1.24
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7542 ns 7167 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7708 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7458 ns 7208 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6750 ns 7459 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 942535 ns 944745 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1661083 ns 1646750 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1212459 ns 1186708 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1388375 ns 1375541.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2367291.5 ns 2434792 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214118 ns 214131 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12379333 ns 12360250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9634187.5 ns 9584833 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9303250.5 ns 9257792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17994791.5 ns 18118625 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1954978 ns 1941495.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17400125 ns 17409917 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14391542 ns 14369603.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14366500 ns 14347521 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 20976166.5 ns 21171916 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 134083 ns 85209 ns 1.57
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 134145.5 ns 138875 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 140125 ns 134958 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 133834 ns 132917 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126509 ns 125576 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2067833 ns 2040229.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2021792 ns 2026646 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2040375 ns 2030000 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2038229.5 ns 2046729 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 937192.5 ns 954388.5 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 1250 ns 1000 ns 1.25
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 1542 ns 1292 ns 1.19
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3500 ns 1791 ns 1.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1041.5 ns 1416 ns 0.74
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16603 ns 16301 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2792 ns 2458 ns 1.14
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2791 ns 2583 ns 1.08
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2834 ns 2792 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2687.5 ns 2875 ns 0.93
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 180632.5 ns 180190.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8084 ns 8041 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6416 ns 6959 ns 0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6916 ns 7125 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10583 ns 10833 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33846 ns 33324 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224958 ns 217125 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230000 ns 220125 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220875 ns 220542 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206709 ns 207145.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 304013 ns 294304 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3666 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3750 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22588 ns 22268 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14625 ns 14542 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14250 ns 14458 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14625 ns 14500 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14458 ns 14250 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 437508.5 ns 451646.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 145791 ns 135084 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 141583 ns 135167 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 142459 ns 145833 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 141375.5 ns 135771 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125856 ns 124920.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1928792 ns 1931125 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1919959 ns 1923875 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1933062.5 ns 1933583.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1928146 ns 1941584 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 879589 ns 895888.5 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 870875 ns 869083.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 819625 ns 814146 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1235083 ns 1222709 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 966104.5 ns 942729 ns 1.02
lenet(28, 28, 1, 32)/forward/GPU/CUDA 274565 ns 269464 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2825084 ns 2833167 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2525875 ns 2528333.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3358499.5 ns 3338750 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3396917 ns 3399146 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1523569.5 ns 1538408 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14750 ns 20750 ns 0.71
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15375 ns 15041.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16875 ns 16229.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14958 ns 14959 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 130066 ns 129111.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 261458 ns 215916 ns 1.21
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 259875 ns 229604.5 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216042 ns 215709 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220500 ns 224833 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 583170.5 ns 586555.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 219729.5 ns 219250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220479 ns 220020.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 223250 ns 222125 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221354.5 ns 219916 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 243425.5 ns 244257 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 510541.5 ns 529291.5 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 506917 ns 509000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 498667 ns 509666 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 512312.5 ns 509542 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1217935 ns 1272897.5 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 3667 ns 3125 ns 1.17
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4833 ns 4500 ns 1.07
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 5167 ns 4542 ns 1.14
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3979.5 ns 3959 ns 1.01
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16933 ns 16759 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7625 ns 7208 ns 1.06
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7375 ns 7208 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7333 ns 7250 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7250 ns 7334 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 178880.5 ns 181468 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18792 ns 16792 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17542 ns 17062.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19791 ns 17812.5 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18125 ns 17250 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 137690.5 ns 134619.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 252708 ns 211583 ns 1.19
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213500 ns 213625 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214541 ns 212812.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214000 ns 213083 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 849873 ns 895952 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4229.5 ns 3917 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4916 ns 4833 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5417 ns 4625 ns 1.17
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4291.5 ns 4625 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 168191 ns 212453 ns 0.79
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10750 ns 10166 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11042 ns 10417 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10875 ns 10875 ns 1
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10042 ns 10583 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 949351.5 ns 992994 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3292 ns 3125 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3625 ns 3709 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4167 ns 4750 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3459 ns 3916 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 206120 ns 212054.5 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 7083 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7791 ns 7125 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7666 ns 7583 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7292 ns 7500 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 985137 ns 1004688 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23600875 ns 23464771 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43903313 ns 35060375 ns 1.25
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37710791.5 ns 37779167 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34490521 ns 34969333 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1837956 ns 1848833 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 191551625 ns 184464833.5 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 186643917 ns 160073583.5 ns 1.17
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 145792667 ns 145086500 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 271888584 ns 445100854 ns 0.61
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16496336 ns 16527443 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 292672562 ns 271288729 ns 1.08
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 266647854 ns 263438959 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 299377291.5 ns 302324416 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 325821396 ns 496832583.5 ns 0.66
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 184041 ns 181417 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182292 ns 185458 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184917 ns 185750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183667 ns 181708 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 199516.5 ns 193313 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 632125 ns 589438 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 596250 ns 631229 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 589146 ns 598125 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 634646 ns 590687.5 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 950274 ns 966959 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3923584 ns 3877125 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 4065250 ns 3946625 ns 1.03
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3605250 ns 3651083.5 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4910271 ns 5012833.5 ns 0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA 551654.5 ns 530368 ns 1.04
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 16427166.5 ns 17988625 ns 0.91
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17546270.5 ns 18469458 ns 0.95
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 15424750 ns 17328979.5 ns 0.89
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 41363334 ns 20374792 ns 2.03
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2634322 ns 2619767.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32132 ns 32351 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9500 ns 9041 ns 1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9500 ns 9541.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9792 ns 9833 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9541 ns 9500 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 247547.5 ns 247867.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 513820542 ns 498558729 ns 1.03
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 535432083 ns 468495750 ns 1.14
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 355647999.5 ns 362160229 ns 0.98
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 672007125 ns 607173041 ns 1.11
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12475563.5 ns 12482436 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1968156417 ns 1885912604.5 ns 1.04
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1778975000 ns 1633604541 ns 1.09
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1508167229 ns 1504714375 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2144133562.5 ns 2155903916.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49346209.5 ns 49283559 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1659562.5 ns 1664666.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1222625 ns 1200396 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1402292 ns 1387542 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2420750 ns 2441166 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214908 ns 216027 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12714958 ns 12783813 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 10033625 ns 9969333 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9669250 ns 9630041 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18444395.5 ns 18564625 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2014155.5 ns 2024417 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17720021 ns 17729000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14836625 ns 14689833 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14593959 ns 14572562.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21470916.5 ns 21460792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26208 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26291 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26208 ns 26334 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24129 ns 24291 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67292 ns 67375 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67166 ns 66792 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67437.5 ns 67250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67125 ns 66916 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 370580.5 ns 376851.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206334 ns 206292 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 212084 ns 213042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211708 ns 212292 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200042 ns 200542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25933.5 ns 25875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 652229 ns 608438 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 673167 ns 631687.5 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 623750.5 ns 622729.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 594625 ns 592459 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 321589 ns 328754.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 689375 ns 702583 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 686646 ns 644542 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 603125.5 ns 631083 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 595854 ns 682250 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131463 ns 131950 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2275292 ns 2262083 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2318250 ns 2242917 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2234167 ns 2231125 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2258041 ns 2307979 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1083174.5 ns 1167364 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17208 ns 17125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16708.5 ns 20083 ns 0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18542 ns 18791 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 26209 ns 18041.5 ns 1.45
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 133722.5 ns 132602 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233041 ns 229500 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 238708 ns 218833 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220895.5 ns 219792 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 247479 ns 230333.5 ns 1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 889518 ns 967555 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23455 ns 23714 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9917 ns 9417 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9916.5 ns 9833 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10166 ns 9875 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9709 ns 9833 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 243499.5 ns 247044.5 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 5209 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5667 ns 5812.5 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7208 ns 6812.5 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5166 ns 5916.5 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 195501 ns 211718.5 ns 0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 7084 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7417 ns 7459 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7667 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7208 ns 7500 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 713265 ns 739090.5 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2459 ns 1917 ns 1.28
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2375 ns 2208 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2250 ns 2250 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2042 ns 2250 ns 0.91
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17937 ns 18219 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6667 ns 6292 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6625 ns 6417 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6667 ns 6729.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6500 ns 6584 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 295613.5 ns 307391 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 781188 ns 749208 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 762250 ns 748625 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 746542 ns 746500 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 746084 ns 748625 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21171 ns 21224.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 815833 ns 803167 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 816958.5 ns 792833 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775937.5 ns 792834 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 810604.5 ns 813166 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 269191.5 ns 271736 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8042 ns 8125 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6417 ns 7583 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6958 ns 6959 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10625 ns 10917 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33169.5 ns 32567.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 265000 ns 232666 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 268728.5 ns 240625 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229125 ns 227604 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217229 ns 258125 ns 0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 330500.5 ns 333854 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10375 ns 9959 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10646 ns 10709 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11292 ns 10833 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10125 ns 10271 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 217068 ns 226295 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24542 ns 24167 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25354.5 ns 24729.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25500 ns 24417 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24333 ns 25354.5 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1027279 ns 1051998 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106479791.5 ns 106630458.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 126041750 ns 117910875 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120943833 ns 120489750 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117512916.5 ns 117867166.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2638425 ns 2630839 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 384219250 ns 375572750 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 372791166.5 ns 347200750 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 338002625 ns 370237167 ns 0.91
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 471273750 ns 484151625 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15225519 ns 15207487.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 803612958.5 ns 607408041 ns 1.32
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 771462084 ns 591624416 ns 1.30
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 812264500 ns 811424250 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 607987313 ns 961849167 ns 0.63
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7042 ns 6834 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7208 ns 6708 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8166.5 ns 8041 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6750 ns 7354 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 210285.5 ns 213896 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14333 ns 14000 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14750 ns 15125 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14000 ns 14458 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13792 ns 13666 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 971185 ns 993505 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6209 ns 5958 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6417 ns 6145.5 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7500 ns 7458 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 6312.5 ns 0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 204318 ns 209272 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12875 ns 12500 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12583 ns 12625 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13125 ns 13250 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12333 ns 12250 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 695362.5 ns 719970 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5042 ns 5000 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5625 ns 5667 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6250 ns 5500 ns 1.14
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5958 ns 5458 ns 1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17261 ns 17137 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15666 ns 15083 ns 1.04
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15709 ns 15459 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15583 ns 15458 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15458 ns 15583 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 184710.5 ns 185445 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23615 ns 23381 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6542 ns 6291 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6542 ns 6334 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6542 ns 6520.5 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6208 ns 6541 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 226664.5 ns 227150.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5750 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5916 ns 5792 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 5834 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24395 ns 24282 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22729 ns 23416.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21625 ns 20542 ns 1.05
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21667 ns 21292 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 20854.5 ns 21416 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 248686 ns 249310.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 192437 ns 192603.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 194875 ns 190208 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 190958 ns 187125 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 198042 ns 189437.5 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166753.5 ns 167056.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1364250 ns 1339333.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1373333.5 ns 1319750.5 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1330458 ns 1298333 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1326229.5 ns 1349625 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1188373.5 ns 1248940 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23125 ns 22188 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23000 ns 22167 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24041 ns 23250 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21667 ns 30833 ns 0.70
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 256526 ns 318042 ns 0.81
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 131208 ns 175104 ns 0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 183125.5 ns 129354 ns 1.42
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118667 ns 147250 ns 0.81
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 180917 ns 180250 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1281592 ns 1355497.5 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23301 ns 23100 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6833 ns 6167 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 6416 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 6583 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6417 ns 6583 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 241136.5 ns 245385 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4542 ns 4208 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5229.5 ns 4625 ns 1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5125 ns 4833 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4666 ns 4708 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 219000.5 ns 232572 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10334 ns 9583 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10625 ns 10020.5 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 9791 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10291.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1232626.5 ns 1286978.5 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23375 ns 23645 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6042 ns 5708 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6000 ns 5750 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5959 ns 5667 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns 5667 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 260915.5 ns 263109.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6837750 ns 6835750 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6418708 ns 6400459 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6547416.5 ns 6536604 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7628667 ns 7672542 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214982 ns 215618 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24126020.5 ns 24116958 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21396208 ns 21263041 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 20992000 ns 20976375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29707541 ns 29871542 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2104096.5 ns 2094351.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48614958 ns 37551959 ns 1.29
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45739708 ns 34396208.5 ns 1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45440458 ns 45713375 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38260167 ns 49651167 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5917 ns 5583 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6083 ns 6250 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7041 ns 6625 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5708 ns 6625 ns 0.86
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 205307.5 ns 210693 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8583 ns 8166 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8959 ns 9000 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8417 ns 8625 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8125 ns 8500 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 959724.5 ns 993726 ns 0.97
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1564625 ns 1570250 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1276958 ns 1273479 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1632792 ns 1626896 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2147187.5 ns 2142333 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 276902.5 ns 271789 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7938667 ns 7954709 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6675417 ns 6282562.5 ns 1.06
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7179229.5 ns 7141958 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10466792 ns 10525875 ns 0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1755348 ns 1760839.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 375979.5 ns 377437.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 356791.5 ns 378125 ns 0.94
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 453958 ns 450292 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 31791.5 ns 30500 ns 1.04
batchedmm(128, Bsize=4)/forward/GPU/CUDA 47221 ns 42718 ns 1.11
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 724250 ns 743209 ns 0.97
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 820708 ns 790458 ns 1.04
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1064167 ns 1051750 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 93125 ns 123333 ns 0.76
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 222768.5 ns 280362 ns 0.79
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 413500 ns 415750 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 220417 ns 305875 ns 0.72
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 305958 ns 306125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 758417 ns 757167 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44850 ns 44026.5 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 664291 ns 662333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 464750 ns 523625 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 524625 ns 524208 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 971875 ns 973917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 190748 ns 188149 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 660125 ns 698417 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 688833 ns 669875 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 599208.5 ns 674375 ns 0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 676041 ns 683041.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131585 ns 131691 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2465396 ns 2527000 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2549750 ns 2445791.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2454750 ns 2456458.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2436396 ns 2515459 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1099429.5 ns 1199048 ns 0.92
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 2084 ns 1917 ns 1.09
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2500 ns 2041.5 ns 1.22
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4584 ns 2459 ns 1.86
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 2000 ns 2437.5 ns 0.82
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16017 ns 16312 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5541 ns 5208 ns 1.06
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5625 ns 5500 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5541 ns 5625 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5459 ns 5479.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 183422 ns 184945 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1479917 ns 1481291 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1515750 ns 1524125 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1523083 ns 1521750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1448834 ns 1447604.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39978 ns 39655 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5170937.5 ns 5139771 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5319792 ns 5014250 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5296208 ns 5294625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4989229.5 ns 5015729.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195522 ns 194949 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3625 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3666 ns 3625 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3666 ns 3750 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34698 ns 33334 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15458 ns 15291 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15292 ns 15083 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15500 ns 15292 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15250 ns 15167 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 348167 ns 349359.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 96375 ns 94542 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 104834 ns 103166 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 94000 ns 103209 ns 0.91
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 92875 ns 95625 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113764.5 ns 113041.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 319291 ns 318084 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 326792 ns 316917 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 317083 ns 316666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317375 ns 321750 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 196865 ns 192326 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 959 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1083 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23731 ns 23389 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 7708 ns 1.10
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8167 ns 7916 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 7959 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 8270.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 250818 ns 246988.5 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 536458.5 ns 534875 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 514770.5 ns 514875 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 583167 ns 572375 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 177291.5 ns 256145.5 ns 0.69
batchedmm(128, Bsize=32)/forward/GPU/CUDA 128802.5 ns 129558.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1430708 ns 1420041.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1491625 ns 1466708.5 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1790583 ns 1756250 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 862187.5 ns 902625 ns 0.96
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274040.5 ns 276092.5 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32562 ns 31832 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6750 ns 6084 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6542 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6666 ns 6292 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6584 ns 6292 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 252956.5 ns 248681.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1721104 ns 1729313 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1775187.5 ns 1725667 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1796833.5 ns 1769167 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1760583 ns 1772187.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168117 ns 168168 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4395271 ns 4416792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4422959 ns 4351145.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4375792 ns 4368958 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4339937.5 ns 4403479.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1159629.5 ns 1091804.5 ns 1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 16708.5 ns 7041.5 ns 2.37
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7042 ns 7333 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 8000 ns 7375 ns 1.08
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7125 ns 7375 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19970 ns 20581 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 52520.5 ns 32334 ns 1.62
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 74791 ns 62021 ns 1.21
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33083 ns 33333 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 43000 ns 71833 ns 0.60
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 199671 ns 196104.5 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17333 ns 17208 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17875 ns 17520.5 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18229.5 ns 17875 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17708 ns 17459 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18853 ns 18509 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53541.5 ns 52875 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53500 ns 53625 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53500 ns 53541 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53542 ns 53084 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 320380 ns 318108.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 102541.5 ns 104959 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 109541 ns 107334 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 99500 ns 107250 ns 0.93
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 97875 ns 101250 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47141 ns 46996 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 328250 ns 324500 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 333084 ns 325958 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 324125 ns 323083 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324041 ns 327500 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213969 ns 208617.5 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1504750 ns 1506583 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1541208 ns 1549708 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1549666 ns 1549292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1472416.5 ns 1480958 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52318 ns 51270 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5156854.5 ns 5143666.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5311833 ns 5297771 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5311062.5 ns 5293084 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4595917 ns 5004625.5 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202301 ns 201935.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28125 ns 28187.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28208 ns 28208 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24301 ns 24383 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66917 ns 66666.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66542 ns 66333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67750 ns 66459 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66500 ns 66292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 492894 ns 489192 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1505459 ns 1485833 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 959542 ns 1144729 ns 0.84
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1085458.5 ns 1129875 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2196437.5 ns 2267333 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 576585.5 ns 580996.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3106250 ns 3110979 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2641667 ns 2747916.5 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2753084 ns 2752750 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3807583 ns 3882333 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1958924 ns 1989937 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7926875 ns 7919834 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8046333.5 ns 7899375 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7926812.5 ns 7923709 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4419125 ns 4904167 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 134333 ns 77917 ns 1.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 140333 ns 139667 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 135750 ns 140875 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 136000 ns 133958 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193966.5 ns 193313 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2042250 ns 2016625 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2053604 ns 2021791 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2031125 ns 2024750 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2012625 ns 2026750 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 738348.5 ns 747334.5 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.