-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support passing in device and client to XLA #1020
Conversation
57de396
to
2e5ff52
Compare
I want to wait for EnzymeAD/Reactant.jl#222 which exposes |
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 1514e7d | Previous: 8bfa628 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4375 ns |
4625 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4375 ns |
4084 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5938 ns |
5791 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4125 ns |
4292 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
60068 ns |
60959 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
10125 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10291 ns |
9959 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10042 ns |
10375 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10458 ns |
10666 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
424087.5 ns |
427044 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
959 ns |
1167 ns |
0.82 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1250 ns |
1250 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1542 ns |
1458 ns |
1.06 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3042 ns |
3542 ns |
0.86 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18235 ns |
18260 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4084 ns |
4125 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4208 ns |
3833 ns |
1.10 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4209 ns |
4125 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4041 ns |
4000 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
111236.5 ns |
111381 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57000 ns |
57709 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46417 ns |
47250 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38292 ns |
38250 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81667 ns |
80333 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37054 ns |
37655 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2042458 ns |
2026167 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2082375 ns |
2092708.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2085396 ns |
2059625.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1988541 ns |
1993416 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
196865 ns |
197377 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144042 ns |
152958 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146124.5 ns |
148250 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
145770.5 ns |
146417 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
148000 ns |
150375 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165635 ns |
167595 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1112500 ns |
1098542 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1113875 ns |
1124250 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1147125 ns |
1116146 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1116187.5 ns |
1107229.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
526497 ns |
523151 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4250 ns |
3584 ns |
1.19 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3542 ns |
3625 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4916 ns |
5708.5 ns |
0.86 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3333 ns |
3417 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
71425.5 ns |
70157 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8834 ns |
8834 ns |
1 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9500 ns |
8667 ns |
1.10 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8667 ns |
9291 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9083 ns |
9042 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
480134 ns |
492826.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16875 ns |
17000 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17083 ns |
16375 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17667 ns |
18667 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16958 ns |
17083 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55287.5 ns |
54850 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225708 ns |
213146 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221375 ns |
216104 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221125 ns |
214167 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213667 ns |
225333 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
276702.5 ns |
272672.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
459 ns |
1.09 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
709 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
459 ns |
583 ns |
0.79 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17631 ns |
17542 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1459 ns |
1708 ns |
0.85 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1541 ns |
1458 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1583.5 ns |
1625 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1708 ns |
1750 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
104160 ns |
104205 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7250 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5833 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5208 ns |
5209 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
4000 ns |
2.48 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24094 ns |
23961 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
274729 ns |
228750.5 ns |
1.20 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
271146 ns |
228333 ns |
1.19 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
238083.5 ns |
228500 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
227646.5 ns |
226334 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
170934.5 ns |
170956 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3834 ns |
3875 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3916 ns |
3916 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3834 ns |
3834 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23993 ns |
23832 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16875 ns |
16833 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16750 ns |
16708 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
19500 ns |
16708 ns |
1.17 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16750 ns |
16958 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
164614 ns |
165501.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
575542 ns |
579042 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
584000 ns |
574375 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
605667 ns |
575083 ns |
1.05 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
572083 ns |
576292 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113682 ns |
113664 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1428709 ns |
1417708 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1417020.5 ns |
1429333 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1455458 ns |
1425729.5 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1422708 ns |
1422208 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
214581 ns |
214791 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1083708 ns |
1082104 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
966124.5 ns |
959958.5 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1334708 ns |
1341792 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1295562 ns |
1294792 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
279300 ns |
281583.5 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5908000 ns |
5777875 ns |
1.02 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4591709 ns |
4456083 ns |
1.03 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4976209 ns |
4934792 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5721000 ns |
5627500 ns |
1.02 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1097188 ns |
1106964 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
541 ns |
542 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
24143 ns |
23988 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2083 ns |
2084 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2083 ns |
2083 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2125 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2084 ns |
2125 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
178433 ns |
179026 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6375 ns |
6084 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6167 ns |
6167 ns |
1 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6791 ns |
7041 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5750 ns |
6375 ns |
0.90 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
65777 ns |
66163.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11125 ns |
11291 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11042 ns |
10791 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12167 ns |
12125 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10875 ns |
11354.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
452424 ns |
456626.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7500 ns |
7000 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7020.5 ns |
7042 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8167 ns |
8375 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6417 ns |
7042 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
52916.5 ns |
52652 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17542 ns |
17375 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17084 ns |
17167 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18125 ns |
17770.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16625 ns |
18708 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
306123.5 ns |
306093.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
541 ns |
542 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
32503 ns |
33004 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8625 ns |
8583 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8666 ns |
8208 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9333 ns |
9583 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8875 ns |
9042 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
163266.5 ns |
162492.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64583 ns |
64542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64459 ns |
64417 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64542 ns |
64625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64708 ns |
64750 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112855 ns |
112347.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
286958 ns |
277542 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
284333 ns |
281625 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
283417 ns |
288750 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
282208 ns |
275500 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
189716 ns |
189809 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3364542 ns |
3285583 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3028687.5 ns |
3022333.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
2790833 ns |
2780375 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4100917 ns |
4038625 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
572405 ns |
573967 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7623458 ns |
7586208.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7440791 ns |
7415437 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7366542 ns |
7333375 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8197688 ns |
8220958 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1349771 ns |
1351752.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
18788459 ns |
18835167 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
19166542 ns |
19044834 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
19272542 ns |
19135125 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
15690750 ns |
15633417 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23515500 ns |
23661916.5 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33757666.5 ns |
33965500 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
40869250 ns |
41107417 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35082500 ns |
34858709 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1844978 ns |
1862815 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
189627750 ns |
189289541 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
164852917 ns |
164224708 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
157414459 ns |
157847979 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
439775958 ns |
438904833 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13911288.5 ns |
13913764 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
289819458.5 ns |
289733584 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
338120021.5 ns |
338173667 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
307472667 ns |
307489541.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
394527375 ns |
393585937.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23771 ns |
21708.5 ns |
1.10 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
23812.5 ns |
24458 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25667 ns |
25937 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22208.5 ns |
24229 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
95804 ns |
96907 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
103916.5 ns |
103750 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
103500 ns |
105292 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104042 ns |
104208 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103500 ns |
151250 ns |
0.68 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
506012.5 ns |
504189 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6958 ns |
6583 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7166 ns |
7292 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7917 ns |
7959 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5583 ns |
6958 ns |
0.80 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68880 ns |
68581 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14792 ns |
14916.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15062.5 ns |
14709 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14834 ns |
16666 ns |
0.89 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16020.5 ns |
14292 ns |
1.12 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
479169 ns |
483895 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
2993625 ns |
3017937 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2067437.5 ns |
2022458 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2318395.5 ns |
2307959 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4596333.5 ns |
4846645.5 ns |
0.95 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
586471 ns |
585796 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23496917 ns |
23617917 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18039875 ns |
17975417 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18250417 ns |
18323812.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
34954792 ns |
35597209 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3109927.5 ns |
3109235 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33379041 ns |
33405687.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27647875 ns |
27693604 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27702417 ns |
27860958 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41697625 ns |
42002937.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
73583.5 ns |
72375 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
75396 ns |
84624.5 ns |
0.89 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
84103.5 ns |
83250 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74875 ns |
73750 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100532.5 ns |
102852 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
323042 ns |
218167 ns |
1.48 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219458.5 ns |
309979 ns |
0.71 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
320541.5 ns |
317479 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218083.5 ns |
288875 ns |
0.75 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
540970.5 ns |
550996 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12667 ns |
12041 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12625 ns |
12729.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13333 ns |
13833 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12792 ns |
11666.5 ns |
1.10 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
70619 ns |
71604 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26833 ns |
26625 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26667 ns |
26959 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27792 ns |
28292 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26959 ns |
26458 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
470712 ns |
484486.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
13417 ns |
12417 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12708.5 ns |
12542 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14500 ns |
14584 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12625 ns |
13041.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
52550 ns |
53694 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26500 ns |
26312.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25750 ns |
26270.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26937.5 ns |
26667 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26083 ns |
26333 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
305657 ns |
309291.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
181666.5 ns |
178770.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182979 ns |
182334 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
183666.5 ns |
184895.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
180500 ns |
179750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
56554 ns |
57908 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
587292 ns |
587125 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
583125 ns |
596500 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
590791 ns |
593770.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
583167 ns |
583166 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
287118 ns |
290369.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7084 ns |
7354.5 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6958 ns |
7167 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7333 ns |
7875 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6750 ns |
6833 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
70418 ns |
70829 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14291 ns |
14375 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13917 ns |
14708 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15375 ns |
15625 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14542 ns |
14083 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
458663 ns |
471312.5 ns |
0.97 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1170729 ns |
1235042 ns |
0.95 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1237583 ns |
1283583 ns |
0.96 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1266125 ns |
1282875 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1309250 ns |
1325208 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
348527 ns |
301270 ns |
1.16 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4115834 ns |
4111125 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4378458 ns |
4361625 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4798917 ns |
4786395.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
4449458 ns |
4453229.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1045715 ns |
1047552 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1791 ns |
1750 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23953.5 ns |
23328 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4916 ns |
4833 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4792 ns |
4792 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4917 ns |
4917 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4916 ns |
4917 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
193278 ns |
186698 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7208.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6354 ns |
5584 ns |
1.14 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8687.5 ns |
8667 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7375 ns |
7312.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
57033 ns |
54539 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11542 ns |
10833 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10875 ns |
10834 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11708 ns |
12375 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11292 ns |
11916 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
339762.5 ns |
329099 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
334 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23315 ns |
22753 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2708 ns |
2708 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2792 ns |
2667 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3000 ns |
2959 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2708 ns |
3000 ns |
0.90 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
163052.5 ns |
157496 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
13667 ns |
13167 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
13375 ns |
13166 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
15083.5 ns |
15000 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
13520.5 ns |
13792 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
57725.5 ns |
55218 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25458 ns |
24833 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24541 ns |
24542 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25333 ns |
25375 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24917 ns |
24709 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
300059 ns |
289966 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4166 ns |
4083 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4166 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4167 ns |
4125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
25178 ns |
24660 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16250 ns |
15958 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16084 ns |
16417 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16042 ns |
16042 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16083 ns |
16125 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
199817 ns |
194045.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5667 ns |
5667 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5584 ns |
5625 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5708 ns |
5750 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5708 ns |
5791 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
33846.5 ns |
32989 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21604.5 ns |
21125 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20667 ns |
20459 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21416 ns |
21542 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20958 ns |
20875 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
178601.5 ns |
174273 ns |
1.02 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
404395.5 ns |
403209 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
373854 ns |
371125 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
469333 ns |
474292 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
527166 ns |
539604.5 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67224 ns |
66734 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
994917 ns |
1011917 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
871792 ns |
884896 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1218458 ns |
1220125 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
1362646 ns |
1400208 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
192087.5 ns |
190566.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80437.5 ns |
82917 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82959 ns |
82791 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83958 ns |
88958.5 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82916.5 ns |
83187.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194575 ns |
192556.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1691687.5 ns |
1921500 ns |
0.88 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1913270.5 ns |
1696166 ns |
1.13 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1935000 ns |
1938083 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1693333 ns |
1915875 ns |
0.88 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
398092 ns |
393732 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22211 ns |
21580 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
173510.5 ns |
165924 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8167 ns |
6708 ns |
1.22 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
6250 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10042 ns |
9750 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7438 ns |
8125 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
60835.5 ns |
56950.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
8916.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8958 ns |
8958 ns |
1 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
9625 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9208 ns |
9542 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
314546 ns |
299584.5 ns |
1.05 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121181896 ns |
120035854.5 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174356250 ns |
174382959 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
155331667 ns |
154831333 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
105591312 ns |
103109500 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5490856 ns |
5474606 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
616313021 ns |
617124000 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
555471625 ns |
555612167 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
466804583 ns |
468382792 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
757846375 ns |
756087750 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34940752 ns |
38213656 ns |
0.91 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
649235042 ns |
651747459 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
665954333.5 ns |
666674583.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
596507375 ns |
602170708.5 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
741194667 ns |
734251875 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57709 ns |
57208 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47417 ns |
48167 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39167 ns |
39167 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83833 ns |
83958 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38045 ns |
37250 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1920958 ns |
1929792 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1970917 ns |
1973292 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1819395.5 ns |
1984249.5 ns |
0.92 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1889084 ns |
1881417 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
176927 ns |
171491 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
281874.5 ns |
273354 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
268333.5 ns |
267959 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
269333 ns |
270687.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
266916 ns |
268834 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
128624.5 ns |
124192.5 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
674792 ns |
658333 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
677166 ns |
674854.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
621541 ns |
665333 ns |
0.93 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
681103.5 ns |
670500 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
753295 ns |
664813 ns |
1.13 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2194792 ns |
2190167 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2200750 ns |
2214354.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2188604 ns |
2216958.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2196250 ns |
2099979 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
134058.5 ns |
133238 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5495625 ns |
5505354.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5489000 ns |
5504750 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5583959 ns |
5565292 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5501708 ns |
5499708 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
794649 ns |
740235 ns |
1.07 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
650458 ns |
650417 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
648250 ns |
649020.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
647729.5 ns |
640625 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
637958 ns |
648292 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47562 ns |
47265 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1820292 ns |
1821708 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1718291 ns |
1720959 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1692833 ns |
1675729.5 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2104334 ns |
2108500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
226480 ns |
224014 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58000 ns |
58583 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46667 ns |
46645.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38917 ns |
38750 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83958 ns |
83834 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28972 ns |
28947 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2032895.5 ns |
2024916 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2080792 ns |
2086188 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1870500 ns |
2100521 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1987874.5 ns |
1993416.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
192457.5 ns |
191815.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13326479 ns |
13473875 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12445229 ns |
12547041.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12578437.5 ns |
12559604 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
14851000 ns |
15213416.5 ns |
0.98 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
517471.5 ns |
517805 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47244833 ns |
47353458 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41818500 ns |
41833334 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41261562 ns |
41118750 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58290958 ns |
58300041 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3201520 ns |
3203904 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
74154459 ns |
74077042 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
68317750 ns |
68022250 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90991833 ns |
90906749.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
76224833 ns |
99115937.5 ns |
0.77 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57750 ns |
58958 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47209 ns |
47375 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38750 ns |
38729.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83937.5 ns |
83500 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
46606 ns |
47777 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1920104.5 ns |
1923375 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1958583 ns |
1961541 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1752354 ns |
1980229 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1897541 ns |
1890354 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190968 ns |
194350.5 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
291 ns |
291 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
31329 ns |
32617.5 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6083 ns |
6208.5 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6083 ns |
5958 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6583 ns |
6708 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6000 ns |
6437.5 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
175665 ns |
173722.5 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31064 ns |
32110 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2584 ns |
2583 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2666 ns |
2542 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2833 ns |
2833 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2584 ns |
2833 ns |
0.91 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
162308 ns |
161891 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
287512021 ns |
286335145.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
341664333 ns |
339870250 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
320946166.5 ns |
320445937.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
272895792 ns |
272825875 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7028736 ns |
7113314 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
998982709 ns |
990386709 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
938834833 ns |
938484666 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
867538167 ns |
868613416.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1101126958.5 ns |
1158749666 ns |
0.95 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33951770 ns |
33903874 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1316927292 ns |
1310266104.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1337023792 ns |
1325766333.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1620632917 ns |
1623996500 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1310255479 ns |
1663239334 ns |
0.79 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1407417 ns |
1461479 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1417625 ns |
1415750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1413625 ns |
1429167 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1406687.5 ns |
1414437.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
127437 ns |
128213 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5023604 ns |
5019792 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5018750 ns |
5022458 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4760396 ns |
5050000 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5017562.5 ns |
5006541.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
594578 ns |
557532 ns |
1.07 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
174929541 ns |
175263520.5 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
128751271 ns |
129816208.5 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
146934041.5 ns |
145953208.5 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
163487375.5 ns |
164619104.5 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4859464 ns |
4883992 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
664127542 ns |
831528333 ns |
0.80 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
641259166 ns |
497840084 ns |
1.29 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
567946500 ns |
556789916 ns |
1.02 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
850158541 ns |
679969833 ns |
1.25 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16210159 ns |
16195623 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8956208 ns |
8914083 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8755875 ns |
8769917 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
8188500 ns |
8216313 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
10176729.5 ns |
10158000 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1606153 ns |
1595526 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36297042 ns |
35894250 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
36814229 ns |
36843625 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
34349792 ns |
34476562 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
38895333 ns |
38802729 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
8910559.5 ns |
6454567.5 ns |
1.38 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47209 ns |
47396 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
49333 ns |
49334 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47542 ns |
47542 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
49166 ns |
47417 ns |
1.04 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18896 ns |
19457 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50083 ns |
50292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50583 ns |
50520.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
51042 ns |
50584 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50250 ns |
50250 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
197236.5 ns |
189575 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7584 ns |
8104 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8083 ns |
6791 ns |
1.19 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9208 ns |
9125 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8333 ns |
7333 ns |
1.14 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
95601 ns |
86829.5 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9958 ns |
9875 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9458 ns |
9583 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10167 ns |
10375 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
10208 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
565793 ns |
537525 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
7875 ns |
8208 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7917 ns |
8250 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
9500 ns |
9812.5 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8708 ns |
6375 ns |
1.37 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
129044 ns |
113788.5 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12375 ns |
13333.5 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12542 ns |
12625 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13584 ns |
13584 ns |
1 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12583 ns |
13208 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
512785 ns |
479705.5 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1042 ns |
958 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
958 ns |
958 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
31575 ns |
32580 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7958 ns |
7750 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
7625 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8250 ns |
8542 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
8208 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
203613.5 ns |
201701.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23167 ns |
23250 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23209 ns |
23042 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23666 ns |
23500 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23209 ns |
23167 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18630 ns |
18765.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
51875 ns |
52875 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52292 ns |
52292 ns |
1 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
53083 ns |
52792 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52042 ns |
52459 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
280711 ns |
260844.5 ns |
1.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1404833 ns |
1400229 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1406625 ns |
1398666.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1405229.5 ns |
1400708 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1454625 ns |
1398917 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
195023.5 ns |
196521.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5017834 ns |
5018604 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5014083.5 ns |
5004729.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5053208 ns |
5044229.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5002958 ns |
5001271 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
619245 ns |
595122 ns |
1.04 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3030666.5 ns |
3043083 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2072750 ns |
2094042 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2270479.5 ns |
2287146 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4951458.5 ns |
4530875 ns |
1.09 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
582238 ns |
582703 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24376833.5 ns |
24366625 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18837750 ns |
18829583 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
19058604.5 ns |
19120291 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36607062.5 ns |
36653000 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3184045.5 ns |
3189516.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34131312.5 ns |
33943229 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28332646 ns |
28373417 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28385728.5 ns |
28357208 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41613645.5 ns |
41659750 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144867000 ns |
144299750 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
143371708 ns |
142248375 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
125843979.5 ns |
126632146 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
174338583.5 ns |
173840291.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22787685 ns |
22781482 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
960337666.5 ns |
1307941437.5 ns |
0.73 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
884236209 ns |
1133574500.5 ns |
0.78 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
718061541 ns |
711240125 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
671977458 ns |
670828250 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
119167280 ns |
118499942 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74375 ns |
74542 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
73667 ns |
73917 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
78583 ns |
83125 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74271 ns |
72916.5 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
236816 ns |
225032.5 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
190333 ns |
202979.5 ns |
0.94 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
203833 ns |
282792 ns |
0.72 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
193125 ns |
253479.5 ns |
0.76 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
275667 ns |
244146 ns |
1.13 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1230276 ns |
1201754 ns |
1.02 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35529791.5 ns |
35408938 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
35399125 ns |
35449645.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32566334 ns |
32512083 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
41050959 ns |
41003541.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5849679 ns |
5848198 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
145971667 ns |
146608875 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
151770062.5 ns |
151542938 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
140117333 ns |
138849083 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
287497375 ns |
287439584 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34904065 ns |
34913824 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121596229 ns |
121086291.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174621209 ns |
174190000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
155248875 ns |
155717667 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
105916125 ns |
106488666.5 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5467486 ns |
5478422 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
477537959 ns |
611208666 ns |
0.78 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
465834541 ns |
466441167 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
454243958 ns |
453562937.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
742668146 ns |
741621625 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32286932 ns |
35157227 ns |
0.92 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
641994375 ns |
648662584 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
654482979 ns |
657411208 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
588547854.5 ns |
585962375 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
850465708 ns |
845072208 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1310375 ns |
1304708 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
956750 ns |
965666 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
662041.5 ns |
744354 ns |
0.89 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2098541.5 ns |
1944604 ns |
1.08 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
559887.5 ns |
572387 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2974896 ns |
2974271 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2550167 ns |
2531646 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2481312.5 ns |
2512854 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3687958 ns |
3691334 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1744044.5 ns |
1817474 ns |
0.96 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
6644917 ns |
6642416 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
6496417 ns |
6630792 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
6453208 ns |
6466375 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
4452625 ns |
4443145.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7334 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6083 ns |
6208 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5333 ns |
5458 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
10167 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25006 ns |
25916 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212125 ns |
212104 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
222333 ns |
219562.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221209 ns |
220667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205875 ns |
206291 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
256601.5 ns |
257490 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
300709625 ns |
301772791.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
221256000 ns |
222879750 ns |
0.99 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
218293312.5 ns |
222700312.5 ns |
0.98 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
306523896 ns |
311773125 ns |
0.98 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7678868 ns |
7676597.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1086946812.5 ns |
1082870459 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
893668062.5 ns |
892532250 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
821712646 ns |
883941208.5 ns |
0.93 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1164252583.5 ns |
1154293562 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26517827 ns |
26959026 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5833 ns |
6459 ns |
0.90 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5166 ns |
5209 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8917 ns |
10000 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5625 ns |
5708.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
164437.5 ns |
168546.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7292 ns |
7458 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6834 ns |
6792 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7459 ns |
7542 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6917 ns |
7792 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
668536 ns |
639812.5 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
458 ns |
542 ns |
0.85 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23464 ns |
24361 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9083 ns |
9000 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
8917 ns |
9000 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
9583 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8958 ns |
9708 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
228625 ns |
234125.5 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351875 ns |
351500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351375 ns |
351500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
352083.5 ns |
351916 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
355166.5 ns |
356625 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21438 ns |
21502 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
779459 ns |
811270.5 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
775625 ns |
774958.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
776333 ns |
776584 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
827729 ns |
821875 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
306099 ns |
315795.5 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
333375 ns |
335896 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
338187.5 ns |
338208.5 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
447083 ns |
441167 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
327541 ns |
331375 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17916 ns |
18761.5 ns |
0.95 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
688187.5 ns |
695166 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
740833 ns |
738208 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1043750.5 ns |
1036458 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
687437.5 ns |
692396 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
259575.5 ns |
292461.5 ns |
0.89 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
352250 ns |
354166.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
345500 ns |
346771 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
430417 ns |
433791 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
366833 ns |
370250 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22452 ns |
23121 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
749437.5 ns |
757417 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
749667 ns |
749625 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1068125 ns |
1070562.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
819375 ns |
828458 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
228679 ns |
257074.5 ns |
0.89 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3417 ns |
3292 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3375 ns |
3458 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3709 ns |
3750 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3375 ns |
3417 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17943 ns |
18586 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4209 ns |
4167 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4208 ns |
4375 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4333 ns |
4417 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4333 ns |
4250 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
247167 ns |
296700.5 ns |
0.83 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6708 ns |
3625 ns |
1.85 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3541 ns |
3750 ns |
0.94 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6541 ns |
6541 ns |
1 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
6354.5 ns |
0.62 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
187075 ns |
232189.5 ns |
0.81 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8333 ns |
8187.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8167 ns |
8000 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8416 ns |
8458 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8584 ns |
8500 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1170319.5 ns |
1227082 ns |
0.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204792 ns |
203417 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210709 ns |
209541.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210500 ns |
208250 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
205709 ns |
198709 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34545 ns |
35300 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
607041 ns |
612417 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
620875 ns |
623292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
623083 ns |
623250 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
630937.5 ns |
630166 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
345777 ns |
347973 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
972333 ns |
977646 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
936145.5 ns |
935437.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
971895.5 ns |
970083 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
1287875 ns |
1286374.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
207543.5 ns |
209031 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4486834 ns |
4514333 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4467750 ns |
4466146 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4468375 ns |
4452875 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
6363084 ns |
6260416.5 ns |
1.02 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
989981 ns |
947144.5 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4000 ns |
3542 ns |
1.13 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3667 ns |
3417 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6354.5 ns |
5896 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3834 ns |
6667 ns |
0.58 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
232582.5 ns |
219336.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7459 ns |
6917 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7459 ns |
6958 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7270.5 ns |
7708 ns |
0.94 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7458 ns |
7291 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
1031292.5 ns |
1020167.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1623666.5 ns |
1635042 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1158083 ns |
1200395.5 ns |
0.96 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1365542 ns |
1363584 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2483375 ns |
2345187.5 ns |
1.06 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
212766.5 ns |
215784.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12292250 ns |
12316854.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9562291 ns |
9564000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9349083.5 ns |
9378437.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18086958 ns |
17989542 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1951499.5 ns |
1948181 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17292271 ns |
17368125 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14359292 ns |
14382958 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14481708 ns |
14502250 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21083791.5 ns |
21085917 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
91250 ns |
90917 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
90417 ns |
89500 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
93375 ns |
91833 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
92271 ns |
113437.5 ns |
0.81 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126008 ns |
126891 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2027791 ns |
2009625 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1698917 ns |
2030000 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1986791 ns |
2039270.5 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2022292 ns |
1871125 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1059363.5 ns |
1032563 ns |
1.03 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
343041.5 ns |
342166.5 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
343334 ns |
343375 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
408459 ns |
406458 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
307729.5 ns |
311729 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15788 ns |
16465.5 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
698500 ns |
706208 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
728604 ns |
728542 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
1025520.5 ns |
1018584 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
643645.5 ns |
650375 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
194667.5 ns |
195366.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7375 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5875 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5333 ns |
5416 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
10000 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33151 ns |
34591 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223333.5 ns |
243791 ns |
0.92 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220209 ns |
220125 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221084 ns |
221083 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221375 ns |
239167 ns |
0.93 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
347875 ns |
327793 ns |
1.06 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3709 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22573 ns |
22616 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14334 ns |
14292 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14333 ns |
14416 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14042 ns |
14208 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14292 ns |
14417 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
494233.5 ns |
480334.5 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
97708 ns |
94458 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
96979 ns |
92625 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
97042 ns |
96875 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
101458 ns |
96229.5 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125532 ns |
126007 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1923854 ns |
1714792 ns |
1.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1897479.5 ns |
1926792 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1925958 ns |
1913291.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1920000 ns |
1711417 ns |
1.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1048446.5 ns |
1034230 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
863709 ns |
876916.5 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
822479.5 ns |
817791 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1185917 ns |
1169438 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
954084 ns |
966187.5 ns |
0.99 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
268705.5 ns |
275657.5 ns |
0.97 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2763958.5 ns |
2828583 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2467666.5 ns |
2474833 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3312375 ns |
3335750 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3401645.5 ns |
3304292 ns |
1.03 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1642790 ns |
1618381.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16416 ns |
16709 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17375 ns |
15625 ns |
1.11 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19020.5 ns |
18667 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15958 ns |
15583 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
143820 ns |
142594 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
241750 ns |
228750 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
215125 ns |
215750 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216500 ns |
217625 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
258750 ns |
255500 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
647705 ns |
641543.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
222166 ns |
222458 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
222937.5 ns |
221500 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
223500 ns |
223458.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
221625 ns |
222604.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
270664.5 ns |
269850.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
530500 ns |
537583 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
504708 ns |
497334 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
498354 ns |
499583 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
509042 ns |
526833 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1440100 ns |
1430878.5 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
332166.5 ns |
330125 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
334583 ns |
332834 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
436584 ns |
435458.5 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
316291.5 ns |
315917 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16549 ns |
16581 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
714916 ns |
717084 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
731749.5 ns |
728166.5 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
1018041 ns |
1021104 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
659916 ns |
662729.5 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
197924.5 ns |
195479.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17500 ns |
17875 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17208 ns |
17167 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19500 ns |
20250 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17625 ns |
17208 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
146247.5 ns |
145639 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213167 ns |
223750 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212354 ns |
212417 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214229.5 ns |
214041 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
224292 ns |
221917 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
963761 ns |
1035551.5 ns |
0.93 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
7187.5 ns |
6708 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6458 ns |
6333 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7542 ns |
7208 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6250 ns |
6625 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
247788.5 ns |
240542 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10666 ns |
10584 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10292 ns |
9917 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
11166.5 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10625 ns |
10917 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1101991 ns |
1097401.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7209 ns |
3500 ns |
2.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3167 ns |
3208 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5334 ns |
6333.5 ns |
0.84 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3125 ns |
6750 ns |
0.46 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
249742.5 ns |
250006 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7333 ns |
7625 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7125 ns |
7084 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7625 ns |
8125 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7333.5 ns |
7500 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1104576 ns |
1102649 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23676542 ns |
23315625 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
35315625 ns |
34529125 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
41021500 ns |
41513333.5 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34975146 ns |
34929834 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1841641 ns |
1838602 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
183575291 ns |
184421875 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
159863125 ns |
159459792 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
150390416 ns |
151225083 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
414546000 ns |
413223958 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16498361 ns |
16387494 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
426892000 ns |
428743125 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
255220666.5 ns |
252439020.5 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
233789999.5 ns |
233017396 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
488318583 ns |
484197291 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182625 ns |
183584 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
183416 ns |
182750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
186375 ns |
186625 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
183562.5 ns |
183146 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
230078 ns |
228677.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
591125 ns |
596083 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
586583.5 ns |
586292 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
588875 ns |
589770.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
633896 ns |
631958 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1092628.5 ns |
1119701 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3870667 ns |
3838833 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3639041 ns |
3643375.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3556854.5 ns |
3563521 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
5348541.5 ns |
5359750 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
539345 ns |
537722 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17399709 ns |
17412417 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17238083 ns |
17190667 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
17087396 ns |
17100375 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
22258375 ns |
22144083 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2638865 ns |
2612799 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32200 ns |
32035 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9916 ns |
9208 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8666 ns |
8542 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9917 ns |
10208 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9167 ns |
9459 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
269524 ns |
264327.5 ns |
1.02 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
498856625 ns |
504274209 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
429009812.5 ns |
430218396 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
471079125 ns |
471374500 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
677305167 ns |
672994208.5 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12482570.5 ns |
12486595 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
2039988125 ns |
2049529562.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1629390541 ns |
1632649709 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1544585104 ns |
1536417708 ns |
1.01 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2216826791.5 ns |
2205666041.5 ns |
1.01 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49313694 ns |
49389302 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1648875 ns |
1657645.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1178188 ns |
1189208.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1335999.5 ns |
1382000 ns |
0.97 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2451250 ns |
2334125 ns |
1.05 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
216670 ns |
214982 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12769708 ns |
12688500 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9922083 ns |
9942000 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9700833 ns |
9748312.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18419000.5 ns |
18407312 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2018117 ns |
2050613 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17711521 ns |
17691583.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14689667 ns |
14746041.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14712292 ns |
14804417 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21443959 ns |
21386084 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26208 ns |
26167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26209 ns |
26292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26375 ns |
26291 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26167 ns |
26291 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23929 ns |
24125 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67042 ns |
66875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66709 ns |
66917 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66958 ns |
67083 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66667 ns |
67209 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
401514 ns |
398847.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203834 ns |
202667 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209500 ns |
209000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209833 ns |
209167 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
201916 ns |
199583 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26172 ns |
26392 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
650145.5 ns |
612416.5 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
666666.5 ns |
627416.5 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
623083.5 ns |
667979 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
632333 ns |
631250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
349666.5 ns |
353043.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
655125 ns |
645542 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
647625 ns |
643375 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
546958 ns |
664187.5 ns |
0.82 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
681500 ns |
540834 ns |
1.26 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131296 ns |
132126 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2258229 ns |
2247375 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2255500 ns |
2239958 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2297083.5 ns |
2302917 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2241417 ns |
2219000 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1166127 ns |
1328726 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18666.5 ns |
17667 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17729 ns |
16979.5 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21000 ns |
20792 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17958 ns |
18500 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
144841.5 ns |
146392.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
262417 ns |
229708 ns |
1.14 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219042 ns |
225333 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221084 ns |
229292 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
257833 ns |
259083 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1023581 ns |
1081671 ns |
0.95 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
459 ns |
542 ns |
0.85 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22851 ns |
23645 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10083 ns |
9833.5 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
9542 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10458 ns |
10708 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9500 ns |
9916 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
257737.5 ns |
262941 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7145.5 ns |
7291 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5270.5 ns |
5833 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8084 ns |
9625 ns |
0.84 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7354.5 ns |
7250 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
234765.5 ns |
234003 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7583 ns |
7333 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7041 ns |
7000 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7625 ns |
7833 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6917 ns |
7250 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
799451.5 ns |
810029.5 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2125 ns |
2042 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1917 ns |
2000 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2583 ns |
2375 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2250 ns |
2208 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
17933 ns |
18218 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6583 ns |
6542 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6458 ns |
6500 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6750 ns |
6708 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6500 ns |
6750 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
329954.5 ns |
335368 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
762666 ns |
750166 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
748292 ns |
746604.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
748937.5 ns |
751041 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
752125 ns |
761417 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21677 ns |
21856 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
810604 ns |
775334 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
772458 ns |
775042 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
775020.5 ns |
804792 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
811250.5 ns |
791625 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
297528.5 ns |
299022 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333.5 ns |
7375 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5875 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5042 ns |
5208 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10125 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32915 ns |
32492 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
261083 ns |
233188 ns |
1.12 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
227854.5 ns |
227750 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228167 ns |
254458 ns |
0.90 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
255667 ns |
255583 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
362215.5 ns |
359227 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12729.5 ns |
11042 ns |
1.15 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12542 ns |
12458 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13208 ns |
12959 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10125 ns |
12000 ns |
0.84 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
247873.5 ns |
245075.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24541 ns |
24875 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24625 ns |
24458 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25042 ns |
25458 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24916.5 ns |
24583.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1130164.5 ns |
1120608 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106546042 ns |
106980458 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
117887062.5 ns |
118006979.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
123861917 ns |
123940208 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117586083 ns |
118407959 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2640571 ns |
2661574 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
392989041 ns |
394378313 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
367697333 ns |
368164500 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
360473458 ns |
358657167 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
483729375 ns |
482282708 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15223969 ns |
15138278 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
755583520.5 ns |
759267583 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
579266333 ns |
577881125 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
748066646 ns |
749378833 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
768472896 ns |
945671312.5 ns |
0.81 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7750 ns |
7458 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7042 ns |
7958 ns |
0.88 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8834 ns |
8750 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8250 ns |
7333 ns |
1.13 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
240391.5 ns |
235620 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14875 ns |
14500 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13708 ns |
13333 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14292 ns |
15041 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13167 ns |
14292 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1084587 ns |
1078273.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8770.5 ns |
8542 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7958 ns |
7792 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
9375 ns |
9187.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6458 ns |
7833.5 ns |
0.82 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
235585.5 ns |
235827.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12916.5 ns |
13167 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12375 ns |
12084 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12666 ns |
13084 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12333 ns |
12833 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
794027 ns |
787391.5 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
345125 ns |
347250 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
345458 ns |
344875 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
418042 ns |
409896 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
310875 ns |
310562 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16738 ns |
16566 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
706459 ns |
713833.5 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
731167 ns |
727291 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
1022500 ns |
1023416 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
650625 ns |
654959 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
200821 ns |
197250.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
334 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
375 ns |
0.78 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23278 ns |
23066 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6333 ns |
6250 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6167 ns |
6334 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6708 ns |
6750 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6250 ns |
6791 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
241472.5 ns |
238420 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5708 ns |
5750 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5667 ns |
5750 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5792 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5667 ns |
5834 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24186 ns |
23863 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21187.5 ns |
21750 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21333 ns |
21000 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21833 ns |
21958 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21084 ns |
21708 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
264949.5 ns |
261085 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
145958 ns |
152146 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
145667 ns |
145250 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
150937.5 ns |
149541 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
183459 ns |
145937 ns |
1.26 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167512 ns |
166536.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1325645.5 ns |
1328792 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1283209 ns |
1319083.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1321500 ns |
1350812.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1321750 ns |
1317084 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1344605.5 ns |
1336276 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24708 ns |
24917 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
23750 ns |
24208 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25791.5 ns |
25708 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23875 ns |
24208.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
286945.5 ns |
351114.5 ns |
0.82 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
178875 ns |
131125 ns |
1.36 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
121125 ns |
117791 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
118416 ns |
172917 ns |
0.68 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
177750 ns |
177334 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1461558 ns |
1465398.5 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
333 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
291 ns |
292 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
375 ns |
0.67 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22927.5 ns |
22926 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6417 ns |
6417 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6125 ns |
6458 ns |
0.95 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6708 ns |
6917 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6584 ns |
6542 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
258576.5 ns |
254551 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6625 ns |
7625 ns |
0.87 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6812 ns |
4167 ns |
1.63 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6834 ns |
7708.5 ns |
0.89 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4042 ns |
7375 ns |
0.55 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
256701 ns |
250274.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10583 ns |
10042 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10125 ns |
9708 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10041.5 ns |
10333 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10208 ns |
10250 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1359307.5 ns |
1345295 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1625 ns |
1584 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1583 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1583 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23350 ns |
22897 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5625 ns |
5625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5709 ns |
5584 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5916 ns |
5959 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5584 ns |
5958 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
275462 ns |
271438.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6828396 ns |
6886125 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6363250 ns |
6378229 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6516562.5 ns |
6526875 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7621750 ns |
7602250 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215412 ns |
213111 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24080041.5 ns |
24073062 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21280041 ns |
21283625 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
20997625 ns |
21045584 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29759000.5 ns |
29677875 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2123616 ns |
2108165 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
37323458 ns |
37353145.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
34474916.5 ns |
34386667 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45793312.5 ns |
45930020.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
37921042 ns |
49322334 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7333.5 ns |
7708.5 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5187.5 ns |
5875 ns |
0.88 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7937.5 ns |
8333 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6916.5 ns |
7062.5 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
239076 ns |
238522.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8125 ns |
8458 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7708 ns |
8042 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8459 ns |
8583 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8208 ns |
8292 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1072155.5 ns |
1070850 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1562125 ns |
1544374.5 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1257937.5 ns |
1259666.5 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1631625 ns |
1632771 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2159229 ns |
2150667 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
280483 ns |
278945 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7910479.5 ns |
7908937.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6565334 ns |
6609937 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7148750 ns |
7237750.5 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10451187.5 ns |
10434334 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1898276.5 ns |
1889956 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
336958 ns |
340979 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
340916.5 ns |
345792 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
417542 ns |
417125 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
342750 ns |
345833 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46061 ns |
42448 ns |
1.09 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
748770.5 ns |
746500.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
793209 ns |
784542 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1057812.5 ns |
1073250 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
768708 ns |
761062.5 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
312846.5 ns |
303720.5 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397333 ns |
397500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288166 ns |
288250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
212083 ns |
212666 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755667 ns |
756084 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44394 ns |
43887 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
666250 ns |
671083 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
532250 ns |
530083 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
471125 ns |
470667 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
973625 ns |
974750 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
191807 ns |
188388.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
642749.5 ns |
679250 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
657917 ns |
645333.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
600542 ns |
642458 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
682375 ns |
638562.5 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132357 ns |
131530 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2459583 ns |
2409292 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2421437.5 ns |
2456416.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2501750 ns |
2514583 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2470417 ns |
2456292 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1375665.5 ns |
1277300 ns |
1.08 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
341833 ns |
345146 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
342917 ns |
343583 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
406292 ns |
403708.5 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
308208 ns |
312208 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16247 ns |
16009 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
701208 ns |
709667 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
725603.5 ns |
724500 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
1027750 ns |
1022687.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
644500 ns |
650417 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
201443.5 ns |
195917 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1461354.5 ns |
1460417 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1500917 ns |
1500812.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1493542 ns |
1496375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1440084 ns |
1438708 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
41707 ns |
40600 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5118646 ns |
5128791 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5284042 ns |
5302375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5161771 ns |
5313000 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4984145.5 ns |
4970208.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198485.5 ns |
196206.5 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3667 ns |
3708 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33901 ns |
32895 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15041 ns |
15167 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15084 ns |
15083 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15083 ns |
15083 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15083 ns |
15375 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
382601.5 ns |
376729 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
70917 ns |
71459 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71375 ns |
71250 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
70958 ns |
71375 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71375 ns |
70708 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
114182 ns |
113177.5 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
317708 ns |
317917 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
320416 ns |
320417 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
331250 ns |
325333 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318541 ns |
320916 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
198022 ns |
193043 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1000 ns |
958 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
959 ns |
958 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
959 ns |
1042 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23593 ns |
23363 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8042 ns |
8083 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8041.5 ns |
7792 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8583 ns |
8750 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7916.5 ns |
8750 ns |
0.90 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
264877.5 ns |
260535.5 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
473667 ns |
475499.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
478708 ns |
470520.5 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
549417 ns |
557125 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
545834 ns |
557959 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129526 ns |
129404 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1394625 ns |
1399270.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1382500 ns |
1382375 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1612937.5 ns |
1611125 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
1580937.5 ns |
1582104.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
279438 ns |
274924 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
291 ns |
250 ns |
1.16 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31859 ns |
31647 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6375 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6083 ns |
6042 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6729.5 ns |
6666 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6020.5 ns |
6625 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
267633.5 ns |
262541.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1724250 ns |
1761833 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1721479 ns |
1723396 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1727042 ns |
1733812.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1730208 ns |
1730625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
168767.5 ns |
169477.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4344292 ns |
4358625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4323437.5 ns |
4358708 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4399666 ns |
4403062.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4355416 ns |
4373875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1262148 ns |
1208123 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6791 ns |
7167 ns |
0.95 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6417 ns |
6875 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6625 ns |
6916 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6709 ns |
6750 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
21098 ns |
20662 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
48084 ns |
51625 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
32833 ns |
32917 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
33000 ns |
48208.5 ns |
0.68 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
51416 ns |
51417 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
212419 ns |
292106.5 ns |
0.73 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
353583 ns |
354562.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
345083 ns |
348666.5 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
435187.5 ns |
433333 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
317875 ns |
322041.5 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18605 ns |
18353 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
723437.5 ns |
724625 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
740646 ns |
730583 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
1044625 ns |
1038687.5 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
669791.5 ns |
675333 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
346908.5 ns |
335730.5 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75167 ns |
75458 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75208 ns |
75333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75292 ns |
75375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
74625 ns |
74584 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47711 ns |
46864.5 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
324709 ns |
325166 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
338396 ns |
324250 ns |
1.04 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
336208 ns |
336875 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
325000 ns |
325125 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
213486 ns |
209059.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1486042 ns |
1485709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1526750 ns |
1526833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1518500 ns |
1522792 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1465542 ns |
1462625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52087.5 ns |
51397 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5125208 ns |
5113395.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5101375 ns |
5295292 ns |
0.96 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5010541.5 ns |
5300812.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4979833 ns |
5001042 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
206862 ns |
202971.5 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28291 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28250 ns |
28208 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28208 ns |
28208 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28250 ns |
28209 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
25316 ns |
24514.5 ns |
1.03 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66417 ns |
66417 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66583 ns |
66458 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66542 ns |
66500 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66417 ns |
66500 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
540711 ns |
505942 ns |
1.07 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1499041 ns |
1502084 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1143146 ns |
1124250 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
814500 ns |
944270.5 ns |
0.86 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2202375.5 ns |
2255250 ns |
0.98 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
572810.5 ns |
566674 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3097583 ns |
3090791 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2552125 ns |
2751542 ns |
0.93 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2429833.5 ns |
2628896 ns |
0.92 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3815666 ns |
3819709 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2014809 ns |
1979936 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
8858833 ns |
8847333 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
8751542 ns |
8768375 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
8741583 ns |
8750250 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
6425062.5 ns |
6340375 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
123833.5 ns |
85125 ns |
1.45 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82667 ns |
83021 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
81896 ns |
85708.5 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
125792 ns |
83562.5 ns |
1.51 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194861 ns |
192703 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2004479 ns |
2012875 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1739604.5 ns |
2024062.5 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1766291 ns |
2038542 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2013646 ns |
2008812 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
741224 ns |
791664.5 ns |
0.94 |
This comment was automatically generated by workflow using github-action-benchmark.
No description provided.