This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ci: run tests only on
1.10
for now (#172)
- Loading branch information
Showing
3 changed files
with
29 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2d7533c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5209
ns5541
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5208
ns5208.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7291
ns6834
ns1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6208
ns4917
ns1.26
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
115729
ns102997
ns1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2692776
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
408504
ns422395
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10083
ns10125
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10208
ns10167
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10375
ns9917
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9833
ns10020.5
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
496762
ns530333
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
17703724
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
10961843
ns11174375
ns0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1312
ns2854
ns0.46
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1500
ns1375
ns1.09
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1875
ns3750
ns0.50
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1479.5
ns2792
ns0.53
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
20353.5
ns19948
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1346068.5
nsbias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
31961
ns33501
ns0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4000
ns3834
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4416
ns4250
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4500
ns4208
ns1.07
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4333
ns4416
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
133606
ns131207.5
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
9495102
nsbias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
147546.5
ns146692
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57500
ns58167
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46333
ns39792
ns1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39750
ns38209
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82562.5
ns83208
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36967.5
ns36515
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
548600
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
80581
ns80481
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2024000
ns2038875
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2088104
ns2083750
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2081875
ns2035541
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1983520.5
ns2003250
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
218972
ns217066
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
7891968
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
973560
ns1203774
ns0.81
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145834
ns146333.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
172583
ns147458
ns1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
151875.5
ns174542
ns0.87
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
176250
ns150167
ns1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167986
ns167907.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7801350.5
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
197777
ns171622
ns1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1108729.5
ns1119853.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1105292
ns1129187.5
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1119062.5
ns1072541
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1108749.5
ns1117229.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
642887
ns620063
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33405409
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1027070
ns1023002
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6083
ns5021.5
ns1.21
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4937.5
ns5083
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5896
ns6417
ns0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5750
ns4584
ns1.25
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
83848
ns79500
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5356951.5
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
69841
ns59431
ns1.18
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9000
ns8833
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9042
ns8458
ns1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9042
ns9083
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8542
ns8958
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
556012
ns540188.5
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
37949872
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
395964
ns390145
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18791
ns17750
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16875
ns17000
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20917
ns22125
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22791.5
ns18146
ns1.26
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
61826
ns61981.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3296125
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
76391
ns78051
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
211083
ns212750
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218583.5
ns257833
ns0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221999.5
ns221375
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
211500
ns221750
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
328054
ns323096
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
14617604.5
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
468680
ns463260
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
750
ns666
ns1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
666.5
ns625
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
917
ns875
ns1.05
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
625
ns625
ns1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
19270
ns18860
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1164614.5
nsbias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
31200
ns30120
ns1.04
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1459
ns1458
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1417
ns1375
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1500
ns1625
ns0.92
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1459
ns1375
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
115345.5
ns114822.5
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
8786881.5
nsbias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
136362
ns123847
ns1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7500
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5958
ns5333
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5458
ns5333
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10167
ns10459
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23777
ns23715.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1195053
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49421
ns46501
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228791
ns227792
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
262833
ns241750
ns1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
244208
ns241584
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227438
ns227125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
188310
ns188481.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
30683195
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
646667
ns591832
ns1.09
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4125
ns4084
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3916
ns4125
ns0.95
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns3958
ns1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4083
ns4125
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23548.5
ns23784
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
2046712.5
nsdense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
49050
ns45550
ns1.08
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16750
ns16750
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16833
ns16792
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16833
ns16791
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
17000
ns16500
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
184716.5
ns184666.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
10810606
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
178062
ns171442
ns1.04
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
491291
ns493292
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
385708
ns312833
ns1.23
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
313250
ns310584
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
846667
ns847917
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113504.5
ns113490
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
400320
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
243402
ns243193
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2157041.5
ns2121291
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1860000
ns1584833
ns1.17
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1596917
ns1574875
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3118291.5
ns3034896
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
228877.5
ns228348
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
9523997.5
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
743298
ns739108
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6541.5
ns7021
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6167
ns6792
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7145.5
ns7958
ns0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6416
ns6875
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
82766.5
ns82934
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5786455
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
67260
ns57300
ns1.17
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11708.5
ns11520.5
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10333
ns11708
ns0.88
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12417
ns12062.5
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10375
ns10896
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
599572.5
ns598177.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
36065836.5
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
415124
ns401725
ns1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23681.5
ns23280.5
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2157030
nsdense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
49180
ns48351
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2083
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2166
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2209
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
230420
ns217524
ns1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
10946869
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
182202
ns178702
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9208
ns8542
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8666.5
ns9229.5
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9917
ns11042
ns0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8792
ns8042
ns1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
100396.5
ns92171
ns1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3318002.5
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
75271
ns76060.5
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17229.5
ns19125
ns0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18479.5
ns18895.5
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18625
ns19375
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18000
ns18458
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
575393.5
ns534402.5
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
16729549.5
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
385864
ns379154
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns500
ns1.17
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
34044
ns33745.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1236371
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
48691
ns45241
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9625.5
ns9104
ns1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9541.5
ns9583
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9709
ns9187.5
ns1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8833.5
ns10042
ns0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
254859
ns242113
ns1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19246352.5
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
375034
ns367124
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397208
ns398958
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287667
ns215291
ns1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215291
ns213750
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
755625
ns756041
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112458
ns111898
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
340204
nsdense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
76851
ns77281
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1468271
ns1396458
ns1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1130458
ns859875
ns1.31
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
858125
ns847958
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2440187.5
ns2356833.5
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
199457
ns199002
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
9886202
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
322043
ns322423
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8021.5
ns7250
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7875
ns7625.5
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8750
ns9062.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7125
ns7229
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
134916.5
ns126183.5
ns1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
5780710
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
70255.5
ns57821
ns1.22
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16917
ns16959
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15042
ns14354.5
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15979
ns14792
ns1.08
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16000
ns15042
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
878404
ns851673
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
41935612.5
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
433994
ns420849.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
28792
ns32959
ns0.87
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25792
ns29083.5
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28833.5
ns30875
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
30354.5
ns25770.5
ns1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
183000.5
ns184566
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7959277.5
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
115401
ns110921
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
112375
ns160875
ns0.70
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
144438
ns124458
ns1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
105854.5
ns145396
ns0.73
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
150875
ns157729
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
977911
ns1005586
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
41813067
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
589736
ns576731
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74166
ns75875
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74604
ns75042
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
77333
ns80959
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76334
ns74437.5
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
189045
ns190691
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7503392
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
128881
ns124242
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
295667
ns300833
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
307166
ns322542
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
300000
ns298292
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
276875.5
ns219396
ns1.26
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
986480
ns1023572
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
40933470
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
697017.5
ns692382
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
13166.5
ns13000
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13229
ns13500
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14833.5
ns14833
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13667
ns13208
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
133538.5
ns136120
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
5773755.5
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
236113
ns234302
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27000
ns27083.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27500
ns26395.5
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27187.5
ns27146
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27438
ns27770.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
917467.5
ns907766
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
39999839
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
698258
ns693402
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11209
ns11500
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11292
ns10875
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13375
ns13249.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11083
ns11666
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
119722.5
ns119510.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
3349179
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
240142
ns240667.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
23333
ns23021
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
23084
ns23312.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
24000
ns23917
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21958
ns22708
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
678230.5
ns664160.5
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
22343314.5
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
678857
ns675107
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
65021
ns66750
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
62875
ns63542
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
68667
ns68709
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66417
ns65000
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101393
ns101310
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3400903
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
236963
ns234673
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
477895.5
ns466062.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
476959
ns478625
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
468750
ns472875
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
495833
ns518125
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
488817
ns484379
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20464230
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
715823
ns712597
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7146
ns7479
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8375
ns7687.5
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8500
ns9958
ns0.85
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7021
ns7667
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
136539.5
ns134386
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5535345
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
69291
ns57600
ns1.20
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
11458
ns15750
ns0.73
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14500
ns16333
ns0.89
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16125
ns15250
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13416
ns15291
ns0.88
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
886518
ns880162.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
37792827
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
407319.5
ns398914
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6154209
ns6151875
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6370021
ns3226750
ns1.97
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
3225542
ns3223292
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11912875
ns11913583
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
345647
ns350966
ns0.98
batchedmm(512, Bsize=4)/forward/GPU/oneAPI
49342806
nsbatchedmm(512, Bsize=4)/forward/GPU/AMDGPU
305758
ns302008
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19108188
ns19126979
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
19939624.5
ns11161229.5
ns1.79
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
11149250
ns11077916
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36445875
ns36533646
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1059965
ns1006948.5
ns1.05
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI
79558988
nsbatchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1166672
ns1127082
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1000
ns1042
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1000
ns1042
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
959
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23689
ns23502
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2151476.5
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
209622
ns209393
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3917
ns3958
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4041
ns4083
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4000
ns4041
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3916
ns3917
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
274634
ns270232
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10742838
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
625596
ns623846
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7292
ns7833
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9000
ns8042
ns1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10250
ns9750
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
9062.5
ns7625
ns1.19
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
116615
ns116542
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3546009
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
69341
ns69700
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
12000
ns12375
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12667
ns12458
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12437.5
ns12917
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12417
ns12292
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
605595
ns604932
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22519876
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
363803
ns357073.5
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22597.5
ns22511.5
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
2178291
nsdense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
48315.5
ns46531
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2834
ns3167
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2916
ns3166
ns0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3167
ns3333
ns0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
3083
ns2875
ns1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
194557
ns194011
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
9614403
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
170192
ns158126.5
ns1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11333
ns12125
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11459
ns12333
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13708
ns13708
ns1
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12333
ns11937.5
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
115903.5
ns115429.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3311083
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
239372.5
ns237322
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20792
ns22000
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23500
ns24459
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
22395.5
ns23396
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21458.5
ns21792
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
558538
ns554065.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
19541146
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
657037
ns651546.5
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4167
ns4416
ns0.94
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4417
ns4291
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24750
ns24232
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2038545
nsdense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
49870
ns48651
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16708
ns16208
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16167
ns16500
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16500
ns16042
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16667
ns16250
ns1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
317514
ns316149
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
12292699
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
212047.5
ns208227
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2083
ns2083
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2125
ns2083
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2125
ns2083
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2083
ns2000
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35083
ns34761
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1184726
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
206953
ns205252
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
17250
ns17937.5
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
18667
ns19271
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
19584
ns18584
ns1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20125
ns18375
ns1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
284678
ns283100
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20274746
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
691617
ns682562.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
60292
ns59229.5
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
66792
ns60896
ns1.10
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
62000
ns60959
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51125
ns53792
ns0.95
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66448
ns66317
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI
87696389
nsbatchedmm(16, Bsize=512)/forward/GPU/AMDGPU
117412
ns100931
ns1.16
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
198916
ns195625
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
167229
ns149417
ns1.12
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
141417
ns138292
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
300125
ns219291
ns1.37
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
209004
ns208292.5
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI
147263909.5
nsbatchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
620696.5
ns554746
ns1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82583
ns85062
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
140250
ns127458
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86417
ns86104
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
116583
ns86812.5
ns1.34
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
191982.5
ns192707
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5863118
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203942
ns169152
ns1.21
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921771
ns1926791.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1908917
ns1918312.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1919708
ns1895083
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1924521
ns1862750
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
504208.5
ns503729
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
26294676.5
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1070976
ns915670
ns1.17
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21855
ns21463.5
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2006228
nsdense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
41700
ns41990
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1834
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
242053
ns244422
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
10350039
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
183192
ns183082
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9833
ns11375
ns0.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9833
ns10292
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11709
ns12166
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
10583
ns9084
ns1.17
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
116639.5
ns113574.5
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3403003.5
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
238567.5
ns237182
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8875
ns9583
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10875
ns12396
ns0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns10750
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9500
ns9458
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
488952
ns489512
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
20132943
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
630866
ns632057
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57875
ns57959
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46958
ns39208
ns1.20
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39625
ns38708
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82250
ns83375
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38551
ns38522
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1316937
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
79411
ns78311
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1922646
ns1724708.5
ns1.11
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1979292
ns1941208
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1942292
ns1947834
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1900917
ns1891208.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
210456
ns210148.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33978774
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1015680
ns998640
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
267333
ns269083
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
269625
ns268833
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
270729.5
ns275875
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
269645.5
ns269729.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
192987.5
ns193164
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7844239
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
285143
ns282737.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
698604
ns587166.5
ns1.19
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
671916.5
ns614875
ns1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
667416
ns651500
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
626771
ns652062
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
985897
ns993619.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45574369
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
913670
ns899480
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2218667
ns2202416
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2215687
ns2216125
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2220312.5
ns2192812.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2213250
ns2220500
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
157769
ns179761.5
ns0.88
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8237698
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
425304
ns415294
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5486562
ns5520708
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5529917
ns5537000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5524333.5
ns5449958.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5488625
ns5515167
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
927722
ns930917
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
53249072
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1555466
ns1711728
ns0.91
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
478042
ns477542
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
346167
ns257375
ns1.34
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
257167
ns255375
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
909250
ns908666
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46497
ns46830
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
825183
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
245473
ns245313
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2167292
ns2116979
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1862208
ns1589770.5
ns1.17
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1591771
ns1579645.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3122542
ns3037833.5
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
255431
ns274670.5
ns0.93
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
12961347
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
773598
ns769148
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57520.5
ns57875
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46708
ns39000
ns1.20
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39292
ns38458
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82500
ns83333
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28213
ns28067
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1370930
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
76011
ns75041
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2032125
ns2047334
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2090250
ns2049854.5
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2068583
ns2059333
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1997000
ns1987666.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
223132
ns227893
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35910018
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1194083
ns1038901
ns1.15
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57812.5
ns58000
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46708
ns39333
ns1.19
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39583
ns38333
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82375
ns83125
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48361
ns48807.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
762273.5
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
80795.5
ns67171
ns1.20
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1928084
ns1934875
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1964958
ns1962667
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1966541.5
ns1938167
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1886625
ns1827396
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
230366
ns233324
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
16959659
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
920174
ns914834.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
33705
ns34314.5
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1253501.5
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
45940
ns45171
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6646
ns6542
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7395.5
ns7083
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7292
ns7000
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6417
ns6958
ns0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
201838.5
ns202653
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
21257580
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
371664
ns366114
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32336
ns32763
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1213220
nsdense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
37120
ns38131
ns0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
3292
ns2792
ns1.18
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3000
ns3000
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3125
ns3459
ns0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2666
ns2875
ns0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
182468
ns184852
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
7479362
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
151261
ns151962
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
502687.5
ns494188
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
491916.5
ns500333.5
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
465083.5
ns470041.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
498417
ns489437
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
134412
ns134801.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5713043
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
367259
ns322243
ns1.14
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4072041
ns4053479
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4093021
ns4072375
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4069979
ns4033500
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4043667
ns4070625
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
669547
ns680027
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
34596141
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1474565
ns1463545
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49859062
ns49933854
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35504667
ns26023000
ns1.36
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
26029000
ns25982541.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96942959
ns97045646
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1621240
ns1626445
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI
55961032
nsbatchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1046111
ns1047410
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154467896
ns155000104.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112182625
ns89050542
ns1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
89208292
ns88666916.5
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
294884062.5
ns295479666.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6486949
ns6477658
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI
128111295
nsbatchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5579662.5
ns5560101.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
19541
ns20062.5
ns0.97
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
18625
ns15500
ns1.20
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
13917
ns13833.5
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15458.5
ns15708.5
ns0.98
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
20271
ns20427
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1104775.5
nsbias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
26071
ns25781
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
10729.5
ns11063
ns0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
9000
ns7895.5
ns1.14
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
8125
ns7937.5
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17291
ns17375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
244379
ns248558
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
10081500
nsbias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
148582
ns143922
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8374.5
ns8417
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8750
ns10229
ns0.86
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10833
ns10375
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
9104.5
ns8646
ns1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
120247
ns119635
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3746738
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
239122.5
ns239173
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9437.5
ns10041.5
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9708
ns10667
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11792
ns10750
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9500
ns10145.5
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
585732.5
ns591757
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22572008
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
659212
ns654107
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9083.5
ns10375
ns0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9833.5
ns9770.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10375
ns11312.5
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9438
ns9500
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
116564
ns117527.5
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3425324
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
75361
ns72401
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13958
ns14292
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13291.5
ns17708
ns0.75
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
16625
ns14834
ns1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13750
ns14750
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
556648.5
ns562161
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
19935565.5
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
351184
ns345113
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
33504
ns34287
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1200134
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
207882
ns207072
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns8625
ns0.87
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7958
ns9667
ns0.82
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9542
ns8667
ns1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7625
ns8687.5
ns0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
223084.5
ns224465.5
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21568038
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
665587
ns658996
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
17958
ns17292
ns1.04
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
17584
ns13771
ns1.28
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
13334
ns12458.5
ns1.07
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10833.5
ns10770.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
20393
ns20290
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1168335
nsbias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
191442
ns186982
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
35542
ns35625
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
35583
ns35625
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
36208
ns35834
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
35500
ns35666
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
258577
ns261247.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11381817
nsbias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
591656
ns589266
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
511813
ns450208
ns1.14
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
447292
ns494583.5
ns0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
456792
ns456791.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
517125
ns461833
ns1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194619
ns194699
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5685561
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
368453.5
ns360324
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4055479
ns4069833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4065479.5
ns4063479
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4057292
ns4038041.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4051125
ns4038167
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
506270
ns514235
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
28041384.5
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1368029
ns1354948.5
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
786875042
ns788948625
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
540385750
ns416422208.5
ns1.30
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
417627729
ns415183312.5
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1558687604
ns1509932250
ns1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22789985.5
ns22522291.5
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/oneAPI
176484643
nsbatchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14667995.5
ns14572928
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2512454792
ns2530024250
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1772086292
ns1506878542
ns1.18
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1545039084
ns1519381125
ns1.02
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
6322382417
ns4752439166
ns1.33
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118300758
ns118941901
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI
918719991.5
nsbatchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87803948.5
ns87857404.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76458.5
ns77417
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76958
ns77625
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78437
ns79500
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76937.5
ns76875
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
191503.5
ns194658.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8039760
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
106691
ns106561
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
279042
ns284458
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
208625
ns286188
ns0.73
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
282125
ns197750
ns1.43
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
196250
ns192708
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
989645.5
ns1005733
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44408111.5
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
636782
ns630306
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199893333
ns199829146
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
139025625
ns104009479.5
ns1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
104051042
ns103995667
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
388708625
ns389216083
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5839621
ns5833781
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI
79074303
nsbatchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3603877.5
ns3615787
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
619152625
ns620952291.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
439143666
ns354227354.5
ns1.24
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
353463000
ns354977104.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1177182375
ns1182226250
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26537180.5
ns26559529
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI
276530657.5
nsbatchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
22057437
ns21846736
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns7167
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6167
ns5375
ns1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5375
ns5250
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9792
ns10292
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26296
ns27179
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1196971
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46670
ns48210
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212500
ns212666.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219917
ns222542
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
223521
ns221917
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
208917
ns206167
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
213879
ns217340.5
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20926055
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
531735
ns523165
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8104
ns8708
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8709
ns8958
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10791.5
ns10667
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9229
ns8813
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
112861.5
ns115467
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3389305
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
73211
ns73431
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7584
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7542
ns11521
ns0.65
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10229.5
ns8542
ns1.20
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7834
ns8062.5
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
490362
ns494404
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
19246537
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
323133
ns316873
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
458
ns500
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns708
ns0.71
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
708
ns708
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
459
ns583
ns0.79
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24659
ns25358
ns0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1256249
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
48770
ns47920
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9250
ns9250
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8479.5
ns11396
ns0.74
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
12291
ns10875
ns1.13
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9083
ns9750
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
245415
ns246651
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
24116959
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
395734
ns388584
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
112500.5
ns110834
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
103271
ns87791
ns1.18
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
88333
ns87792
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
154625
ns154959
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
23200
ns23405
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
818562
nsbias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
193152
ns189432
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
578000
ns539625
ns1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
534875
ns562458
ns0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
548917
ns535812.5
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
535333
ns535000
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
215198
ns220513
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11436046
nsbias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
610641.5
ns604586.5
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5000
ns5354
ns0.93
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
5416.5
ns7042
ns0.77
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
7604.5
ns8229.5
ns0.92
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
6625
ns6541
ns1.01
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17413
ns17715
ns0.98
batchedmm(16, Bsize=32)/forward/GPU/oneAPI
72455521
nsbatchedmm(16, Bsize=32)/forward/GPU/AMDGPU
80361
ns71815.5
ns1.12
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
11792
ns11750
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
10791.5
ns11459
ns0.94
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
11208
ns10792
ns1.04
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
17000
ns17125
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
203659.5
ns206057.5
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI
98210292
nsbatchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
381654
ns379023.5
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
39542
ns39250
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51459
ns51250
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
51333
ns50583
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13520.5
ns13750
ns0.98
batchedmm(16, Bsize=128)/forward/GPU/CUDA
19998
ns21128.5
ns0.95
batchedmm(16, Bsize=128)/forward/GPU/oneAPI
76386107.5
nsbatchedmm(16, Bsize=128)/forward/GPU/AMDGPU
89551
ns84216
ns1.06
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36229.5
ns36208
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
31458
ns30584
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
30250
ns29250
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57167
ns57375
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
180703
ns184668
ns0.98
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI
112491463
nsbatchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
412909.5
ns414734
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1791
ns1583.5
ns1.13
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1875
ns2000
ns0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2125
ns2187
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1813
ns1833.5
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
19867
ns19835
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1142759
nsbias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
34540
ns25650
ns1.35
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2042
ns2292
ns0.89
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2459
ns0.88
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2500
ns2458
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2062.5
ns2187.5
ns0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
193884
ns197459.5
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
9110958
nsbias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
138796.5
ns134722
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5791
ns5021
ns1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4916
ns5167
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6312.5
ns5500
ns1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4937.5
ns5959
ns0.83
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
140483
ns141255
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
5688843
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
70765.5
ns59291
ns1.19
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8375
ns8396
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8292
ns9208
ns0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9917
ns9791
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8291
ns8375
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
811929.5
ns823637
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
40105318
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
393874
ns383144
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
55000
ns54917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
55833
ns54291
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
54292
ns54250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
56167
ns56541
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36588.5
ns37246
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1189517
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
206632.5
ns204842
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
486646
ns477000
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
497020.5
ns496604
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
505500
ns494271
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
504479.5
ns467792
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
256235
ns259843
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27551860
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
837064
ns794468
ns1.05
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3311209
ns3306791
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2324917
ns1761916
ns1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1764917
ns1756167
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6305667
ns6310604.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204534
ns205873.5
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/oneAPI
77630538
nsbatchedmm(128, Bsize=128)/forward/GPU/AMDGPU
220612.5
ns214142
ns1.03
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11424750.5
ns11469395.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8337875
ns6567229
ns1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
6554562.5
ns6474021
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21046187.5
ns21232020.5
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
736592
ns743103.5
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI
121665223
nsbatchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1067736
ns1064100
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6375
ns7125
ns0.89
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5146
ns4791
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7333
ns7042
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4917
ns5333
ns0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
130414
ns130642.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
5600903.5
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
56000
ns55570
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7333
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7104.5
ns8500
ns0.84
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7833
ns7500
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6917
ns7625
ns0.91
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
716948.5
ns721790
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
34048818
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
377284
ns371284
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
100375
ns124000
ns0.81
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
98042
ns105458
ns0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
101229
ns100416.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
121958
ns93688
ns1.30
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
148678
ns149649.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5976414.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203162
ns203312
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2025979.5
ns2020750
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2023750
ns2021041
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2027979
ns1993771
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2028208
ns2025000
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
667124
ns676279
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32503605.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1113981
ns1107011
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
34896
ns33958.5
ns1.03
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36541.5
ns34334
ns1.06
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
33000
ns32584
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
667
ns708
ns0.94
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15608
ns16105
ns0.97
batchedmm(2, Bsize=4)/forward/GPU/oneAPI
72119754.5
nsbatchedmm(2, Bsize=4)/forward/GPU/AMDGPU
83761
ns78881
ns1.06
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2542
ns2479.5
ns1.03
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2875
ns4000
ns0.72
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3042
ns3125
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2125
ns2292
ns0.93
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
136848
ns139246
ns0.98
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI
92906510
nsbatchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
357139
ns352743.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7209
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5417
ns1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5417
ns5291
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9875
ns10083
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35691
ns36300
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1119535
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49751
ns49595.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
239895.5
ns217854
ns1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219708
ns222916.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222104
ns220604.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206166
ns206125
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
239376
ns241210
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27974510.5
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
574776
ns515535
ns1.11
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3958
ns0.95
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22068
ns22201
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2145282
nsdense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
42250
ns41991
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14958
ns14708
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14541
ns14708
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14750
ns14750
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14875
ns14708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
298530
ns301554
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
11632418
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
196947
ns195902
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145083
ns116166.5
ns1.25
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
103646
ns130416
ns0.79
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
105729.5
ns104479
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
113042
ns105250
ns1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132784
ns135232
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6087845
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
204547
ns169232
ns1.21
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1918083
ns1928583
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1923042
ns1925875
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1921375
ns1895041.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1925292
ns1745875
ns1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
658916
ns664669
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
30625432
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1069806
ns1220022.5
ns0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20959
ns18583
ns1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17979.5
ns18792
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22125
ns22250
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18125
ns18250
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104444.5
ns107671
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3374722
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81701
ns77341
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
229875
ns216667
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223646
ns216667
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
218125.5
ns217812.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225125
ns227125
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
492479
ns497386
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
19457097
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
483554.5
ns470184
ns1.03
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
27374.5
ns26145.5
ns1.05
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
31063
ns28562
ns1.09
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
26708
ns26792
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1458
ns1458
ns1
batchedmm(16, Bsize=4)/forward/GPU/CUDA
15690
ns16337
ns0.96
batchedmm(16, Bsize=4)/forward/GPU/oneAPI
73206765
nsbatchedmm(16, Bsize=4)/forward/GPU/AMDGPU
89171
ns86810
ns1.03
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4875
ns4875
ns1
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4896
ns5104
ns0.96
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5250
ns5333
ns0.98
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4542
ns4833
ns0.94
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
200612
ns203656
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI
94501114
nsbatchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
394774
ns391324
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221875
ns222125
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
223209
ns222583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
225917
ns226333
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
223750
ns223333
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
216221
ns222346
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7634874
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
277862
ns273793
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
535958
ns500833
ns1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
499104
ns504334
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
510167
ns498167
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
508166
ns497542
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1024022
ns1053089
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45569833
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
864044
ns851353.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25166
ns20667
ns1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20166.5
ns20313
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21750
ns23083
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19167
ns20000
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
111455.5
ns113758.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3479193
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
78821
ns79011
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
245354
ns213084
ns1.15
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223375
ns213541
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
225417
ns214291
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218541
ns215500
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
707911
ns724087
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
25617389
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
538875
ns538870.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7125
ns6666
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6250
ns6666.5
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8666
ns9125
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6458
ns6584
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
132297.5
ns134050
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
5594794
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
67671
ns67330
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10583
ns10875
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10250
ns10603.5
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10958
ns10584
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10875
ns10750
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
778959.5
ns782883
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
37279902
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
393784
ns386274
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5250
ns5000
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6167
ns4625
ns1.33
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7583
ns6541
ns1.16
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5208
ns6375
ns0.82
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
134141.5
ns136660
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
5548829
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
69361
ns58460
ns1.19
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7834
ns7667
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7667
ns7916.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns7750
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7458
ns7750
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
742994
ns747431
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
37148580
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
400934
ns392653
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14518042
ns14573000
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10053875
ns7702333.5
ns1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
7724104
ns7661229.5
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27741083
ns27919750
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA
554321.5
ns552572
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/oneAPI
94275820
nsbatchedmm(128, Bsize=512)/forward/GPU/AMDGPU
399814.5
ns402049
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46185458.5
ns46551750
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33419604
ns26549208
ns1.26
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
26602708.5
ns26263166.5
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85208959
ns85671542
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2813842
ns3391019
ns0.83
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI
194819687
nsbatchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3323814
ns3300103
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
69583
ns67042
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
66979
ns67375
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
70292
ns70583
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
67625
ns68291
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
102627
ns103426.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3515302.5
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
232062
ns229352.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
520062.5
ns468625
ns1.11
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
473208
ns497666.5
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
482063
ns469292
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
474708
ns468500
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
703393
ns709808.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26797269
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
793873
ns786728
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
541
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
31962
ns32664
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1180122
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47320
ns47181
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8583
ns8833
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9583.5
ns9750
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9541
ns9708
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9667
ns9792
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
278738.5
ns281049
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21728099.5
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
381274
ns373464
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9666
ns9666
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9459
ns9708
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9667
ns9625
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9666
ns9666
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23100
ns23531
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
2057483
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
212922
ns211602
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
50458
ns50250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
50875
ns50250
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
50375
ns50125
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
50209
ns50167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
273986
ns276186.5
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11648854
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
610646
ns603776
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
54917
ns54916
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
55708
ns54333
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
54292
ns54292
ns1
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
55875
ns56125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27572
ns28315
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1222185
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
206592
ns204202
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
522166
ns515312.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
504250
ns495208
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
503500
ns494875
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
472833.5
ns465271
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
236683
ns238356
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
32890414.5
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
889849
ns843049
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
653833
ns657146
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
639812.5
ns678750
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
654166.5
ns625021
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
643729
ns649917
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
186765
ns189901
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8191594
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
303073
ns230582
ns1.31
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2228375
ns2239292
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2240916.5
ns2249895.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2265312.5
ns2176354.5
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2228084
ns2265625
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
907493
ns926422
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
49570533.5
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1227082.5
ns1211101.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22083
ns21083
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21333
ns22187.5
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21416.5
ns23666
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20208
ns19959
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
108981.5
ns112183.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3615898
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81661
ns81261
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
232104.5
ns254333
ns0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222250
ns220666
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228583
ns220750
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
259708
ns226708
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
701359
ns705957
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27641264
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
557775.5
ns548680
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns583
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
541
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
22562
ns23346
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1174965
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
48641
ns47671
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9896
ns9500
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10166
ns9917
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9979.5
ns9959
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10646
ns10083
ns1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
259541
ns260912
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
25096956
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
406314
ns400874
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10000
ns10500
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8875
ns8895.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10333
ns11625
ns0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9625
ns8750
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
114946
ns116855
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3356422
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
75001
ns67861
ns1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7312.5
ns7687.5
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7833
ns8000
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7833
ns7875
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7645.5
ns7812.5
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
479855
ns481589
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
17554055
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
327064
ns324483
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1375
ns1666
ns0.83
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1834
ns2042
ns0.90
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2125
ns2104.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1708
ns1459
ns1.17
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
19733
ns19805
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1143637.5
nsbias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
192542
ns190981
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3542
ns3520.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3584
ns3792
ns0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3875
ns3854.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3500
ns3583
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
210034.5
ns211153.5
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10599117
nsbias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
584616
ns578046
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
148333.5
ns147645.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
129000
ns106542
ns1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
107396
ns106708.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
233604.5
ns225875
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
23312
ns23334
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1181923
nsbias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
41095.5
ns35995.5
ns1.14
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
161208.5
ns144708
ns1.11
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
140708
ns104000
ns1.35
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
104000
ns87625
ns1.19
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
259375
ns252562.5
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
208046
ns210178
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
11091691.5
nsbias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
267983
ns230212
ns1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7270.5
ns7125
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5959
ns5375
ns1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5333
ns5292
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9959
ns10250
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32872
ns33945.5
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1199319
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50331
ns49690
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
258729
ns219375
ns1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
234500
ns260458
ns0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
238125
ns228500.5
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
253021
ns222499.5
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
256256.5
ns257172
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27890996
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
595296
ns523825
ns1.14
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
13000
ns13625
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12396
ns13479
ns0.92
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
14500
ns15125
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12500
ns13333
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
131871
ns132277
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
5626771
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
236102
ns234872
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23854.5
ns24084
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24500
ns23645.5
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25187.5
ns24708.5
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24750
ns24459
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
821231
ns830067.5
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
40073814
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
689137
ns681347
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9167
ns9792
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9834
ns10063
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11417
ns11375
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8999.5
ns9291.5
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
119274.5
ns120374.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3523753.5
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
76811
ns73601
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14083
ns14541
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14166.5
ns14813
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15104
ns14812.5
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14083
ns14875
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
630553.5
ns637361.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
21897908
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
373463
ns368293
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9021
ns10333
ns0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9875
ns9687.5
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11250
ns12041.5
ns0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9750
ns10125.5
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
117966.5
ns119012
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3400750
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
77501
ns73051
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12854
ns12792
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12937
ns13395.5
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13187.5
ns13375
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13166
ns13166
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
522874
ns525610
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19612958
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
349524
ns342408
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
30958.5
ns31416.5
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
34895.5
ns32520.5
ns1.07
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
30208
ns28917
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2042
ns2167
ns0.94
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16552
ns16642
ns0.99
batchedmm(2, Bsize=128)/forward/GPU/oneAPI
76609794
nsbatchedmm(2, Bsize=128)/forward/GPU/AMDGPU
87451
ns78711
ns1.11
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5375
ns5583.5
ns0.96
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5229
ns4958
ns1.05
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5395.5
ns5250
ns1.03
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6417
ns6584
ns0.97
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
135958
ns137549
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI
111332262.5
nsbatchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
390584
ns383954
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
291
ns292
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
291
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
24266
ns24843
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1220615
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
49051
ns48221
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6458
ns6375
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6792
ns6708.5
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6875
ns6916.5
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6375
ns6875
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
181716
ns183051
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
22738910
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
394694
ns391009
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2000
ns1958
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2125
ns2042
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2125
ns2084
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
1959
ns2041
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25193
ns25908
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1233759.5
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
207422
ns207502
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16937.5
ns17333.5
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17583
ns17333
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17666
ns17625
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17167
ns18000
ns0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
266060
ns266084
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
25037224.5
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
702687
ns691847
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
177959
ns153459
ns1.16
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
151000
ns175583.5
ns0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
151250
ns150250
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
156666
ns150417
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
185813
ns192072
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8186035
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
213762
ns176432
ns1.21
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1294417
ns1193541
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1322667
ns1327291.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1326979.5
ns1298166.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1325125
ns1330166.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
850017
ns864717
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
46207436
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1106552
ns1114311
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25687.5
ns25604.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25000
ns25333
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27125
ns28625
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
27375
ns25541
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
226385
ns232128
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7541451
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
115741
ns115071
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
180771
ns118791.5
ns1.52
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
134583.5
ns126708
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
175167
ns118625
ns1.48
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
164479
ns117979
ns1.39
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
971603.5
ns994805
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45326263
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
614401.5
ns588415.5
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns250
ns1.17
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns334
ns0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22475
ns23227
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1258351.5
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
48960
ns46150
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6458.5
ns6417
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6875
ns6750
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6875
ns6958
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6458.5
ns6750
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
197699
ns199656
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25220935
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
395854
ns393763.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5666
ns6250
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6542
ns6500
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6416
ns7291.5
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6167
ns5291
ns1.17
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
136571.5
ns137884.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
5759376
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
236832
ns233922
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10167
ns10104.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10250
ns10125
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10708.5
ns10562.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10021
ns10250
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
843659.5
ns853228
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
42177959
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
680842
ns672507
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
708
ns708
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
708
ns708
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
750
ns750
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
667
ns708
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22622
ns22896
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
2092408
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
211377.5
ns209942
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4958
ns4834
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5167
ns5042
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5125
ns5125
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4834
ns4834
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
217676
ns220625.5
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10379046
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
586156
ns580650
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7646
ns8750
ns0.87
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8458
ns8708
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10000.5
ns10395.5
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8625
ns8167
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
117310.5
ns118921.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3542404
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
77011
ns71421
ns1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8167
ns8292
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8792
ns8791
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9541
ns8958
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8500
ns8916
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
559897.5
ns567449
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
21100984
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
351894
ns346934
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
129875
ns125791.5
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
131334
ns96000
ns1.37
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
98500
ns96187.5
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
183000
ns181542
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA
45933
ns46439
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/oneAPI
73470628
nsbatchedmm(128, Bsize=4)/forward/GPU/AMDGPU
104986
ns93231
ns1.13
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
320833
ns302834
ns1.06
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
340500
ns166542
ns2.04
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
196229
ns166917
ns1.18
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
614646
ns567708
ns1.08
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
184661
ns186141
ns0.99
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI
95503191
nsbatchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
520426
ns466525
ns1.12
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397833
ns398250
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287792
ns215167
ns1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215167
ns214291
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756459
ns756250
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43884
ns43722
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1380208.5
nsdense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
82001
ns80301
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1449083
ns1402813
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1131416
ns862208
ns1.31
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
862375
ns854333
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2444146
ns2359583.5
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
248740
ns247149
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
11082909
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
350333
ns350254
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
652083
ns657333
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
652854
ns621958.5
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
654417
ns628854
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
661125
ns542146
ns1.22
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
184615
ns185394
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8038741
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
311568
ns258293
ns1.21
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2443958.5
ns2469895.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2461416.5
ns2491916.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2443812.5
ns2389875
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2444771
ns2478250
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
932610
ns934339.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
51927904
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1324133
ns1448647.5
ns0.91
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
34083.5
ns34271
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
36437.5
ns34250.5
ns1.06
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
33771
ns32312.5
ns1.05
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
834
ns916.5
ns0.91
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15954
ns16189.5
ns0.99
batchedmm(2, Bsize=32)/forward/GPU/oneAPI
74465713
nsbatchedmm(2, Bsize=32)/forward/GPU/AMDGPU
84121
ns71551
ns1.18
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3042
ns3166.5
ns0.96
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3208
ns3437.5
ns0.93
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3416
ns3541
ns0.96
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3084
ns3125
ns0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
134871
ns134833
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI
101832238
nsbatchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
355194
ns339494
ns1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
435000
ns437000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
441208
ns432458
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
431291
ns432833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
449458
ns449416
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
42183
ns42351
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1418032
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
241737
ns238133
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4139000
ns4152625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4281375
ns4271667
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4272125
ns4252417
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4043500
ns4062020.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
231383.5
ns231247
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
38875009
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1238087.5
ns1229715
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3917
ns0.96
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3916
ns3916
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34290
ns34451.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1242809
nsdense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
40730
ns38680
ns1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15750
ns15458
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15500
ns15708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15708
ns15625
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15667
ns15459
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
253133
ns252640
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
8969271
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
178362
ns169682
ns1.05
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404000
ns403417
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295666
ns221209
ns1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
221167
ns220042
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760500
ns760791
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113399
ns113133
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
1019290
nsdense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
89320
ns87381
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1474312.5
ns1431749.5
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1157021
ns886583
ns1.31
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
884958
ns881812.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2465875
ns2383750
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
244167
ns229435.5
ns1.06
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
11671477
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
354019
ns350874
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
500
ns459
ns1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
666
ns583
ns1.14
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
500
ns584
ns0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24808
ns24713
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1214092.5
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
210112
ns207622
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7916
ns7458.5
ns1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8167
ns8041.5
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns8292
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7459
ns7792
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
203590.5
ns202392.5
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
24613685
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
690937
ns689378
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
832166.5
ns833145.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
619583
ns466667
ns1.33
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
472250
ns467771
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1542500
ns1542833
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130624
ns130433
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/oneAPI
75509279
nsbatchedmm(128, Bsize=32)/forward/GPU/AMDGPU
236082
ns166542
ns1.42
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2694208.5
ns2696000
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1991375
ns1539437.5
ns1.29
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1537625
ns1533500
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4930000
ns4930000
ns1
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
233850
ns233723
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI
102808354
nsbatchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
768638
ns771469
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
250
ns375
ns0.67
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31761
ns31721
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1224489
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
47050
ns48111
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6417
ns6312.5
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6792
ns6812.5
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6666
ns6875
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6375
ns6500
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
219075.5
ns217171.5
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
23474742
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
362424
ns362335
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1776458
ns1777250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1755459
ns1758812.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1754000
ns1730917
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1755666
ns1776250
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
183229.5
ns184219
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8315915
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
375104
ns354280
ns1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4353771
ns4352917
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4398479
ns4382542
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4376083
ns4351834
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4351333
ns4391416
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
833369
ns837734
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
47106002
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1251643
ns1247440
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7083.5
ns6771
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7104
ns7937.5
ns0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7375
ns7333
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6834
ns6687.5
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
22695
ns22420
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1216626
nsbias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
37200
ns36840.5
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
48479.5
ns45312.5
ns1.07
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
50874.5
ns48146
ns1.06
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
47979
ns33917
ns1.41
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
47208
ns52729.5
ns0.90
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
207872
ns206304
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
10801241
nsbias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
234813
ns232673
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
22854
ns22146
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
26375
ns23896
ns1.10
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
23146
ns22417
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5333
ns5334
ns1.00
batchedmm(2, Bsize=512)/forward/GPU/CUDA
17805
ns18024
ns0.99
batchedmm(2, Bsize=512)/forward/GPU/oneAPI
89168517
nsbatchedmm(2, Bsize=512)/forward/GPU/AMDGPU
90691
ns83860.5
ns1.08
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
12083
ns12000
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
10208.5
ns9437.5
ns1.08
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
9583
ns9583
ns1
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
18104.5
ns18250
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
217973
ns218264
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI
150119195
nsbatchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
389829
ns367444
ns1.06
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
405958
ns406417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
297166.5
ns223333
ns1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
223625
ns222292
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762167
ns762750
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46720
ns46291
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1360027
nsdense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
90521
ns88691
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1491042
ns1428625
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1165750
ns892375
ns1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
892791.5
ns886833
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2470333
ns2386333
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
279542.5
ns279641
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
11213824.5
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
375414
ns379995
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
436000
ns436833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
440750
ns432708
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
432000
ns429500
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
449042
ns449500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54332
ns52933
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
999725
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
237743
ns235598
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4137041.5
ns4147167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4271042
ns4260354
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4270646
ns4227333
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4030959
ns4030354.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
253348
ns252356.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32411933.5
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1223273
ns1204784
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9458
ns9583
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
8000
ns7292
ns1.10
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7209
ns7250
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
13458
ns13500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24044
ns23984
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
2135292
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
214732
ns212683
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
49833
ns49416
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
49750
ns49459
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
49458
ns49167
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
49500
ns49625
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
335918.5
ns333606
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
12693187
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
656617
ns652008
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
136583
ns106875
ns1.28
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82145.5
ns113729
ns0.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85583
ns88666
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83104
ns89666.5
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
191318.5
ns191172
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5843078
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
205972
ns200642
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2013959
ns2027750.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2017792
ns2023896
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2022958
ns1986666
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2019333
ns2015667
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
508706
ns507573.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
28081381
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1089431
ns1086742.5
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.