This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: dropout tests are no longer broken
- Loading branch information
Showing
1 changed file
with
2 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7162f43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
7162f43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/115077
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
7162f43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5958.5
ns5312.5
ns1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6875
ns7792
ns0.88
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8292
ns8000
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5583
ns6958.5
ns0.80
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
119005
ns119033
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
912250
ns825375
ns1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
404275
ns401934
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9979
ns9583
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9917
ns9875
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10458
ns9875
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9833.5
ns9979
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
553048
ns554263
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5278042
ns2713291
ns1.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
666908
ns671997
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1416
ns7645.5
ns0.19
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3000
ns7500
ns0.40
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
2083
ns9750
ns0.21
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
2916
ns8521
ns0.34
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
21729
ns23694
ns0.92
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
209229.5
ns222062.5
ns0.94
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
28980
ns31840
ns0.91
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4250
ns4770.5
ns0.89
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4083.5
ns5041
ns0.81
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4209
ns5583.5
ns0.75
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3958
ns5062
ns0.78
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
147955
ns145766
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
1637541
ns1568604.5
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
144052
ns146901
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58125
ns56917
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46334
ns47083
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46875
ns47375
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82000
ns82792
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37545
ns39154
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1091270.5
ns1060708
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
78176
ns81970
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2039500
ns2023687.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2086417
ns2084333.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090625
ns2097166.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2001458
ns1996667
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
236094
ns220055
ns1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
5524250
ns5389292
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
981882
ns1353254
ns0.73
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
177916.5
ns147146
ns1.21
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
149750
ns149750
ns1
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
151999.5
ns146270.5
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
155625
ns150667
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166025
ns165828.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1652208
ns1542042
ns1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
208252
ns204932
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1118959
ns1114916
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1115500
ns1110250
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1118583
ns1120437.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1120729.5
ns1114709
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
702528
ns688383
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6243791
ns6685792
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
921401
ns1030010.5
ns0.89
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5375
ns4479
ns1.20
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5333.5
ns5417
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5979.5
ns5583
ns1.07
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4166
ns4334
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
92951.5
ns91302
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
464791
ns449229
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
66471
ns69581
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8666
ns8458
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8709
ns8625
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8917
ns9292
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8667
ns8375
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
609555
ns588432.5
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
6613792
ns6040187
ns1.09
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
389844
ns387324
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17500
ns17146
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18312.5
ns18438
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22041
ns21500
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17999.5
ns17229.5
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
66046
ns66199
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1312542
ns1266312.5
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
78481
ns76211
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221708.5
ns215958
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213209
ns219125
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221208
ns215188
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220125
ns221708
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
351410
ns351090
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5781271
ns5667541.5
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
467066
ns469564
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
583
ns7584
ns0.07687236286919831
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns8166.5
ns0.07653217412600258
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
895.5
ns11750
ns0.0762127659574468
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
625
ns8562.5
ns0.072992700729927
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
20127
ns22778
ns0.88
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
299208
ns301791
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
32561
ns32530
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1417
ns2209
ns0.64
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1417
ns2417
ns0.59
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1709
ns2916.5
ns0.59
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1417
ns2375
ns0.60
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
123446
ns126097.5
ns0.98
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
1707625
ns1533792
ns1.11
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
134852
ns135982
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7458
ns14041
ns0.53
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns14167
ns0.43
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6167
ns14458
ns0.43
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10250
ns16709
ns0.61
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23451
ns32839
ns0.71
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
641500
ns609208
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
47640
ns56260
ns0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
234125
ns227375
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
238167
ns275292
ns0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
235875
ns275000
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
256229
ns261458
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
190753.5
ns202099.5
ns0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8815979
ns8740042
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
641363
ns655201
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4084
ns4125
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4125
ns4083
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23283
ns22662
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
223916
ns219958
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
46130
ns46610
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16500
ns21041
ns0.78
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17000
ns21791
ns0.78
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17208
ns22250
ns0.77
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16916
ns20917
ns0.81
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
192389
ns205015
ns0.94
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
1461979.5
ns975584
ns1.50
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
171782
ns182977
ns0.94
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
511041.5
ns509167
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
404542
ns404417
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
405083
ns405000
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
865583
ns864791
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113186.5
ns113334.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
432792
ns421604.5
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
241413
ns240942
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2283979.5
ns2318229.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2032375
ns2030833
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2028375
ns2041375
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3275708
ns3280292
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
238715
ns250973.5
ns0.95
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
2068312
ns1903125
ns1.09
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
738234
ns725307
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6875
ns5375
ns1.28
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6562.5
ns7604
ns0.86
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7791.5
ns8500
ns0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
7021
ns6458.5
ns1.09
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
89026
ns89376.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
835166
ns762334
ns1.10
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
64940
ns64761
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11395.5
ns10583
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10875
ns11958
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11813
ns11958
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12250
ns10792
ns1.14
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
626379
ns632512
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5900375
ns5666041
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
405390
ns401324
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns2625
ns0.19
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns2958
ns0.17
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns3250
ns0.17
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns2792
ns0.18
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
22768
ns30482.5
ns0.75
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
329250
ns340083
ns0.97
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
46571
ns54341
ns0.86
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
ns10750
ns0.19
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns11833
ns0.18
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns13000
ns0.17
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns10625
ns0.20
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
222641.5
ns252151
ns0.88
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
2041875
ns1962708.5
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
177642
ns189561.5
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8625
ns26500
ns0.33
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8625
ns31771
ns0.27
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10375
ns35000
ns0.30
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8625
ns28479
ns0.30
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
103937.5
ns121854.5
ns0.85
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
808791.5
ns730917
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
72021
ns80315.5
ns0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18000
ns22791.5
ns0.79
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18145.5
ns25542
ns0.71
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18583
ns25334
ns0.73
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18416.5
ns23000
ns0.80
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
571815.5
ns616060
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5404208
ns5306187.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
375389.5
ns388424
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns1667
ns0.30
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns2000
ns0.25
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
584
ns2167
ns0.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns1834
ns0.32
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
34719
ns40493
ns0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
470542
ns296417
ns1.59
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
45830
ns48340
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9208
ns10000
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9667
ns11187.5
ns0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9458
ns11958
ns0.79
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9166.5
ns10583
ns0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
247304
ns266372
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5213666.5
ns4716875
ns1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
365724
ns379563.5
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
398667
ns396417
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288000
ns287875
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287959
ns288125
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
755708
ns756000
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111672
ns111465
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
399458
ns367958
ns1.09
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
74445.5
ns75531
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1404125
ns1453958.5
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1133833
ns1136125
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1130625
ns1142437.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2440291.5
ns2444854
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
204531
ns219029
ns0.93
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1656875
ns1657083
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
322048.5
ns327328
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7458
ns7042
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7729.5
ns8250
ns0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8645.5
ns8833
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7104
ns7209
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
140658.5
ns141318
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
446584
ns440833
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
65181
ns65171
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14854
ns12000
ns1.24
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14041.5
ns14812
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14812.5
ns14750
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14500
ns11917
ns1.22
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
931410
ns936057
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6503021
ns5924541.5
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
419645
ns423354
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25750
ns23584
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
30229
ns29312.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27250
ns31187
ns0.87
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24749.5
ns24833
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
197427
ns197551
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1129250
ns605479
ns1.87
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
113061
ns114941
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
114729
ns108084
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
151542
ns124167
ns1.22
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
149125
ns154542
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
144292
ns151166.5
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1069359
ns1062517
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6073166.5
ns6076604
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
585507
ns587946
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76500
ns73750
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75375
ns82958
ns0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
82062.5
ns78916
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76375
ns77042
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
205087.5
ns204012
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
534500
ns530625
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
126942
ns129202
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222708
ns209875
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218938
ns218333
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
301625
ns286250
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
211000
ns224708
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1112213
ns1104104
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6842291.5
ns6448250
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
687778
ns693286
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
16625
ns15687.5
ns1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
17084
ns17458
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
18250
ns18250
ns1
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16979.5
ns16812.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
145569.5
ns144830
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
470166
ns448250
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
231123
ns231562
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27000.5
ns24667
ns1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26583.5
ns26229.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28167
ns27083
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
28104
ns24833
ns1.13
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
973282.5
ns963572.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6243292
ns6046333
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
686787
ns687187
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
10708
ns32208.5
ns0.33
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11833
ns38208
ns0.31
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13146
ns43375
ns0.30
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11083
ns31459
ns0.35
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
122917
ns138477.5
ns0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
909979.5
ns880000
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
234753
ns243662
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
22125
ns23270.5
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
22208
ns23917
ns0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
22833
ns25145.5
ns0.91
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21458
ns22645.5
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
698911
ns705404
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5533791.5
ns5486750
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
667848
ns671427
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
63250
ns63271
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
63291
ns64396
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
68792
ns66666
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66896
ns63375.5
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104795
ns106695.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1335750
ns1328458
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
233243
ns236317
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
465750
ns437479.5
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
485500
ns464312.5
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
479729
ns451499.5
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
439584
ns485145.5
ns0.91
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
513938
ns511151
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6244250
ns6149750
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
700442.5
ns716967
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7896
ns7542
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7625
ns7375
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8437.5
ns8500
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7791.5
ns7125
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
143630.5
ns142876.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
485125
ns463208.5
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
64771
ns64690
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16271
ns12958
ns1.26
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14042
ns13812
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15395.5
ns14417
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
12291
ns15458
ns0.80
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
937305
ns934056
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5742562.5
ns5680771
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
395454
ns396764
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6159333
ns6145625
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6381500
ns6375834
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6372812.5
ns6379875
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11913291
ns11908958
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301808
ns348241
ns0.87
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
300133
ns302192.5
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19143020.5
ns19047770.5
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
20001479.5
ns19961208.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19942104
ns19978625
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36496041.5
ns36632228.5
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1192939
ns1017536
ns1.17
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1151383
ns1157817
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
917
ns3208
ns0.29
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
958
ns3541
ns0.27
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
959
ns4084
ns0.23
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
958
ns3250
ns0.29
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
22767.5
ns30273
ns0.75
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
329667
ns335958
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
206473
ns212322
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3667
ns11417
ns0.32
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3667
ns12291
ns0.30
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3750
ns15000
ns0.25
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3667
ns11459
ns0.32
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
278806
ns300887
ns0.93
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2160750
ns2150875
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
619497
ns613366
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8770.5
ns32583
ns0.27
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7542
ns39625
ns0.19
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9396
ns42125
ns0.22
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8416.5
ns31291
ns0.27
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
119876.5
ns134275.5
ns0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
873333.5
ns782479
ns1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
72170
ns81161
ns0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11417
ns17959
ns0.64
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11875
ns19937.5
ns0.60
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12750
ns20375
ns0.63
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12354.5
ns18291
ns0.68
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
634380
ns653629
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5505625
ns4601291
ns1.20
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
353023
ns371429
ns0.95
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22115
ns22327
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
323000
ns324479
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
45911
ns46841
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2875
ns6791
ns0.42
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3083
ns7208
ns0.43
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3375
ns9375
ns0.36
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2958
ns6667
ns0.44
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
201959
ns215371.5
ns0.94
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
1698500
ns1703500
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
162226.5
ns166071
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10625
ns10167
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11791
ns12875.5
ns0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13292
ns13125
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12792
ns11083
ns1.15
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
120724
ns120797.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
1021125
ns935500
ns1.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
237033
ns233122
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20104.5
ns20709
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20416.5
ns21875
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21667
ns21750
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
22499.5
ns22625
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
591155
ns590585
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
4924396
ns4822000
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
647057
ns648361
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns6833
ns0.64
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4458
ns7041
ns0.63
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4417
ns7833
ns0.56
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4458
ns6875
ns0.65
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
23658
ns31284
ns0.76
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
224687.5
ns229313
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
47091
ns52301
ns0.90
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16208
ns26125
ns0.62
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16583
ns27209
ns0.61
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16625
ns30000
ns0.55
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16541
ns25834
ns0.64
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
328446.5
ns347032.5
ns0.95
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
1736125
ns1080292
ns1.61
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
206977.5
ns216482.5
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2000
ns3334
ns0.60
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2041
ns3458
ns0.59
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2167
ns3875
ns0.56
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2166
ns3417
ns0.63
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35584
ns41491.5
ns0.86
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
496542
ns397958
ns1.25
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
202993
ns206202
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
16625
ns16917
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
17083
ns20833
ns0.82
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
19375
ns21208
ns0.91
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17292
ns17687.5
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
291555
ns288016
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5729000
ns5201083
ns1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
685537
ns696531.5
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
58875
ns55459
ns1.06
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
64917
ns64896
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
65979.5
ns65583.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51500
ns51541.5
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66564
ns66456
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
114441
ns113921
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
135604
ns132500
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
160729.5
ns166374.5
ns0.97
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
161958.5
ns111500
ns1.45
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
220333
ns316833
ns0.70
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
213911
ns217912
ns0.98
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
608437
ns613066
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
109333.5
ns80625
ns1.36
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
125708
ns125645.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
107083
ns86146
ns1.24
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84000
ns82959
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193351
ns193130
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1827812
ns1989666.5
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
202822.5
ns216662.5
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924187
ns1912792
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1898125
ns1921187.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1884187.5
ns1912375
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1880291.5
ns1908917
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
530663
ns526124
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9424833.5
ns8680374.5
ns1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1067912
ns1069560
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns2375
ns0.12
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns2792
ns0.10
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns3459
ns0.08441746169413125
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns2375
ns0.12
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21196
ns28282
ns0.75
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
359916.5
ns355667
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
41330
ns46231
ns0.89
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1833
ns9625
ns0.19
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns13459
ns0.13
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns14166
ns0.13
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns9625
ns0.19
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
249929
ns270635.5
ns0.92
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
1139125
ns1067437.5
ns1.07
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
181082
ns195496.5
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9916
ns7958
ns1.25
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9625
ns9854
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11687
ns11000
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6958
ns7521
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
117950.5
ns116502.5
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
889354
ns901250
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
233722
ns233553
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8500
ns8500
ns1
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8917
ns10042
ns0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9583
ns10208
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10417
ns8708
ns1.20
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
519707.5
ns518097.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
4836042
ns4329250
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
623196
ns626366
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58000
ns63291
ns0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46250
ns58084
ns0.80
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46125
ns57292
ns0.81
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81708
ns89791
ns0.91
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39849.5
ns50283
ns0.79
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1140667
ns1167250
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
73531
ns84641
ns0.87
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1931958.5
ns1912667
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1988875
ns1975417
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1971000
ns1966250
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1886041
ns1870792
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
219155.5
ns232632
ns0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11236021
ns11151146
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1172433
ns1177471
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
420583
ns415042
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
421937.5
ns419333.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
421541.5
ns422000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
418396
ns417187.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
208248.5
ns207638.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
544854.5
ns542958.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
281798
ns282777.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
791042
ns667541
ns1.19
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
776334
ns748979
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
760667
ns673708
ns1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
745854.5
ns675250
ns1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1047869
ns1040445
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6618917
ns6673666.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
906724.5
ns908713
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3475167
ns3514375
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3435791.5
ns3451021
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3445666
ns3440750
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3354000
ns3449792
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
170188
ns184364
ns0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1444500.5
ns1385709
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
419994
ns425164
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6226750
ns6177354
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6217084
ns6248791
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6216479
ns6199834
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6171167
ns6163750
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
989017
ns983055
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8277583.5
ns8007792
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1541036
ns1641310.5
ns0.94
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
473104.5
ns474292
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
341916.5
ns345209
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
342625
ns346500
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
902958
ns905500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
45561
ns54357
ns0.84
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
434709
ns404541
ns1.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
242062
ns247753
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2288334
ns2334208
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2037291
ns2038708.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2030792
ns2043542
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3276500
ns3293041.5
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
265294
ns278906.5
ns0.95
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2215417
ns2088084
ns1.06
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
765908
ns756552
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57833
ns62291
ns0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
45833
ns58500
ns0.78
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46083
ns56958
ns0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82333
ns89250
ns0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
27901.5
ns38171
ns0.73
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1140354.5
ns1166417
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
77831
ns86475.5
ns0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2037292
ns2035708
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2096667
ns2102479
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2095750
ns2080937.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1972770.5
ns2008875
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
232003
ns245090
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11459667
ns11967000
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1054811
ns1207986.5
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58333
ns57709
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46333
ns48084
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46583
ns49000
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82125
ns83667
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48836.5
ns56367
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1107208
ns1087374.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
71481
ns80301
ns0.89
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1939584
ns1917209
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1977833.5
ns1944583
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1921500
ns1961125
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1901541
ns1894375
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
237483
ns246493.5
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9889292
ns9855333.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
914639.5
ns1034015
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns1375
ns0.21
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
291
ns1667
ns0.17
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns1959
ns0.19
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns1375
ns0.21
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34044.5
ns39676
ns0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
479792
ns286209
ns1.68
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
45711
ns48840
ns0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns7375
ns0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7166.5
ns9083
ns0.79
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7042
ns9479.5
ns0.74
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6875
ns7583
ns0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
208323.5
ns213194.5
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5275958
ns4692500
ns1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
365689
ns380024
ns0.96
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns291
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32143
ns32533
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
257083
ns254979.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
38641
ns37420
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
3083
ns6042
ns0.51
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2834
ns7083
ns0.40
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3375
ns9333
ns0.36
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
3334
ns6083
ns0.55
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
186317
ns199543
ns0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
955000
ns950520.5
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
153876.5
ns164831
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
456103.5
ns437749.5
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
457959
ns487292
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
449833.5
ns466021
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
446792
ns442021
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
137834
ns143480
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2047375
ns2179687.5
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
367534
ns370168.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3827084
ns3794500
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3779125
ns3803417
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3815354.5
ns3791458
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3805250
ns3801709
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
704868
ns707529.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10646500
ns10857667
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1467966
ns1463934
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49950250
ns49798563
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35502667
ns35524209
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35525542
ns35534958
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96946146
ns97214791.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1591119
ns1600126
ns0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1048170.5
ns1047610
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154686729
ns153739771
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112459458.5
ns112306958.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112502167
ns112388250
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
294796688
ns294975583
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6513322
ns6485489.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5586938
ns5559847
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
18083.5
ns21209
ns0.85
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
18875
ns20792
ns0.91
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
17666
ns20667
ns0.85
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15333
ns23334
ns0.66
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
20058
ns23699
ns0.85
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
263167
ns222416.5
ns1.18
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
27460
ns28521
ns0.96
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
11000
ns11500
ns0.96
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
9062.5
ns10000
ns0.91
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
9375
ns10375
ns0.90
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17208
ns18416
ns0.93
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
258100
ns259109.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
1636271
ns1578917
ns1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
148221
ns147201.5
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8375
ns24625
ns0.34
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8500
ns27667
ns0.31
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9709
ns30500
ns0.32
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8667
ns23937.5
ns0.36
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
114814.5
ns137987.5
ns0.83
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
871583
ns670209
ns1.30
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
237832
ns243177.5
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10083
ns10917
ns0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9458
ns11500
ns0.82
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10917
ns11750
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10000
ns10792
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
616125
ns621051
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5530209
ns4704896
ns1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
649716
ns650846
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9104
ns8041
ns1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9729
ns10271
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11875
ns11125
ns1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10146
ns9292
ns1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
119182.5
ns119985.5
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
925000
ns895208
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
71721
ns71901
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
14000
ns13354
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13291.5
ns13667
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
16125
ns13917
ns1.16
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
16375
ns13708
ns1.19
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
587210
ns585616
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4884791.5
ns4221708
ns1.16
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
342378
ns351904
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
459
ns1542
ns0.30
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
500
ns1750
ns0.29
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
583
ns1792
ns0.33
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
541
ns1583
ns0.34
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
34108
ns40136
ns0.85
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
463166
ns273959
ns1.69
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
204103
ns207332
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8500
ns8750
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7208
ns9250
ns0.78
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns9291
ns1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8875
ns8875
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
231051
ns227150.5
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5606395.5
ns4712916
ns1.19
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
656736.5
ns674086
ns0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
16708
ns17875
ns0.93
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
17709
ns19167
ns0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
15459
ns18896
ns0.82
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
11812.5
ns18125
ns0.65
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
21417
ns24199.5
ns0.89
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
326333
ns208625.5
ns1.56
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
187682
ns187926.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
32083
ns32417
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
32083
ns32958
ns0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32583
ns33458
ns0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
32125
ns32625
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
271002
ns275193
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
1817666.5
ns1674271
ns1.09
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
588436
ns588556
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
442583
ns455833.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
484500
ns470416.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
466916.5
ns445500
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
474875
ns442125
ns1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195143
ns194972.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1992645.5
ns2002875
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
366114
ns368743
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3836708
ns3826416.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3833541
ns3821625
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3838416.5
ns3805291.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3820125
ns3828770.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
541312.5
ns539774
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8971125
ns9665250
ns0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1356384
ns1360323
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
831437333
ns787624562.5
ns1.06
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
543898584
ns541996916
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
543162208
ns539785459
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1507470958
ns1557728417
ns0.97
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22556380.5
ns22543125
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14756806
ns14726018
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2992149250
ns2518400750
ns1.19
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
2941572500
ns1785169708
ns1.65
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1796840333
ns1784676208
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4805069125
ns5268664750
ns0.91
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
365112590
ns367578104
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
88955268
ns88737971
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76312
ns75084
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
87000
ns76541.5
ns1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
80792
ns78958
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
77791.5
ns75625
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
206518
ns206590
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
538646
ns947916
ns0.57
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
106801
ns120271
ns0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
299625
ns193042
ns1.55
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
288125
ns278584
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
206708.5
ns194458
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
252125
ns249250
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1045048
ns1038440
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6344792
ns6277083
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
629272
ns658001
ns0.96
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199941103.5
ns199276312.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
139210416.5
ns139271583
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139343334
ns139246333
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
389333417
ns388477666
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5834950
ns5836579.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3575806.5
ns3573103
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
620327354.5
ns619375645.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
439885000
ns439498458
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
440156062.5
ns439699604.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1177316333
ns1187020083
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26374289
ns26508453
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
22085394.5
ns22071416
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns13833
ns0.52
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6292
ns13292
ns0.47
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5709
ns13625
ns0.42
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns16334
ns0.62
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27684
ns37105
ns0.75
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
612583
ns682166
ns0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48080
ns56160
ns0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215042
ns219750
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220416.5
ns228708
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221542
ns229666.5
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
207541.5
ns213125
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
221316
ns233596
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9129500
ns9102583
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
527145.5
ns556036
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9666.5
ns8625
ns1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8291
ns9083.5
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9979
ns10416
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8416.5
ns8125
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
115804.5
ns116194
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
910250
ns900041.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
69921
ns73561
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8271
ns7583
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns8084
ns0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns8208
ns1.27
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10416.5
ns7708
ns1.35
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
516584
ns515823
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4789687
ns4141083
ns1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
315163
ns319483
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns6709
ns0.07452675510508272
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
416
ns7000
ns0.05942857142857143
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns7250
ns0.092
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns6833
ns0.07317430118542367
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
25989
ns35385
ns0.73
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
482354
ns317896
ns1.52
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
46350
ns58250
ns0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
12250
ns15083
ns0.81
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8708
ns16500
ns0.53
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
13250
ns17375
ns0.76
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14583
ns15583
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
251956
ns263078
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5957666
ns5365146
ns1.11
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
390104
ns398084
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
107083
ns111604
ns0.96
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
99458.5
ns106999.5
ns0.93
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
100667
ns111125
ns0.91
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146396
ns158562.5
ns0.92
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
24308
ns27181
ns0.89
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
269875
ns268208
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
190462
ns193052
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
513042
ns479520.5
ns1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
501687
ns510437.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
490687
ns480729
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
494500
ns479354.5
ns1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
229321
ns233277
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2440291
ns2209500
ns1.10
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
605607
ns604431
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5167
ns5021
ns1.03
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
5020.5
ns5708.5
ns0.88
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
6333
ns6333.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
5166.5
ns6625
ns0.78
batchedmm(16, Bsize=32)/forward/GPU/CUDA
15753
ns16031
ns0.98
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
83851
ns84920
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
12625
ns12729.5
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
10562.5
ns11646
ns0.91
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
11916
ns12146
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
17313
ns17375
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
211388
ns216325.5
ns0.98
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
376734
ns366004
ns1.03
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
38250
ns35312.5
ns1.08
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51833
ns51479
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
52709
ns53042
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13750
ns13667
ns1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA
21653
ns21712
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
87391
ns91931
ns0.95
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
37167
ns37354.5
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
30854
ns44104
ns0.70
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
31854
ns32958
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
58271
ns57917
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
189565
ns194626.5
ns0.97
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
418544
ns399414
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1667
ns8542
ns0.20
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1917
ns9791.5
ns0.20
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2167
ns11625
ns0.19
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1666.5
ns9750
ns0.17
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
20444
ns23397
ns0.87
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
314396.5
ns305375
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
34190
ns34271
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2292
ns3041
ns0.75
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2083
ns3271
ns0.64
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2500
ns3792
ns0.66
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2083
ns3208
ns0.65
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
202251.5
ns206318
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
1617458
ns1504750.5
ns1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
140196.5
ns141011.5
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5395.5
ns4792
ns1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4875
ns4708.5
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6167
ns6834
ns0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4834
ns4667
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
143952.5
ns141147.5
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
450959
ns457167
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
69101
ns68731
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9083
ns8458.5
ns1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8417
ns8459
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9417
ns8750
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8041.5
ns8333
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
869463
ns861183
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5940500
ns5555937.5
ns1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
397364
ns385044
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56875
ns58083
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57750
ns59084
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57666
ns59416
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58042
ns59416
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37216
ns43710
ns0.85
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
666708
ns532666
ns1.25
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
210572.5
ns207252
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
459395.5
ns449104.5
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
469645.5
ns465666.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
472604
ns467437
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
434437.5
ns435520.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
265753
ns264164
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8430333
ns8246875
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
835999
ns831528
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3320312.5
ns3290708
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2332854.5
ns2334854.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2338770.5
ns2339729
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6300103.5
ns6308458
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204444
ns204167
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
218932.5
ns218552
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11527604
ns11346209
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8330834
ns8328312.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8347500
ns8321834
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21053229.5
ns21080084
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
739381
ns728462
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1071796
ns1058000
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4750
ns5083.5
ns0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5145.5
ns6875
ns0.75
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7104
ns7083
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5500
ns6604
ns0.83
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
137284.5
ns136287.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
846417
ns783520.5
ns1.08
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
56935.5
ns55800
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10167
ns7000
ns1.45
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7417
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8875
ns7375
ns1.20
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7166
ns7291.5
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
754765
ns747674.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5619792
ns5585312.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
363983.5
ns365323
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
120333
ns110750
ns1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
123500
ns127458.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
104166
ns122542
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
121959
ns117167
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
151640
ns156753
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2035604
ns2136000
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203932
ns226292.5
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2022667
ns2021500
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2031562.5
ns2022042
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2025666.5
ns2031021
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1707375
ns2023917
ns0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
702589
ns706711
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11089875
ns10690542
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1248902
ns1254492
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
33458
ns28833.5
ns1.16
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36625
ns36542
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
35291
ns34917
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
584
ns708.5
ns0.82
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15174
ns15392
ns0.99
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
79811
ns79601
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2604.5
ns3250
ns0.80
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2708
ns3833
ns0.71
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3667
ns3917
ns0.94
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2167
ns2834
ns0.76
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
138127.5
ns139825
ns0.99
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
339374
ns340743
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7334
ns8416
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083
ns7333
ns0.83
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns7541
ns0.80
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10167
ns11208
ns0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36513
ns42506
ns0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
649854
ns420187.5
ns1.55
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
47760
ns50571
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214542
ns213521
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
226417
ns229958
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
226979
ns222791.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205334
ns215375
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
244233
ns251022
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8096000
ns7930584
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
576056
ns574850
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns6209
ns0.63
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3916
ns6375
ns0.61
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns6459
ns0.61
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns6125
ns0.64
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21418
ns28584
ns0.75
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
247729.5
ns251125
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
42160
ns47090
ns0.90
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14625
ns23167
ns0.63
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14917
ns24166
ns0.62
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14959
ns24375
ns0.61
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14875
ns23292
ns0.64
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
308560
ns333019.5
ns0.93
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
1007917
ns1014854.5
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
197152
ns208872
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
127625
ns110000.5
ns1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
102583.5
ns148604
ns0.69
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
108542
ns126750
ns0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144833
ns133125
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
140996
ns148515
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2049875
ns2080104
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
205082
ns217122
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924833
ns1912521
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1933417
ns1906583
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1927375
ns1884312.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1690041
ns1920187.5
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
688959
ns696456
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10643666
ns10487959
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1213162
ns1218296
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16916
ns17542
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22833
ns22458
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22083.5
ns20771
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18791
ns18771
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
108917
ns112142.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1343250
ns1340625
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
80395.5
ns80871
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217750
ns215875
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216291.5
ns253583
ns0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219333
ns217667
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217312.5
ns216417
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
519232
ns525953.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6202583.5
ns6121020.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
476885
ns476639.5
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
25542
ns23979.5
ns1.07
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
30750
ns32625
ns0.94
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
28270.5
ns28250
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1291
ns1666.5
ns0.77
batchedmm(16, Bsize=4)/forward/GPU/CUDA
15852
ns16428
ns0.96
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
87461
ns81141
ns1.08
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4646
ns5271
ns0.88
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4667
ns5854.5
ns0.80
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5437.5
ns6437.5
ns0.84
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4791
ns5646
ns0.85
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
205245
ns215206.5
ns0.95
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
379294
ns379243.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
305167
ns303000
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
305500
ns305416.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
307958.5
ns308771
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
307625
ns305083
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
227913
ns231043
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1203958
ns1184000
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
273823
ns272543
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
577625
ns529833.5
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
541708.5
ns567729.5
ns0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
538625
ns533292
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
531000
ns536958.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1079009
ns1091736
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6297000
ns6208000
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
857439
ns868528
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19583
ns36042
ns0.54
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21020.5
ns39083
ns0.54
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23146
ns42458
ns0.55
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20125
ns37041
ns0.54
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
113331
ns131591
ns0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1478062.5
ns1464375
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79341
ns87560
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219375
ns215250.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
223208
ns215104.5
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221333
ns215917
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213250
ns219083.5
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
742408.5
ns768516
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7389833.5
ns7384104
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
536040.5
ns532425
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6459
ns5500
ns1.17
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6167
ns7000
ns0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8271
ns8542
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6709
ns6334
ns1.06
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
140459.5
ns140673
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
825583
ns772083
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
64880
ns67510
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10917
ns10271
ns1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10291.5
ns10292
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11000
ns10833.5
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9042
ns10875
ns0.83
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
822871.5
ns833045.5
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5686479
ns5336083
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
386913.5
ns390258.5
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4666
ns4709
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4729
ns5500
ns0.86
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6166.5
ns6709
ns0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4729
ns6542
ns0.72
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
142533.5
ns144256
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
842166
ns792979
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
66671
ns66931
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns7125.5
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7520.5
ns7917
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7792
ns7917
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7334
ns7604.5
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
781913
ns790107
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
6165208
ns5547667
ns1.11
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
390103
ns391578.5
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14542834
ns14365625
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10133792
ns10109792
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10143250
ns10132375
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27710708
ns27659333
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
529129.5
ns534508
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
397218.5
ns392324
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46516250
ns45855833
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33510145.5
ns33506395.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33544583
ns33525958
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85322958
ns85233208
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2673514
ns2804828.5
ns0.95
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3288493
ns3316671
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
68541
ns83646
ns0.82
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
66354
ns87875
ns0.76
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
69208
ns90333
ns0.77
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66250
ns85687.5
ns0.77
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
120674
ns124763.5
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1454437.5
ns1478042
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
228432
ns248002.5
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
450875
ns442062
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
484020.5
ns451167
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
448417
ns444167
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
455791
ns441479
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
730668
ns747087
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7716375
ns7697145.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
781067
ns784227
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns1750
ns0.31
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
541
ns1875
ns0.29
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
584
ns2000
ns0.29
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
584
ns1750
ns0.33
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32007
ns38564
ns0.83
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
466021
ns469896
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
48911
ns50030
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9375
ns10250
ns0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8125
ns10938
ns0.74
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9562.5
ns11042
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8583
ns10625
ns0.81
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
284659
ns286642
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5684229
ns4815583.5
ns1.18
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
375694
ns389993
ns0.96
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9791
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9833
ns9834
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9792
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9875
ns9792
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
22881
ns23280
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
221417
ns228792
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
210902
ns204342
ns1.03
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45708
ns49584
ns0.92
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45875
ns50542
ns0.91
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46292
ns50708
ns0.91
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45875
ns49917
ns0.92
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
288887.5
ns308151
ns0.94
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
1407145.5
ns1545500
ns0.91
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
605076
ns603836
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56709
ns62834
ns0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57166
ns64292
ns0.89
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57041
ns64333
ns0.89
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57875
ns64250
ns0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28717
ns39886
ns0.72
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
606417
ns638041.5
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
202332
ns213412
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
458583
ns456084
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
511562
ns488791.5
ns1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
468541
ns476146
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
444250
ns491750
ns0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
246546
ns263616
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9398625
ns9629125
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
886918.5
ns891718
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
622666.5
ns638875
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
650042
ns657062.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
626375
ns647917
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
609145.5
ns637833
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
206439.5
ns209655
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1405042
ns1377917
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
302253
ns308858
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2246916.5
ns2231042
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2234750
ns2234709
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2232959
ns2231770.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2238478.5
ns2224542
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
974225.5
ns969019
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7985125
ns7164667
ns1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1218663
ns1319082
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23292
ns36750.5
ns0.63
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19541
ns40083
ns0.49
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23708.5
ns42416
ns0.56
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20416
ns36146
ns0.56
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
113048.5
ns131167.5
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1461666
ns1489541.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79571
ns89901
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
229583
ns221125
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219083
ns231999.5
ns0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
225166
ns223062.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
260792
ns220250
ns1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
729682
ns745440.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7602084
ns7764958
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
553456
ns549685
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns6833
ns0.07932094248499927
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns7208
ns0.06936736958934517
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns7375
ns0.0904406779661017
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns6834
ns0.08530875036581798
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
22763
ns33512
ns0.68
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
508625
ns444542
ns1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
47710
ns57271
ns0.83
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9104.5
ns15833.5
ns0.58
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9584
ns17146
ns0.56
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9750
ns17041
ns0.57
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9666
ns16687.5
ns0.58
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
266997
ns282963.5
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6194875
ns5994417
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
405084
ns408498.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9167
ns9458.5
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8354
ns9167
ns0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10083
ns10416.5
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9333
ns8541.5
ns1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
119160
ns120739.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
906395.5
ns888833.5
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
71140
ns70321
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7437.5
ns7500
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7833
ns7667
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8042
ns7895.5
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7834
ns7417
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
505714
ns513454.5
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4336416.5
ns3973625
ns1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
318483
ns319713
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1542
ns9354
ns0.16
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1500
ns9542
ns0.16
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2000
ns10583
ns0.19
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1459
ns9229
ns0.16
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
20988
ns24142
ns0.87
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
308500
ns305208
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
188631.5
ns190361
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3208
ns4145.5
ns0.77
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3375
ns4208
ns0.80
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3750
ns4750
ns0.79
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3292
ns4167
ns0.79
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
215843
ns226431.5
ns0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
1789834
ns1679312.5
ns1.07
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
578526
ns577155
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
149521
ns155083
ns0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
131625
ns136375
ns0.97
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
129375
ns140958
ns0.92
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225125
ns232833.5
ns0.97
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
23704
ns26998
ns0.88
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
277145.5
ns297875
ns0.93
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
41105.5
ns42431
ns0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
160500
ns144458
ns1.11
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
127875
ns127291
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
111292
ns112104.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
285104
ns252250
ns1.13
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
215659.5
ns219049
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
2088792
ns2074312
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
218702
ns265923
ns0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns8583
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6042
ns7292
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6041
ns7292
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns11333
ns0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32743
ns38422
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
684042
ns374313
ns1.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50791
ns51281
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
258458
ns221417
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
269000
ns229791
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
232916
ns230458.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213583
ns214041.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
264603
ns259283
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8332437.5
ns8241896
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
596726
ns592306
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
15208
ns15479
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
14750
ns15375
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
16812.5
ns17458
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
15500
ns15542
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
139047.5
ns137835
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
826750
ns778728.5
ns1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
231322
ns231852
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24000
ns23417
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23459
ns23791
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24833
ns24000
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23729.5
ns23937
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
863000.5
ns858271
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5971562.5
ns5635500
ns1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
676426
ns677086
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9125
ns26604
ns0.34
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9792
ns28250
ns0.35
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11625
ns31333
ns0.37
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9959
ns26812.5
ns0.37
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
123210
ns137010
ns0.90
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
822750
ns925417
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
73491
ns82411
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13792
ns14792
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13834
ns15708
ns0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14708
ns16000
ns0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13959
ns15645.5
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
662447.5
ns668142
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5427958
ns5325770.5
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
363743
ns366524
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9500
ns9312.5
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9917
ns9416
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10917
ns10583
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9500
ns9542
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
121865.5
ns121280
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
947416
ns932375
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
72530
ns72561
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12625
ns12354
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12791
ns13000
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13500
ns13042
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12792
ns12458.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
548713
ns545614
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4635584
ns4752396
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
338153
ns340553.5
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
31166.5
ns26958
ns1.16
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
34604.5
ns34792
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
32000.5
ns32041.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2041
ns1958.5
ns1.04
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16060
ns16169
ns0.99
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
80571
ns80481
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5459
ns6042
ns0.90
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5229.5
ns6208
ns0.84
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5500
ns6520.5
ns0.84
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6375
ns6834
ns0.93
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
137726
ns141884.5
ns0.97
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
367863
ns371004
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns6458
ns0.0452152369154537
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns6834
ns0.04272753877670471
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns6875
ns0.05454545454545454
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns6375
ns0.04580392156862745
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
25579
ns34623
ns0.74
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
497667
ns457312.5
ns1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
46710
ns56171
ns0.83
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6250
ns12916
ns0.48
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6333
ns13791
ns0.46
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6875
ns14084
ns0.49
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6417
ns13042
ns0.49
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
186593
ns198569
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
6148312.5
ns5453125
ns1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
386914
ns396759
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2000
ns8292
ns0.24
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2042
ns8625
ns0.24
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2042
ns8833
ns0.23
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2042
ns8333
ns0.25
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25958
ns35748
ns0.73
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
499875
ns324084
ns1.54
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
204552
ns215022
ns0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16250
ns22770.5
ns0.71
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16375
ns23812.5
ns0.69
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17208
ns24291.5
ns0.71
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17208
ns22458
ns0.77
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
273322
ns284982
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6425583
ns5718333
ns1.12
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
703277
ns709637
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
151375
ns149125
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
175375
ns155917
ns1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
151625
ns152500
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
148542
ns148250
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
201338.5
ns200827
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1421833
ns1424250.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
235723
ns214342
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1323958
ns1322854.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1324646
ns1324334
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1329917
ns1306187.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1331958
ns1319750
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
910604
ns894838
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6755792
ns6451042
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1099850.5
ns1104625
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
26000
ns25541.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
26917
ns25166
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27959
ns27666
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
26333
ns24084
ns1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
236208.5
ns236708
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1183958.5
ns1207792
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
115741
ns114481
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
131958.5
ns117291.5
ns1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
174417
ns119125.5
ns1.46
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118937.5
ns119021
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
134229.5
ns129000
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1078567
ns1066520
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6416541.5
ns6154750
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
608010.5
ns614935
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
250
ns6417
ns0.03895901511609787
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns6750
ns0.04325925925925926
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns6875
ns0.05454545454545454
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
334
ns6458
ns0.051718798389594305
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22647
ns32046
ns0.71
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
486354
ns304791.5
ns1.60
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
46890
ns56421
ns0.83
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6354
ns12958
ns0.49
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6625
ns13958
ns0.47
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6917
ns14104
ns0.49
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6770.5
ns12979.5
ns0.52
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
203495.5
ns219681.5
ns0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
6258459
ns5367125
ns1.17
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
385664
ns404804
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7041
ns6042
ns1.17
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6125
ns6958
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7500
ns8000
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6792
ns5812.5
ns1.17
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
145023.5
ns143745
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
543645.5
ns721833
ns0.75
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
232412
ns232722
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9916.5
ns9875
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9917
ns10417
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10459
ns10375
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9917
ns10083.5
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
899150
ns893962
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
6366229
ns6022625
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
670016
ns667866
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
625
ns666
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
667
ns666
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
667
ns666
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
667
ns667
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22252
ns22221.5
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
328833
ns253958
ns1.29
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
206112
ns206192
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4583
ns7958
ns0.58
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4666
ns8833
ns0.53
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4916
ns8875
ns0.55
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4583
ns8041
ns0.57
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
226579.5
ns238671.5
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
1758521
ns1611250
ns1.09
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
576735
ns575715
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7792
ns24208
ns0.32
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8459
ns26562.5
ns0.32
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9666
ns29458
ns0.33
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8749.5
ns25313
ns0.35
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
121871.5
ns134686.5
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
904542
ns819479.5
ns1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
72230
ns82871
ns0.87
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8084
ns9833
ns0.82
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8917
ns10750
ns0.83
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8750
ns10584
ns0.83
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8771
ns9959
ns0.88
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
586988
ns592874
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5366104
ns4586229.5
ns1.17
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
343713
ns342583
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
126750
ns125959
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
130333
ns129958
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
129959
ns130021
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
183083
ns181187.5
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA
45749
ns45830
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
102711
ns105671
ns0.97
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
334625
ns325125
ns1.03
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
345916.5
ns323667
ns1.07
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
313708
ns316417
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
607375
ns616792
ns0.98
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
190009.5
ns194713
ns0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
509425
ns508449.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
399042
ns400583
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288041
ns290666
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288083
ns291292
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756334
ns759541
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43243
ns51490
ns0.84
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
421334
ns458875
ns0.92
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
81861
ns84931
ns0.96
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1414229.5
ns1458459
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1137312.5
ns1140687.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1123812
ns1149666.5
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2440958
ns2451791
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
246507.5
ns274619
ns0.90
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1867604
ns1914208
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
350093
ns358283
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
658792
ns633666
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
649229
ns663666.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
645250
ns645687.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
613562.5
ns632541
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
199471
ns200663
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1393834
ns1352979.5
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
313773
ns307532.5
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2462459
ns2467667
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2455083
ns2454750
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2436959
ns2454500
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2440000
ns2451167
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
993992.5
ns984218.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10375437
ns7766292
ns1.34
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1301638
ns1380642
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
33187
ns32292
ns1.03
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
36020.5
ns36875
ns0.98
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34437
ns34000
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
833
ns958.5
ns0.87
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15321
ns15278.5
ns1.00
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
76730
ns78690.5
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3083.5
ns3792
ns0.81
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3208
ns4333
ns0.74
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3459
ns4583.5
ns0.75
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3042
ns4124.5
ns0.74
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
136645
ns140987
ns0.97
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
338843.5
ns336043
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
408667
ns413209
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
408167
ns415792
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
408208
ns416395.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
419208
ns427145.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
42599
ns54475
ns0.78
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1163958
ns1198125
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238593
ns250702
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3876875
ns3877833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4000249.5
ns3995771
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3985416.5
ns3886792
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3759708.5
ns3754728.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
242677
ns255856
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11849584
ns11943833
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1429804
ns1432843
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33734
ns33843
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
245792
ns177584
ns1.38
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
37910
ns37790.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15500
ns19500
ns0.79
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15750
ns20083
ns0.78
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
16000
ns20375
ns0.79
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15750
ns19708
ns0.80
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
253719.5
ns265715
ns0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
877916
ns870334
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
168172
ns178112
ns0.94
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404041
ns404500
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295791
ns296167
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295750
ns295667
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760750
ns760375
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
112938
ns112966
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
472396
ns439666
ns1.07
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
87831
ns87800
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1425834
ns1477375
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1164145.5
ns1147125
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1154500
ns1158208
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2463334
ns2470770.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
237064
ns253070
ns0.94
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1932604
ns1857708
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
351653
ns354333
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
500
ns6708
ns0.07453786523553965
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
583
ns7125
ns0.08182456140350877
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
583
ns7125
ns0.08182456140350877
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
584
ns6666
ns0.08760876087608761
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
25248
ns34849
ns0.72
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
487000
ns444292
ns1.10
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
206212
ns216212
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7208
ns14000
ns0.51
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7792
ns15125
ns0.52
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8000
ns15584
ns0.51
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7958
ns14250
ns0.56
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
210355
ns223556.5
ns0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5898958
ns5345375.5
ns1.10
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
681746
ns696921.5
ns0.98
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
832291.5
ns832708
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
620334
ns618166
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
620375
ns611542
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1554083
ns1540812.5
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129543
ns130337.5
ns0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
230863
ns224742
ns1.03
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2682917
ns2662417
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
2002166.5
ns2007708
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
2003625
ns2003084
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4890125
ns4932771
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
259430
ns261909.5
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
768547
ns835813
ns0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns1375
ns0.21
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns1542
ns0.19
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns1583
ns0.24
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns1375
ns0.27
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31540
ns36888
ns0.86
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
476750
ns366667
ns1.30
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
46720
ns49661
ns0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6083
ns7687.5
ns0.79
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6333
ns8542
ns0.74
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6667
ns8291
ns0.80
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6625
ns7958
ns0.83
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
229293
ns219381
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5740833
ns4917833
ns1.17
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
358943
ns375813
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2382542
ns2401916.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2420709
ns2401583
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2392542
ns2379416
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2375083
ns2371833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
201753.5
ns198341.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1502667
ns2274958
ns0.66
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
374933.5
ns374084
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4665833
ns4636458
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4657978.5
ns4653166.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4661000
ns4641125
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4630625
ns4652750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
899813
ns889968
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7005729
ns6404438
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1237162
ns1356447.5
ns0.91
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6896
ns17208.5
ns0.40
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7000
ns14583
ns0.48
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7667
ns16313
ns0.47
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6916.5
ns21229
ns0.33
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
23032
ns25470
ns0.90
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
249750
ns267750
ns0.93
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
37160.5
ns42811
ns0.87
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
45292
ns45146
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
33000
ns49833
ns0.66
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
45875
ns34417
ns1.33
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
32937.5
ns73000.5
ns0.45
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
215380
ns218060
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2165104
ns2129250
ns1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
263512
ns268402
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
21604
ns20459
ns1.06
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
25667
ns26208
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
25291.5
ns25292
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5250
ns5333.5
ns0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA
16144
ns16594
ns0.97
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
84401
ns83491
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
11917
ns12541
ns0.95
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
10125
ns11375
ns0.89
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10833
ns11625
ns0.93
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
19021
ns19084
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
225599
ns227944.5
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
371243.5
ns370203
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
406334
ns409416
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
296833
ns299958
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
297292
ns300250
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762458
ns765750
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
45629.5
ns53976
ns0.85
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
508917
ns442125
ns1.15
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
88701
ns94470.5
ns0.94
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1436625
ns1489667
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1165250
ns1171812
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1164416.5
ns1175459
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2470667
ns2480500
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
290172
ns311892
ns0.93
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2138521
ns2072208.5
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
373543.5
ns370933
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
434312.5
ns435250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
436375
ns438084
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
437145.5
ns437333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
446208
ns448333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54620
ns61295
ns0.89
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1097771
ns1135104
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
234342
ns237222
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3914542
ns3895917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4034250
ns4001312.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4027166.5
ns3913375.5
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3797250
ns3807916.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
262204
ns261286
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10527209
ns9972333
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1358593
ns1208741
ns1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8709
ns11000
ns0.79
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
7667
ns10292
ns0.74
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7666
ns10334
ns0.74
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12417
ns14625
ns0.85
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
23305
ns30723
ns0.76
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
229958
ns233208.5
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
209822
ns215396.5
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
44791
ns52791
ns0.85
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45375
ns53583
ns0.85
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45125
ns54083.5
ns0.83
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45125
ns53125
ns0.85
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
345825.5
ns366013
ns0.94
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
1263834
ns1891437.5
ns0.67
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
655157
ns643336
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
122229
ns94209
ns1.30
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
124458
ns90833
ns1.37
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86958
ns85958
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
86333.5
ns126167
ns0.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190067.5
ns190399.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2021459
ns1996458
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
225357.5
ns221047
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2020250
ns2017500
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2026812.5
ns2011417
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2019709
ns1801333.5
ns1.12
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2000187.5
ns1978875
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
534191
ns531205
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9193375
ns9357625
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
953884
ns1089565
ns0.88
This comment was automatically generated by workflow using github-action-benchmark.