forked from FasterDecoding/Medusa
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprof_file_medusa.csv
We can make this file beautiful and searchable if this error is corrected: It looks like row 14 should actually have 1 column, instead of 3 in line 13.
128 lines (128 loc) · 30 KB
/
prof_file_medusa.csv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 13.96 Gb 13.96 Gb 2340
hipGetDevicePropertiesR0600 0.00% 5.261us 0.00% 5.261us 5.261us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
hipDeviceGetStreamPriorityRange 0.00% 0.510us 0.00% 0.510us 0.510us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
hipStreamIsCapturing 0.00% 284.261us 0.00% 284.261us 1.077us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 264
hipMalloc 0.45% 32.886ms 0.45% 32.886ms 124.568us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 264
hipMemcpyWithStream 99.31% 7.228s 99.31% 7.228s 17.293ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 418
Memcpy HtoD (Host -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 786.792ms 99.41% 786.792ms 1.882ms 0 b 0 b 0 b 0 b 418
hipMemGetInfo 0.00% 4.380us 0.00% 4.380us 4.380us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
hipFree 0.00% 23.270us 0.00% 23.270us 23.270us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
hipLaunchKernel 0.18% 13.039ms 0.18% 13.039ms 869.295us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 15
void at::native::elementwise_kernel<512, 1, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 4.704ms 0.59% 4.704ms 313.590us 0 b 0 b 0 b 0 b 15
hipDeviceSynchronize 0.05% 3.837ms 0.05% 3.837ms 3.837ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 7.278s
Self CUDA time total: 791.496ms
------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us 0 b 0 b 16
hipDeviceSynchronize 100.00% 1.899us 100.00% 1.899us 1.899us 0 b 0 b 1
------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.899us
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 512 b 512 b 1.00 Gb 1.00 Gb 109128
hipMemcpyAsync 0.03% 453.264us 0.03% 453.264us 7.195us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 63
Memcpy HtoD (Host -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 62.560us 0.00% 62.560us 8.937us 0 b 0 b 0 b 0 b 7
hipMemcpyWithStream 54.06% 788.460ms 54.06% 788.460ms 1.546ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 510
hipStreamIsCapturing 0.00% 4.352us 0.00% 4.352us 0.544us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 8
hipMalloc 0.10% 1.507ms 0.10% 1.507ms 167.395us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 9
hipLaunchKernel 30.79% 449.027ms 30.79% 449.027ms 10.735us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 41829
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.225ms 0.08% 1.225ms 20.076us 0 b 0 b 0 b 0 b 61
void (anonymous namespace)::elementwise_kernel_with_... 0.00% 0.000us 0.00% 0.000us 0.000us 70.720us 0.00% 70.720us 2.210us 0 b 0 b 0 b 0 b 32
hipGetDevicePropertiesR0600 0.25% 3.697ms 0.25% 3.697ms 0.391us 0.000us 0.00% 0.000us 0.000us 0 b 0 b -2.50 Kb -2.50 Kb 9450
void at::native::(anonymous namespace)::indexSelectL... 0.00% 0.000us 0.00% 0.000us 0.000us 271.841us 0.02% 271.841us 8.769us 0 b 0 b 0 b 0 b 31
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 100.800us 0.01% 100.800us 3.252us 0 b 0 b 0 b 0 b 31
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 70.080us 0.00% 70.080us 2.261us 0 b 0 b 0 b 0 b 31
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.168ms 0.08% 1.168ms 5.148us 0 b 0 b 0 b 0 b 227
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 357.761us 0.02% 357.761us 11.541us 0 b 0 b 0 b 0 b 31
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 109.600us 0.01% 109.600us 3.535us 0 b 0 b 0 b 0 b 31
void at::native::elementwise_kernel<512, 1, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 28.733ms 1.92% 28.733ms 9.458us 0 b 0 b 0 b 0 b 3038
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 600.801us 0.04% 600.801us 19.381us 0 b 0 b 0 b 0 b 31
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 165.120us 0.01% 165.120us 5.326us 0 b 0 b 0 b 0 b 31
void at::native::elementwise_kernel<512, 1, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 142.080us 0.01% 142.080us 4.583us 0 b 0 b 0 b 0 b 31
Memcpy DtoD (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 237.280us 0.02% 237.280us 3.827us 0 b 0 b 0 b 0 b 62
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 183.040us 0.01% 183.040us 5.905us 0 b 0 b 0 b 0 b 31
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 10.788ms 0.72% 10.788ms 4.972us 0 b 0 b 0 b 0 b 2170
void at::native::elementwise_kernel<512, 1, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 11.829ms 0.79% 11.829ms 5.870us 0 b 0 b 0 b 0 b 2015
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 10.264ms 0.69% 10.264ms 5.094us 0 b 0 b 0 b 0 b 2015
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 24.453ms 1.63% 24.453ms 12.135us 0 b 0 b 0 b 0 b 2015
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 9.326ms 0.62% 9.326ms 4.629us 0 b 0 b 0 b 0 b 2015
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 11.231ms 0.75% 11.231ms 5.574us 0 b 0 b 0 b 0 b 2015
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 12.556ms 0.84% 12.556ms 6.231us 0 b 0 b 0 b 0 b 2015
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 44.723ms 2.99% 44.723ms 7.475us 0 b 0 b 0 b 0 b 5983
hipModuleLoadData 10.03% 146.253ms 10.03% 146.253ms 48.751ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 3
hipExtModuleLaunchKernel 3.10% 45.227ms 3.10% 45.227ms 4.879us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 9269
Cijk_Alik_Bljk_HHS_BH_MT128x96x64_MI32x32x8x1_SE_1LD... 0.00% 0.000us 0.00% 0.000us 0.000us 98.910ms 6.61% 98.910ms 310.063us 0 b 0 b 0 b 0 b 319
void at::native::index_elementwise_kernel<128, 4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 32.377ms 2.16% 32.377ms 15.403us 0 b 0 b 0 b 0 b 2102
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 11.381ms 0.76% 11.381ms 5.736us 0 b 0 b 0 b 0 b 1984
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 22.439ms 1.50% 22.439ms 11.310us 0 b 0 b 0 b 0 b 1984
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 30.794ms 2.06% 30.794ms 10.348us 0 b 0 b 0 b 0 b 2976
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 20.668ms 1.38% 20.668ms 6.477us 0 b 0 b 0 b 0 b 3191
Cijk_Alik_Bljk_HHS_BH_MT64x16x32_MI16x16x16x1_SE_1LD... 0.00% 0.000us 0.00% 0.000us 0.000us 39.144ms 2.62% 39.144ms 39.459us 0 b 0 b 0 b 0 b 992
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 7.159ms 0.48% 7.159ms 7.216us 0 b 0 b 0 b 0 b 992
void (anonymous namespace)::softmax_warp_forward<c10... 0.00% 0.000us 0.00% 0.000us 0.000us 21.670ms 1.45% 21.670ms 21.844us 0 b 0 b 0 b 0 b 992
Cijk_Ailk_Bljk_HHS_BH_MT128x128x32_MI32x32x8x1_SN_1L... 0.00% 0.000us 0.00% 0.000us 0.000us 2.820ms 0.19% 2.820ms 88.140us 0 b 0 b 0 b 0 b 32
Cijk_Alik_Bljk_HHS_BH_MT160x256x64_MI32x32x8x1_SN_1L... 0.00% 0.000us 0.00% 0.000us 0.000us 533.199ms 35.64% 533.199ms 268.750us 0 b 0 b 0 b 0 b 1984
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 7.464ms 0.50% 7.464ms 6.507us 0 b 0 b 0 b 0 b 1147
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 7.039ms 0.47% 7.039ms 6.887us 0 b 0 b 0 b 0 b 1022
Cijk_Alik_Bljk_HHS_BH_MT128x96x64_MI32x32x8x1_SE_1LD... 0.00% 0.000us 0.00% 0.000us 0.000us 207.251ms 13.85% 207.251ms 208.923us 0 b 0 b 0 b 0 b 992
hipMemsetAsync 0.05% 678.691us 0.05% 678.691us 7.712us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 512 b 512 b 88
hipOccupancyMaxPotentialBlockSize 1.57% 22.925ms 1.57% 22.925ms 764.173us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 30
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 2.423ms 0.16% 2.423ms 78.158us 0 b 0 b 0 b 0 b 31
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 1.361ms 0.09% 1.361ms 23.465us 0 b 0 b 0 b 0 b 58
Memset (Device) 0.00% 0.000us 0.00% 0.000us 0.000us 477.441us 0.03% 477.441us 5.425us 0 b 0 b 0 b 0 b 88
void at::native::mbtopk::fill<unsigned int, unsigned... 0.00% 0.000us 0.00% 0.000us 0.000us 96.160us 0.01% 96.160us 3.205us 0 b 0 b 0 b 0 b 30
void at::native::mbtopk::radixFindKthValues<c10::Hal... 0.00% 0.000us 0.00% 0.000us 0.000us 711.043us 0.05% 711.043us 11.851us 0 b 0 b 0 b 0 b 60
void at::native::sbtopk::gatherTopK<c10::Half, unsig... 0.00% 0.000us 0.00% 0.000us 0.000us 1.505ms 0.10% 1.505ms 50.160us 0 b 0 b 0 b 0 b 30
void at::native::bitonicSortKVInPlace<2, -1, 16, 16,... 0.00% 0.000us 0.00% 0.000us 0.000us 501.762us 0.03% 501.762us 16.725us 0 b 0 b 0 b 0 b 30
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 254.402us 0.02% 254.402us 4.240us 0 b 0 b 0 b 0 b 60
void at::native::index_elementwise_kernel<128, 4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 478.722us 0.03% 478.722us 7.979us 0 b 0 b 0 b 0 b 60
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 101.280us 0.01% 101.280us 3.376us 0 b 0 b 0 b 0 b 30
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 204.320us 0.01% 204.320us 6.811us 0 b 0 b 0 b 0 b 30
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 529.600us 0.04% 529.600us 17.653us 0 b 0 b 0 b 0 b 30
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 97.601us 0.01% 97.601us 3.253us 0 b 0 b 0 b 0 b 30
void rocprim::detail::block_reduce_kernel<false, roc... 0.00% 0.000us 0.00% 0.000us 0.000us 115.040us 0.01% 115.040us 3.835us 0 b 0 b 0 b 0 b 30
void rocprim::detail::block_reduce_kernel<true, rocp... 0.00% 0.000us 0.00% 0.000us 0.000us 139.042us 0.01% 139.042us 4.635us 0 b 0 b 0 b 0 b 30
Memcpy DtoH (Device -> Host) 0.00% 0.000us 0.00% 0.000us 0.000us 6.417ms 0.43% 6.417ms 12.732us 0 b 0 b 0 b 0 b 504
void rocprim::detail::init_lookback_scan_state_kerne... 0.00% 0.000us 0.00% 0.000us 0.000us 235.040us 0.02% 235.040us 4.052us 0 b 0 b 0 b 0 b 58
void rocprim::detail::partition_kernel<(rocprim::det... 0.00% 0.000us 0.00% 0.000us 0.000us 595.201us 0.04% 595.201us 10.262us 0 b 0 b 0 b 0 b 58
void rocprim::detail::transform_kernel<rocprim::deta... 0.00% 0.000us 0.00% 0.000us 0.000us 254.720us 0.02% 254.720us 4.392us 0 b 0 b 0 b 0 b 58
void at::native::(anonymous namespace)::write_indice... 0.00% 0.000us 0.00% 0.000us 0.000us 316.961us 0.02% 316.961us 10.565us 0 b 0 b 0 b 0 b 30
void at::native::index_elementwise_kernel<128, 4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 578.241us 0.04% 578.241us 19.275us 0 b 0 b 0 b 0 b 30
Cijk_Alik_Bljk_HHS_BH_MT64x64x64_MI32x32x8x1_SE_1LDS... 0.00% 0.000us 0.00% 0.000us 0.000us 220.147ms 14.71% 220.147ms 55.175us 0 b 0 b 0 b 0 b 3990
Cijk_Ailk_Bljk_HHS_BH_MT64x64x32_MI32x32x8x1_SE_1LDS... 0.00% 0.000us 0.00% 0.000us 0.000us 38.735ms 2.59% 38.735ms 40.349us 0 b 0 b 0 b 0 b 960
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 1.173ms 0.08% 1.173ms 39.099us 0 b 0 b 0 b 0 b 30
void at::native::(anonymous namespace)::cunn_SoftMax... 0.00% 0.000us 0.00% 0.000us 0.000us 1.760ms 0.12% 1.760ms 58.656us 0 b 0 b 0 b 0 b 30
void at::native::_scatter_gather_elementwise_kernel<... 0.00% 0.000us 0.00% 0.000us 0.000us 167.200us 0.01% 167.200us 5.573us 0 b 0 b 0 b 0 b 30
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 565.280us 0.04% 565.280us 18.843us 0 b 0 b 0 b 0 b 30
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 766.724us 0.05% 766.724us 13.219us 0 b 0 b 0 b 0 b 58
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 894.242us 0.06% 894.242us 15.418us 0 b 0 b 0 b 0 b 58
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 278.720us 0.02% 278.720us 4.645us 0 b 0 b 0 b 0 b 60
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 277.762us 0.02% 277.762us 4.629us 0 b 0 b 0 b 0 b 60
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 156.320us 0.01% 156.320us 5.211us 0 b 0 b 0 b 0 b 30
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 151.840us 0.01% 151.840us 5.061us 0 b 0 b 0 b 0 b 30
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 163.360us 0.01% 163.360us 5.445us 0 b 0 b 0 b 0 b 30
void at::native::elementwise_kernel<512, 1, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 151.840us 0.01% 151.840us 5.061us 0 b 0 b 0 b 0 b 30
void at::native::tensor_kernel_scan_innermost_dim<lo... 0.00% 0.000us 0.00% 0.000us 0.000us 148.480us 0.01% 148.480us 4.949us 0 b 0 b 0 b 0 b 30
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 349.440us 0.02% 349.440us 11.648us 0 b 0 b 0 b 0 b 30
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 365.280us 0.02% 365.280us 12.176us 0 b 0 b 0 b 0 b 30
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 156.800us 0.01% 156.800us 5.227us 0 b 0 b 0 b 0 b 30
hipHostMalloc 0.01% 135.427us 0.01% 135.427us 67.714us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 2
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 235.841us 0.02% 235.841us 8.423us 0 b 0 b 0 b 0 b 28
void rocprim::detail::block_reduce_kernel<true, rocp... 0.00% 0.000us 0.00% 0.000us 0.000us 157.920us 0.01% 157.920us 5.640us 0 b 0 b 0 b 0 b 28
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 251.360us 0.02% 251.360us 8.379us 0 b 0 b 0 b 0 b 30
void at::native::vectorized_elementwise_kernel<2, at... 0.00% 0.000us 0.00% 0.000us 0.000us 180.000us 0.01% 180.000us 6.000us 0 b 0 b 0 b 0 b 30
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 429.440us 0.03% 429.440us 14.315us 0 b 0 b 0 b 0 b 30
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 62.720us 0.00% 62.720us 4.825us 0 b 0 b 0 b 0 b 13
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 144.160us 0.01% 144.160us 4.971us 0 b 0 b 0 b 0 b 29
void at::native::vectorized_elementwise_kernel<2, at... 0.00% 0.000us 0.00% 0.000us 0.000us 12.320us 0.00% 12.320us 6.160us 0 b 0 b 0 b 0 b 2
hipDeviceSynchronize 0.00% 1.830us 0.00% 1.830us 1.830us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.458s
Self CUDA time total: 1.496s