Question about PR1084 Support for Mixed Input TensorOp #1117
-
In FragmentShuffler shuffling registers for int8 dtype, and I'm not very cleared about the whole process, can someone explain it? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Data is loaded using ldmatrix from SMEM into Registers. Each thread owns |
Beta Was this translation helpful? Give feedback.
Data is loaded using ldmatrix from SMEM into Registers. Each thread owns
s8x4
contiguous data in the operandB matrix. However, to issuemma.sync
onbf16
orf16
each thread needs to own (f16x2x2
). Thef16x2 .... f16x2
are separated by 6 elements. Please see layout diagrams from GTC 2020 talk forIMMA
andHMMA
. Basically, the data is loaded thinking we will doIMMA
; but theFragmentShuffler
shuffles to get it ready in registers forHMMA
. From the GTC 2020 talk put the yellow operandB from slide 21 and slide 22 side-by-side.