Skip to content

Question about PR1084 Support for Mixed Input TensorOp #1117

Answered by manishucsd
MARD1NO asked this question in Q&A
Discussion options

You must be logged in to vote

Data is loaded using ldmatrix from SMEM into Registers. Each thread owns s8x4 contiguous data in the operandB matrix. However, to issue mma.sync on bf16 or f16 each thread needs to own (f16x2x2). The f16x2 .... f16x2 are separated by 6 elements. Please see layout diagrams from GTC 2020 talk for IMMA and HMMA. Basically, the data is loaded thinking we will do IMMA; but the FragmentShuffler shuffles to get it ready in registers for HMMA. From the GTC 2020 talk put the yellow operandB from slide 21 and slide 22 side-by-side.

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@MARD1NO
Comment options

Answer selected by MARD1NO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants