-
Hi, I have a use case where I want to do epilogue computation with more than one source. Specifically, in Huggingface BERT model, there are a lot of The problem is that the last cc @Laurawly |
Beta Was this translation helpful? Give feedback.
Replies: 27 comments 3 replies
-
Beta Was this translation helpful? Give feedback.
-
I might try to make a new kernel that takes a vector of |
Beta Was this translation helpful? Give feedback.
-
We welcome more activation functions too |
Beta Was this translation helpful? Give feedback.
-
You can take a look at this unit test and its testbed. This special fused kernel can do |
Beta Was this translation helpful? Give feedback.
-
You can enhance the testbed like below to test all the features
|
Beta Was this translation helpful? Give feedback.
-
You also need to fix a bug like this
|
Beta Was this translation helpful? Give feedback.
-
Above diff is merged in #383 |
Beta Was this translation helpful? Give feedback.
-
UPDATE: It looks promissing, but in cutlass/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h Lines 203 to 205 in ec4f7e5 V to be the same size as the product AB (V is the other operand for elemwise add in residual block) and tmp_C to be per-channel bias. But V is referred to as "broadcast vector" throughout in the codebase.
Moreover, I need to apply an activation functor to the result of |
Beta Was this translation helpful? Give feedback.
-
Can you show me what you want in the similar way to the code you quoted above? I think you need to write your own elementwise functor. This file is the same as what cublas specified in https://docs.nvidia.com/cuda/cublas/index.html and we cannot change it. |
Beta Was this translation helpful? Give feedback.
-
A x B and I can apply an optional activation functor (can be identity) to the result of per-channel bias addition.
Using the sigmoid activation for example, this is what I want. This will cover all fusion possibilities in the models tested in my PR apache/tvm#9746.
The first step is to make the following diff work. It currently fails in the validation check against the reference.
|
Beta Was this translation helpful? Give feedback.
-
I didn't know that cublas has a concept of "bias". Maybe the impedance mismatch here is that I'm trying to abuse an API modeled after the BLAS API for fusing residual block in deep learning models :) |
Beta Was this translation helpful? Give feedback.
-
I am a bit lost of what you need. Do you need a Forget about
|
Beta Was this translation helpful? Give feedback.
-
Ok, here it is:
cutlass/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu Lines 186 to 188 in ec4f7e5
I think if we can swap the role of |
Beta Was this translation helpful? Give feedback.
-
Yes. You can add a template to control add |
Beta Was this translation helpful? Give feedback.
-
Thanks @hwu36, I got everything I wanted working. My change is at https://github.com/NVIDIA/cutlass/compare/master...masahi:epilogue-fusion-residual-block?expand=1 I need to fix the test for the split_k mode, but other than that all tests pass. Next week I'll try integrate this to TVM to fuse residual blocks optimally. |
Beta Was this translation helpful? Give feedback.
-
ok just got e2e residual block fusion in resnet50 working. The performance improved from 3.16 msec in apache/tvm#9746 to 2.76! It's very close to the TRT result (2.53 msec). cc @@Laurawly I'll do more benchmarking on other models and open a PR to add a new epilogue functor specialized for residual blocks. |
Beta Was this translation helpful? Give feedback.
-
@hwu36 I've encountered a slightly different input shape config like this, and I'm wondering if this can be supported by
The only difference is the shape of I thought we can do this via the stride
Is there anything wrong with this? |
Beta Was this translation helpful? Give feedback.
-
Is this a typo? It is still
Should it be
Should V's index be I am a bit lost when reading your code maybe due to the potential typoes. |
Beta Was this translation helpful? Give feedback.
-
Sorry yes, there is one typo. But Corrected
|
Beta Was this translation helpful? Give feedback.
-
Okay. I think We can talk about the 2nd bias tensor, the one which is per batch per channel after this because it requires CUDA source code change. |
Beta Was this translation helpful? Give feedback.
-
Moreover, |
Beta Was this translation helpful? Give feedback.
-
I edited the code in the above 2 replies. |
Beta Was this translation helpful? Give feedback.
-
Yes, I already use I've also got my original use case for multiple-sources fusion,
Are you suggesting that I try something else? Because I think I'm already at the "first bias tensor is working expected" stage and ready for
But if per-batch bias cannot be supported by the |
Beta Was this translation helpful? Give feedback.
-
Give me some time to think about it. I will respond tomorrow. |
Beta Was this translation helpful? Give feedback.
-
Looking at lines such as I assume cutlass only looks at the first element of a stride array. That makes sense because it corresponds to the row stride in a 2D GEMM matrix. So the only possible broadcasting op is along the inner-most dimension, and per-batch broadcast using the stride |
Beta Was this translation helpful? Give feedback.
-
You are correct. We need to make changes to CUDA source code to let First, set Stride to all 0s like https://github.com/NVIDIA/cutlass/tree/master/examples/17_fprop_per_channel_bias The exact row number is calculated here which is
to compute
to I think it should work. |
Beta Was this translation helpful? Give feedback.
-
Thanks @hwu36! The following diff worked:
|
Beta Was this translation helpful? Give feedback.
Thank you very @masahi for working on integrating CULTASS into TVM. We are very excited to see this moving forward.
As to your question, I assume
gamma
is a scalar constant andC
andD
has the same layout and data type. Your need to make source code change to CUTLASS to achieve this, basically replicating what we have forbeta x C
.If you use
device::GEMM
, here are the things you needTensorRef
for your new source here: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L280