Epilogue with mutiple sources #347

masahi · 2021-10-18T05:12:19Z

masahi
Oct 18, 2021

Hi, I have a use case where I want to do epilogue computation with more than one source. Specifically, in Huggingface BERT model, there are a lot of dense(linear) -> bias add -> residual add patterns, and I want to fuse them into a single cutlass call.

The problem is that the last residual add introduces another "source", which doesn't fit into the GEMM model of alpha * A * B + beta * C. I need to do something like alpha * A * B + beta * C + gamma * D. Is there any way that achieves this with cutlass?

cc @Laurawly

Answered by hwu36

Oct 19, 2021

Thank you very @masahi for working on integrating CULTASS into TVM. We are very excited to see this moving forward.

As to your question, I assume gamma is a scalar constant and C and D has the same layout and data type. Your need to make source code change to CUTLASS to achieve this, basically replicating what we have for beta x C.

If you use device::GEMM, here are the things you need

Add another TensorRef for your new source here: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L280
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L301
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/d…

View full answer

hwu36 · 2021-10-19T04:09:42Z

hwu36
Oct 19, 2021
Maintainer

Thank you very @masahi for working on integrating CULTASS into TVM. We are very excited to see this moving forward.

As to your question, I assume gamma is a scalar constant and C and D has the same layout and data type. Your need to make source code change to CUTLASS to achieve this, basically replicating what we have for beta x C.

If you use device::GEMM, here are the things you need

Add another TensorRef for your new source here: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L280
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L301
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L310
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L339
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L410
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L430
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L598
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L617
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L626
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L647
Add gamma here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L105

In kernel level

In the epilogue threadblock level

Add one new source fragment here: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/threadblock/epilogue.h#L370-L372
Add one more source load here: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/threadblock/epilogue.h#L391-L392
Pass new source fragment to elementwise op here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/threadblock/epilogue.h#L434
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/threadblock/epilogue.h#L453
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/threadblock/epilogue.h#L461-L462
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/threadblock/epilogue.h#L471

Suppose you use LinearCombination, in the epilogue thread level

Add gamma here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L81
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L89-L92
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L98
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L105
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L113
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L120
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L131-L132
Same here https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L140-L141
Do gamma scaling similar to beta scaling: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L193
Do one more multiply_add similar to this: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L195

You also need to change your setting up code to pass gamma and the pointer of the new source matrix.

1 reply

masahi Oct 19, 2021
Author

Thank you very much @hwu36! This is extremely helpful.

Required changes look like totally doable. Ideally we could upstream our change, but for this case in particular, the change is going to be very specific to my use case. So probably we will be keeping our change in "cutlass extension" under the TVM repo, along with more activation functions (tanh, hardswish) we will need later.

masahi · 2021-10-19T10:13:52Z

masahi
Oct 19, 2021
Author

I might try to make a new kernel that takes a vector of TensorRef as sources, rather than one or two of them. That might be general enough so that it could be worth upstreaming.

2 replies

ginowu Nov 18, 2021

I might try to make a new kernel that takes a vector of TensorRef as sources, rather than one or two of them. That might be general enough so that it could be worth upstreaming.

If the operation of every extra input for epilogue phase can be configurable, that will be more general.

masahi Nov 18, 2021
Author

Yes that's my intention. Otherwise it won't be very useful.

hwu36 · 2021-10-19T13:17:43Z

hwu36
Oct 19, 2021
Maintainer

We welcome more activation functions too

0 replies

hwu36 · 2021-12-17T03:07:40Z

hwu36
Dec 17, 2021
Maintainer

You can take a look at this unit test and its testbed. This special fused kernel can do elementwise(alpha x conv + beta x C + per_channel_bias)

0 replies

hwu36 · 2021-12-17T03:15:37Z

hwu36
Dec 17, 2021
Maintainer

You can enhance the testbed like below to test all the features

diff --git a/test/unit/conv/device/conv2d_with_broadcast_testbed.h b/test/unit/conv/device/conv2d_with_broadcast_testbed.h
index bb79a1cfd..bd9596a7e 100644
--- a/test/unit/conv/device/conv2d_with_broadcast_testbed.h
+++ b/test/unit/conv/device/conv2d_with_broadcast_testbed.h
@@ -23,7 +23,11 @@
  *
  **************************************************************************************************/
 /*! \file
-    \brief Implicit GEMM testbed
+    \brief Implicit GEMM for fused epilogue broadcast testbed
+
+    Parallel split-k is not tested because we can just use regular conv kernel
+    when we need to use parallel-splitk.  Broadcast can happen in the reduction
+    kernel.
 */
 #pragma once

@@ -53,7 +57,46 @@ namespace test {
 namespace conv {
 namespace device {

+/////////////////////////////////////////////////////////////////////////////////////////////////
+
 template <typename Conv2d>
+struct Conv2dWithBroadcastReferenceOp {
+
+  using OutputOp = typename Conv2d::EpilogueOutputOp;
+
+  using ElementCompute = typename OutputOp::ElementCompute;
+  using ElementZ = typename OutputOp::ElementZ;
+  using ElementT = typename OutputOp::ElementT;
+
+  typename OutputOp::BinaryOp binary_op;
+  typename OutputOp::ElementwiseOp elementwise_op;
+
+  Conv2dWithBroadcastReferenceOp() { }
+
+  void operator()(ElementZ &Z, ElementT &T, ElementCompute conv2d, ElementCompute bias) {
+    ElementCompute t_full = binary_op(conv2d, bias);
+    T = ElementT(t_full);
+
+    ElementCompute z_full = elementwise_op(t_full);
+    Z = ElementZ(z_full);
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Fused testbed
+//
+//  Y = CONV(AB, C)
+//
+//  T[n, p, q, k] = ReductionOp(Y[n, p, q, k], Broadcast[k])
+//
+//  Z[n, p, q, k] = Elementwise(T[n, p, q, k])
+//
+
+template <
+  typename Conv2d,
+  typename ReferenceOp = Conv2dWithBroadcastReferenceOp<Conv2d>
+>
 class TestbedConv2dWithBroadcast {
 public:

@@ -66,6 +109,8 @@ public:
   using ElementAccumulator = typename Conv2d::ElementAccumulator;
   using ElementCompute = typename Conv2d::ElementCompute;
   using EpilogueOutputOp = typename Conv2d::EpilogueOutputOp;
+  using ElementZ = typename EpilogueOutputOp::ElementZ;
+  using ElementT = typename EpilogueOutputOp::ElementT;

   static cutlass::conv::Operator const kConvolutionalOperator = Conv2d::kConvolutionalOperator;

@@ -80,8 +125,13 @@ public:
   cutlass::HostTensor<ElementA, LayoutA> tensor_A;
   cutlass::HostTensor<ElementB, LayoutB> tensor_B;
   cutlass::HostTensor<ElementC, LayoutC> tensor_C;
-  cutlass::HostTensor<ElementC, LayoutC> tensor_D_computed;
-  cutlass::HostTensor<ElementC, LayoutC> tensor_D_reference;
+  cutlass::HostTensor<ElementAccumulator, LayoutC> tensor_C_reference;
+  cutlass::HostTensor<ElementZ, LayoutC> tensor_Z_computed;
+  cutlass::HostTensor<ElementZ, LayoutC> tensor_Z_reference;
+  cutlass::HostTensor<ElementT, LayoutC> tensor_T_computed;
+  cutlass::HostTensor<ElementT, LayoutC> tensor_T_reference;
+  cutlass::HostTensor<ElementAccumulator, LayoutC> tensor_Y_reference;
+  cutlass::HostTensor<ElementC, LayoutC> tensor_Broadcast;                 // Input Broadcast
 public:

@@ -147,18 +197,44 @@ public:
     tensor_A.resize(implicit_gemm_tensor_a_extent(kConvolutionalOperator, problem_size));
     tensor_B.resize(implicit_gemm_tensor_b_extent(kConvolutionalOperator, problem_size));
     tensor_C.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
-    tensor_D_computed.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
-    tensor_D_reference.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
+    tensor_C_reference.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
+    tensor_Z_computed.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
+    tensor_Z_reference.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
+    tensor_T_computed.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
+    tensor_T_reference.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
+    tensor_Y_reference.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
+    tensor_Broadcast.resize({
+      1,
+      1,
+      1,
+      implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size).c(),
+    });

     initialize_tensor(tensor_A.host_view(), init_A, seed);
     initialize_tensor(tensor_B.host_view(), init_B, seed * 17);
     initialize_tensor(tensor_C.host_view(), init_C, seed * 39);
-    
+    initialize_tensor(tensor_Broadcast.host_view(), init_C, seed * 39);
+ 
+    for (int n = 0; n < tensor_C_reference.extent().n(); ++n) {
+      for (int p = 0; p < tensor_C_reference.extent().h(); ++p) {
+        for (int q = 0; q < tensor_C_reference.extent().w(); ++q) {
+          for (int k = 0; k < tensor_C_reference.extent().c(); ++k) {
+            tensor_C_reference.at({n, p, q, k}) = ElementAccumulator(tensor_C.at({n, p, q, k}));
+          }
+        }
+      }
+    }
+   
     tensor_A.sync_device();
     tensor_B.sync_device();
     tensor_C.sync_device();
-    tensor_D_computed.sync_device();
-    tensor_D_reference.sync_device();
+    tensor_Broadcast.sync_device();
+    tensor_C_reference.sync_device();
+    tensor_Z_computed.sync_device();
+    tensor_Z_reference.sync_device();
+    tensor_T_computed.sync_device();
+    tensor_T_reference.sync_device();
+    tensor_Y_reference.sync_device();
   }

   bool sufficient() const {
@@ -215,18 +291,21 @@ public:

     // configure the operator
     Conv2d conv2d_op;
-
     typename Conv2d::Arguments conv2d_args(
       problem_size,
       tensor_A.device_ref(),
       tensor_B.device_ref(),
       tensor_C.device_ref(),
-      tensor_D_computed.device_ref(),
+      tensor_Z_computed.device_ref(),
       {alpha, beta},
-      split_k_mode
+      split_k_mode,
+      tensor_Broadcast.device_data(),
+      tensor_T_computed.device_data(),
+      0,         // This must be zero
+      implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size).c()
     );

-    // find workspace requirement for parallel split-k reduction
+    // initialize the kernel 
     size_t workspace_size = Conv2d::get_workspace_size(conv2d_args);

     cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
@@ -239,22 +318,6 @@ public:
       return true;
     }

-    // conv2d operation with parallel split-k-mode
-    if (split_k_mode == cutlass::conv::SplitKMode::kParallel) {
-
-      // conv2d output is written to workspace in global memory
-      conv2d_args.ref_D.reset(reinterpret_cast<ElementC*>(workspace.get()));
-      // accumulate mma for each cta in k-dimension (1.0 * A * B)
-      conv2d_args.output_op = {ElementCompute(1), ElementCompute(0)}; 
-      // update conv2d operator arguments
-      status = conv2d_op.update(conv2d_args, workspace.get());
-    }
-    
-    EXPECT_TRUE(status == cutlass::Status::kSuccess);
-    if (status != cutlass::Status::kSuccess) {
-      return false;
-    }
-
     // run conv2d operator
     status = conv2d_op();

@@ -269,52 +332,13 @@ public:
     EXPECT_EQ(result, cudaSuccess) << " device reference error: "
                                    << cudaGetErrorString(result);

-    tensor_D_computed.sync_host();
+    tensor_T_computed.sync_host();
+    tensor_Z_computed.sync_host();

     //
-    // Reference check - support caching results
+    // Reference check
     //

-    CachedTestKey cached_test_key = CreateCachedConv2dWithBroadcastTestKey<
-        ElementA, LayoutA,
-        ElementB, LayoutB,
-        ElementC, LayoutC,
-        ElementAccumulator,
-        ElementCompute
-      >(
-        kConvolutionalOperator,
-        problem_size, 
-        alpha, 
-        beta, 
-        tensor_A.host_view(),
-        tensor_B.host_view(),
-        tensor_C.host_view()
-      );
-
-    //
-    // Look for the cached key
-    //
-
-    bool cached_result_loaded = false;
-    CachedTestResult cached_test_result;
-
-    std::string conv2d_result_cache_name = 
-      std::string("cached_results_") + CUTLASS_TARGET_NAME + ".txt";
-
-    if (CUTLASS_TEST_ENABLE_CACHED_RESULTS) {
-
-      CachedTestResultListing cached_results(conv2d_result_cache_name);
-
-      auto cached = cached_results.find(cached_test_key);
-
-      cached_result_loaded = cached.first;
-      if (cached_result_loaded) {
-        cached_test_result = cached.second;
-      }
-    }
-    
-    if (!cached_result_loaded) {
-
 #if CUTLASS_CONV_TEST_UNIT_REFERENCE_DEVICE_ENABLED

     cutlass::reference::device::Conv2d<
@@ -322,22 +346,22 @@ public:
       LayoutA,
       ElementB,
       LayoutB,
-      ElementC,
+      ElementAccumulator,
       LayoutC,
-      ElementCompute,
+      ElementAccumulator,
       ElementAccumulator
     >(
       kConvolutionalOperator,
       problem_size,
       tensor_A.device_ref(),
       tensor_B.device_ref(),
-      tensor_C.device_ref(),
-      tensor_D_reference.device_ref(),
+      tensor_C_reference.device_ref(),
+      tensor_Y_reference.device_ref(),
       alpha,
       beta);

     // sync host (copy device data to host) for dumping error output in case of mismatches
-    tensor_D_reference.sync_host();
+    tensor_Y_reference.sync_host();

 #else

@@ -346,48 +370,50 @@ public:
       LayoutA,
       ElementB,
       LayoutB,
-      ElementC,
+      ElementAccumulator,
       LayoutC,
-      ElementCompute,
+      ElementAccumulator,
       ElementAccumulator
     >(
       kConvolutionalOperator,
       problem_size,
       tensor_A.host_ref(),
       tensor_B.host_ref(),
-      tensor_C.host_ref(),
-      tensor_D_reference.host_ref(),
+      tensor_C_reference.host_ref(),
+      tensor_Y_reference.host_ref(),
       alpha,
       beta);

 #endif
+    ReferenceOp reference_op;

-      if (CUTLASS_TEST_ENABLE_CACHED_RESULTS) {
-
-        cached_test_result.D = TensorHash(tensor_D_reference.host_view());
-
-        CachedTestResultListing cached_results(conv2d_result_cache_name);
-
-        cached_results.append(cached_test_key, cached_test_result);
-        cached_results.write(conv2d_result_cache_name);
+    // compute tensor Z and tensor T
+    for (int n = 0; n < problem_size.N; ++n) {
+      for (int p = 0; p < problem_size.P; ++p) {
+        for (int q = 0; q < problem_size.Q; ++q) {
+          for (int k = 0; k < problem_size.K; ++k) {
+  
+            ElementZ z;
+            ElementT t;
+    
+            reference_op(z, t, tensor_Y_reference.at({n, p, q, k}), tensor_Broadcast.at({0, 0, 0, k}));
+    
+            tensor_Z_reference.at({n, p, q, k}) = z;
+            tensor_T_reference.at({n, p, q, k}) = t;
+          }
+        }
       }
-    } // if (!cached_result_loaded)
-
-
-    uint32_t tensor_D_hash = TensorHash(tensor_D_computed.host_view());
+    }

-    if (CUTLASS_TEST_ENABLE_CACHED_RESULTS) {
-      passed = (tensor_D_hash == cached_test_result.D);
+    passed = cutlass::reference::host::TensorEquals(
+      tensor_T_computed.host_view(), 
+      tensor_T_reference.host_view());

-      EXPECT_EQ(tensor_D_hash, cached_test_result.D) 
-        << "Hash-based comparison failed for key:" << "\n" << cached_test_key << "\n";
-    }
-    else {
+    EXPECT_TRUE(passed);

-      passed = cutlass::reference::host::TensorEquals(
-        tensor_D_computed.host_view(), 
-        tensor_D_reference.host_view());
-    }
+    passed = cutlass::reference::host::TensorEquals(
+      tensor_Z_computed.host_view(), 
+      tensor_Z_reference.host_view());

     EXPECT_TRUE(passed);

@@ -435,14 +461,16 @@ public:
         << "\nA:\n" << tensor_A.host_view() << "\n"
         << "\nB:\n" << tensor_B.host_view() << "\n"
         << "\nC:\n" << tensor_C.host_view() << "\n"
-        << "\nD reference:\n" << tensor_D_reference.host_view() << "\n"
-        << "\nD computed:\n" << tensor_D_computed.host_view() << "\n";
-
+        << "\nBroadcast:\n" << tensor_Broadcast.host_view() << "\n"
+        << "\nY reference:\n" << tensor_Y_reference.host_view() << "\n"
+        << "\nT reference:\n" << tensor_T_reference.host_view() << "\n"
+        << "\nT computed:\n" << tensor_T_computed.host_view() << "\n"
+        << "\nZ reference:\n" << tensor_Z_reference.host_view() << "\n"
+        << "\nZ computed:\n" << tensor_Z_computed.host_view() << "\n";
     }

     return passed;
   }
-
 };

 /////////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -582,8 +610,7 @@ bool TestAllConv2dWithBroadcast(
     );

   cutlass::conv::SplitKMode split_k_modes [] = {
-    cutlass::conv::SplitKMode::kSerial,
-    cutlass::conv::SplitKMode::kParallel,
+    cutlass::conv::SplitKMode::kSerial
   };

   int split_k_slices[] = {

0 replies

hwu36 · 2021-12-17T03:32:27Z

hwu36
Dec 17, 2021
Maintainer

You also need to fix a bug like this

diff --git a/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h b/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
index acc849777..389eb26e7 100644
--- a/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
+++ b/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
@@ -201,8 +201,8 @@ public:
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < kElementsPerAccess; ++i) {
       ElementCompute z = binary_op(alpha_ * tmp_Accum[i] + beta_ * tmp_C[i], V[i]);
-      result_Z[i] = z;
-      result_T[i] = skip_elementwise_ ? z : elementwise_op(z);
+      result_T[i] = z;
+      result_Z[i] = skip_elementwise_ ? z : elementwise_op(z);
     }
 
     NumericArrayConverter<ElementZ, ElementCompute, kElementsPerAccess> convert_z;
@@ -230,8 +230,8 @@ public:
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < kElementsPerAccess; ++i) {
       ElementCompute z = binary_op(alpha_ * tmp_Accum[i], V[i]);
-      result_Z[i] = z;
-      result_T[i] = skip_elementwise_ ? z : elementwise_op(z);
+      result_T[i] = z;
+      result_Z[i] = skip_elementwise_ ? z : elementwise_op(z);
     }

0 replies

hwu36 · 2021-12-17T21:09:57Z

hwu36
Dec 17, 2021
Maintainer

Above diff is merged in #383

0 replies

masahi · 2021-12-19T01:56:38Z

masahi
Dec 19, 2021
Author

~~A n00b question: How can I run only one specific test? I want to modify conv2d_fprop_with_broadcast_sm75.cu and test if LinearCombinationBiasElementwise meets my need.~~

UPDATE: test/unit/conv/device/cutlass_test_unit_conv_device_tensorop_f32_sm75 --gtest_filter=*Broadcast* seems to work

It looks promissing, but in

cutlass/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h

Lines 203 to 205 in ec4f7e5

    
           ElementCompute z = binary_op(alpha_ * tmp_Accum[i] + beta_ * tmp_C[i], V[i]); 
        
           result_T[i] = z; 
        
           result_Z[i] = skip_elementwise_ ? z : elementwise_op(z);

, I need V to be the same size as the product AB (V is the other operand for elemwise add in residual block) and tmp_C to be per-channel bias. But V is referred to as "broadcast vector" throughout in the codebase.

Moreover, I need to apply an activation functor to the result of alpha_ * tmp_Accum[i] + beta_ * tmp_C[i] before binary op.

0 replies

hwu36 · 2021-12-19T02:37:55Z

hwu36
Dec 19, 2021
Maintainer

Can you show me what you want in the similar way to the code you quoted above?

I think you need to write your own elementwise functor. This file is the same as what cublas specified in https://docs.nvidia.com/cuda/cublas/index.html and we cannot change it.

0 replies

masahi · 2021-12-19T02:59:17Z

masahi
Dec 19, 2021
Author

LinearCombinationBiasElementwise is very close to what I want as long as V tensor (what's refereed to as "broadcast tensor" or ptr_Vector in

cutlass/include/cutlass/conv/kernel/implicit_gemm_convolution_with_fused_epilogue.h

Line 223 in 808c253

void * ptr_Vector;

can be of the same as A x B and I can apply an optional activation functor (can be identity) to the result of per-channel bias addition.

elementwise(alpha x conv + beta x C + per_channel_bias) is not quite what we want because the per channel bias must be added only to alpha x conv before the elemwise add in a residual block. So it needs to be elementwise((conv + per_channel_bias) + C) .

Using the sigmoid activation for example, this is what I want. This will cover all fusion possibilities in the models tested in my PR apache/tvm#9746.

binary_op = plus or multiply;
elemwise_op = relu or identity;
ElementCompute z = binary_op(sigmoid(tmp_Accum[i] + per_channel_bias[i]), V[i]);
result_T[i] = z;
result_Z[i] = elemwise_op(z);

The first step is to make the following diff work. It currently fails in the validation check against the reference.

diff --git a/test/unit/conv/device/conv2d_with_broadcast_testbed.h b/test/unit/conv/device/conv2d_with_broadcast_testbed.h
index 4b8778e..eed1f91 100644
--- a/test/unit/conv/device/conv2d_with_broadcast_testbed.h
+++ b/test/unit/conv/device/conv2d_with_broadcast_testbed.h
@@ -203,12 +203,7 @@ public:
     tensor_T_computed.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
     tensor_T_reference.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
     tensor_Y_reference.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));
-    tensor_Broadcast.resize({
-      1,
-      1,
-      1,
-      implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size).c(),
-    });
+    tensor_Broadcast.resize(implicit_gemm_tensor_c_extent(kConvolutionalOperator, problem_size));

     initialize_tensor(tensor_A.host_view(), init_A, seed);
     initialize_tensor(tensor_B.host_view(), init_B, seed * 17);
@@ -396,7 +391,7 @@ public:
             ElementZ z;
             ElementT t;

-            reference_op(z, t, tensor_Y_reference.at({n, p, q, k}), tensor_Broadcast.at({0, 0, 0, k}));
+            reference_op(z, t, tensor_Y_reference.at({n, p, q, k}), tensor_Broadcast.at({n, p, q, k}));

0 replies

masahi · 2021-12-19T03:05:03Z

masahi
Dec 19, 2021
Author

This file is the same as what cublas specified in https://docs.nvidia.com/cuda/cublas/index.html and we cannot change it.

I didn't know that cublas has a concept of "bias". Maybe the impedance mismatch here is that I'm trying to abuse an API modeled after the BLAS API for fusing residual block in deep learning models :)

0 replies

hwu36 · 2021-12-19T04:08:45Z

hwu36
Dec 19, 2021
Maintainer

I am a bit lost of what you need.

Do you need a k x 1 vector in which every element is the bias for each bias?

Forget about linear_combination_bias_elementwise.h, can you write down what you need in the form like below?

C - N x P x Q x K
CONV - N x P x Q x K
V - k x 1
Y - N x P x Q x K
T - N x P x Q x K
Z - N x P x Q x K
alpha - 1x1
beta - 1x1

Y[n, p, q, k] = alpha x CONV[n, p, q, k] + beta x C[n, p, q , k]

T[n, p, q, k] = binary_op(Y[n, p , q, k], V[k]])

Z[n, p, q, k] = elementwise(T(n, p, q, k))

0 replies

masahi · 2021-12-19T04:28:31Z

masahi
Dec 19, 2021
Author

Ok, here it is:

C - 1 x 1 x 1 x K
CONV - N x P x Q x K
V - N x P x Q x K
Y - N x P x Q x K
T - N x P x Q x K
Z - N x P x Q x K
alpha - 1x1
beta - 1x1

Y[n, p, q, k] = alpha x CONV[n, p, q, k] + beta x C[0, 0, 0, k]

T[n, p, q, k] = binary_op(Y[n, p , q, k], V[n, p, q, k]])

Z[n, p, q, k] = elementwise(T(n, p, q, k))

Y is the normal output of conv2d followed by per channel bias add. C tensor being bias is realized by the "stride 0 trick" described in

cutlass/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu

Lines 186 to 188 in ec4f7e5

    
           // tensor C  is treated as the bias vector. We can enable the CONV 
        
           // to project away the N, H, W dimension by setting the stride to zero. 
        
           {tensor_c_bias.device_data(), LayoutOutput::Stride(0)},

.

V tensor is the "additional tensor" I want to put as an input to an epilogue. This is the other operand to the elementwise addition in a residual block:

I think if we can swap the role of C and V in how they are used by cutlass today, it should meet my need.

0 replies

hwu36 · 2021-12-19T04:36:08Z

hwu36
Dec 19, 2021
Maintainer

I think if we can swap the role of C and V in what is already supported by cutlass, it should work.

Yes.

You can add a template to control add V first or C first.

0 replies

masahi · 2021-12-19T08:02:17Z

masahi
Dec 19, 2021
Author

Thanks @hwu36, I got everything I wanted working. My change is at https://github.com/NVIDIA/cutlass/compare/master...masahi:epilogue-fusion-residual-block?expand=1

I need to fix the test for the split_k mode, but other than that all tests pass. Next week I'll try integrate this to TVM to fuse residual blocks optimally.

0 replies

masahi · 2021-12-21T03:12:48Z

masahi
Dec 21, 2021
Author

ok just got e2e residual block fusion in resnet50 working. The performance improved from 3.16 msec in apache/tvm#9746 to 2.76! It's very close to the TRT result (2.53 msec). cc @@Laurawly

I'll do more benchmarking on other models and open a PR to add a new epilogue functor specialized for residual blocks.

0 replies

masahi · 2021-12-21T19:49:16Z

masahi
Dec 21, 2021
Author

@hwu36 I've encountered a slightly different input shape config like this, and I'm wondering if this can be supported by WithBroadcast variant of the kernel:

C - 1 x 1 x 1 x K
CONV - N x P x Q x K
V - N x P x Q x K
Y - N x 1 x 1 x K
T - N x P x Q x K
Z - N x P x Q x K
alpha - 1x1
beta - 1x1

Y[n, p, q, k] = alpha x CONV[n, p, q, k] + beta x C[0, 0, 0, k]

T[n, p, q, k] = binary_op(Y[n, p , q, k], V[n, 0, 0, k]])

Z[n, p, q, k] = elementwise(T(n, p, q, k))

The only difference is the shape of V tensor (the residual input), which maps to the C tensor in the API alpha * AB + C + beta * per_channel_bias. Now, both of binary additions are broadcast ones.

I thought we can do this via the stride TensorNHWC(0, 0, K) for C tensor (V in the above schematic). Below is the code I've generated, but this fails with illegal access memory error.

  TensorNHWC layout_A(TensorNHWC::packed(cutlass::make_Coord(N, H, W, C)));
  TensorNHWC layout_B(TensorNHWC::packed(cutlass::make_Coord(K, R, S, C)));
  TensorNHWC layout_C(TensorNHWC(0, 0, K)); // <- Trying to support broadcast addtion with shape (N, 1, 1, K) - Doesn't seme to be correct
  TensorNHWC layout_D(TensorNHWC::packed(cutlass::make_Coord(N, P, Q, K)));

  typename Conv2d::Arguments arguments{
    problem_size,
    {static_cast<ElementInputA*>(ptr_a), layout_A},
    {static_cast<ElementInputB*>(ptr_b), layout_B},
    {static_cast<ElementOutput*>(ptr_residual), layout_C},
    {static_cast<ElementOutput*>(ptr_out),layout_D},
    {alpha, beta},
    cutlass::conv::SplitKMode::kSerial,
    static_cast<ElementOutput*>(ptr_bias),
    nullptr,
    0,
    K
  };

Is there anything wrong with this?

0 replies

hwu36 · 2021-12-21T20:02:10Z

hwu36
Dec 21, 2021
Maintainer

Y - N x 1 x 1 x K

Is this a typo? It is still N x P x Q x K, correct?

V - N x P x Q x K

Should it be 1 x 1 x 1 x K?

T[n, p, q, k] = binary_op(Y[n, p , q, k], V[n, 0, 0, k]])

Should V's index be [0, 0, 0, k]?

I am a bit lost when reading your code maybe due to the potential typoes.

0 replies

masahi · 2021-12-21T20:09:38Z

masahi
Dec 21, 2021
Author

Sorry yes, there is one typo. But V is indeed (N, 1, 1, K), not (1, 1, 1, K). (N, 1, 1, K) is still perfectly broadcast-compatible with (N, P, Q, K) with respect to the standard Numpy broadcasting semantics. This is not a "bias" tensor, it is like per-batch, per-channel feature amplifier (it is multiplied) used in MobilenetV3 and EfficientNet V2.

Corrected



C - 1 x 1 x 1 x K
CONV - N x P x Q x K
V - N x 1 x 1 x K
Y - N x P x Q x K
T - N x P x Q x K
Z - N x P x Q x K
alpha - 1x1
beta - 1x1

Y[n, p, q, k] = alpha x CONV[n, p, q, k] + beta x C[0, 0, 0, k]

T[n, p, q, k] = binary_op(Y[n, p , q, k], V[n, 0, 0, k]])

Z[n, p, q, k] = elementwise(T(n, p, q, k))

0 replies

hwu36 · 2021-12-21T20:16:52Z

hwu36
Dec 21, 2021
Maintainer

Okay.

I think TensorNHWC layout_C(TensorNHWC(0, 0, K)); needs to be TensorNHWC layout_C({1, 1, 1, K});. Can you first try this and make the bias_ptr to be nulltptr so that we can make sure the first bias tensor is working as desired?

We can talk about the 2nd bias tensor, the one which is per batch per channel after this because it requires CUDA source code change.

0 replies

hwu36 · 2021-12-21T20:18:38Z

hwu36
Dec 21, 2021
Maintainer

Moreover,
{static_cast<ElementOutput*>(ptr_residual), layout_C},
needs to be
{static_cast<ElementOutput*>(ptr_residual), layout_C::Stride(0)},

0 replies

hwu36 · 2021-12-21T20:28:09Z

hwu36
Dec 21, 2021
Maintainer

I edited the code in the above 2 replies.

0 replies

masahi · 2021-12-21T20:42:20Z

masahi
Dec 21, 2021
Author

Yes, I already use {static_cast<ElementOutput*>(ptr_bias), layout_C::Stride(0)} for the normal conv2d + one bias case (using DefaultConv2d kernel). My guess when coming up with TensorNHWC(0, 0, K) thing is that, since (1, 1, 1, K) bias can be supported by Stride(0) = TensorNHWC(0, 0, 0), I thought per-batch bias can be similarly supported by Stride(0, 0, K) where K is the stride for the N dimension.

I've also got my original use case for multiple-sources fusion, (AB + per_channel_bias) + C summarized in #347 (comment), working using DefaultConv2dWithBroadcast kernel. For that, I generate the following code and it works perfectly.

  TensorNHWC layout_A(TensorNHWC::packed(cutlass::make_Coord(N, H, W, C)));
  TensorNHWC layout_B(TensorNHWC::packed(cutlass::make_Coord(K, R, S, C)));
  TensorNHWC layout_C(TensorNHWC::packed(cutlass::make_Coord(N, P, Q, K)));
  TensorNHWC layout_D(TensorNHWC::packed(cutlass::make_Coord(N, P, Q, K)));

  typename Conv2d::Arguments arguments{
    problem_size,
    {static_cast<ElementInputA*>(ptr_a), layout_A},
    {static_cast<ElementInputB*>(ptr_b), layout_B},
    {static_cast<ElementOutput*>(ptr_residual), layout_C},
    {static_cast<ElementOutput*>(ptr_out),layout_D},
    {alpha, beta},
    cutlass::conv::SplitKMode::kSerial,
    static_cast<ElementOutput*>(ptr_bias),
    nullptr,
    0,
    K
  };

Are you suggesting that I try something else? Because I think I'm already at the "first bias tensor is working expected" stage and ready for

We can talk about the 2nd bias tensor, the one which is per batch per channel after this because it requires CUDA source code change

But if per-batch bias cannot be supported by the Stride(0, 0, K) trick, it is fine to give up fusing this workload.

0 replies

hwu36 · 2021-12-22T03:47:17Z

hwu36
Dec 22, 2021
Maintainer

Give me some time to think about it. I will respond tomorrow.

0 replies

masahi · 2021-12-22T09:18:10Z

masahi
Dec 22, 2021
Author

Looking at lines such as

cutlass/include/cutlass/epilogue/threadblock/output_iterator_parameter.h

Lines 39 to 40 in c77a524

    
           static int const kTensorStrideIdx =  
        
             (kConvolutionalOperator == conv::Operator::kWgrad ? kWgradStrideIdx : 0);

cutlass/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h

Line 117 in 6c2f8f2

layout.stride(0) * int(sizeof(AccessType)) / kElementsPerAccess,

I assume cutlass only looks at the first element of a stride array. That makes sense because it corresponds to the row stride in a 2D GEMM matrix.

So the only possible broadcasting op is along the inner-most dimension, and per-batch broadcast using the stride TensorNHWC(0, 0, K) doesn't seem possible.

0 replies

hwu36 · 2021-12-22T21:10:08Z

hwu36
Dec 22, 2021
Maintainer

You are correct. We need to make changes to CUDA source code to let C to be per-batch-par-channel bias.

First, set Stride to all 0s like https://github.com/NVIDIA/cutlass/tree/master/examples/17_fprop_per_channel_bias

The exact row number is calculated here which is (row_offset + thread_start_row_). Then you can do something like this

          int output_row = row_offset + thread_start_row_;                                           
          int output_N = output_row / (convolution_P * convolution_Q); // You need to pass `P`, `Q`, `K` into the iterator

to compute n. Then add

output_N * covolution_K

to memory_pointer.

I think it should work.

0 replies

masahi · 2021-12-23T08:03:44Z

masahi
Dec 23, 2021
Author

Thanks @hwu36! The following diff worked:

diff --git a/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu b/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu
index cce5edd..d50b3c0 100644
--- a/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu
+++ b/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu
@@ -120,8 +120,9 @@ using ImplicitGemm = cutlass::conv::device::ImplicitGemmConvolution<Conv2dFpropK
 int run() {
 
   // Construct Conv2dProblemSize with user defined output size
+  const int batch_size = 16;
   cutlass::conv::Conv2dProblemSize problem_size(      
-    {1, 7, 7, 512},                               // activation 
+    {batch_size, 7, 7, 512},                               // activation
     {512, 3, 3, 512},                             // filter
     {1, 1, 1, 1},                                 // padding
     {1, 1},                                       // striding
@@ -135,7 +136,7 @@ int run() {
   cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(problem_size.filter_extent());
 
   // Create tensor C with dimensions 1x1x1xk which is the bias vector
-  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c_bias({1, 1, 1, problem_size.K});
+  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c_bias({batch_size, 1, 1, problem_size.K});
 
   // Create tensor D used to store output from CUTLASS kernel
   cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d(problem_size.output_extent());
@@ -252,7 +253,7 @@ int run() {
           tensor_ref_d.at({n, p, q, k}) =
               std::max(ElementOutput(0),
                        ElementOutput(tensor_ref_d.at({n, p, q, k}) +
-                                     tensor_c_bias.at({0, 0, 0, k})));
+                                     tensor_c_bias.at({n, 0, 0, k})));
         }
       }
     }
diff --git a/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h b/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h
index 294770d..3f992fd 100644
--- a/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h
+++ b/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h
@@ -270,8 +270,10 @@ public:
             + cluster * ThreadMap::Delta::kCluster;
 
           bool row_guard = ((row_offset + thread_start_row_) < extent_row_);
+          int output_row = row_offset + thread_start_row_;
+          int output_N = output_row / (7 * 7);
 
-          AccessType *memory_pointer = reinterpret_cast<AccessType *>(byte_pointer + byte_offset);
+          AccessType *memory_pointer = reinterpret_cast<AccessType *>(byte_pointer + byte_offset) + (output_N * 512) / kElementsPerAccess;
 
           CUTLASS_PRAGMA_UNROLL
           for (int column = 0; column < ThreadMap::Iterations::kColumn; ++column) {

0 replies

Epilogue with mutiple sources #347

masahi Oct 18, 2021

Replies: 27 comments · 3 replies

hwu36 Oct 19, 2021 Maintainer

masahi Oct 19, 2021 Author

masahi Oct 19, 2021 Author

ginowu Nov 18, 2021

masahi Nov 18, 2021 Author

hwu36 Oct 19, 2021 Maintainer

hwu36 Dec 17, 2021 Maintainer

hwu36 Dec 17, 2021 Maintainer

hwu36 Dec 17, 2021 Maintainer

hwu36 Dec 17, 2021 Maintainer

masahi Dec 19, 2021 Author

hwu36 Dec 19, 2021 Maintainer

masahi Dec 19, 2021 Author

masahi Dec 19, 2021 Author

hwu36 Dec 19, 2021 Maintainer

masahi Dec 19, 2021 Author

hwu36 Dec 19, 2021 Maintainer

masahi Dec 19, 2021 Author

masahi Dec 21, 2021 Author

masahi Dec 21, 2021 Author

hwu36 Dec 21, 2021 Maintainer

masahi Dec 21, 2021 Author

hwu36 Dec 21, 2021 Maintainer

hwu36 Dec 21, 2021 Maintainer

hwu36 Dec 21, 2021 Maintainer

masahi Dec 21, 2021 Author

hwu36 Dec 22, 2021 Maintainer

masahi Dec 22, 2021 Author

hwu36 Dec 22, 2021 Maintainer

masahi Dec 23, 2021 Author

masahi
Oct 18, 2021

Replies: 27 comments 3 replies

hwu36
Oct 19, 2021
Maintainer

masahi Oct 19, 2021
Author

masahi
Oct 19, 2021
Author

masahi Nov 18, 2021
Author

hwu36
Oct 19, 2021
Maintainer

hwu36
Dec 17, 2021
Maintainer

hwu36
Dec 17, 2021
Maintainer

hwu36
Dec 17, 2021
Maintainer

hwu36
Dec 17, 2021
Maintainer

masahi
Dec 19, 2021
Author

hwu36
Dec 19, 2021
Maintainer

masahi
Dec 19, 2021
Author

masahi
Dec 19, 2021
Author

hwu36
Dec 19, 2021
Maintainer

masahi
Dec 19, 2021
Author

hwu36
Dec 19, 2021
Maintainer

masahi
Dec 19, 2021
Author

masahi
Dec 21, 2021
Author

masahi
Dec 21, 2021
Author

hwu36
Dec 21, 2021
Maintainer

masahi
Dec 21, 2021
Author

hwu36
Dec 21, 2021
Maintainer

hwu36
Dec 21, 2021
Maintainer

hwu36
Dec 21, 2021
Maintainer

masahi
Dec 21, 2021
Author

hwu36
Dec 22, 2021
Maintainer

masahi
Dec 22, 2021
Author

hwu36
Dec 22, 2021
Maintainer

masahi
Dec 23, 2021
Author