Why this cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4 GEMM has blocks filled with zeros? #338

dsilvavinicius · 2021-10-04T13:16:34Z

dsilvavinicius
Oct 4, 2021

After running the profiler, I found that cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4 is the best GEMM for my problem size. I got the parameters from the profiler generated source, then wrote a simple test program that has a cutlass::gemm::device::Gemm reproducing those parameters from cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4. However, the resulting matrix has blocks filled with zeros and I would like to understand why. The image compares expected value (left) and actual GEMM values (right).

Test code to create and execute the GEMM and to calculate expected values:

using ElementAccumulator = float;                   // <- data type of accumulator
using ElementComputeEpilogue = ElementAccumulator;  // <- data type of epilogue operations
using ElementInputA = cutlass::half_t;              // <- data type of elements in input matrix A
using ElementInputB = cutlass::half_t;              // <- data type of elements in input matrix B
using ElementOutput = float;                        // <- data type of elements in output matrix D

using LayoutInputA = cutlass::layout::ColumnMajor;
using LayoutInputB = cutlass::layout::RowMajor;
using LayoutOutput = cutlass::layout::RowMajor;

//cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4
using Gemm = typename cutlass::gemm::device::Gemm<
    ElementInputB, LayoutInputB,    // transposed B operand
    ElementInputA, LayoutInputA,    // transposed A operand
    ElementOutput, LayoutOutput,
    float,
    cutlass::arch::OpClassTensorOp,
    cutlass::arch::Sm75,
    cutlass::gemm::GemmShape<64, 128, 64>,
    cutlass::gemm::GemmShape<64, 64, 32>,
    cutlass::gemm::GemmShape<16, 8, 8>,
    cutlass::epilogue::thread::LinearCombination<
        ElementOutput,
        4,
        ElementAccumulator,
        ElementComputeEpilogue
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
    2,
    4,
    4,
    false,
    cutlass::arch::OpMultiplyAdd
>;

int run()
{
    // ===================
    // Setting matrices up
    // ===================

    const int length_m = 64;
    const int length_n = 128;
    const int length_k = 4;

    // Create a tuple of problem size for matrix multiplication
    cutlass::gemm::GemmCoord problem_size(length_m, length_n, length_k);

    // Initialize tensors using CUTLASS helper functions
    cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(
        problem_size.mk());  // <- Create matrix A with dimensions M x K
    cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(
        problem_size.kn());  // <- Create matrix B with dimensions '

    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c_bias(
        { problem_size.m(), 1 });  // <- Create matrix C with dimensions M x 1

    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d(
        problem_size.mn());  // <- Create matrix D with dimensions M x N used to store output from
                             // CUTLASS kernel

    for (int m = 0; m < problem_size.m(); ++m)
    {
        for (int k = 0; k < problem_size.k(); ++k)
        {
            tensor_b.at({m, k}) = m;
        }
        tensor_c_bias.at({ m, 0 }) = m;
    }

    for (int k = 0; k < problem_size.k(); ++k)
    {
        for (int n = 0; n < problem_size.n(); ++n)
        {
            tensor_a.at({ k, n }) = n;
        }
    }

    // Copy data from host to GPU
    tensor_a.sync_device();
    tensor_b.sync_device();
    tensor_c_bias.sync_device();
    tensor_d.sync_device();

    // Initialize alpha for dot product computation
    ElementComputeEpilogue alpha = ElementComputeEpilogue(1);
    ElementComputeEpilogue beta = ElementComputeEpilogue(0);

    // Split K dimension into 1 partitions
    int split_k_slices = 1;

    // ===============
    // Calculated GEMM
    // ===============

    // Create a tuple of gemm kernel arguments. This is later passed as arguments to launch
    // instantiated CUTLASS kernel
    typename Gemm::Arguments arguments{
      problem_size,                       // <- problem size of matrix multiplication
      tensor_b.device_ref(),              // <- reference to matrix B on device
      tensor_a.device_ref(),              // <- reference to matrix A on device

      {tensor_c_bias.device_data(), 0},   // <- the C matrix is treated as the bias vector. We can enable the GEMM
                                          //    to project away the N dimension by setting the stride to zero.

      tensor_d.device_ref(),              // <- reference to matrix D on device
      {alpha, beta},                      // <- alpha and beta
      split_k_slices };                   // <- k-dimension split factor

    // Using the arguments, query for extra workspace required for matrix multiplication computation
    size_t workspace_size = Gemm::get_workspace_size(arguments);

    // Allocate workspace memory
    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

    // Instantiate CUTLASS kernel depending on templates
    Gemm gemm_op;

    // Check the problem size is supported or not 
    cutlass::Status status = gemm_op.can_implement(arguments);
    CUTLASS_CHECK(status);

    // Initialize CUTLASS kernel with arguments and workspace pointer
    status = gemm_op.initialize(arguments, workspace.get());
    CUTLASS_CHECK(status);

    // Launch initialized CUTLASS kernel
    status = gemm_op();
    CUTLASS_CHECK(status);

    // Copy output data from CUTLASS and reference kernel to host for comparison
    tensor_d.sync_host();

    // ==============
    // Reference GEMM
    // ==============

    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_d(
        problem_size.mn());  // <- Create matrix D with dimensions M x N used to store reference values

    for (int i = 0; i < problem_size.m(); ++i)
    {
        for (int j = 0; j < problem_size.n(); ++j)
        {
            ElementAccumulator accum(0.f);
            for (int k = 0; k < problem_size.k(); ++k)
            {
                ElementAccumulator b_ik = tensor_b.at({i, k});
                ElementAccumulator a_kj = tensor_a.at({k, j});
                accum += b_ik * a_kj;
            }
                        
            ElementAccumulator epilogue = accum /* + tensor_c_bias.at({i, 0})*/;
            tensor_ref_d.at({ i , j }) = epilogue;
        }
    }

    // Wait for kernels to finish
    cudaDeviceSynchronize();

    // ==========
    // Comparison
    // ==========
    
    // Check if output from CUTLASS kernel and reference kernel are equal or not
    std::cout << (cutlass::reference::host::TensorEquals(tensor_d.host_view(),
        tensor_ref_d.host_view())
        ? "Passed"
        : "Failed")
        << std::endl;

    CUTLASS_CHECK(status);
    return 0;
}

hwu36 · 2021-10-05T03:34:57Z

hwu36
Oct 5, 2021
Maintainer

You are running a NT (col x row) gemm, not a TN (row x col) gemm.

As you may notice, cutlass profiler swaps and transpose the operands. (The reason is that cutlass only has row major output, but the profiler needs to use col major output to be aligned with cublas. So A x B -> C becomes B' x A' -> C'). However, the transpose/swap of NT is still NT, transpose/swap of TN is still TN.

To reduce your confusion, you can simply just not doing any swap/transpose by yourself, but let cutlass handles it (https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L528-L533).

using LayoutInputA = cutlass::layout::RowMajor;
using LayoutInputB = cutlass::layout::ColumnMajor;
using LayoutOutput = cutlass::layout::ColumnMajor;

//cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4
using Gemm = typename cutlass::gemm::device::Gemm<
    ElementInputA, LayoutInputA,    // transposed B operand
    ElementInputB, LayoutInputB,    // transposed A operand
    ElementOutput, LayoutOutput,
    float,
    cutlass::arch::OpClassTensorOp,
    cutlass::arch::Sm75,
    cutlass::gemm::GemmShape<64, 128, 64>,
    cutlass::gemm::GemmShape<64, 64, 32>,
    cutlass::gemm::GemmShape<16, 8, 8>,
    cutlass::epilogue::thread::LinearCombination<
        ElementOutput,
        4,
        ElementAccumulator,
        ElementComputeEpilogue
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
    2,
    4,
    4,
    false,
    cutlass::arch::OpMultiplyAdd
>;

    // Initialize tensors using CUTLASS helper functions
    cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(
        problem_size.mk());  // <- Create matrix A with dimensions M x K
    cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(
        problem_size.kn());  // <- Create matrix B with dimensions '

    typename Gemm::Arguments arguments{
      problem_size,                       // <- problem size of matrix multiplication
      tensor_a.device_ref(),              // <- reference to matrix B on device
      tensor_b.device_ref(),              // <- reference to matrix A on device

      {tensor_c_bias.device_data(), 0},   // <- the C matrix is treated as the bias vector. We can enable the GEMM
                                          //    to project away the N dimension by setting the stride to zero.

      tensor_d.device_ref(),              // <- reference to matrix D on device
      {alpha, beta},                      // <- alpha and beta
      split_k_slices };                   // <- k-dimension split factor

BTW, your reference code does not include c_bias and you can just use https://github.com/NVIDIA/cutlass/blob/master/examples/12_gemm_bias_relu/gemm_bias_relu.cu#L217-L251 to do reference check directly.

1 reply

dsilvavinicius Oct 5, 2021
Author

Thanks for the reply! I've changed the order and layout of the matrices, but the problem persists. I'm now comparing with the reference GEMM (thanks for that tip too!) and generating random A and B. The pattern generated by the custom GEMM is always 4 correct floats followed by 4 incorrect zeros. Maybe the error is because cutlass::arch::OpClassTensorOp cannot work with width less than 8? Since my problem has K = 4, that may be an incovenience.

Updated test:

using ElementAccumulator = float;                   // <- data type of accumulator
using ElementComputeEpilogue = ElementAccumulator;  // <- data type of epilogue operations
using ElementInputA = cutlass::half_t;              // <- data type of elements in input matrix A
using ElementInputB = cutlass::half_t;              // <- data type of elements in input matrix B
using ElementOutput = float;                        // <- data type of elements in output matrix D

using LayoutInputA = cutlass::layout::RowMajor;
using LayoutInputB = cutlass::layout::ColumnMajor;
using LayoutOutput = cutlass::layout::ColumnMajor;

//cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4
using Gemm = typename cutlass::gemm::device::Gemm <
    ElementInputA, LayoutInputA,
    ElementInputB, LayoutInputB,
    ElementOutput, LayoutOutput,
    float,
    cutlass::arch::OpClassTensorOp,
    cutlass::arch::Sm75,
    cutlass::gemm::GemmShape<64, 128, 64>,
    cutlass::gemm::GemmShape<64, 64, 32>,
    cutlass::gemm::GemmShape<16, 8, 8>,
    cutlass::epilogue::thread::LinearCombination<
        ElementOutput,
        4,
        ElementAccumulator,
        ElementComputeEpilogue
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
    2,
    4,
    4,
    false,
    cutlass::arch::OpMultiplyAdd
> ;

int run()
{
    // ===================
    // Setting matrices up
    // ===================

    const int length_m = 128;
    const int length_n = 64;
    const int length_k = 4;

    // Create a tuple of problem size for matrix multiplication
    cutlass::gemm::GemmCoord problem_size(length_m, length_n, length_k);

    // Initialize tensors using CUTLASS helper functions
    cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(
        problem_size.mk());  // <- Create matrix A with dimensions M x K
    cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(
        problem_size.kn());  // <- Create matrix B with dimensions

    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d(
        problem_size.mn());  // <- Create matrix D with dimensions M x N used to store output from
                             // CUTLASS kernel

    cutlass::reference::host::TensorFillRandomUniform(
        tensor_a.host_view(),
        1,
        ElementInputA(4),
        ElementInputA(-4),
        0);  // <- Fill matrix A on host with uniform-distribution random data
    cutlass::reference::host::TensorFillRandomUniform(
        tensor_b.host_view(),
        1,
        ElementInputB(4),
        ElementInputB(-4),
        0);

    // Copy data from host to GPU
    tensor_a.sync_device();
    tensor_b.sync_device();
    tensor_d.sync_device();

    // Initialize alpha for dot product computation
    ElementComputeEpilogue alpha = ElementComputeEpilogue(1);
    ElementComputeEpilogue beta = ElementComputeEpilogue(0);

    // Split K dimension into 1 partitions
    int split_k_slices = 1;

    // ===============
    // Calculated GEMM
    // ===============

    // Create a tuple of gemm kernel arguments. This is later passed as arguments to launch
    // instantiated CUTLASS kernel
    typename Gemm::Arguments arguments{
      problem_size,                       // <- problem size of matrix multiplication
      tensor_a.device_ref(),              // <- reference to matrix A on device
      tensor_b.device_ref(),              // <- reference to matrix B on device
      tensor_d.device_ref(),                
      tensor_d.device_ref(),              // <- reference to matrix D on device
      {alpha, beta},                      // <- alpha and beta
      split_k_slices };                   // <- k-dimension split factor

    // Using the arguments, query for extra workspace required for matrix multiplication computation
    size_t workspace_size = Gemm::get_workspace_size(arguments);

    // Allocate workspace memory
    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

    // Instantiate CUTLASS kernel depending on templates
    Gemm gemm_op;

    // Check the problem size is supported or not 
    cutlass::Status status = gemm_op.can_implement(arguments);
    CUTLASS_CHECK(status);

    // Initialize CUTLASS kernel with arguments and workspace pointer
    status = gemm_op.initialize(arguments, workspace.get());
    CUTLASS_CHECK(status);

    // Launch initialized CUTLASS kernel
    status = gemm_op();
    CUTLASS_CHECK(status);

    // ==============
    // Reference GEMM
    // ==============

    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_d(
        problem_size.mn());  // <- Create matrix D with dimensions M x N used to store reference values

    cutlass::reference::device::Gemm<
        ElementInputA,
        LayoutInputA,
        ElementInputB,
        LayoutInputB,
        ElementOutput,
        LayoutOutput,
        ElementComputeEpilogue,
        ElementComputeEpilogue
    > gemm_device_reference;

    // Launch device reference to compute strictly the product A * B
    gemm_device_reference(
        problem_size,
        alpha,
        tensor_a.device_ref(),
        tensor_b.device_ref(),
        0,
        tensor_ref_d.device_ref());

    // Wait for kernels to finish
    cudaDeviceSynchronize();

    // Copy output data from CUTLASS and reference kernel to host for comparison
    tensor_d.sync_host();
    tensor_ref_d.sync_host();

    // ==========
    // Comparison
    // ==========
    
    // Check if output from CUTLASS kernel and reference kernel are equal or not
    std::cout << (cutlass::reference::host::TensorEquals(tensor_d.host_view(),
        tensor_ref_d.host_view())
        ? "Passed"
        : "Failed")
        << std::endl;

    CUTLASS_CHECK(status);
    return 0;
}

hwu36 · 2021-10-05T17:36:12Z

hwu36
Oct 5, 2021
Maintainer

I modified example 12 to 1) use your tile size and problem size 2) do bias, but no relu 3) dump the result. Here is the diff

diff --git a/examples/12_gemm_bias_relu/gemm_bias_relu.cu b/examples/12_gemm_bias_relu/gemm_bias_relu.cu
index cc904139a..8c43b68b0 100644
--- a/examples/12_gemm_bias_relu/gemm_bias_relu.cu
+++ b/examples/12_gemm_bias_relu/gemm_bias_relu.cu
@@ -58,7 +58,7 @@ using ElementOutput = float;                        // <- data type of elements
 //   3) Mx1 bias vector becomes 1xM after the swapping/transposing.
 //   4) we can use the existing OutputIterator to load 1xM bias vector.
 
-using LayoutInputA = cutlass::layout::ColumnMajor;
+using LayoutInputA = cutlass::layout::RowMajor;
 using LayoutInputB = cutlass::layout::ColumnMajor;
 using LayoutOutput = cutlass::layout::ColumnMajor;
 
@@ -70,7 +70,7 @@ using SmArch = cutlass::arch::Sm75;
 
 // This code section describes the tile size a thread block will compute
 using ShapeMMAThreadBlock =
-    cutlass::gemm::GemmShape<128, 128, 32>;  // <- threadblock tile M = 128, N = 128, K = 32
+    cutlass::gemm::GemmShape<128, 64, 64>;  // <- threadblock tile M = 128, N = 128, K = 32
 // This code section describes tile size a warp will compute
 using ShapeMMAWarp = cutlass::gemm::GemmShape<64, 64, 32>;  // <- warp tile M = 64, N = 64, K = 32 
 // This code section describes the size of MMA op
@@ -83,13 +83,9 @@ using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSw
 //
 //    d_ij = max(0, alpha * sum_k(a_ik * b_kj) + c_ij )
 //
-using EpilogueOp = cutlass::epilogue::thread::LinearCombinationRelu<
+using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
     ElementOutput,                                        // <- data type of output matrix
-    128 / cutlass::sizeof_bits<ElementOutput>::value,     // <- this is the number of elements per
-                                                          // vectorized memory access. For half
-                                                          // precision, it's 8 elements. This becomes
-                                                          // the vector width of math instructions in
-                                                          // epilogue too
+    4,
     ElementAccumulator,                                   // <- data type of accumulator
     ElementComputeEpilogue,                               // <- data type for alpha in linear combination function
     cutlass::epilogue::thread::ScaleType::NoBetaScaling>; // <- alpha x C + bias
@@ -111,13 +107,15 @@ using Gemm = cutlass::gemm::device::Gemm<ElementInputA,
                                          ShapeMMAOp,
                                          EpilogueOp,
                                          SwizzleThreadBlock,
@@ -111,13 +107,15 @@ using Gemm = cutlass::gemm::device::Gemm<ElementInputA,
                                          ShapeMMAOp,
                                          EpilogueOp,
                                          SwizzleThreadBlock,
-                                         NumStages>;
+                                         NumStages,
+                                         4,
+                                         4>;
 
 int run() {
 
-  const int length_m = 5120;
-  const int length_n = 4096;
-  const int length_k = 4096;
+  const int length_m = 64;
+  const int length_n = 128;
+  const int length_k = 4;
 
   // Create a tuple of problem size for matrix multiplication
   cutlass::gemm::GemmCoord problem_size(length_m, length_n, length_k);
@@ -243,10 +241,9 @@ int run() {
   // Compute bias + relu in host code
   for (int i = 0; i < problem_size.m(); ++i) {
     for (int j = 0; j < problem_size.n(); ++j) {
-      tensor_ref_d.at({i, j}) = std::max(
-        ElementOutput(0), 
+      tensor_ref_d.at({i, j}) = 
         ElementOutput(tensor_ref_d.at({i, j}) + tensor_c_bias.at({i, 0}))
-      );
+      ;
     }
   }
 
@@ -257,6 +254,10 @@ int run() {
                     : "Failed")
             << std::endl;
 
+
+  std::cout << "Computed:\n" << tensor_d.host_view() << "\n\n\n";
+  std::cout << "Reference:\n" << tensor_ref_d.host_view() << "\n\n\n";
+
   CUTLASS_CHECK(status);
   return 0;
 }

To run it. First do above editing, then

make 12_gemm_bias_relu
./examples/12_gemm_bias_relu/12_gemm_bias_relu

5 replies

dsilvavinicius Oct 6, 2021
Author

Thanks again. I found the problem. It seems that the code derived from example 12 above does not pass when compiled with debug info (-G), because of those zero blocks. If I remove the flag, the test passes. I could replicate this behavior with a clean repo clone. My setup is Windows 10, RTX 2070 Super, CUDA 11.4, Visual Studio 2019, CMAKE 3.21.0. Steps:

Clone repo and cd into it
export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_ENABLE_TESTS=OFF -DCUTLASS_UNITY_BUILD_ENABLED=ON
Open the generated solution with Visual Studio 2019
Make the changes in the diff above to example 12
Compile the changed example 12 with debug info (-G)
Run the example.
Compile the changed example 12 without debug info (remove the -G flag)
Run the example.

hwu36 Oct 6, 2021
Maintainer

It sounds like a severe compiler bug. Are you using the latest CUDA 11.4 update 2? Is it possible to figure out the exact nvcc command line?

dsilvavinicius Oct 6, 2021
Author

This is the nvcc version.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:47:52_Pacific_Daylight_Time_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

This is the compilation command line:

F:\gamedev\cutlass_test\build\examples\12_gemm_bias_relu>"F:\gamedev\cuda11.4\dev\bin\nvcc.exe" 
-gencode=arch=compute_75,code=\"sm_75,compute_75\" -gencode=arch=compute_75,code=\"compute_75,compute_75\" 
--use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual 
Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64" -x cu   -IF:\gamedev\cutlass_test\include
-IF:\gamedev\cutlass_test\examples\common -IF:\gamedev\cutlass_test\build\include
-IF:\gamedev\cutlass_test\tools\util\include -IF:\gamedev\cuda11.4\dev\include -IF:\gamedev\cuda11.4\dev\include  -G
--keep-dir x64\Debug  -maxrregcount=0  --machine 64 --compile -cudart static -Xcompiler="/EHsc -Zi -Ob0 /WX /wd4819 
/fp:strict" -g  -D_WINDOWS -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_TEST_LEVEL=0 
-DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DCUTLASS_NAMESPACE=cutlass 
-DCUTLASS_ENABLE_CUBLAS=1 -D"CMAKE_INTDIR=\"Debug\"" -D_MBCS -D"CMAKE_INTDIR=\"Debug\"" -Xcompiler "/EHsc 
/W3 /nologo /Od /Fd12_gemm_bias_relu.dir\Debug\vc142.pdb /FS /Zi /RTC1 /MDd /GR" -o 
12_gemm_bias_relu.dir\Debug\gemm_bias_relu.obj 
"F:\gamedev\cutlass_test\examples\12_gemm_bias_relu\gemm_bias_relu.cu"

hwu36 Oct 6, 2021
Maintainer

Thank you. It is very helpful. You are using CUDA 11.4 Update 1. Would you please try the latest CUDA 11.4 Update 2. 11.4 Update 2 fixes many bugs.

If it still fail, would you please paste your whole gemm_bias_relu.cu here?

dsilvavinicius Oct 6, 2021
Author

The problem persists on CUDA 11.4.2. Here's gemm_bias_relu.cu:

/***************************************************************************************************
 * Copyright (c) 2017-2021, NVIDIA CORPORATION.  All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without modification, are permitted
 * provided that the following conditions are met:
 *     * Redistributions of source code must retain the above copyright notice, this list of
 *       conditions and the following disclaimer.
 *     * Redistributions in binary form must reproduce the above copyright notice, this list of
 *       conditions and the following disclaimer in the documentation and/or other materials
 *       provided with the distribution.
 *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
 *       to endorse or promote products derived from this software without specific prior written
 *       permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
 * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
 * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
 * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 **************************************************************************************************/

/**
*/

#include <algorithm>
#include <iostream>

#include "cutlass/cutlass.h"
#include "cutlass/gemm/device/gemm.h"
#include "cutlass/epilogue/thread/linear_combination_relu.h"
#include "cutlass/util/host_tensor.h"
#include "cutlass/util/reference/device/gemm.h"
#include "cutlass/util/reference/host/tensor_compare.h"
#include "cutlass/util/reference/host/tensor_copy.h"
#include "cutlass/util/reference/host/tensor_fill.h"
#include "cutlass/util/tensor_view_io.h"
#include "helper.h"

// The code section below describes datatype for input, output matrices and computation between
// elements in input matrices.
using ElementAccumulator = float;                   // <- data type of accumulator
using ElementComputeEpilogue = ElementAccumulator;  // <- data type of epilogue operations
using ElementInputA = cutlass::half_t;              // <- data type of elements in input matrix A
using ElementInputB = cutlass::half_t;              // <- data type of elements in input matrix B
using ElementOutput = float;                        // <- data type of elements in output matrix D

// The code section below describes matrix layout of input and output matrices.
// Column Major for Matrix A, B and C.
//
// Note this example only works for ColumnMajor output because
//   1) we only have row major epilogue.
//   2) we swap A and B if the output is column major then we can still use the
//      row major epilogue.
//   3) Mx1 bias vector becomes 1xM after the swapping/transposing.
//   4) we can use the existing OutputIterator to load 1xM bias vector.

using LayoutInputA = cutlass::layout::RowMajor;
using LayoutInputB = cutlass::layout::ColumnMajor;
using LayoutOutput = cutlass::layout::ColumnMajor;

// This code section describes whether you want to use tensor cores or regular SIMT cores on GPU SM
using MMAOp = cutlass::arch::OpClassTensorOp;

// This code section describes CUDA SM architecture number
using SmArch = cutlass::arch::Sm75;

// This code section describes the tile size a thread block will compute
using ShapeMMAThreadBlock =
    cutlass::gemm::GemmShape<128, 64, 64>;  // <- threadblock tile M = 128, N = 128, K = 32
// This code section describes tile size a warp will compute
using ShapeMMAWarp = cutlass::gemm::GemmShape<64, 64, 32>;  // <- warp tile M = 64, N = 64, K = 32 
// This code section describes the size of MMA op
using ShapeMMAOp = cutlass::gemm::GemmShape<16, 8, 8>;  // <- MMA Op tile M = 8, N = 8, K = 4

// This code section describes how threadblocks are scheduled on GPU
using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>;  // <- ??

// Define the epilogue operation as LinearCombinationRelu. This is approximately equal to
//
//    d_ij = max(0, alpha * sum_k(a_ik * b_kj) + c_ij )
//
using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
    ElementOutput,                                        // <- data type of output matrix
    4,
    ElementAccumulator,                                   // <- data type of accumulator
    ElementComputeEpilogue,                               // <- data type for alpha in linear combination function
    cutlass::epilogue::thread::ScaleType::NoBetaScaling>; // <- alpha x C + bias

// Number of pipelines you want to use
constexpr int NumStages = 2;

using Gemm = cutlass::gemm::device::Gemm<ElementInputA,
                                         LayoutInputA,
                                         ElementInputB,
                                         LayoutInputB,
                                         ElementOutput,
                                         LayoutOutput,
                                         ElementAccumulator,
                                         MMAOp,
                                         SmArch,
                                         ShapeMMAThreadBlock,
                                         ShapeMMAWarp,
                                         ShapeMMAOp,
                                         EpilogueOp,
                                         SwizzleThreadBlock,
                                         NumStages,
                                         4,
                                         4>;

int run() {

  const int length_m = 64;
  const int length_n = 128;
  const int length_k = 4;

  // Create a tuple of problem size for matrix multiplication
  cutlass::gemm::GemmCoord problem_size(length_m, length_n, length_k);

  // Initialize tensors using CUTLASS helper functions
  cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(
      problem_size.mk());  // <- Create matrix A with dimensions M x K
  cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(
      problem_size.kn());  // <- Create matrix B with dimensions K x N

  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c_bias(
      {problem_size.m(), 1});  // <- Create matrix C with dimensions M x 1

  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d(
      problem_size.mn());  // <- Create matrix D with dimensions M x N used to store output from
                           // CUTLASS kernel
  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_d(
      problem_size.mn());  // <- Create matrix D with dimensions M x N used to store output from
                           // reference kernel

  // Fill input and output matrices on host using CUTLASS helper functions
  cutlass::reference::host::TensorFillRandomUniform(
      tensor_a.host_view(),
      1,
      ElementInputA(4),
      ElementInputA(-4),
      0);  // <- Fill matrix A on host with uniform-distribution random data
  cutlass::reference::host::TensorFillRandomUniform(
      tensor_b.host_view(),
      1,
      ElementInputB(4),
      ElementInputB(-4),
      0);  // <- Fill matrix B on host with uniform-distribution random data
  cutlass::reference::host::TensorFillRandomUniform(
      tensor_c_bias.host_view(),
      1,
      ElementOutput(4),
      ElementOutput(-4),
      0);  // <- Fill matrix C on host with uniform-distribution random data
  cutlass::reference::host::TensorFill(
      tensor_d.host_view());  // <- fill matrix D on host with zeros
  cutlass::reference::host::TensorFill(
      tensor_ref_d.host_view());  // <- fill matrix D for reference on host with zeros

  // Copy data from host to GPU
  tensor_a.sync_device();
  tensor_b.sync_device();
  tensor_c_bias.sync_device();
  tensor_d.sync_device();
  tensor_ref_d.sync_device();

  // Initialize alpha for dot product computation
  ElementComputeEpilogue alpha = ElementComputeEpilogue(1);

  // Split K dimension into 1 partitions
  int split_k_slices = 1;

  // Create a tuple of gemm kernel arguments. This is later passed as arguments to launch
  // instantiated CUTLASS kernel
  typename Gemm::Arguments arguments{
    problem_size,                       // <- problem size of matrix multiplication
    tensor_a.device_ref(),              // <- reference to matrix A on device
    tensor_b.device_ref(),              // <- reference to matrix B on device

    {tensor_c_bias.device_data(), 0},   // <- the C matrix is treated as the bias vector. We can enable the GEMM
                                        //    to project away the N dimension by setting the stride to zero.

    tensor_d.device_ref(),              // <- reference to matrix D on device
    {alpha},                              // <- alpha
    split_k_slices};                    // <- k-dimension split factor

  // Using the arguments, query for extra workspace required for matrix multiplication computation
  size_t workspace_size = Gemm::get_workspace_size(arguments);

  // Allocate workspace memory
  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

  // Instantiate CUTLASS kernel depending on templates
  Gemm gemm_op;

  // Check the problem size is supported or not 
  cutlass::Status status = gemm_op.can_implement(arguments);
  CUTLASS_CHECK(status);

  // Initialize CUTLASS kernel with arguments and workspace pointer
  status = gemm_op.initialize(arguments, workspace.get());
  CUTLASS_CHECK(status);

  // Launch initialized CUTLASS kernel
  status = gemm_op();
  CUTLASS_CHECK(status);

  //
  // Create instantiation for device reference gemm kernel
  //

  cutlass::reference::device::Gemm<ElementInputA,
                                   LayoutInputA,
                                   ElementInputB,
                                   LayoutInputB,
                                   ElementOutput,
                                   LayoutOutput,
                                   ElementComputeEpilogue,
                                   ElementComputeEpilogue>
      gemm_device_reference;

  // Launch device reference to compute strictly the product A * B
  gemm_device_reference(
    problem_size,
    alpha,
    tensor_a.device_ref(),
    tensor_b.device_ref(),
    0,
    tensor_ref_d.device_ref());

  // Wait for kernels to finish
  cudaDeviceSynchronize();

  // Copy output data from CUTLASS and reference kernel to host for comparison
  tensor_d.sync_host();
  tensor_ref_d.sync_host();

  // Compute bias + relu in host code
  for (int i = 0; i < problem_size.m(); ++i) {
    for (int j = 0; j < problem_size.n(); ++j) {
      tensor_ref_d.at({i, j}) = 
        ElementOutput(tensor_ref_d.at({i, j}) + tensor_c_bias.at({i, 0}));
    }
  }

  // Check if output from CUTLASS kernel and reference kernel are equal or not
  std::cout << (cutlass::reference::host::TensorEquals(tensor_d.host_view(),
                                                       tensor_ref_d.host_view())
                    ? "Passed"
                    : "Failed")
            << std::endl;

  std::cout << "Computed:\n" << tensor_d.host_view() << "\n\n\n";
  std::cout << "Reference:\n" << tensor_ref_d.host_view() << "\n\n\n";

  CUTLASS_CHECK(status);
  return 0;
}

int main() {

  bool notSupported = false;

  // Turing Tensor Core operations exposed with mma.sync are first available in CUDA 10.2.
  //
  // CUTLASS must be compiled with CUDA 10.1 Toolkit to run these examples.
  if (!(__CUDACC_VER_MAJOR__ > 10 || (__CUDACC_VER_MAJOR__ == 10 && __CUDACC_VER_MINOR__ >= 2))) {
    std::cerr << "Turing Tensor Core operations must be compiled with CUDA 10.2 Toolkit or later." << std::endl;
    notSupported = true;
  }

  cudaDeviceProp props;

  cudaError_t error = cudaGetDeviceProperties(&props, 0);
  if (error != cudaSuccess) {
    std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
    return -1;
  }

  if (!(props.major * 10 + props.minor >= 75)) {
    std::cerr << "Turing Tensor Ops must be run on a machine with compute capability at least 75."
              << std::endl;
    notSupported = true;
  }

  if (notSupported) {
    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
    return 0;
  }

  return run();
}

Here's the compilation command line:

F:\gamedev\cutlass_test\build\examples\12_gemm_bias_relu>"F:\gamedev\cuda11.4.2\dev\bin\nvcc.exe" 
-gencode=arch=compute_75,code=\"sm_75,compute_75\" -gencode=arch=compute_75,code=\"compute_75,compute_75\" 
--use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64" 
-x cu   -IF:\gamedev\cutlass_test\include -IF:\gamedev\cutlass_test\examples\common -IF:\gamedev\cutlass_test\build\include 
-IF:\gamedev\cutlass_test\tools\util\include -IF:\gamedev\cuda11.4.2\dev\include -IF:\gamedev\cuda11.4.2\dev\include     
--keep-dir x64\Debug  -maxrregcount=0  --machine 64 --compile -cudart static -Xcompiler="/EHsc -Zi -Ob0 /WX /wd4819 /fp:strict" -g  
-D_WINDOWS -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 
-DCUTLASS_DEBUG_TRACE_LEVEL=0 -DCUTLASS_NAMESPACE=cutlass -DCUTLASS_ENABLE_CUBLAS=1 -D"CMAKE_INTDIR=\"Debug\"" 
-D_MBCS -D"CMAKE_INTDIR=\"Debug\"" -Xcompiler "/EHsc /W3 /nologo /Od /Fd12_gemm_bias_relu.dir\Debug\vc142.pdb /FS /Zi /RTC1 
/MDd /GR" -o 12_gemm_bias_relu.dir\Debug\gemm_bias_relu.obj 
"F:\gamedev\cutlass_test\examples\12_gemm_bias_relu\gemm_bias_relu.cu"

hwu36 · 2022-01-14T06:37:14Z

hwu36
Jan 14, 2022
Maintainer

The latest cuda 11.6 fixes -G bug.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why this cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4 GEMM has blocks filled with zeros? #338

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why this cutlass_tensorop_s1688gemm_f16_64x128_64x2_tn_align4 GEMM has blocks filled with zeros? #338

dsilvavinicius Oct 4, 2021

Replies: 3 comments · 6 replies

hwu36 Oct 5, 2021 Maintainer

dsilvavinicius Oct 5, 2021 Author

hwu36 Oct 5, 2021 Maintainer

dsilvavinicius Oct 6, 2021 Author

hwu36 Oct 6, 2021 Maintainer

dsilvavinicius Oct 6, 2021 Author

hwu36 Oct 6, 2021 Maintainer

dsilvavinicius Oct 6, 2021 Author

hwu36 Jan 14, 2022 Maintainer

dsilvavinicius
Oct 4, 2021

Replies: 3 comments 6 replies

hwu36
Oct 5, 2021
Maintainer

dsilvavinicius Oct 5, 2021
Author

hwu36
Oct 5, 2021
Maintainer

dsilvavinicius Oct 6, 2021
Author

hwu36 Oct 6, 2021
Maintainer

dsilvavinicius Oct 6, 2021
Author

hwu36 Oct 6, 2021
Maintainer

dsilvavinicius Oct 6, 2021
Author

hwu36
Jan 14, 2022
Maintainer