From 6df5a9b0d680d95dab91f4eb5ae68b99ead0ff7e Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:07:35 +0800
Subject: [PATCH 01/19] Update 01-vector-addition.md
---
docs/01-getting-started/tutorials/01-vector-addition.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/docs/01-getting-started/tutorials/01-vector-addition.md b/docs/01-getting-started/tutorials/01-vector-addition.md
index 079b6da..dda2c27 100644
--- a/docs/01-getting-started/tutorials/01-vector-addition.md
+++ b/docs/01-getting-started/tutorials/01-vector-addition.md
@@ -2,6 +2,8 @@
title: 向量相加
---
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL)
+
在本教程中,你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。
你将了解:
From 2cfb112b87e03be2d29f83c008be58190805b1b3 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:12:26 +0800
Subject: [PATCH 02/19] Update 02-fused-softmax.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
增加了在线运行教程
---
docs/01-getting-started/tutorials/02-fused-softmax.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/docs/01-getting-started/tutorials/02-fused-softmax.md b/docs/01-getting-started/tutorials/02-fused-softmax.md
index 371c0b4..e902385 100644
--- a/docs/01-getting-started/tutorials/02-fused-softmax.md
+++ b/docs/01-getting-started/tutorials/02-fused-softmax.md
@@ -2,6 +2,8 @@
title: 融合 Softmax (Fused Softmax)
---
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/QEhTxGYyzqY)
+
在本教程中,您将编写一个融合的 softmax 操作,该操作在某些类别的矩阵上比 PyTorch 的原生操作快得多:即那些可以适应 GPU 静态随机存取存储器 (SRAM) 的行。
From 6ebfe44a857b47049f7595ae32c505fa6cf616f3 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:12:46 +0800
Subject: [PATCH 03/19] Update 01-vector-addition.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
增加了在线运行教程链接
---
docs/01-getting-started/tutorials/01-vector-addition.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/01-vector-addition.md b/docs/01-getting-started/tutorials/01-vector-addition.md
index dda2c27..536793d 100644
--- a/docs/01-getting-started/tutorials/01-vector-addition.md
+++ b/docs/01-getting-started/tutorials/01-vector-addition.md
@@ -2,7 +2,7 @@
title: 向量相加
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL)
在本教程中,你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。
From 3b0f1c789b28752b6608eac9ac7b182df1a2d782 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:13:44 +0800
Subject: [PATCH 04/19] Update 03-matrix-multiplication.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
增加了在线运行教程链接
---
docs/01-getting-started/tutorials/03-matrix-multiplication.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/docs/01-getting-started/tutorials/03-matrix-multiplication.md b/docs/01-getting-started/tutorials/03-matrix-multiplication.md
index 9593376..72aa6a7 100644
--- a/docs/01-getting-started/tutorials/03-matrix-multiplication.md
+++ b/docs/01-getting-started/tutorials/03-matrix-multiplication.md
@@ -2,6 +2,8 @@
title: 矩阵乘法
---
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/dheUrOfGo5m)
+
在本教程中,您将编写一个非常简短的高性能 FP16 矩阵乘法内核,其性能可以与 cuBLAS 或 rocBLAS 相媲美。
From cae2c1e9b9796ad9a886d2daae2a71f8dea1f491 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:18:33 +0800
Subject: [PATCH 05/19] Update 04-low-memory-dropout.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
增加了在线运行教程链接
---
docs/01-getting-started/tutorials/04-low-memory-dropout.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/docs/01-getting-started/tutorials/04-low-memory-dropout.md b/docs/01-getting-started/tutorials/04-low-memory-dropout.md
index e93c57a..41b73d5 100644
--- a/docs/01-getting-started/tutorials/04-low-memory-dropout.md
+++ b/docs/01-getting-started/tutorials/04-low-memory-dropout.md
@@ -2,6 +2,8 @@
title: 低内存 Dropout
---
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/mkRMwoRH87l)
+
在本教程中,您将编写一个内存高效的 Dropout 实现,其状态将由单个 int32 seed 组成。这与传统 Dropout 实现不同,传统实现通常由与输入 shape 相同的位掩码张量组成。
From 9974f675ffb760ae55125b05ffab5d2925055710 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:21:02 +0800
Subject: [PATCH 06/19] Update 05-layer-normalization.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
增加了在线教程链接
---
docs/01-getting-started/tutorials/05-layer-normalization.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/docs/01-getting-started/tutorials/05-layer-normalization.md b/docs/01-getting-started/tutorials/05-layer-normalization.md
index eff47d9..64b6c17 100644
--- a/docs/01-getting-started/tutorials/05-layer-normalization.md
+++ b/docs/01-getting-started/tutorials/05-layer-normalization.md
@@ -2,12 +2,12 @@
title: 层标准化
---
-在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/EC3Euf81ZW2)
+在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。
在此过程中,你将了解:
-
* 在 Triton 中实现反向传播 (backward pass)。
* 在 Triton 中实现并行归约 (parallel reduction)。
From 4c9235c5f6a8d22d6ccab3bdbe1d203ecb1a554a Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:22:41 +0800
Subject: [PATCH 07/19] Update 06-fused-attention.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
增加了教程运行链接
---
docs/01-getting-started/tutorials/06-fused-attention.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/docs/01-getting-started/tutorials/06-fused-attention.md b/docs/01-getting-started/tutorials/06-fused-attention.md
index d9f7f58..1c5a761 100644
--- a/docs/01-getting-started/tutorials/06-fused-attention.md
+++ b/docs/01-getting-started/tutorials/06-fused-attention.md
@@ -2,6 +2,8 @@
title: 融合注意力 (Fused Attention)
---
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/om2XKloXGTB)
+
这是根据 [Tri Dao 的 Flash Attention v2 算法](https://tridao.me/publications/flash2/flash2.pdf)的 Triton 实现。致谢:OpenAI 核心团队
From c5ed9815239ad12e06049ee7f2cbdb52b591e3fa Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:23:48 +0800
Subject: [PATCH 08/19] Update 07-libdevice-tl.extra.libdevice-function.md
---
.../tutorials/07-libdevice-tl.extra.libdevice-function.md | 3 +++
1 file changed, 3 insertions(+)
diff --git a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md
index dc3288e..04fcf35 100644
--- a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md
+++ b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md
@@ -1,6 +1,9 @@
---
title: Libdevice (tl_extra.libdevice) 函数
---
+
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/RFagQOhvTsc)
+
Triton 可以调用外部库中的自定义函数。在这个例子中,我们将使用 libdevice 库在张量上应用 asin 函数。请参考以下链接获取关于所有可用 libdevice 函数语义的详细信息:
* CUDA:https://docs.nvidia.com/cuda/libdevice-users-guide/index.html
From ca310a123bfb52f9e898bce6745f9c898cac4ef7 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:24:48 +0800
Subject: [PATCH 09/19] Update 08-group-gemm.md
---
docs/01-getting-started/tutorials/08-group-gemm.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/docs/01-getting-started/tutorials/08-group-gemm.md b/docs/01-getting-started/tutorials/08-group-gemm.md
index 781fe0e..8e1f476 100644
--- a/docs/01-getting-started/tutorials/08-group-gemm.md
+++ b/docs/01-getting-started/tutorials/08-group-gemm.md
@@ -2,6 +2,8 @@
title: 分组 GEMM
---
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HTr2JbfRjsl)
+
分组 GEMM 内核通过启动固定数量的 CTA 来计算一组 gemms。调度是静态的,并且在设备上完成。

From e5f7c5cd6b8703357cd54fde0b8e1bf246060b46 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 18:25:42 +0800
Subject: [PATCH 10/19] Update 09-persistent-matmul.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
增加了教程链接
---
docs/01-getting-started/tutorials/09-persistent-matmul.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/docs/01-getting-started/tutorials/09-persistent-matmul.md b/docs/01-getting-started/tutorials/09-persistent-matmul.md
index b51de9f..6438cb0 100644
--- a/docs/01-getting-started/tutorials/09-persistent-matmul.md
+++ b/docs/01-getting-started/tutorials/09-persistent-matmul.md
@@ -2,6 +2,8 @@
title: 持久矩阵乘法 (Persistent Matmul)
---
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HMjXImmXZFV)
+
该脚本展示了使用 Triton 进行矩阵乘法的持久化内核实现 (persistent kernel implementations)。包含多种矩阵乘法方法,例如基础的朴素方法 (naive)、持久化方法 (persistent) 以及基于张量内存加速器(TMA,Tensor Memory Accelerator)的方法。这些内核同时支持半精度浮点数(FP16)和 8 位浮点数(FP8)数据类型,但 FP8 的实现仅在计算能力大于等于 9.0 的 CUDA 设备上可用。
Triton 与 cuBLAS 的具体实现将会在多种各异的配置情形下开展基准测试工作,并通过质子分析器 (proton profiler) 进行评估。使用者可以通过命令行参数灵活指定矩阵的维度和迭代步骤。
From ebc932968277765281ead409bf242cced339c096 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:11:40 +0800
Subject: [PATCH 11/19] Update
docs/01-getting-started/tutorials/01-vector-addition.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/01-vector-addition.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/01-vector-addition.md b/docs/01-getting-started/tutorials/01-vector-addition.md
index 536793d..17d55b8 100644
--- a/docs/01-getting-started/tutorials/01-vector-addition.md
+++ b/docs/01-getting-started/tutorials/01-vector-addition.md
@@ -2,7 +2,7 @@
title: 向量相加
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL)
在本教程中,你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。
From b529a2eff6018492f094a9f1b9f6de80c4aea9d0 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:11:46 +0800
Subject: [PATCH 12/19] Update
docs/01-getting-started/tutorials/02-fused-softmax.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/02-fused-softmax.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/02-fused-softmax.md b/docs/01-getting-started/tutorials/02-fused-softmax.md
index e902385..1c89a99 100644
--- a/docs/01-getting-started/tutorials/02-fused-softmax.md
+++ b/docs/01-getting-started/tutorials/02-fused-softmax.md
@@ -2,7 +2,7 @@
title: 融合 Softmax (Fused Softmax)
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/QEhTxGYyzqY)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/QEhTxGYyzqY)
在本教程中,您将编写一个融合的 softmax 操作,该操作在某些类别的矩阵上比 PyTorch 的原生操作快得多:即那些可以适应 GPU 静态随机存取存储器 (SRAM) 的行。
From d0263358d9bbf0362ee3d9836ff5b094924356ba Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:11:52 +0800
Subject: [PATCH 13/19] Update
docs/01-getting-started/tutorials/03-matrix-multiplication.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/03-matrix-multiplication.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/03-matrix-multiplication.md b/docs/01-getting-started/tutorials/03-matrix-multiplication.md
index 72aa6a7..e769bd4 100644
--- a/docs/01-getting-started/tutorials/03-matrix-multiplication.md
+++ b/docs/01-getting-started/tutorials/03-matrix-multiplication.md
@@ -2,7 +2,7 @@
title: 矩阵乘法
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/dheUrOfGo5m)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/dheUrOfGo5m)
在本教程中,您将编写一个非常简短的高性能 FP16 矩阵乘法内核,其性能可以与 cuBLAS 或 rocBLAS 相媲美。
From d0a9ca93cb2265d7dce8b4fc41d98ca1cb8c3f4b Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:11:58 +0800
Subject: [PATCH 14/19] Update
docs/01-getting-started/tutorials/04-low-memory-dropout.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/04-low-memory-dropout.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/04-low-memory-dropout.md b/docs/01-getting-started/tutorials/04-low-memory-dropout.md
index 41b73d5..80e66b3 100644
--- a/docs/01-getting-started/tutorials/04-low-memory-dropout.md
+++ b/docs/01-getting-started/tutorials/04-low-memory-dropout.md
@@ -2,7 +2,7 @@
title: 低内存 Dropout
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/mkRMwoRH87l)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/mkRMwoRH87l)
在本教程中,您将编写一个内存高效的 Dropout 实现,其状态将由单个 int32 seed 组成。这与传统 Dropout 实现不同,传统实现通常由与输入 shape 相同的位掩码张量组成。
From ab6d70220cce13f837bfa4ca128fc83205c3199a Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:12:04 +0800
Subject: [PATCH 15/19] Update
docs/01-getting-started/tutorials/05-layer-normalization.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/05-layer-normalization.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/05-layer-normalization.md b/docs/01-getting-started/tutorials/05-layer-normalization.md
index 64b6c17..d06452e 100644
--- a/docs/01-getting-started/tutorials/05-layer-normalization.md
+++ b/docs/01-getting-started/tutorials/05-layer-normalization.md
@@ -2,7 +2,7 @@
title: 层标准化
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/EC3Euf81ZW2)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/EC3Euf81ZW2)
在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。
From fb9b3c192eb582245cde0e97da395829e7146fa4 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:12:10 +0800
Subject: [PATCH 16/19] Update
docs/01-getting-started/tutorials/06-fused-attention.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/06-fused-attention.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/06-fused-attention.md b/docs/01-getting-started/tutorials/06-fused-attention.md
index 1c5a761..8b7d787 100644
--- a/docs/01-getting-started/tutorials/06-fused-attention.md
+++ b/docs/01-getting-started/tutorials/06-fused-attention.md
@@ -2,7 +2,7 @@
title: 融合注意力 (Fused Attention)
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/om2XKloXGTB)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/om2XKloXGTB)
这是根据 [Tri Dao 的 Flash Attention v2 算法](https://tridao.me/publications/flash2/flash2.pdf)的 Triton 实现。致谢:OpenAI 核心团队
From f627abcb2d260d4736dbd3b4e3ee470fc2bd684b Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:12:16 +0800
Subject: [PATCH 17/19] Update
docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md
Co-authored-by: sparanoid
---
.../tutorials/07-libdevice-tl.extra.libdevice-function.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md
index 04fcf35..27a8555 100644
--- a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md
+++ b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md
@@ -2,7 +2,7 @@
title: Libdevice (tl_extra.libdevice) 函数
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/RFagQOhvTsc)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/RFagQOhvTsc)
Triton 可以调用外部库中的自定义函数。在这个例子中,我们将使用 libdevice 库在张量上应用 asin 函数。请参考以下链接获取关于所有可用 libdevice 函数语义的详细信息:
From ef6dd3ab51e92c0553a1a81c8b9cde5a90e00a2b Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:12:21 +0800
Subject: [PATCH 18/19] Update
docs/01-getting-started/tutorials/09-persistent-matmul.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/09-persistent-matmul.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/09-persistent-matmul.md b/docs/01-getting-started/tutorials/09-persistent-matmul.md
index 6438cb0..8924784 100644
--- a/docs/01-getting-started/tutorials/09-persistent-matmul.md
+++ b/docs/01-getting-started/tutorials/09-persistent-matmul.md
@@ -2,7 +2,7 @@
title: 持久矩阵乘法 (Persistent Matmul)
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HMjXImmXZFV)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HMjXImmXZFV)
该脚本展示了使用 Triton 进行矩阵乘法的持久化内核实现 (persistent kernel implementations)。包含多种矩阵乘法方法,例如基础的朴素方法 (naive)、持久化方法 (persistent) 以及基于张量内存加速器(TMA,Tensor Memory Accelerator)的方法。这些内核同时支持半精度浮点数(FP16)和 8 位浮点数(FP8)数据类型,但 FP8 的实现仅在计算能力大于等于 9.0 的 CUDA 设备上可用。
From 9a14798be29ce8f6d7c1e3b7bd8b66b0948a4766 Mon Sep 17 00:00:00 2001
From: yuudiiii <162973048+yuudiiii@users.noreply.github.com>
Date: Wed, 11 Dec 2024 19:12:27 +0800
Subject: [PATCH 19/19] Update
docs/01-getting-started/tutorials/08-group-gemm.md
Co-authored-by: sparanoid
---
docs/01-getting-started/tutorials/08-group-gemm.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/01-getting-started/tutorials/08-group-gemm.md b/docs/01-getting-started/tutorials/08-group-gemm.md
index 8e1f476..4bf66c8 100644
--- a/docs/01-getting-started/tutorials/08-group-gemm.md
+++ b/docs/01-getting-started/tutorials/08-group-gemm.md
@@ -2,7 +2,7 @@
title: 分组 GEMM
---
-[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HTr2JbfRjsl)
+[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HTr2JbfRjsl)
分组 GEMM 内核通过启动固定数量的 CTA 来计算一组 gemms。调度是静态的,并且在设备上完成。