From 6df5a9b0d680d95dab91f4eb5ae68b99ead0ff7e Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:07:35 +0800 Subject: [PATCH 01/19] Update 01-vector-addition.md --- docs/01-getting-started/tutorials/01-vector-addition.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/01-getting-started/tutorials/01-vector-addition.md b/docs/01-getting-started/tutorials/01-vector-addition.md index 079b6da..dda2c27 100644 --- a/docs/01-getting-started/tutorials/01-vector-addition.md +++ b/docs/01-getting-started/tutorials/01-vector-addition.md @@ -2,6 +2,8 @@ title: 向量相加 --- +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL) + 在本教程中,你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。 你将了解: From 2cfb112b87e03be2d29f83c008be58190805b1b3 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:12:26 +0800 Subject: [PATCH 02/19] Update 02-fused-softmax.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 增加了在线运行教程 --- docs/01-getting-started/tutorials/02-fused-softmax.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/01-getting-started/tutorials/02-fused-softmax.md b/docs/01-getting-started/tutorials/02-fused-softmax.md index 371c0b4..e902385 100644 --- a/docs/01-getting-started/tutorials/02-fused-softmax.md +++ b/docs/01-getting-started/tutorials/02-fused-softmax.md @@ -2,6 +2,8 @@ title: 融合 Softmax (Fused Softmax) --- +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/QEhTxGYyzqY) + 在本教程中,您将编写一个融合的 softmax 操作,该操作在某些类别的矩阵上比 PyTorch 的原生操作快得多:即那些可以适应 GPU 静态随机存取存储器 (SRAM) 的行。 From 6ebfe44a857b47049f7595ae32c505fa6cf616f3 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:12:46 +0800 Subject: [PATCH 03/19] Update 01-vector-addition.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 增加了在线运行教程链接 --- docs/01-getting-started/tutorials/01-vector-addition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/01-vector-addition.md b/docs/01-getting-started/tutorials/01-vector-addition.md index dda2c27..536793d 100644 --- a/docs/01-getting-started/tutorials/01-vector-addition.md +++ b/docs/01-getting-started/tutorials/01-vector-addition.md @@ -2,7 +2,7 @@ title: 向量相加 --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL) 在本教程中,你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。 From 3b0f1c789b28752b6608eac9ac7b182df1a2d782 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:13:44 +0800 Subject: [PATCH 04/19] Update 03-matrix-multiplication.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 增加了在线运行教程链接 --- docs/01-getting-started/tutorials/03-matrix-multiplication.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/01-getting-started/tutorials/03-matrix-multiplication.md b/docs/01-getting-started/tutorials/03-matrix-multiplication.md index 9593376..72aa6a7 100644 --- a/docs/01-getting-started/tutorials/03-matrix-multiplication.md +++ b/docs/01-getting-started/tutorials/03-matrix-multiplication.md @@ -2,6 +2,8 @@ title: 矩阵乘法 --- +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/dheUrOfGo5m) + 在本教程中,您将编写一个非常简短的高性能 FP16 矩阵乘法内核,其性能可以与 cuBLAS 或 rocBLAS 相媲美。 From cae2c1e9b9796ad9a886d2daae2a71f8dea1f491 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:18:33 +0800 Subject: [PATCH 05/19] Update 04-low-memory-dropout.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 增加了在线运行教程链接 --- docs/01-getting-started/tutorials/04-low-memory-dropout.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/01-getting-started/tutorials/04-low-memory-dropout.md b/docs/01-getting-started/tutorials/04-low-memory-dropout.md index e93c57a..41b73d5 100644 --- a/docs/01-getting-started/tutorials/04-low-memory-dropout.md +++ b/docs/01-getting-started/tutorials/04-low-memory-dropout.md @@ -2,6 +2,8 @@ title: 低内存 Dropout --- +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/mkRMwoRH87l) + 在本教程中,您将编写一个内存高效的 Dropout 实现,其状态将由单个 int32 seed 组成。这与传统 Dropout 实现不同,传统实现通常由与输入 shape 相同的位掩码张量组成。 From 9974f675ffb760ae55125b05ffab5d2925055710 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:21:02 +0800 Subject: [PATCH 06/19] Update 05-layer-normalization.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 增加了在线教程链接 --- docs/01-getting-started/tutorials/05-layer-normalization.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/01-getting-started/tutorials/05-layer-normalization.md b/docs/01-getting-started/tutorials/05-layer-normalization.md index eff47d9..64b6c17 100644 --- a/docs/01-getting-started/tutorials/05-layer-normalization.md +++ b/docs/01-getting-started/tutorials/05-layer-normalization.md @@ -2,12 +2,12 @@ title: 层标准化 --- -在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。 +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/EC3Euf81ZW2) +在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。 在此过程中,你将了解: - * 在 Triton 中实现反向传播 (backward pass)。 * 在 Triton 中实现并行归约 (parallel reduction)。 From 4c9235c5f6a8d22d6ccab3bdbe1d203ecb1a554a Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:22:41 +0800 Subject: [PATCH 07/19] Update 06-fused-attention.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 增加了教程运行链接 --- docs/01-getting-started/tutorials/06-fused-attention.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/01-getting-started/tutorials/06-fused-attention.md b/docs/01-getting-started/tutorials/06-fused-attention.md index d9f7f58..1c5a761 100644 --- a/docs/01-getting-started/tutorials/06-fused-attention.md +++ b/docs/01-getting-started/tutorials/06-fused-attention.md @@ -2,6 +2,8 @@ title: 融合注意力 (Fused Attention) --- +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/om2XKloXGTB) + 这是根据 [Tri Dao 的 Flash Attention v2 算法](https://tridao.me/publications/flash2/flash2.pdf)的 Triton 实现。致谢:OpenAI 核心团队 From c5ed9815239ad12e06049ee7f2cbdb52b591e3fa Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:23:48 +0800 Subject: [PATCH 08/19] Update 07-libdevice-tl.extra.libdevice-function.md --- .../tutorials/07-libdevice-tl.extra.libdevice-function.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md index dc3288e..04fcf35 100644 --- a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md +++ b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md @@ -1,6 +1,9 @@ --- title: Libdevice (tl_extra.libdevice) 函数 --- + +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/RFagQOhvTsc) + Triton 可以调用外部库中的自定义函数。在这个例子中,我们将使用 libdevice 库在张量上应用 asin 函数。请参考以下链接获取关于所有可用 libdevice 函数语义的详细信息: * CUDA:https://docs.nvidia.com/cuda/libdevice-users-guide/index.html From ca310a123bfb52f9e898bce6745f9c898cac4ef7 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:24:48 +0800 Subject: [PATCH 09/19] Update 08-group-gemm.md --- docs/01-getting-started/tutorials/08-group-gemm.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/01-getting-started/tutorials/08-group-gemm.md b/docs/01-getting-started/tutorials/08-group-gemm.md index 781fe0e..8e1f476 100644 --- a/docs/01-getting-started/tutorials/08-group-gemm.md +++ b/docs/01-getting-started/tutorials/08-group-gemm.md @@ -2,6 +2,8 @@ title: 分组 GEMM --- +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HTr2JbfRjsl) + 分组 GEMM 内核通过启动固定数量的 CTA 来计算一组 gemms。调度是静态的,并且在设备上完成。 ![图片](/img/docs/Tutorials/GroupGEMM/09.png) From e5f7c5cd6b8703357cd54fde0b8e1bf246060b46 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 18:25:42 +0800 Subject: [PATCH 10/19] Update 09-persistent-matmul.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 增加了教程链接 --- docs/01-getting-started/tutorials/09-persistent-matmul.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/01-getting-started/tutorials/09-persistent-matmul.md b/docs/01-getting-started/tutorials/09-persistent-matmul.md index b51de9f..6438cb0 100644 --- a/docs/01-getting-started/tutorials/09-persistent-matmul.md +++ b/docs/01-getting-started/tutorials/09-persistent-matmul.md @@ -2,6 +2,8 @@ title: 持久矩阵乘法 (Persistent Matmul) --- +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HMjXImmXZFV) + 该脚本展示了使用 Triton 进行矩阵乘法的持久化内核实现 (persistent kernel implementations)。包含多种矩阵乘法方法,例如基础的朴素方法 (naive)、持久化方法 (persistent) 以及基于张量内存加速器(TMA,Tensor Memory Accelerator)的方法。这些内核同时支持半精度浮点数(FP16)和 8 位浮点数(FP8)数据类型,但 FP8 的实现仅在计算能力大于等于 9.0 的 CUDA 设备上可用。 Triton 与 cuBLAS 的具体实现将会在多种各异的配置情形下开展基准测试工作,并通过质子分析器 (proton profiler) 进行评估。使用者可以通过命令行参数灵活指定矩阵的维度和迭代步骤。 From ebc932968277765281ead409bf242cced339c096 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:11:40 +0800 Subject: [PATCH 11/19] Update docs/01-getting-started/tutorials/01-vector-addition.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/01-vector-addition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/01-vector-addition.md b/docs/01-getting-started/tutorials/01-vector-addition.md index 536793d..17d55b8 100644 --- a/docs/01-getting-started/tutorials/01-vector-addition.md +++ b/docs/01-getting-started/tutorials/01-vector-addition.md @@ -2,7 +2,7 @@ title: 向量相加 --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL) 在本教程中,你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。 From b529a2eff6018492f094a9f1b9f6de80c4aea9d0 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:11:46 +0800 Subject: [PATCH 12/19] Update docs/01-getting-started/tutorials/02-fused-softmax.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/02-fused-softmax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/02-fused-softmax.md b/docs/01-getting-started/tutorials/02-fused-softmax.md index e902385..1c89a99 100644 --- a/docs/01-getting-started/tutorials/02-fused-softmax.md +++ b/docs/01-getting-started/tutorials/02-fused-softmax.md @@ -2,7 +2,7 @@ title: 融合 Softmax (Fused Softmax) --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/QEhTxGYyzqY) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/QEhTxGYyzqY) 在本教程中,您将编写一个融合的 softmax 操作,该操作在某些类别的矩阵上比 PyTorch 的原生操作快得多:即那些可以适应 GPU 静态随机存取存储器 (SRAM) 的行。 From d0263358d9bbf0362ee3d9836ff5b094924356ba Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:11:52 +0800 Subject: [PATCH 13/19] Update docs/01-getting-started/tutorials/03-matrix-multiplication.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/03-matrix-multiplication.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/03-matrix-multiplication.md b/docs/01-getting-started/tutorials/03-matrix-multiplication.md index 72aa6a7..e769bd4 100644 --- a/docs/01-getting-started/tutorials/03-matrix-multiplication.md +++ b/docs/01-getting-started/tutorials/03-matrix-multiplication.md @@ -2,7 +2,7 @@ title: 矩阵乘法 --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/dheUrOfGo5m) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/dheUrOfGo5m) 在本教程中,您将编写一个非常简短的高性能 FP16 矩阵乘法内核,其性能可以与 cuBLAS 或 rocBLAS 相媲美。 From d0a9ca93cb2265d7dce8b4fc41d98ca1cb8c3f4b Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:11:58 +0800 Subject: [PATCH 14/19] Update docs/01-getting-started/tutorials/04-low-memory-dropout.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/04-low-memory-dropout.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/04-low-memory-dropout.md b/docs/01-getting-started/tutorials/04-low-memory-dropout.md index 41b73d5..80e66b3 100644 --- a/docs/01-getting-started/tutorials/04-low-memory-dropout.md +++ b/docs/01-getting-started/tutorials/04-low-memory-dropout.md @@ -2,7 +2,7 @@ title: 低内存 Dropout --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/mkRMwoRH87l) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/mkRMwoRH87l) 在本教程中,您将编写一个内存高效的 Dropout 实现,其状态将由单个 int32 seed 组成。这与传统 Dropout 实现不同,传统实现通常由与输入 shape 相同的位掩码张量组成。 From ab6d70220cce13f837bfa4ca128fc83205c3199a Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:12:04 +0800 Subject: [PATCH 15/19] Update docs/01-getting-started/tutorials/05-layer-normalization.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/05-layer-normalization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/05-layer-normalization.md b/docs/01-getting-started/tutorials/05-layer-normalization.md index 64b6c17..d06452e 100644 --- a/docs/01-getting-started/tutorials/05-layer-normalization.md +++ b/docs/01-getting-started/tutorials/05-layer-normalization.md @@ -2,7 +2,7 @@ title: 层标准化 --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/EC3Euf81ZW2) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/EC3Euf81ZW2) 在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。 From fb9b3c192eb582245cde0e97da395829e7146fa4 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:12:10 +0800 Subject: [PATCH 16/19] Update docs/01-getting-started/tutorials/06-fused-attention.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/06-fused-attention.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/06-fused-attention.md b/docs/01-getting-started/tutorials/06-fused-attention.md index 1c5a761..8b7d787 100644 --- a/docs/01-getting-started/tutorials/06-fused-attention.md +++ b/docs/01-getting-started/tutorials/06-fused-attention.md @@ -2,7 +2,7 @@ title: 融合注意力 (Fused Attention) --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/om2XKloXGTB) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/om2XKloXGTB) 这是根据 [Tri Dao 的 Flash Attention v2 算法](https://tridao.me/publications/flash2/flash2.pdf)的 Triton 实现。致谢:OpenAI 核心团队 From f627abcb2d260d4736dbd3b4e3ee470fc2bd684b Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:12:16 +0800 Subject: [PATCH 17/19] Update docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md Co-authored-by: sparanoid --- .../tutorials/07-libdevice-tl.extra.libdevice-function.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md index 04fcf35..27a8555 100644 --- a/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md +++ b/docs/01-getting-started/tutorials/07-libdevice-tl.extra.libdevice-function.md @@ -2,7 +2,7 @@ title: Libdevice (tl_extra.libdevice) 函数 --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/RFagQOhvTsc) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/RFagQOhvTsc) Triton 可以调用外部库中的自定义函数。在这个例子中,我们将使用 libdevice 库在张量上应用 asin 函数。请参考以下链接获取关于所有可用 libdevice 函数语义的详细信息: From ef6dd3ab51e92c0553a1a81c8b9cde5a90e00a2b Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:12:21 +0800 Subject: [PATCH 18/19] Update docs/01-getting-started/tutorials/09-persistent-matmul.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/09-persistent-matmul.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/09-persistent-matmul.md b/docs/01-getting-started/tutorials/09-persistent-matmul.md index 6438cb0..8924784 100644 --- a/docs/01-getting-started/tutorials/09-persistent-matmul.md +++ b/docs/01-getting-started/tutorials/09-persistent-matmul.md @@ -2,7 +2,7 @@ title: 持久矩阵乘法 (Persistent Matmul) --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HMjXImmXZFV) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HMjXImmXZFV) 该脚本展示了使用 Triton 进行矩阵乘法的持久化内核实现 (persistent kernel implementations)。包含多种矩阵乘法方法,例如基础的朴素方法 (naive)、持久化方法 (persistent) 以及基于张量内存加速器(TMA,Tensor Memory Accelerator)的方法。这些内核同时支持半精度浮点数(FP16)和 8 位浮点数(FP8)数据类型,但 FP8 的实现仅在计算能力大于等于 9.0 的 CUDA 设备上可用。 From 9a14798be29ce8f6d7c1e3b7bd8b66b0948a4766 Mon Sep 17 00:00:00 2001 From: yuudiiii <162973048+yuudiiii@users.noreply.github.com> Date: Wed, 11 Dec 2024 19:12:27 +0800 Subject: [PATCH 19/19] Update docs/01-getting-started/tutorials/08-group-gemm.md Co-authored-by: sparanoid --- docs/01-getting-started/tutorials/08-group-gemm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/01-getting-started/tutorials/08-group-gemm.md b/docs/01-getting-started/tutorials/08-group-gemm.md index 8e1f476..4bf66c8 100644 --- a/docs/01-getting-started/tutorials/08-group-gemm.md +++ b/docs/01-getting-started/tutorials/08-group-gemm.md @@ -2,7 +2,7 @@ title: 分组 GEMM --- -[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HTr2JbfRjsl) +[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HTr2JbfRjsl) 分组 GEMM 内核通过启动固定数量的 CTA 来计算一组 gemms。调度是静态的,并且在设备上完成。