Skip to content

Commit

Permalink
Merge pull request #16 from yuudiiii/patch-6
Browse files Browse the repository at this point in the history
增加了教程运行链接
  • Loading branch information
sparanoid authored Dec 11, 2024
2 parents f5f0f5a + 9a14798 commit d2bee1f
Show file tree
Hide file tree
Showing 9 changed files with 19 additions and 2 deletions.
2 changes: 2 additions & 0 deletions docs/01-getting-started/tutorials/01-vector-addition.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 向量相加
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/YSztKYdMWSL)

在本教程中,你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。

你将了解:
Expand Down
2 changes: 2 additions & 0 deletions docs/01-getting-started/tutorials/02-fused-softmax.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 融合 Softmax (Fused Softmax)
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/QEhTxGYyzqY)

在本教程中,您将编写一个融合的 softmax 操作,该操作在某些类别的矩阵上比 PyTorch 的原生操作快得多:即那些可以适应 GPU 静态随机存取存储器 (SRAM) 的行。


Expand Down
2 changes: 2 additions & 0 deletions docs/01-getting-started/tutorials/03-matrix-multiplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 矩阵乘法
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/dheUrOfGo5m)

在本教程中,您将编写一个非常简短的高性能 FP16 矩阵乘法内核,其性能可以与 cuBLAS 或 rocBLAS 相媲美。


Expand Down
2 changes: 2 additions & 0 deletions docs/01-getting-started/tutorials/04-low-memory-dropout.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 低内存 Dropout
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/mkRMwoRH87l)

在本教程中,您将编写一个内存高效的 Dropout 实现,其状态将由单个 int32 seed 组成。这与传统 Dropout 实现不同,传统实现通常由与输入 shape 相同的位掩码张量组成。


Expand Down
4 changes: 2 additions & 2 deletions docs/01-getting-started/tutorials/05-layer-normalization.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
title: 层标准化
---

在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。
[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/EC3Euf81ZW2)

在本教程中,你将编写一个比 PyTorch 实现运行更快的高性能层标准化 (layer normalization) 内核。

在此过程中,你将了解:


* 在 Triton 中实现反向传播 (backward pass)。
* 在 Triton 中实现并行归约 (parallel reduction)。

Expand Down
2 changes: 2 additions & 0 deletions docs/01-getting-started/tutorials/06-fused-attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 融合注意力 (Fused Attention)
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/om2XKloXGTB)

这是根据 [Tri Dao 的 Flash Attention v2 算法](https://tridao.me/publications/flash2/flash2.pdf)的 Triton 实现。致谢:OpenAI 核心团队


Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
---
title: Libdevice (tl_extra.libdevice) 函数
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/RFagQOhvTsc)

Triton 可以调用外部库中的自定义函数。在这个例子中,我们将使用 libdevice 库在张量上应用 asin 函数。请参考以下链接获取关于所有可用 libdevice 函数语义的详细信息:

* CUDA:https://docs.nvidia.com/cuda/libdevice-users-guide/index.html
Expand Down
2 changes: 2 additions & 0 deletions docs/01-getting-started/tutorials/08-group-gemm.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 分组 GEMM
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HTr2JbfRjsl)

分组 GEMM 内核通过启动固定数量的 CTA 来计算一组 gemms。调度是静态的,并且在设备上完成。

![图片](/img/docs/Tutorials/GroupGEMM/09.png)
Expand Down
2 changes: 2 additions & 0 deletions docs/01-getting-started/tutorials/09-persistent-matmul.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
title: 持久矩阵乘法 (Persistent Matmul)
---

[在线运行此教程](https://openbayes.com/console/hyperai-tutorials/containers/HMjXImmXZFV)

该脚本展示了使用 Triton 进行矩阵乘法的持久化内核实现 (persistent kernel implementations)。包含多种矩阵乘法方法,例如基础的朴素方法 (naive)、持久化方法 (persistent) 以及基于张量内存加速器(TMA,Tensor Memory Accelerator)的方法。这些内核同时支持半精度浮点数(FP16)和 8 位浮点数(FP8)数据类型,但 FP8 的实现仅在计算能力大于等于 9.0 的 CUDA 设备上可用。

Triton 与 cuBLAS 的具体实现将会在多种各异的配置情形下开展基准测试工作,并通过质子分析器 (proton profiler) 进行评估。使用者可以通过命令行参数灵活指定矩阵的维度和迭代步骤。
Expand Down

0 comments on commit d2bee1f

Please sign in to comment.