[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397

sxjscience · 2020-10-21T19:59:16Z

We are having ongoing efforts about supporting sparse attention in GluonNLP: #1395. To better accelerate related kernels, we can compare the performance of these potential solutions, including:

Use BlockSparse kernel to implement the operator
We may try out these implementations
Directly implement window attention
- Use CUTLASS and implement our own version
- Use TVM + Ansor: https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py

sxjscience · 2020-10-22T00:23:16Z

@ZiyueHuang Created the issue here about how we may use TVM to accelerate the speed.

sxjscience added enhancement New feature or request help wanted Extra attention is needed labels Oct 21, 2020

sxjscience changed the title ~~[Window Attention][Performance] Use TVM + Ansor to accelerate the kernel of window attention~~ [Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark Oct 23, 2020

sxjscience added the performance Performance issues label Oct 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397

sxjscience commented Oct 21, 2020 •

edited

Loading

sxjscience commented Oct 22, 2020

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397

Comments

sxjscience commented Oct 21, 2020 • edited Loading

sxjscience commented Oct 22, 2020

sxjscience commented Oct 21, 2020 •

edited

Loading