[Feature] Sage Attention Support Varlen & write kv cache BF16/FP16 #10360

l1cacheDell · 2025-04-08T09:39:51Z

PR types

New features

PR changes

APIs

Description

1. write kv cache (BF16/FP16) with RoPE integration

This PR add new feature with Sage Attention and integrating with write kv cache kernels, offering flexible cpp operator API like append attention, easy to use.

2. Varlen support, allow different seqlens in a single batch!

This PR also add varlen support for different sequence length in a single batch processing. The key solution is to enlarge the gridDim (The amount of blocks when launching) to [max_seqlen, num_heads, bsz], where max_seqlen is the longest sequence length in this batch.

In kernel implementation, we allow the rest threads in the same blocks (processing the edge of a sequence) executing to avoid accuracy loss (which will introduce a bit extra latency, but it's necessary). Return and end up the kernel processing when the thread_base_token exceeds the current sequence length.

Due to the Sage Attention kernel implementation, we need to support varlen feature not only for attn forward kernels, but also for quantization kernels (quant qk, quant_transpose_permute v kernels), which takes extra developing & debugging time more than estimated.

paddle-bot · 2025-04-08T09:39:55Z

Thanks for your contribution!

codecov · 2025-04-09T11:03:28Z

Codecov Report

Attention: Patch coverage is 0% with 96 lines in your changes missing coverage. Please review.

Project coverage is 49.05%. Comparing base (e3ed3a3) to head (f5f2716).
Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/ops/triton_ops/segment_mean.py	0.00%	61 Missing ⚠️
...erimental/transformers/fused_transformer_layers.py	0.00%	18 Missing ⚠️
paddlenlp/experimental/transformers/utils.py	0.00%	17 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #10360      +/-   ##
===========================================
- Coverage    49.09%   49.05%   -0.05%     
===========================================
  Files          763      764       +1     
  Lines       125659   125767     +108     
===========================================
+ Hits         61688    61689       +1     
- Misses       63971    64078     +107

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

l1cacheDell added 30 commits March 7, 2025 11:28

fuse all sageattn kernels & unified API entry

4d1455b

remove files

9bc195b

bug fix

837e308

Merge branch 'develop' into sageattn_enhance_v1

635f933

upload header file

d5d4e63

stage commit

a4f4a6e

use sa for general prefill

d85180d

stage

9d0671d

merge develop

e8bc462

resolve merge conflicts

5e9cec7

delete lines

c7e816a

stage commit

8233172

Merge branch 'develop' into sageattn_write_kv

88d3e2c

finish debugging yahooooooo

f40355d

Merge branch 'develop' into sageattn_write_kv

5e6eed5

fix param error

5c60b70

add comments

317ea01

add for FP8 and A8W8

72f1d44

Merge branch 'develop' into sageattn_write_kv

3d6b47c

debug

b7e0e95

hotfix for a param

f4bfb9d

bugfix2

1458cda

Merge branch 'buf_fix_for_get_block' into sageattn_write_kv

146169e

add one more dispatch for smooth weight & shift_bias parameter

14660c1

add library function

b7fb4e3

tmp commit

97d7638

debug

da90229

debug 2

9b055f0

Merge branch 'develop' into sageattn_write_kv

5e027f7

debug 3

f14c455

l1cacheDell added 6 commits March 24, 2025 11:28

revert to None-W8A8-FP8 version

10a3c83

more modify for revert

bb7fbf5

revert fix 3

dfa498c

finish revert

0a07f39

remove abuntant lines

a70cbed

remove lines

e10639c

paddle-bot bot added the contributor label Apr 8, 2025

paddle-bot bot assigned lugimzzz Apr 8, 2025

l1cacheDell added 5 commits April 8, 2025 17:45

remove write 8bit cache

93c4518

Merge branch 'develop' into write_kv16_varlen

5e32a99

upload triton kernel

27c9625

Merge branch 'segment_mean_triton' into write_kv16_varlen

715c8b9

upload varlen kernels

1510e02

update SA entry

0901249

l1cacheDell marked this pull request as draft April 9, 2025 11:47

l1cacheDell added 7 commits April 9, 2025 21:42

debug

f3025cb

debug: last dim not continuous, vfp8

ccd6279

remove no-varlen files compilation

4886475

Merge branch 'develop' into segment_mean_triton

45582c5

update kernel from review

d49d8e2

merge develop

9151688

debug 2

f5f2716

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Sage Attention Support Varlen & write kv cache BF16/FP16 #10360

[Feature] Sage Attention Support Varlen & write kv cache BF16/FP16 #10360

l1cacheDell commented Apr 8, 2025 •

edited

Loading

paddle-bot bot commented Apr 8, 2025

codecov bot commented Apr 9, 2025 •

edited

Loading

[Feature] Sage Attention Support Varlen & write kv cache BF16/FP16 #10360

Are you sure you want to change the base?

[Feature] Sage Attention Support Varlen & write kv cache BF16/FP16 #10360

Conversation

l1cacheDell commented Apr 8, 2025 • edited Loading

PR types

PR changes

Description

1. write kv cache (BF16/FP16) with RoPE integration

2. Varlen support, allow different seqlens in a single batch!

paddle-bot bot commented Apr 8, 2025

codecov bot commented Apr 9, 2025 • edited Loading

Codecov Report

l1cacheDell commented Apr 8, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading