Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Sage Attention Support Varlen & write kv cache BF16/FP16 #10360

Draft
wants to merge 49 commits into
base: develop
Choose a base branch
from

Conversation

l1cacheDell
Copy link
Contributor

@l1cacheDell l1cacheDell commented Apr 8, 2025

PR types

New features

PR changes

APIs

Description

1. write kv cache (BF16/FP16) with RoPE integration

This PR add new feature with Sage Attention and integrating with write kv cache kernels, offering flexible cpp operator API like append attention, easy to use.

2. Varlen support, allow different seqlens in a single batch!

This PR also add varlen support for different sequence length in a single batch processing. The key solution is to enlarge the gridDim (The amount of blocks when launching) to [max_seqlen, num_heads, bsz], where max_seqlen is the longest sequence length in this batch.

In kernel implementation, we allow the rest threads in the same blocks (processing the edge of a sequence) executing to avoid accuracy loss (which will introduce a bit extra latency, but it's necessary). Return and end up the kernel processing when the thread_base_token exceeds the current sequence length.

Due to the Sage Attention kernel implementation, we need to support varlen feature not only for attn forward kernels, but also for quantization kernels (quant qk, quant_transpose_permute v kernels), which takes extra developing & debugging time more than estimated.

Copy link

paddle-bot bot commented Apr 8, 2025

Thanks for your contribution!

Copy link

codecov bot commented Apr 9, 2025

Codecov Report

Attention: Patch coverage is 0% with 96 lines in your changes missing coverage. Please review.

Project coverage is 49.05%. Comparing base (e3ed3a3) to head (f5f2716).
Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/ops/triton_ops/segment_mean.py 0.00% 61 Missing ⚠️
...erimental/transformers/fused_transformer_layers.py 0.00% 18 Missing ⚠️
paddlenlp/experimental/transformers/utils.py 0.00% 17 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #10360      +/-   ##
===========================================
- Coverage    49.09%   49.05%   -0.05%     
===========================================
  Files          763      764       +1     
  Lines       125659   125767     +108     
===========================================
+ Hits         61688    61689       +1     
- Misses       63971    64078     +107     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@l1cacheDell l1cacheDell marked this pull request as draft April 9, 2025 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants