vLLM-Ext: Full enabling of ALiBi #60

tannervoas742 · 2024-12-17T02:11:10Z

Changes:

Added back alibi biases to decode stage.
Optimized ALiBI memory usage.
- Added environment variable "VLLM_PROMPT_ALIBI_MAX_SEQ_LEN" to allow large models to run with restricted prompt lengths.
- Prompt biases instantiated once in init rather than each forward.
- Prompt and decode biases are shared across encoder/decoder layers.
Added environment variable "VLLM_ALIBI_USE_FLOAT32_BIASES" to resolve accuracy issue on long sequences.
Updated jais, mpt, falcon, baichuan, and bloom to work with ALiBI.
- Due to bloom's 176B parameter size I was unable to test this model. Its changes are the simplest though.
Works in lazy and eager mode.
ALiBI is restricted to "VLLM_PROMPT_USE_FUSEDSDPA=false", and "VLLM_CONTIGUOUS_PA=true".
Add position offsets to improve quality on BS > 1 with sequences of varying length.
BS > 1 may have accuracy issues if on FW < 1.19.0. This is due to limitation in softmax. Resolved on FW >= 1.19.0.
NTT patch for GQA

Signed-off-by: Voas, Tanner <[email protected]>

tannervoas742 mentioned this pull request Dec 17, 2024

Resolved ALIBI bias regression due to porting flat PA HabanaAI/vllm-fork#503

Open

vLLM-Ext: Full enabling of ALiBi

22afba6

Signed-off-by: Voas, Tanner <[email protected]>

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from de937c2 to 22afba6 Compare December 18, 2024 14:38

kwisniewski98 approved these changes Jan 9, 2025

View reviewed changes

Provide feedback