ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

training-free, high-throughput long-context LLM inference

Hanshi Sun^1,2, Li-Wen Chang², Wenlei Bao², Size Zheng², Ningxin Zheng², Xin Liu²,
Harry Dong¹, Yuejie Chi¹, Beidi Chen¹

¹Carnegie Mellon University ²ByteDance

[Paper] | [Blog]

ShadowKV Framework

Environment Set Up

To reproduce the results in the paper, you need to set up the environment as follows with a single A100 GPU:

# create env
conda create -n ShadowKV python=3.10 -y
conda activate ShadowKV

# install packages
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

# nemo dependencies (for dataset building)
pip install wheel
pip install Cython
pip install youtokentome
pip install nemo_toolkit[all]==1.23

# flashinfer
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# cutlass
mkdir 3rdparty
git clone https://github.com/NVIDIA/cutlass.git 3rdparty/cutlass

# build kernels for ShadowKV
python setup.py build_ext --inplace

Supported Models

Currently, we support the following LLMs:

Llama-3-8B-1M: gradientai/Llama-3-8B-Instruct-Gradient-1048k
GLM-4-9B-1M: THUDM/glm-4-9b-chat-1m
Llama-3.1-8B: meta-llama/Meta-Llama-3.1-8B-Instruct
Yi-9B-200K: 01-ai/Yi-9B-200K
Phi-3-Mini-128K: microsoft/Phi-3-mini-128k-instruct (only NIAH test supported)
Qwen2-7B-128K: Qwen/Qwen2-7B-Instruct (only NIAH test supported)

Accuracy Evaluations

Here we provide an example to build the dataset and run evaluation for the RULER benchmark with Llama-3-8B-1M.

Build Datasets

To build RULER dataset, please run the following command:

# build RULER
python -c "import nltk; nltk.download('punkt')"
cd data/ruler
bash create_dataset.sh "gradientai/Llama-3-8B-Instruct-Gradient-1048k" "llama-3"

Run Evaluations

For the accuracy evaluation, please run the following command with 8xA100 GPUs:

# Full attention
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name "gradientai/Llama-3-8B-Instruct-Gradient-1048k"

# ShadowKV
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8

Compatibility with MInference

ShadowKV is compatible with pre-filling acceleration techniques, such as MInference. To enable MInference, please add the --minference flag to the command. For example:

# Full attention with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --minference

# ShadowKV with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8 --minference

Efficiency Evaluations

For the efficiency evaluation, please run the following command with a single A100 GPU:

python test/e2e.py --model_name "meta-llama/Meta-Llama-3.1-8B-Instruct" --datalen "122k"

Citation

If you find ShadowKV useful or relevant to your project and research, please kindly cite our paper:

@article{sun2024shadowkv,
  title={ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference},
  author={Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi},
  journal={arXiv preprint arXiv:2410.21465},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
kernels		kernels
models		models
scripts		scripts
static		static
test		test
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Environment Set Up

Supported Models

Accuracy Evaluations

Build Datasets

Run Evaluations

Compatibility with MInference

Efficiency Evaluations

Citation

About

Releases

Packages

Contributors 2

Languages

License

bytedance/ShadowKV

Folders and files

Latest commit

History

Repository files navigation

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Environment Set Up

Supported Models

Accuracy Evaluations

Build Datasets

Run Evaluations

Compatibility with MInference

Efficiency Evaluations

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages