Efficient LLM Scheduling by Learning to Rank [paper]

vllm-ltr is an efficient serving system that approximates Shortest Job First (SJF) scheduling using learning to rank.

Motivation

Most Large Language Model (LLM) serving systems use a First-Come-First-Serve (FCFS) strategy due to the unpredictable output length of requests, which leads to Head-Of-Line (HOL) blocking and reduced performance. While predicting exact output lengths is challenging, we show that it’s possible to rank requests based on their relative output lengths using learning to rank. This ranking enables more efficient scheduling. Our novel scheduler improves upon traditional methods by better approximating SJF, leading to substantial performance gains, such as a 2.8x reduction in latency for chatbot services and a 6.5x increase in throughput for synthetic data generation.

Installation

vllm-ltr is built on vLLM for inference and allRank for training. To install our modified version, follow these steps:

conda create -n vllm-ltr python=3.10
conda activate vllm-ltr
git clone https://github.com/hao-ai-lab/vllm-ltr.git
cd vllm-ltr
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=12.1 -c pytorch -c nvidia #install pytorch according to your cuda version
pip install -e . #install from source 
pip install flash-attn torchaudio==2.2.1 torchvision==0.17.1 numpy==1.25.2 fschat accelerate gcsfs scikit-learn scipy matplotlib evaluate #extra libs

Reproduce Results

For predictor training, refer to the ./train directory, and for end-to-end evaluation, check the ./benchmarks directory.

Fine-tuned predictors can be found on huggingface.

Citation

@article{fu2024efficient,
  title={Efficient LLM Scheduling by Learning to Rank},
  author={Fu, Yichao and Zhu, Siqi and Su, Runlong and Qiao, Aurick and Stoica, Ion and Zhang, Hao},
  journal={arXiv preprint arXiv:2408.15792},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,141 Commits
.buildkite		.buildkite
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docs		docs
examples		examples
rocm_patch		rocm_patch
tests		tests
train		train
vllm		vllm
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.neuron		Dockerfile.neuron
Dockerfile.rocm		Dockerfile.rocm
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
collect_env.py		collect_env.py
format.sh		format.sh
patch_xformers.rocm.sh		patch_xformers.rocm.sh
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
requirements-common.txt		requirements-common.txt
requirements-cpu.txt		requirements-cpu.txt
requirements-cuda.txt		requirements-cuda.txt
requirements-dev.txt		requirements-dev.txt
requirements-neuron.txt		requirements-neuron.txt
requirements-rocm.txt		requirements-rocm.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient LLM Scheduling by Learning to Rank [paper]

Motivation

Installation

Reproduce Results

Citation

About

Releases

Packages

Contributors 3

Languages

License

hao-ai-lab/vllm-ltr

Folders and files

Latest commit

History

Repository files navigation

Efficient LLM Scheduling by Learning to Rank [paper]

Motivation

Installation

Reproduce Results

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages