Skip to content

Unofficial implementation of Token Recycling self-speculative decoding method.

Notifications You must be signed in to change notification settings

smpanaro/token-recycling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Token Recycling ♻️

(Unofficial) implementation of the self-speculative LLM decoding method described in Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling.

🚀 Fast: ~2x speedup over baseline SpecBench on A100. (MAT 2.5)

🎮 Plug n Play: no training and no architecture changes.

🔮 Self-Speculative: no draft model needed.

Installation

pip install -r requirements.txt

Usage

python -m src.cli

or

from src.token_recycling import TokenRecycling

model = TokenRecycling.from_pretrained("HuggingFaceTB/SmolLM2-135M")
output = model.generate("Your prompt here")

Benchmarks

  • Spec-Bench
  • Device: a single NVIDIA A100 GPU (40GB) with 30 CPU cores
  • Testing environment: Pytorch 2.5.1, under CUDA 12.4
  • Experimental Settings: greedy decoding, FP16 precision, batch size = 1
  • Single run (not average of 3 runs like the official leaderboard)
  • Cold Start means the Token Recycling adjacency matrix was reset for each prompt.

Vicuna-7B-v1.3

Note

This only includes methods that don't require extra parameters. Other methods like EAGLE and Hydra score better (+0.01-0.21x). Refer to the official Leaderboard.

Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
Recycling 2.24x 1.87x 2.08x 1.99x 2.50x 1.80x 2.67 2.08x
Recycling Cold Start 2.07x 1.30x 2.23x 1.70x 2.30x 1.95x 2.55 1.93x
PLD 1.56x 1.00x 2.54x 1.13x 1.55x 1.80x 1.75 1.60x
Lookahead 1.45x 1.13x 1.31x 1.20x 1.50x 1.16x 1.64 1.30x

About

Unofficial implementation of Token Recycling self-speculative decoding method.

Topics

Resources

Stars

Watchers

Forks

Languages