Awesome Resource-Efficient LLM Papers

WORK IN PROGRESS A curated list of high-quality papers on resource-efficient LLMs.

This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.

LLM Architecture Design
- Efficient Transformer Architecture
- Non-transformer Architecture
LLM Pre-Training
- Memory Efficiency
- Data Efficiency
LLM Fine-Tuning
- Parameter-Efficient Fine-Tuning
- Full-Parameter Fine-Tuning
LLM Inference
- Model Compression
- Dynamic Acceleration
System Design
LLM Resource Efficiency Leaderboards

Date	keywords	Institute	Paper	Publication
2017-06	Transformers	Google	Attention Is All You Need	NeurIPS

LLM Architecture Design

Efficient Transformer Architecture

Example - Description of an example paper.

Non-transformer Architecture

Date	Keywords	Paper	Venue
2017	Mixture of Experts	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer	ICLR
2022	Mixture of Experts	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	JMLR
2022	Mixture of Experts	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	ICML
2022	Mixture of Experts	Mixture-of-Experts with Expert Choice Routing	NeurIPS
2022	Mixture of Experts	Efficient Large Scale Language Modeling with Mixtures of Experts	EMNLP
2023	RNN LM	RWKV: Reinventing RNNs for the Transformer Era	EMNLP-Findings

LLM Pre-Training

Memory Efficiency

Example - Description of an example paper.

Data Efficiency

Example - Description of an example paper.

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Date	Keywords	Paper	Venue
2019	Masking-based fine-tuning	SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization	ACL
2021	Masking-based fine-tuning	BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models	ACL
2021	Masking-based fine-tuning	Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning	EMNLP
2021	Masking-based fine-tuning	Unlearning Bias in Language Models by Partitioning Gradients	ACL
2022	Masking-based fine-tuning	Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively	NeurIPS

Full-Parameter Fine-Tuning

Example - Description of an example paper.

LLM Inference

Model Compression

Pruning

Date	Keywords	Paper	Venue
2023	Unstructured Pruning	SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot	ICML
2023	Unstructured Pruning	A Simple and Effective Pruning Approach for Large Language Models	ICLR
2023	Unstructured Pruning	AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers	TCAD
2023	Structured Pruning	LLM-Pruner: On the Structural Pruning of Large Language Models	NeurIPS
2023	Structured Pruning	LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation	ICML
2023	Structured Pruning	Structured Pruning for Efficient Generative Pre-trained Language Models	ACL
2023	Structured Pruning	ZipLM: Inference-Aware Structured Pruning of Language Models	NeurIPS
2023	Contextual Pruning	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML

Dynamic Acceleration

Input Pruning

Date	Keywords	Paper	Venue
2021	Score-based Token Removal	Efficient sparse attention architecture with cascade token and head pruning	HPCA
2022	Score-based Token Removal	Learned Token Pruning for Transformers	KDD
2023	Score-based Token Removal	Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference	KDD
2021	Learning-based Token Removal	TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference	NAACL
2022	Learning-based Token Removal	Transkimmer: Transformer Learns to Layer-wise Skim	ACL
2023	Learning-based Token Removal	PuMer: Pruning and Merging Tokens for Efficient Vision Language Models	ACL
2023	Learning-based Token Removal	Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model	arXiv
2023	Learning-based Token Removal	SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models	arXiv

System Design

Hardware Offloading

Example - Description of an example paper.

Collaborative Inference

Example - Description of an example paper.

Libraries

Example - Description of an example paper.

Edge Devices

Example - Description of an example paper.

Other Systems

Example - Description of an example paper.

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

Metric	Description	Example Usage
FLOPs (Floating-point operations)	the number of arithmetic operations on floating-point numbers	[FLOPs]
Training Time	the total duration required for training, typically measured in wall-clock minutes, hours, or days	[minutes, days] [hours]
Inference Time/Latency	the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds	[end-to-end latency in seconds] [next token generation latency in milliseconds]
Throughput	the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS)	[tokens/s] [queries/s]
Speed-Up Ratio	the improvement in inference speed compared to a baseline model	[inference time speed-up] [throughput speed-up]

💾 Memory Metrics

Metric	Description	Example Usage
Number of Parameters	the number of adjustable variables in the LLM’s neural network	[number of parameters]
Model Size	the storage space required for storing the entire model	[peak memory usage in GB]

⚡️ Energy Metrics

Metric	Description	Example Usage
Energy Consumption	the electrical power used during the LLM’s lifecycle	[kWh]
Carbon Emission	the greenhouse gas emissions associated with the model’s energy usage	[kgCO2eq]

The following are available software packages designed for real-time tracking of energy consumption and carbon emission.

CodeCarbon - a lightweight Python-compatible package that quantifies the carbon dioxide emissions generated by computing resources and provides methods for reducing the environmental impact.

Carbontracker -

experiment-impact-tracker

You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or

ML CO2 Impact - a web-based tool that estimates the carbon emission of a model by estimating the electricity consumption of the training procedure.

LLMCarbon -

💵 Financial Cost Metric

Metric	Description	Example Usage
Dollars per parameter	the total cost of training (or running) the LLM by the number of parameters

📨 Network Communication Metric

Metric	Description	Example Usage
Communication Volume	the total amount of data transmitted across the network during a specific LLM execution or training run	[communication volume in TB]

💡 Other Metrics

Metric	Description	Example Usage
Compression Ratio	the reduction in size of the compressed model compared to the original model	[compress rate] [percentage of weights remaining]
Loyalty/Fidelity	the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment	[loyalty] [fidelity]
Robustness	the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output	[after-attack accuracy, query number]
Pareto Optimality	the optimal trade-offs between various competing factors	[Pareto frontier (cost and accuracy)] [Pareto frontier (performance and FLOPs)]

Example - Description of an example paper.

Reference

If you find this paper list useful in your research, please consider citing:

@article{bai2024beyond,
  title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
  author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
  journal={arXiv preprint arXiv:2401.00625},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
media		media
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Resource-Efficient LLM Papers

Table of Contents

LLM Architecture Design

Efficient Transformer Architecture

Non-transformer Architecture

LLM Pre-Training

Memory Efficiency

Data Efficiency

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Full-Parameter Fine-Tuning

LLM Inference

Model Compression

Pruning

Dynamic Acceleration

Input Pruning

System Design

Hardware Offloading

Collaborative Inference

Libraries

Edge Devices

Other Systems

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

💾 Memory Metrics

⚡️ Energy Metrics

💵 Financial Cost Metric

📨 Network Communication Metric

💡 Other Metrics

Reference

About

Releases

Packages

License

shi-yu-wang/Awesome-Resource-Efficient-LLM-Papers

Folders and files

Latest commit

History

Repository files navigation

Awesome Resource-Efficient LLM Papers

Table of Contents

LLM Architecture Design

Efficient Transformer Architecture

Non-transformer Architecture

LLM Pre-Training

Memory Efficiency

Data Efficiency

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Full-Parameter Fine-Tuning

LLM Inference

Model Compression

Pruning

Dynamic Acceleration

Input Pruning

System Design

Hardware Offloading

Collaborative Inference

Libraries

Edge Devices

Other Systems

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

💾 Memory Metrics

⚡️ Energy Metrics

💵 Financial Cost Metric

📨 Network Communication Metric

💡 Other Metrics

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages