Skip to content

shi-yu-wang/Awesome-Resource-Efficient-LLM-Papers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Resource-Efficient LLM Papers Awesome

WORK IN PROGRESS A curated list of high-quality papers on resource-efficient LLMs.
Clean Energy GIF

This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.

Table of Contents

Date keywords Institute Paper Publication
2017-06 Transformers Google Attention Is All You Need NeurIPS
Dynamic JSON Badge

LLM Architecture Design

Efficient Transformer Architecture

  • Example - Description of an example paper.

Non-transformer Architecture

Date Keywords Paper Venue
2017 Mixture of Experts Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ICLR
2022 Mixture of Experts Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity JMLR
2022 Mixture of Experts GLaM: Efficient Scaling of Language Models with Mixture-of-Experts ICML
2022 Mixture of Experts Mixture-of-Experts with Expert Choice Routing NeurIPS
2022 Mixture of Experts Efficient Large Scale Language Modeling with Mixtures of Experts EMNLP
2023 RNN LM RWKV: Reinventing RNNs for the Transformer Era EMNLP-Findings

LLM Pre-Training

Memory Efficiency

  • Example - Description of an example paper.

Data Efficiency

  • Example - Description of an example paper.

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Date Keywords Paper Venue
2019 Masking-based fine-tuning SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization ACL
2021 Masking-based fine-tuning BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models ACL
2021 Masking-based fine-tuning Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning EMNLP
2021 Masking-based fine-tuning Unlearning Bias in Language Models by Partitioning Gradients ACL
2022 Masking-based fine-tuning Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively NeurIPS

Full-Parameter Fine-Tuning

  • Example - Description of an example paper.

LLM Inference

Model Compression

Pruning

Date Keywords Paper Venue
2023 Unstructured Pruning SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot ICML
2023 Unstructured Pruning A Simple and Effective Pruning Approach for Large Language Models ICLR
2023 Unstructured Pruning AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers TCAD
2023 Structured Pruning LLM-Pruner: On the Structural Pruning of Large Language Models NeurIPS
2023 Structured Pruning LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation ICML
2023 Structured Pruning Structured Pruning for Efficient Generative Pre-trained Language Models ACL
2023 Structured Pruning ZipLM: Inference-Aware Structured Pruning of Language Models NeurIPS
2023 Contextual Pruning Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML

Dynamic Acceleration

Input Pruning

Date Keywords Paper Venue
2021 Score-based Token Removal Efficient sparse attention architecture with cascade token and head pruning HPCA
2022 Score-based Token Removal Learned Token Pruning for Transformers KDD
2023 Score-based Token Removal Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference KDD
2021 Learning-based Token Removal TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference NAACL
2022 Learning-based Token Removal Transkimmer: Transformer Learns to Layer-wise Skim ACL
2023 Learning-based Token Removal PuMer: Pruning and Merging Tokens for Efficient Vision Language Models ACL
2023 Learning-based Token Removal Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model arXiv
2023 Learning-based Token Removal SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models arXiv

System Design

Hardware Offloading

  • Example - Description of an example paper.

Collaborative Inference

  • Example - Description of an example paper.

Libraries

  • Example - Description of an example paper.

Edge Devices

  • Example - Description of an example paper.

Other Systems

  • Example - Description of an example paper.

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

Metric Description Example Usage
FLOPs (Floating-point operations) the number of arithmetic operations on floating-point numbers [FLOPs]
Training Time the total duration required for training, typically measured in wall-clock minutes, hours, or days [minutes, days]
[hours]
Inference Time/Latency the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds [end-to-end latency in seconds]
[next token generation latency in milliseconds]
Throughput the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS) [tokens/s]
[queries/s]
Speed-Up Ratio the improvement in inference speed compared to a baseline model [inference time speed-up]
[throughput speed-up]

💾 Memory Metrics

Metric Description Example Usage
Number of Parameters the number of adjustable variables in the LLM’s neural network [number of parameters]
Model Size the storage space required for storing the entire model [peak memory usage in GB]

⚡️ Energy Metrics

Metric Description Example Usage
Energy Consumption the electrical power used during the LLM’s lifecycle [kWh]
Carbon Emission the greenhouse gas emissions associated with the model’s energy usage [kgCO2eq]

The following are available software packages designed for real-time tracking of energy consumption and carbon emission.

You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or

  • ML CO2 Impact - a web-based tool that estimates the carbon emission of a model by estimating the electricity consumption of the training procedure.
  • LLMCarbon -

💵 Financial Cost Metric

Metric Description Example Usage
Dollars per parameter the total cost of training (or running) the LLM by the number of parameters

📨 Network Communication Metric

Metric Description Example Usage
Communication Volume the total amount of data transmitted across the network during a specific LLM execution or training run [communication volume in TB]

💡 Other Metrics

Metric Description Example Usage
Compression Ratio the reduction in size of the compressed model compared to the original model [compress rate]
[percentage of weights remaining]
Loyalty/Fidelity the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment [loyalty]
[fidelity]
Robustness the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output [after-attack accuracy, query number]
Pareto Optimality the optimal trade-offs between various competing factors [Pareto frontier (cost and accuracy)]
[Pareto frontier (performance and FLOPs)]
  • Example - Description of an example paper.

Reference

If you find this paper list useful in your research, please consider citing:

@article{bai2024beyond,
  title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
  author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
  journal={arXiv preprint arXiv:2401.00625},
  year={2024}
}

About

a curated list of high-quality papers on resource-efficient LLMs 🌱

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published