This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.
- LLM Architecture Design
- LLM Pre-Training
- LLM Fine-Tuning
- LLM Inference
- System Design
- LLM Resource Efficiency Leaderboards
Date | keywords | Institute | Paper | Publication |
---|---|---|---|---|
2017-06 | Transformers | Attention Is All You Need | NeurIPS |
- Example - Description of an example paper.
Date | Keywords | Paper | Venue |
---|---|---|---|
2017 | Mixture of Experts | Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | ICLR |
2022 | Mixture of Experts | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity | JMLR |
2022 | Mixture of Experts | GLaM: Efficient Scaling of Language Models with Mixture-of-Experts | ICML |
2022 | Mixture of Experts | Mixture-of-Experts with Expert Choice Routing | NeurIPS |
2022 | Mixture of Experts | Efficient Large Scale Language Modeling with Mixtures of Experts | EMNLP |
2023 | RNN LM | RWKV: Reinventing RNNs for the Transformer Era | EMNLP-Findings |
- Example - Description of an example paper.
- Example - Description of an example paper.
Date | Keywords | Paper | Venue |
---|---|---|---|
2019 | Masking-based fine-tuning | SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ACL |
2021 | Masking-based fine-tuning | BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models | ACL |
2021 | Masking-based fine-tuning | Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning | EMNLP |
2021 | Masking-based fine-tuning | Unlearning Bias in Language Models by Partitioning Gradients | ACL |
2022 | Masking-based fine-tuning | Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively | NeurIPS |
- Example - Description of an example paper.
Date | Keywords | Paper | Venue |
---|---|---|---|
2023 | Unstructured Pruning | SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot | ICML |
2023 | Unstructured Pruning | A Simple and Effective Pruning Approach for Large Language Models | ICLR |
2023 | Unstructured Pruning | AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers | TCAD |
2023 | Structured Pruning | LLM-Pruner: On the Structural Pruning of Large Language Models | NeurIPS |
2023 | Structured Pruning | LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation | ICML |
2023 | Structured Pruning | Structured Pruning for Efficient Generative Pre-trained Language Models | ACL |
2023 | Structured Pruning | ZipLM: Inference-Aware Structured Pruning of Language Models | NeurIPS |
2023 | Contextual Pruning | Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML |
Date | Keywords | Paper | Venue |
---|---|---|---|
2021 | Score-based Token Removal | Efficient sparse attention architecture with cascade token and head pruning | HPCA |
2022 | Score-based Token Removal | Learned Token Pruning for Transformers | KDD |
2023 | Score-based Token Removal | Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference | KDD |
2021 | Learning-based Token Removal | TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference | NAACL |
2022 | Learning-based Token Removal | Transkimmer: Transformer Learns to Layer-wise Skim | ACL |
2023 | Learning-based Token Removal | PuMer: Pruning and Merging Tokens for Efficient Vision Language Models | ACL |
2023 | Learning-based Token Removal | Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model | arXiv |
2023 | Learning-based Token Removal | SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models | arXiv |
- Example - Description of an example paper.
- Example - Description of an example paper.
- Example - Description of an example paper.
- Example - Description of an example paper.
- Example - Description of an example paper.
Metric | Description | Example Usage |
---|---|---|
FLOPs (Floating-point operations) | the number of arithmetic operations on floating-point numbers | [FLOPs] |
Training Time | the total duration required for training, typically measured in wall-clock minutes, hours, or days | [minutes, days] [hours] |
Inference Time/Latency | the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds | [end-to-end latency in seconds] [next token generation latency in milliseconds] |
Throughput | the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS) | [tokens/s] [queries/s] |
Speed-Up Ratio | the improvement in inference speed compared to a baseline model | [inference time speed-up] [throughput speed-up] |
Metric | Description | Example Usage |
---|---|---|
Number of Parameters | the number of adjustable variables in the LLM’s neural network | [number of parameters] |
Model Size | the storage space required for storing the entire model | [peak memory usage in GB] |
Metric | Description | Example Usage |
---|---|---|
Energy Consumption | the electrical power used during the LLM’s lifecycle | [kWh] |
Carbon Emission | the greenhouse gas emissions associated with the model’s energy usage | [kgCO2eq] |
The following are available software packages designed for real-time tracking of energy consumption and carbon emission.
- CodeCarbon - a lightweight Python-compatible package that quantifies the carbon dioxide emissions generated by computing resources and provides methods for reducing the environmental impact.
- Carbontracker -
- experiment-impact-tracker
You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or
- ML CO2 Impact - a web-based tool that estimates the carbon emission of a model by estimating the electricity consumption of the training procedure.
- LLMCarbon -
Metric | Description | Example Usage |
---|---|---|
Dollars per parameter | the total cost of training (or running) the LLM by the number of parameters |
Metric | Description | Example Usage |
---|---|---|
Communication Volume | the total amount of data transmitted across the network during a specific LLM execution or training run | [communication volume in TB] |
Metric | Description | Example Usage |
---|---|---|
Compression Ratio | the reduction in size of the compressed model compared to the original model | [compress rate] [percentage of weights remaining] |
Loyalty/Fidelity | the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment | [loyalty] [fidelity] |
Robustness | the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output | [after-attack accuracy, query number] |
Pareto Optimality | the optimal trade-offs between various competing factors | [Pareto frontier (cost and accuracy)] [Pareto frontier (performance and FLOPs)] |
- Example - Description of an example paper.
If you find this paper list useful in your research, please consider citing:
@article{bai2024beyond,
title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
journal={arXiv preprint arXiv:2401.00625},
year={2024}
}