Skip to content

Releases: DefTruth/Awesome-LLM-Inference

v2.6.1

14 Oct 05:08
7ba03a6
Compare
Choose a tag to compare

What's Changed

  • [From Author] Link CacheGen and CacheBlend to LMCache by @KuntaiDu in #80
  • 🔥[LORC] Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy by @DefTruth in #81
  • Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation by @DefTruth in #82
  • [LLM Inference] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE by @DefTruth in #83
  • 🔥[PARALLELSPEC] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING by @DefTruth in #84

New Contributors

Full Changelog: v2.6...v2.6.1

v2.6

03 Oct 01:02
c3f1409
Compare
Choose a tag to compare

What's Changed

  • 🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS by @DefTruth in #70
  • fix typo by @DefTruth in #71
  • 🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION by @DefTruth in #72
  • [Low-bit] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms by @DefTruth in #73
  • 🔥🔥[HiFloat8] Ascend HiFloat8 Format for Deep Learning by @DefTruth in #74
  • 🔥[AlignedKV] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization by @DefTruth in #75
  • 🔥🔥[Tensor Cores] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores by @DefTruth in #76
  • 🔥[KV-COMPRESS] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD by @DefTruth in #77
  • 🔥[LayerKV] Optimizing Large Language Model Serving with Layer-wise KV Cache Management by @DefTruth in #78
  • Bump up to v2.6 by @DefTruth in #79

Full Changelog: v2.5...v2.6

v2.5

26 Sep 03:25
3e43647
Compare
Choose a tag to compare

What's Changed

  • 🔥[InstInfer] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference by @DefTruth in #65
  • Update codebase of paper "parallel speculative decoding with adaptive draft length" by @smart-lty in #66
  • move RetrievalAttention -> long context by @DefTruth in #67
  • 🔥🔥[CRITIPREFILL] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS by @DefTruth in #68
  • Bump up to v2.5 by @DefTruth in #69

New Contributors

Full Changelog: v2.4...v2.5

v2.4

18 Sep 05:10
829da5a
Compare
Choose a tag to compare

What's Changed

  • 🔥[RetrievalAttention] Accelerating Long-Context LLM Inference via Vector Retrieval by @DefTruth in #62
  • 🔥[Inf-MLLM] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU by @DefTruth in #63
  • Bump up to v2.4 by @DefTruth in #64

Full Changelog: v2.3...v2.4

v2.3

09 Sep 01:25
f0860e8
Compare
Choose a tag to compare

What's Changed

  • 🔥[CHESS] CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification by @DefTruth in #59
  • 🔥[SpMM] High Performance Unstructured SpMM Computation Using Tensor Cores by @DefTruth in #60
  • Bump up to v2.3 by @DefTruth in #61

Full Changelog: v2.2...v2.3

v2.2

04 Sep 06:22
6d7e9f8
Compare
Choose a tag to compare

What's Changed

  • Add NanoFlow code link by @DefTruth in #51
  • 🔥[ACTIVATION SPARSITY] TRAINING-FREE ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS by @DefTruth in #52
  • 🔥[Decentralized LLM] Decentralized LLM Inference over Edge Networks with Energy Harvesting by @DefTruth in #53
  • 🔥[SJF Scheduling] Efficient LLM Scheduling by Learning to Rank by @DefTruth in #54
  • 🔥[Speculative Decoding] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation by @DefTruth in #55
  • 🔥🔥[Prompt Compression] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference by @DefTruth in #56
  • 🔥🔥[Context Distillation] Efficient LLM Context Distillation by @DefTruth in #57
  • Bump up to v2.2 by @DefTruth in #58

Full Changelog: v2.1...v2.2

v2.1

28 Aug 01:53
74f887c
Compare
Choose a tag to compare

What's Changed

  • Update README.md by @DefTruth in #40
  • 🔥[Speculative Decoding] Parallel Speculative Decoding with Adaptive Draft Length by @DefTruth in #41
  • 🔥[FocusLLM] FocusLLM: Scaling LLM’s Context by Parallel Decoding by @DefTruth in #42
  • 🔥[NanoFlow] NanoFlow: Towards Optimal Large Language Model Serving Throughput by @DefTruth in #43
  • 🔥[MagicDec] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding by @DefTruth in #44
  • Add ABQ-LLM code link by @DefTruth in #46
  • 🔥🔥[MARLIN] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models by @DefTruth in #47
  • 🔥[1-bit LLMs] Matmul or No Matmal in the Era of 1-bit LLMs by @DefTruth in #48
  • 🔥🔥[FLA] FLA: A Triton-Based Library for Hardware-Efficient Implementa… by @DefTruth in #49
  • Bump up to v2.1 by @DefTruth in #50

Full Changelog: v2.0...v2.1

v2.0

19 Aug 01:22
8c0b51d
Compare
Choose a tag to compare

What's Changed

  • 🔥🔥[LUT TENSOR CORE] Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration by @DefTruth in #33
  • 🔥🔥[Eigen Attention] Attention in Low-Rank Space for KV Cache Compression by @DefTruth in #34
  • KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning by @DefTruth in #35
  • Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference by @DefTruth in #36
  • 🔥[ABQ-LLM] Arbitrary-Bit Quantized Inference Acceleration for Large Language Models by @DefTruth in #37
  • [Token Recycling] Turning Trash into Treasure: Accelerating Inference… by @DefTruth in #38
  • Bump up to v2.0 by @DefTruth in #39

Full Changelog: v1.9...v2.0

v1.9

12 Aug 01:27
e6b8cf4
Compare
Choose a tag to compare

What's Changed

  • 🔥[DynamoLLM] DynamoLLM: Designing LLM Inference Clusters for Performa… by @DefTruth in #28
  • 🔥[Zero-Delay QKV Compression] Zero-Delay QKV Compression for Mitigati… by @DefTruth in #29
  • 🔥[Automatic Inference Engine Tuning] Towards SLO-Optimized LLM Servin… by @DefTruth in #30
  • 🔥🔥[500xCompressor] 500xCompressor: Generalized Prompt Compression for… by @DefTruth in #31
  • Bump up to v1.9 by @DefTruth in #32

Full Changelog: v1.8...v1.9

v1.8

05 Aug 02:33
6bb8818
Compare
Choose a tag to compare

What's Changed

  • 🔥[flashinfer] FlashInfer: Kernel Library for LLM Serving(@flashinfer-ai) by @DefTruth in #24
  • 🔥[Palu] Palu: Compressing KV-Cache with Low-Rank Projection(@nycu.edu… by @DefTruth in #25
  • 🔥[SentenceVAE] SentenceVAE: Faster, Longer and More Accurate Inferenc… by @DefTruth in #26
  • Bump up to v1.8 by @DefTruth in #27

Full Changelog: v1.7...v1.8