From de47317b9851524f602dfe556c50b17ed2d23f4d Mon Sep 17 00:00:00 2001 From: mobicham <37179323+mobicham@users.noreply.github.com> Date: Mon, 13 Nov 2023 09:56:37 +0100 Subject: [PATCH] Update index.html - fix typos and links --- index.html | 48 +++++++++++++++++++++++------------------------- 1 file changed, 23 insertions(+), 25 deletions(-) diff --git a/index.html b/index.html index e1b844f..78cd305 100644 --- a/index.html +++ b/index.html @@ -56,7 +56,7 @@

Half Quadratic Quantization of Large Machine Learning Mod

Large Language Models (LLMs) have revolutionized various subfields of machine learning like natural language processing, speech recognition and computer vision, enabling machines to understand and generate outputs with unprecedented accuracy and fluency. However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. Quantization methods such as bitsandbytes, GPTQ and AWQ have made it possible to use large models such as the popular LLama2 with significantly less memory, enabling the machine learning community to conduct remarkable research using a single consumer-grade GPU.

-

In this article, we propose a new quantization technique called Half Quadratic Quantization (HQQ).Our approach, requiring no calibration data, significantly speeds up the quantization of large models, while offering compression quality competitive with that of calibration-based methods.. For instance, HQQ takes less than 8 minutes to process the colossal LLama2-70B, that’s 27x faster compared to the widely adopted GPTQ, while significantly outperforming it for extreme low-bit quantization. +

In this article, we propose a new quantization technique called Half-Quadratic Quantization (HQQ). Our approach, requiring no calibration data, significantly speeds up the quantization of large models, while offering compression quality competitive with that of calibration-based methods. For instance, HQQ takes less than 8 minutes to process the colossal LLama2-70B, that’s 27x faster compared to the widely adopted GPTQ, while significantly outperforming it for extreme low-bit quantization.