Merge pull request #590 from zackproser/post-what-is-lora-overview

Add post: The Rich Don't Fine-tune Like You And I
zackproser · Sep 22, 2024 · c7eb2ab · c7eb2ab
2 parents 9ab55ce + f02efc8
commit c7eb2ab
Show file tree

Hide file tree

Showing 13 changed files with 281 additions and 2 deletions.
diff --git a/src/app/blog/what-is-lora-and-qlora/page.mdx b/src/app/blog/what-is-lora-and-qlora/page.mdx
@@ -0,0 +1,271 @@
+import { ArticleLayout } from '@/components/ArticleLayout'
+import { Button } from '@/components/Button'
+import Image from 'next/image'
+
+import richFineTuning from '@/images/rich-fine-tuning.webp'
+import transferLearning from '@/images/transfer-learning.webp'
+import aiResourceGap from '@/images/ai-resource-gap.webp'
+import fineTuneAndroid from '@/images/fine-tune-android.webp'
+import peft from '@/images/peft.webp'
+import hackerJoy from '@/images/hacker-joy.webp'
+import pumpkinPatch from '@/images/pumpkin-patch.webp'
+import pumpkinLora from '@/images/pumpkin-lora.webp'
+import halloweenLora from '@/images/halloween-lora.webp'
+import civitai from '@/images/civitai.webp'
+import pytorch from '@/images/pytorch.webp'
+
+import { createMetadata } from '@/utils/createMetadata'
+
+export const metadata = createMetadata({
+    author: "Zachary Proser",
+    date: "2024-09-22",
+    title: "The Rich Don't Fine-tune Like You and Me: Intro to LoRA and QLoRA",
+    description: "LoRA and QLoRA are two important innovations related to fine-tuning large language models like Llama and GPT.",
+    image: richFineTuning,
+    slug: '/blog/what-is-lora-and-qlora'
+});
+
+export default (props) => <ArticleLayout metadata={metadata} {...props} />
+
+<Image src={richFineTuning} alt="Rich people fine-tuning models" />
+<figcaption>People with Hundreds of Millions of Dollars in Cloud Computing Credits to Burn</figcaption>
+
+Your choice in methods of fine-tuning Large Language Models is determined by your answer to the question: 
+
+> **"Do you have access to hundreds of millions of dollars in capital?"**
+
+If your answer is no, you're probably looking at single or multi-device LoRA or qLoRA - two methods of "steering" a foundational model toward different outputs without incurring the massive resource and time needs of full fine-tuning.
+
+Transfer learning enables models trained on large datasets to be adapted for specific tasks or domains through fine-tuning, a crucial aspect of modern AI development.
+
+<Image src={transferLearning} alt="Transfer learning" />
+<figcaption>Transfer learning is the process of taking a model trained on one task and using it for another task.</figcaption>
+
+However, the resource requirements for traditional fine-tuning methods starkly divide the AI community, separating those with access to enormous computational resources from those without.
+
+<Image src={aiResourceGap} alt="AI Resource Gap" />
+<figcaption>The AI Resource Gap: People with Hundreds of Millions of Dollars in Cloud Computing Credits vs. People with $0 in Cloud Computing Credits and Old Gaming Rigs Lying Around</figcaption>
+
+Fine-tuning is crucial because it allows us to take a general-purpose model and specialize it for particular applications, improving its performance on specific tasks without the need to train a new model from scratch. 
+
+For example, fine-tuning can be used to adapt a language model to generate more accurate customer support responses for a specific company or improve a medical image classifier's ability to detect certain diseases from X-rays.
+
+For those without access to vast financial resources, techniques like LoRA (Low-Rank Adaptation) and qLoRA (Quantized LoRA) have emerged as powerful alternatives. 
+
+These methods allow for efficient "steering" of foundational models towards desired outputs, bypassing the need for extensive computational resources and time typically associated with full fine-tuning.
+
+## The Resource Divide in AI Model Adaptation
+
+### Full Fine-tuning: The Resource-Intensive Approach
+
+Full fine-tuning is the process of updating all the weights of a pre-trained model's parameters to adapt it to a new task or domain. This method can potentially yield the best results, as it allows the model to adjust its entire knowledge base to the new data. 
+
+<Image src={fineTuneAndroid} alt="Fine-tuning on an Android" />
+<figcaption>Full Fine-Tuning: The Resource-Intensive Approach</figcaption>
+
+However, the resource requirements for this approach are staggering and often prohibitive for all but the largest tech companies and research institutions.
+
+To put this into perspective, consider the example of OpenAI's ChatGPT 4. Sam Altman, CEO of OpenAI, has publicly stated that the company spent around $100 million on cloud resources to train this model. 
+
+This level of investment is simply not feasible for the vast majority of developers, researchers, or even medium-sized companies.
+
+<Image src={hackerJoy} alt="Hacker Joy" />
+<figcaption>A machine learning engineer burning through $100,000 of cloud computing credits on a single fine-tuning run.</figcaption>
+
+## PEFT: Democratizing Model Adaptation
+
+Given the prohibitive resource requirements of full fine-tuning, researchers have been actively seeking more efficient alternatives. This search has led to the development of Parameter-Efficient Fine-Tuning (PEFT) techniques. 
+
+PEFT methods aim to achieve comparable performance to full fine-tuning while updating only a small subset of the model's parameters.
+
+<Image src={peft} alt="PEFT" />
+<figcaption>Parameter Efficient Fine-Tuning: Democratizing Model Adaptation</figcaption>
+
+Several PEFT techniques have emerged in recent years, each with its own approach to efficient model adaptation:
+
+1. **Prefix Tuning**: This method prepends a trainable prefix to the input of each layer in the model. By optimizing only these prefixes, the model can be adapted to new tasks with minimal parameter updates.
+1. **Prompt Tuning**: Similar to prefix tuning, this technique focuses on optimizing a small set of continuous task-specific vectors (soft prompts) that are prepended to the input.
+1. **Adapter Layers**: This approach involves inserting small trainable modules (adapters) between the layers of a pre-trained model. Only these adapter layers are updated during fine-tuning, leaving the original model parameters unchanged.
+
+Among these PEFT techniques, one method has gained particular prominence due to its effectiveness and efficiency: Low-Rank Adaptation, or LoRA.
+
+
+## What is LoRA? 
+
+Low-Rank Adaptation (LoRA) is a technique that efficiently fine-tunes large language models without modifying all their parameters. 
+
+Developed by Microsoft, LoRA has gained traction for delivering similar performance to full fine-tuning while significantly reducing memory and computational costs. LoRA works by adding small, trainable rank decomposition matrices to the model's existing weights, allowing it to adapt to new tasks or domains without altering most of its original parameters.
+
+For a practical example, consider the image generation platform civit.ai, where users can create and share LoRA adapters. 
+
+<Image src={civitai} alt="Civit.ai, a website for sharing LoRA adapters and other models" />
+<figcaption>Civit.ai: A platform for creating and sharing LoRA adapters</figcaption>
+
+Suppose you want to generate images of a pumpkin patch for Halloween. A base model might produce fall-themed images—pumpkins, hot cider, and hay bales. 
+
+<Image src={pumpkinPatch} alt="Pumpkin Patch" />
+<figcaption>Your Standard run of the mill pumpkin patch</figcaption>
+
+However, by applying a Halloween-themed LoRA adapter, you can add spooky elements without retraining the entire model. 
+
+<Image src={halloweenLora} alt="Pumpkin Patch" />
+<figcaption>A LoRA adapter that will "steer" a base model toward a Halloween theme</figcaption>
+
+This method efficiently alters the output of the base model, without retraining, and you get a spooky pumpkin patch. 
+
+<Image src={pumpkinLora} alt="Pumpkin Patch" />
+<figcaption>LoRA adapter result: A pumpkin patch with a Halloween theme</figcaption>
+
+The same technique can be applied to text generation models to adapt them to specific writing styles or topics.
+
+## How LoRA Works
+
+LoRA uses low-rank matrix decomposition to update a model's behavior efficiently. Instead of modifying the dense weight matrices of the pre-trained model, LoRA introduces two smaller matrices—down-projection and up-projection—alongside the original weights. 
+
+The down-projection reduces the input to a lower-dimensional space, and the up-projection maps it back. These matrices form a low-rank update, and only they are trained, leaving the original parameters unchanged. 
+
+Inference time means using a trained model, or model plus LoRA adapter, to generate a "prediction" - or output, given some "input" (usually a user prompt).
+
+At inference, the LoRA updates merge with the original weights, resulting in no extra computational overhead. Here's an example in Python using Hugging Face's Transformers library where I'm loading a LoRA adapter I trained and 
+merging its weights with the base model at inference time: 
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+from peft import PeftModel
+import torch
+
+# Specify the base model and the LoRA adapter - which are both available on Hugging Face's model hub
+model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
+new_model = "zackproser/Meta-Llama-3.1-8B-instruct-zp-writing-ft-qlora"
+
+# Load the base model and the LoRA adapter
+base_model = AutoModelForCausalLM.from_pretrained(model_name)
+
+model = PeftModel.from_pretrained(base_model, new_model)
+# Merge the LoRA adapter with the base model
+model = model.merge_and_unload()
+
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.padding_side = "right"
+
+# Create a pipeline for text generation
+pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=2000, device=0)
+
+ # Inference: Generate a response, using the base model merged with the LoRA adapter's weights
+prompt = "Write me an article about getting faster as a developer"
+result = pipe(f"<s>[INST] {prompt} [/INST]")
+print(result)
+```
+
+### Advantages of LoRA
+
+1. **Memory Efficiency**: LoRA reduces memory use by updating fewer parameters, allowing larger models to run on consumer hardware.
+1. **Faster Training**: With fewer parameters to update, training is faster, enabling quicker iterations.
+1. **Reduced Computational Requirements**: LoRA lowers the computational burden, making model adaptation accessible to more developers.
+1. **Flexibility**: LoRA adapters can be easily swapped or combined, enabling modular experimentation without retraining.
+1. **Preservation of Original Model**: Since LoRA doesn’t modify original weights, you can revert to the base model or switch adaptations seamlessly.
+
+### LoRA vs. Full Fine-tuning
+While LoRA offers substantial efficiency gains, it’s useful to compare its performance and resource requirements to full fine-tuning:
+
+| Aspect | Full Fine-tuning | LoRA |
+|--------|------------------|------|
+| Parameter Updates | All model parameters | Only LoRA matrices (typically < 1% of model size) |
+| Memory Requirements | Very High (Full model + gradients) | Low (Only LoRA parameters + gradients) |
+| Training Time | Long (Days to weeks for large models) | Short (Hours to days) |
+| Performance | Potentially highest | Comparable to full fine-tuning in many cases |
+| Flexibility | Limited (entire model is task-specific) | High (adapters can be swapped/combined) |
+| Hardware Requirements | High-end GPUs, often multiple | Can run on consumer-grade hardware |
+
+While full fine-tuning can potentially achieve the best performance, the trade-off in terms of resources and time is substantial. LoRA provides a compelling alternative that can achieve comparable results in many cases, with only a fraction of the resources.
+
+## Practical Example: Using Torchtune for LoRA and qLoRA
+
+Torchtune is a native PyTorch library that helps you implement LoRA (Low-Rank Adaptation) and qLoRA (Quantized Low-Rank Adaptation) for efficient model fine-tuning. 
+
+<Image src={pytorch} alt="PyTorch" />
+<figcaption>PyTorch's Torchtune: The Library I Used to Implement LoRA</figcaption>
+
+### Accessing and Customizing Recipes
+
+Torchtune provides pre-configured recipes for various fine-tuning scenarios. Users can view available recipes with the `tune ls` command:
+
+```bash
+tune ls
+RECIPE CONFIG
+full_finetune_single_device llama2/7B_full_low_memory mistral/7B_full_low_memory
+full_finetune_distributed llama2/7B_full llama2/13B_full mistral/7B_full
+lora_finetune_single_device llama2/7B_lora_single_device llama2/7B_qlora_single_device mistral/7B_lora_single_device
+```
+
+To use a recipe, copy it to your local directory with `tune cp`:
+
+```bash
+tune cp llama3_1/8B_qlora_single_device my_conf
+```
+
+This creates a YAML configuration file you can customize.
+
+### Configuration File Structure
+
+The resulting YAML file contains detailed settings for the fine-tuning process. Key sections include:
+
+1. Model Arguments: Specifies the base model and LoRA parameters.
+2. Tokenizer: Defines the tokenizer path and settings.
+3. Checkpointer: Manages model checkpoints.
+4. Dataset and Sampler: Configures the training data.
+5. Optimizer and Scheduler: Sets learning rate and optimization parameters.
+6. Training: Defines epochs, batch size, and other training specifics.
+7. Logging: Configures metrics logging, often using Weights & Biases.
+
+### LoRA-specific Configuration
+
+The configuration file includes LoRA-specific parameters:
+
+```yaml
+lora_attn_modules: ['q_proj', 'v_proj', 'k_proj', 'output_proj']
+apply_lora_to_mlp: True
+apply_lora_to_output: False
+lora_rank: 8
+lora_alpha: 16
+```
+
+These settings determine which model components are adapted and the extent of the adaptation.
+
+### qLoRA Implementation
+
+For qLoRA, Torchtune uses quantization to reduce memory usage while maintaining performance. The configuration specifies:
+
+```yaml
+*component*: torchtune.models.llama3_1.qlora_llama3_1_8b
+```
+
+This component implements quantization alongside LoRA, enabling efficient fine-tuning on consumer-grade hardware.
+
+### Practical Application
+
+To initiate fine-tuning with the configured recipe, users run:
+
+```bash
+tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device
+```
+
+This command applies the qLoRA technique to the specified Llama 3.1 8B model, using the parameters defined in the configuration file.
+
+Torchtune's LoRA and qLoRA implementations enable efficient model adaptation to specific tasks or domains, minimizing the computational costs of full fine-tuning—ideal for resource-constrained projects and rapid iteration.
+
+## Further Reading
+
+If you're interested in learning more about LoRA and related techniques, here are some valuable resources:
+
+* **Original LoRA Paper**: "LoRA: Low-Rank Adaptation of Large Language Models" by Edward J. Hu et al. (2021) - https://arxiv.org/abs/2106.09685
+* **Hugging Face's PEFT Library Documentation**: https://huggingface.co/docs/peft/index
+* **"Parameter-Efficient Transfer Learning for NLP"** by Neil Houlsby et al. (2019) - https://arxiv.org/abs/1902.00751
+* **"The Power of Scale for Parameter-Efficient Prompt Tuning"** by Brian Lester et al. (2021) - https://arxiv.org/abs/2104.08691
+* **QLoRA Paper**: "QLoRA: Efficient Finetuning of Quantized LLMs" by Tim Dettmers et al. (2023) - https://arxiv.org/abs/2305.14314
+
+## Thanks for reading!
+
+Thanks for reading! If you have any questions, feel free to [reach out to me](/contact).
diff --git a/src/app/layout.tsx b/src/app/layout.tsx
@@ -1,3 +1,4 @@
+import { Noto_Sans } from 'next/font/google';
 import { Analytics } from '@vercel/analytics/react';
 import { GoogleAnalytics } from '@next/third-parties/google'
 import { SpeedInsights } from '@vercel/speed-insights/next';
@@ -7,13 +8,20 @@ import { SimpleNav } from '@/components/SimpleNav'
 import '@/styles/tailwind.css'
 import '@/styles/global.css'
 
+// Initialize the Noto Sans font
+const notoSans = Noto_Sans({
+  subsets: ['latin'],
+  weight: ['400', '700'], // Add any weights you need
+  variable: '--font-noto-sans',
+});
+
 export default function RootLayout({
   children,
 }: {
   children: React.ReactNode
 }) {
   return (
-    <html lang="en" className="h-full antialiased">
+    <html lang="en" className={`h-full antialiased ${notoSans.variable}`}>
       <head>
         <link
           rel="alternate"
@@ -28,7 +36,7 @@ export default function RootLayout({
       </head>
       <body
         suppressHydrationWarning={true}
-        className="flex h-full bg-gray-100 dark:bg-black"
+        className={`flex h-full bg-gray-100 dark:bg-black font-sans`}
       >
         <SessionProvider>
           <Providers>

diff --git a/src/images/ai-resource-gap.webp b/src/images/ai-resource-gap.webp
diff --git a/src/images/civitai.webp b/src/images/civitai.webp
diff --git a/src/images/fine-tune-android.webp b/src/images/fine-tune-android.webp
diff --git a/src/images/hacker-joy.webp b/src/images/hacker-joy.webp
diff --git a/src/images/halloween-lora.webp b/src/images/halloween-lora.webp
diff --git a/src/images/peft.webp b/src/images/peft.webp
diff --git a/src/images/pumpkin-lora.webp b/src/images/pumpkin-lora.webp
diff --git a/src/images/pumpkin-patch.webp b/src/images/pumpkin-patch.webp
diff --git a/src/images/pytorch.webp b/src/images/pytorch.webp
diff --git a/src/images/rich-fine-tuning.webp b/src/images/rich-fine-tuning.webp
diff --git a/src/images/transfer-learning.webp b/src/images/transfer-learning.webp