proof read 6.mdx on tgi and vllm

huggingface · Feb 20, 2025 · f858963 · f858963
1 parent d31cbf0
commit f858963
Showing 1 changed file with 15 additions and 13 deletions.
diff --git a/chapters/en/chapter12/6.mdx b/chapters/en/chapter12/6.mdx
@@ -1,29 +1,31 @@
 # Optimized Inference Deployment
 
-In this chapter, we'll explore advanced frameworks for optimizing LLM deployments: Text Generation Inference (TGI) and vLLM. We'll cover how these tools maximize inference efficiency and simplify production deployments of Large Language Models.
+In this section, we'll explore advanced frameworks for optimizing LLM deployments: Text Generation Inference (TGI) and vLLM. These application are primarily used in production environments to serve LLMs to users.
+
+We'll cover how these tools maximize inference efficiency and simplify production deployments of Large Language Models.
 
 ## Framework Selection Guide
 
 TGI and vLLM serve similar purposes but have distinct characteristics that make them better suited for different use cases. Let's look at the key differences between the two. We'll focus on two key areas: performance and integration.
 
 ### Memory Management and Performance
 
-**TGI** is designed to be very stable and predictable in production, using fixed sequence lengths to keep memory usage consistent. TGI manages memory using Flash Attention 2 and continuous batching techniques. This means it can process attention calculations very efficiently and keep the GPU busy by constantly feeding it work. The system can move parts of the model between CPU and GPU when needed, which helps handle larger models. 
+**TGI** is designed to be stable and predictable in production, using fixed sequence lengths to keep memory usage consistent. TGI manages memory using Flash Attention 2 and continuous batching techniques. This means it can process attention calculations very efficiently and keep the GPU busy by constantly feeding it work. The system can move parts of the model between CPU and GPU when needed, which helps handle larger models. 
 
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png" alt="Flash Attention" />
 
 <Tip title="How Flash Attention Works">
-Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. The standard attention mechanism has quadratic complexity in both time and memory usage, making it inefficient for long sequences.
+Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [section 12.3](2.mdx), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.
 
-The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating a bottleneck. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers. 
+The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating bottlenecks by leaving the GPU idle. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers. 
 
 While the benefits are most significant during training, Flash Attention's reduced VRAM usage and improved efficiency make it valuable for inference as well, enabling faster and more scalable LLM serving.
 </Tip>
 
 **vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient.
 
 <Tip title="How Paged Attention Works">
-Paged Attention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. During text generation, the model needs to store attention keys and values (KV cache) for each generated token, which can become enormous, especially with long sequences or multiple concurrent requests.
+Paged Attention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. As discussed in [section 12.3](2.mdx), during text generation, the model stores attention keys and values (KV cache) for each generated token to reduce redundant computations. The KV cache can become enormous, especially with long sequences or multiple concurrent requests.
 
 vLLM's key innovation lies in how it manages this cache:
 
@@ -32,7 +34,7 @@ vLLM's key innovation lies in how it manages this cache:
 3. **Page Table Management**: A page table tracks which pages belong to which sequence, enabling efficient lookup and access.
 4. **Memory Sharing**: For operations like parallel sampling, pages storing the KV cache for the prompt can be shared across multiple sequences.
 
-This approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html).
+The PagedAttention approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html).
 </Tip>
 
 We can summarize the memory differences in the following table:
@@ -85,7 +87,7 @@ docker run --gpus all \
     -p 8080:80 \
     -v ~/.cache/huggingface:/data \
     ghcr.io/huggingface/text-generation-inference:latest \
-    --model-id mistralai/Mistral-7B-Instruct-v0.2
+    --model-id HuggingFaceTB/SmolLM2-360M-Instruct
 ```
 
 Then interact with it using the OpenAI client:
@@ -101,7 +103,7 @@ client = OpenAI(
 
 # Chat completion
 response = client.chat.completions.create(
-    model="mistralai/Mistral-7B-Instruct-v0.2",
+    model="HuggingFaceTB/SmolLM2-360M-Instruct",
     messages=[
         {"role": "system", "content": "You are a helpful assistant."},
         {"role": "user", "content": "Tell me a story"}
@@ -120,7 +122,7 @@ First, launch the vLLM OpenAI-compatible server:
 
 ```bash
 python -m vllm.entrypoints.openai.api_server \
-    --model mistralai/Mistral-7B-Instruct-v0.2 \
+    --model HuggingFaceTB/SmolLM2-360M-Instruct \
     --host 0.0.0.0 \
     --port 8000
 ```
@@ -138,7 +140,7 @@ client = OpenAI(
 
 # Chat completion
 response = client.chat.completions.create(
-    model="mistralai/Mistral-7B-Instruct-v0.2",
+    model="HuggingFaceTB/SmolLM2-360M-Instruct",
     messages=[
         {"role": "system", "content": "You are a helpful assistant."},
         {"role": "user", "content": "Tell me a story"}
@@ -164,7 +166,7 @@ docker run --gpus all \
     -p 8080:80 \
     -v ~/.cache/huggingface:/data \
     ghcr.io/huggingface/text-generation-inference:latest \
-    --model-id mistralai/Mistral-7B-Instruct-v0.2 \
+    --model-id HuggingFaceTB/SmolLM2-360M-Instruct \
     --max-total-tokens 4096 \
     --max-input-length 3072 \
     --max-batch-total-tokens 8192 \
@@ -182,7 +184,7 @@ client = OpenAI(
 
 # Advanced parameters example
 response = client.chat.completions.create(
-    model="mistralai/Mistral-7B-Instruct-v0.2",
+    model="HuggingFaceTB/SmolLM2-360M-Instruct",
     messages=[
         {"role": "system", "content": "You are a creative storyteller."},
         {"role": "user", "content": "Write a creative story"}
@@ -200,7 +202,7 @@ from vllm import LLM, SamplingParams
 
 # Initialize the model with advanced parameters
 llm = LLM(
-    model="mistralai/Mistral-7B-Instruct-v0.2",
+    model="HuggingFaceTB/SmolLM2-360M-Instruct",
     gpu_memory_utilization=0.85,
     max_num_batched_tokens=8192,
     max_num_seqs=256,