From f858963fa243337f2ea75002c9cd8f0c89f70af2 Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Thu, 20 Feb 2025 21:21:46 +0100 Subject: [PATCH] proof read 6.mdx on tgi and vllm --- chapters/en/chapter12/6.mdx | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/chapters/en/chapter12/6.mdx b/chapters/en/chapter12/6.mdx index 107f1472c..24b9fcf09 100644 --- a/chapters/en/chapter12/6.mdx +++ b/chapters/en/chapter12/6.mdx @@ -1,6 +1,8 @@ # Optimized Inference Deployment -In this chapter, we'll explore advanced frameworks for optimizing LLM deployments: Text Generation Inference (TGI) and vLLM. We'll cover how these tools maximize inference efficiency and simplify production deployments of Large Language Models. +In this section, we'll explore advanced frameworks for optimizing LLM deployments: Text Generation Inference (TGI) and vLLM. These application are primarily used in production environments to serve LLMs to users. + +We'll cover how these tools maximize inference efficiency and simplify production deployments of Large Language Models. ## Framework Selection Guide @@ -8,14 +10,14 @@ TGI and vLLM serve similar purposes but have distinct characteristics that make ### Memory Management and Performance -**TGI** is designed to be very stable and predictable in production, using fixed sequence lengths to keep memory usage consistent. TGI manages memory using Flash Attention 2 and continuous batching techniques. This means it can process attention calculations very efficiently and keep the GPU busy by constantly feeding it work. The system can move parts of the model between CPU and GPU when needed, which helps handle larger models. +**TGI** is designed to be stable and predictable in production, using fixed sequence lengths to keep memory usage consistent. TGI manages memory using Flash Attention 2 and continuous batching techniques. This means it can process attention calculations very efficiently and keep the GPU busy by constantly feeding it work. The system can move parts of the model between CPU and GPU when needed, which helps handle larger models. Flash Attention -Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. The standard attention mechanism has quadratic complexity in both time and memory usage, making it inefficient for long sequences. +Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [section 12.3](2.mdx), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences. -The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating a bottleneck. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers. +The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating bottlenecks by leaving the GPU idle. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers. While the benefits are most significant during training, Flash Attention's reduced VRAM usage and improved efficiency make it valuable for inference as well, enabling faster and more scalable LLM serving. @@ -23,7 +25,7 @@ While the benefits are most significant during training, Flash Attention's reduc **vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient. -Paged Attention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. During text generation, the model needs to store attention keys and values (KV cache) for each generated token, which can become enormous, especially with long sequences or multiple concurrent requests. +Paged Attention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. As discussed in [section 12.3](2.mdx), during text generation, the model stores attention keys and values (KV cache) for each generated token to reduce redundant computations. The KV cache can become enormous, especially with long sequences or multiple concurrent requests. vLLM's key innovation lies in how it manages this cache: @@ -32,7 +34,7 @@ vLLM's key innovation lies in how it manages this cache: 3. **Page Table Management**: A page table tracks which pages belong to which sequence, enabling efficient lookup and access. 4. **Memory Sharing**: For operations like parallel sampling, pages storing the KV cache for the prompt can be shared across multiple sequences. -This approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html). +The PagedAttention approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html). We can summarize the memory differences in the following table: @@ -85,7 +87,7 @@ docker run --gpus all \ -p 8080:80 \ -v ~/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ - --model-id mistralai/Mistral-7B-Instruct-v0.2 + --model-id HuggingFaceTB/SmolLM2-360M-Instruct ``` Then interact with it using the OpenAI client: @@ -101,7 +103,7 @@ client = OpenAI( # Chat completion response = client.chat.completions.create( - model="mistralai/Mistral-7B-Instruct-v0.2", + model="HuggingFaceTB/SmolLM2-360M-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a story"} @@ -120,7 +122,7 @@ First, launch the vLLM OpenAI-compatible server: ```bash python -m vllm.entrypoints.openai.api_server \ - --model mistralai/Mistral-7B-Instruct-v0.2 \ + --model HuggingFaceTB/SmolLM2-360M-Instruct \ --host 0.0.0.0 \ --port 8000 ``` @@ -138,7 +140,7 @@ client = OpenAI( # Chat completion response = client.chat.completions.create( - model="mistralai/Mistral-7B-Instruct-v0.2", + model="HuggingFaceTB/SmolLM2-360M-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a story"} @@ -164,7 +166,7 @@ docker run --gpus all \ -p 8080:80 \ -v ~/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ - --model-id mistralai/Mistral-7B-Instruct-v0.2 \ + --model-id HuggingFaceTB/SmolLM2-360M-Instruct \ --max-total-tokens 4096 \ --max-input-length 3072 \ --max-batch-total-tokens 8192 \ @@ -182,7 +184,7 @@ client = OpenAI( # Advanced parameters example response = client.chat.completions.create( - model="mistralai/Mistral-7B-Instruct-v0.2", + model="HuggingFaceTB/SmolLM2-360M-Instruct", messages=[ {"role": "system", "content": "You are a creative storyteller."}, {"role": "user", "content": "Write a creative story"} @@ -200,7 +202,7 @@ from vllm import LLM, SamplingParams # Initialize the model with advanced parameters llm = LLM( - model="mistralai/Mistral-7B-Instruct-v0.2", + model="HuggingFaceTB/SmolLM2-360M-Instruct", gpu_memory_utilization=0.85, max_num_batched_tokens=8192, max_num_seqs=256,