Skip to content

Commit

Permalink
proof read 6.mdx on tgi and vllm
Browse files Browse the repository at this point in the history
  • Loading branch information
burtenshaw committed Feb 20, 2025
1 parent d31cbf0 commit f858963
Showing 1 changed file with 15 additions and 13 deletions.
28 changes: 15 additions & 13 deletions chapters/en/chapter12/6.mdx
Original file line number Diff line number Diff line change
@@ -1,29 +1,31 @@
# Optimized Inference Deployment

In this chapter, we'll explore advanced frameworks for optimizing LLM deployments: Text Generation Inference (TGI) and vLLM. We'll cover how these tools maximize inference efficiency and simplify production deployments of Large Language Models.
In this section, we'll explore advanced frameworks for optimizing LLM deployments: Text Generation Inference (TGI) and vLLM. These application are primarily used in production environments to serve LLMs to users.

We'll cover how these tools maximize inference efficiency and simplify production deployments of Large Language Models.

## Framework Selection Guide

TGI and vLLM serve similar purposes but have distinct characteristics that make them better suited for different use cases. Let's look at the key differences between the two. We'll focus on two key areas: performance and integration.

### Memory Management and Performance

**TGI** is designed to be very stable and predictable in production, using fixed sequence lengths to keep memory usage consistent. TGI manages memory using Flash Attention 2 and continuous batching techniques. This means it can process attention calculations very efficiently and keep the GPU busy by constantly feeding it work. The system can move parts of the model between CPU and GPU when needed, which helps handle larger models.
**TGI** is designed to be stable and predictable in production, using fixed sequence lengths to keep memory usage consistent. TGI manages memory using Flash Attention 2 and continuous batching techniques. This means it can process attention calculations very efficiently and keep the GPU busy by constantly feeding it work. The system can move parts of the model between CPU and GPU when needed, which helps handle larger models.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png" alt="Flash Attention" />

<Tip title="How Flash Attention Works">
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. The standard attention mechanism has quadratic complexity in both time and memory usage, making it inefficient for long sequences.
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [section 12.3](2.mdx), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.

The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating a bottleneck. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers.
The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating bottlenecks by leaving the GPU idle. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers.

While the benefits are most significant during training, Flash Attention's reduced VRAM usage and improved efficiency make it valuable for inference as well, enabling faster and more scalable LLM serving.
</Tip>

**vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient.

<Tip title="How Paged Attention Works">
Paged Attention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. During text generation, the model needs to store attention keys and values (KV cache) for each generated token, which can become enormous, especially with long sequences or multiple concurrent requests.
Paged Attention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. As discussed in [section 12.3](2.mdx), during text generation, the model stores attention keys and values (KV cache) for each generated token to reduce redundant computations. The KV cache can become enormous, especially with long sequences or multiple concurrent requests.

vLLM's key innovation lies in how it manages this cache:

Expand All @@ -32,7 +34,7 @@ vLLM's key innovation lies in how it manages this cache:
3. **Page Table Management**: A page table tracks which pages belong to which sequence, enabling efficient lookup and access.
4. **Memory Sharing**: For operations like parallel sampling, pages storing the KV cache for the prompt can be shared across multiple sequences.

This approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html).
The PagedAttention approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html).
</Tip>

We can summarize the memory differences in the following table:
Expand Down Expand Up @@ -85,7 +87,7 @@ docker run --gpus all \
-p 8080:80 \
-v ~/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.2
--model-id HuggingFaceTB/SmolLM2-360M-Instruct
```

Then interact with it using the OpenAI client:
Expand All @@ -101,7 +103,7 @@ client = OpenAI(

# Chat completion
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"}
Expand All @@ -120,7 +122,7 @@ First, launch the vLLM OpenAI-compatible server:

```bash
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--model HuggingFaceTB/SmolLM2-360M-Instruct \
--host 0.0.0.0 \
--port 8000
```
Expand All @@ -138,7 +140,7 @@ client = OpenAI(

# Chat completion
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"}
Expand All @@ -164,7 +166,7 @@ docker run --gpus all \
-p 8080:80 \
-v ~/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.2 \
--model-id HuggingFaceTB/SmolLM2-360M-Instruct \
--max-total-tokens 4096 \
--max-input-length 3072 \
--max-batch-total-tokens 8192 \
Expand All @@ -182,7 +184,7 @@ client = OpenAI(

# Advanced parameters example
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"}
Expand All @@ -200,7 +202,7 @@ from vllm import LLM, SamplingParams

# Initialize the model with advanced parameters
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.2",
model="HuggingFaceTB/SmolLM2-360M-Instruct",
gpu_memory_utilization=0.85,
max_num_batched_tokens=8192,
max_num_seqs=256,
Expand Down

0 comments on commit f858963

Please sign in to comment.