Skip to content

Commit

Permalink
proof 4.mdx on pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
burtenshaw committed Feb 20, 2025
1 parent d50f0cb commit 5d1ac5e
Showing 1 changed file with 28 additions and 41 deletions.
69 changes: 28 additions & 41 deletions chapters/en/chapter12/4.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Basic Pipeline Inference

Let's move on to inferring with LLMs for real! In this chapter, we'll explore the `pipeline` abstraction in πŸ€— Transformers, which is like having a Swiss Army knife for machine learning tasks. Whether you're generating text, analyzing sentiment, or translating languages, pipelines make it incredibly easy to get started.
Let's move on to inferring with LLMs for real! In this chapter, we'll explore the `pipeline` abstraction in Transformers, which we have covered in [the previous chapter](chapters/en/chapter2/3.mdx). We learnt that whether we're generating text, analyzing sentiment, or translating languages, pipelines make it incredibly easy to get started. Here we'll look at how to use the pipeline for text generation.

<Tip>

Expand All @@ -9,39 +9,33 @@ Think of the `pipeline` as a handy utility that handles all the complex machiner
</Tip>

## How Pipelines Work

If you're building a machine learning application - there's usually a lot of setup and technical details to handle. That's where the Transformers pipeline comes in. It automates the process from raw input to human-readable output through three stages:

**1. Preprocessing Stage: Getting Your Data Model-Ready**

The pipeline first prepares your inputs so that the model can understand them. For text, it performs tokenization, breaking down your words and sentences into bite-sized pieces (tokens) that the model can understand.

**1. Preprocessing Stage: Getting Your Data Model-Ready** πŸ“
The pipeline first prepares your inputs so that the model can understand them:
- For text: It performs tokenization, breaking down your words and sentences into bite-sized pieces (tokens) that the model can understand
- For images: It handles resizing and pixel normalization, ensuring your images match exactly what the model expects
- For audio: It creates spectrograms or other representations that capture the essential features of sound
**2. Model Inference: The Magic Happens**

**2. Model Inference: The Magic Happens** ✨
During this stage, the pipeline:
- Batches your inputs for efficiency
- Automatically chooses the best available device for computation (CPU or GPU)
- Applies optimizations like half-precision (FP16) inference where it makes sense
- Handles all the technical complexity of running the model

**3. Postprocessing Stage: Output the Results** 🎯
**3. Postprocessing Stage: Output the Results**

Finally, the pipeline transforms the model's raw outputs into something useful for humans:
- Converts token IDs back into readable text
- Transforms model specific outputs (logits) into meaningful probability scores
- Uses probability scores to generate the most likely tokens.

The beauty of this system is that you can focus on what you want to achieve, while the pipeline handles all the technical heavy lifting. As mentioned before, there's a lot here that we're not optimizing for production systems, but it's a great way to get started.

That said, there are still some things that we can optimize with pipelines. Let's look at some of the key configuration options and how to use them.
As mentioned before, there's a lot here that pipeline abstracts away for us, so we're not optimizing for production systems, but it's a great way to get started. That said, there are still some things that we can optimize with pipeline. Let's look at some of the key configuration options and how to use them.

## Basic Usage

Let's see how easy it is to use a pipeline for text generation. Here's a simple example that will get you up and running in no time:

Pipeline works in two steps:
1. Create and configure the pipeline
2. Generate text with the pipeline
It's easy to use a pipeline for text generation. Here's a simple example that will get you up and running in no time:

```python
from transformers import pipeline
Expand All @@ -66,7 +60,7 @@ print(response[0]['generated_text'])

## Key Configuration Options

Let's explore the different ways you can customize your pipeline to suit your needs. Think of these options as the knobs and dials that help you fine-tune your AI assistant:
How can we customize the pipeline to suit our needs?

### Model Loading: Choosing Where to Run

Expand All @@ -90,11 +84,11 @@ generator = pipeline(

In this example we loaded the model twice. Once with the device set to `cpu` and once with the device set to `0` which is the first GPU. The pipeline will automatically choose the best device for the model.

### Generation Parameters: Fine-tuning Your Output πŸ“Έ
### Generation Parameters: Fine-tuning Your Output

Earlier we covered [the token selection process](chapters/en/chapter12/2.mdx) and how the model uses probabilities to select the next token. Let's look at the parameters that we can use to fine-tune the output with the pipeline.
Earlier we covered [the token selection process](chapters/en/chapter12/2.mdx) and how the model uses probabilities to select the next token. Here we look at the parameters that we can use to fine-tune the output with the pipeline.

Think of these parameters as the creative controls for your AI. Just like adjusting the settings on your camera for the perfect photo, these options help you get the most out of the model. Like a camera, poor configuration can lead to blurry photos, so let's look at the different options we have.
Think of these parameters as the creative controls for your model's generation. Just like adjusting the settings on a camera for the perfect photo, these options help you get the most out of the model. Like a camera, poor configuration can lead to blurry photos from even the best model.

```python
response = generator(
Expand All @@ -108,13 +102,14 @@ response = generator(
)
```

### Advanced Optimization: Making Things Faster and More Efficient πŸš΄πŸ»β€β™€οΈ
### Advanced Optimization: Making Things Faster and More Efficient

Let's move beyond the basic usage and look at some of the advanced optimizations that we can use to make the pipeline faster and more efficient.
Let's move beyond token selection and look at some of the optimizations that we can use to make the pipeline faster.

#### 1. Quantization: Shrinking Models With Reduced Precision
Quantization is like compressing a file - it makes the model smaller and faster while keeping most of its capabilities. We won't cover the details of quantization here, but we'll look at how to use it in the pipeline. In [the next section](chapters/en/chapter12/5.mdx) we'll look at some of the tools that we can use to quantize the model.


```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

Expand All @@ -133,8 +128,9 @@ model = AutoModelForCausalLM.from_pretrained(
)
```

#### 2. Memory Management: Keeping Your GPU purring 🐱
GPUs are high performance devices that perform a lot of operations in parallel. To get the most out of them, we need to manage the memory efficiently. In short, so that they're not sitting around idle while waiting for the next batch of data.
#### 2. Memory Management: Keeping Your GPU purring

GPUs are high performance devices that perform a lot of operations in parallel. To get the most out of them, we need to manage their memory efficiently. In short, so that they're not sitting around idle waiting for the next batch of data.

```python
import torch
Expand All @@ -146,9 +142,11 @@ torch.cuda.empty_cache()
torch.cuda.memory_summary()
```

## Processing Multiple Inputs: Batch Processing for Efficiency πŸ“¦
The `torch.cuda.empty_cache()` function is used to clear the cache of the GPU. This is useful when you're running a lot of operations and want to free up memory.

One of the coolest features of pipelines is their ability to handle multiple inputs at once, making everything run faster. Here's how you can process several prompts in one go:
## Processing Multiple Inputs: Batch Processing for Efficiency

One of the coolest features of pipeline is that it handles multiple inputs at once, making everything run faster. Here's how you can process several prompts in one go:

```python
# Let's prepare a bunch of different prompts to process
Expand All @@ -173,25 +171,14 @@ for prompt, response in zip(prompts, responses):
print(f"πŸ€– Response: {response[0]['generated_text']}\n")
```


## Good to Know: Pipeline Limitations πŸ€”

While pipelines are amazing for getting started and prototyping, there are a few things to keep in mind:

- **Speed Tradeoffs**: Pipelines prioritize ease of use over maximum performance
- **Limited Advanced Features**: Some advanced optimizations aren't available out of the box
- **Production Considerations**: For high-traffic applications, you might want to explore specialized serving solutions

For production systems that need to handle lots of requests, consider using Text Generation Inference (TGI) or other dedicated serving solutions that can better optimize for your specific needs.

## Learn More

Want to dive deeper? Here are some great resources to explore:

- [πŸ“– Hugging Face Pipeline Tutorial](https://huggingface.co/docs/transformers/en/pipeline_tutorial) - Your complete guide to pipelines
- [πŸ”§ Pipeline API Reference](https://huggingface.co/docs/transformers/en/main_classes/pipelines) - All the technical details
- [✨ Text Generation Parameters](https://huggingface.co/docs/transformers/en/main_classes/text_generation) - Master the art of text generation
- [⚑ Model Quantization Guide](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one) - Speed up your models
- [Hugging Face Pipeline Tutorial](https://huggingface.co/docs/transformers/en/pipeline_tutorial) - Your complete guide to pipelines
- [Pipeline API Reference](https://huggingface.co/docs/transformers/en/main_classes/pipelines) - All the technical details
- [Text Generation Parameters](https://huggingface.co/docs/transformers/en/main_classes/text_generation) - Master the art of text generation
- [Model Quantization Guide](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one) - Speed up your models

Remember, the pipeline is just the beginning of your journey with πŸ€— Transformers. As you get more comfortable, you'll discover many more powerful features and optimizations to explore!

0 comments on commit 5d1ac5e

Please sign in to comment.