Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update vLLM integration page #182

Merged
merged 2 commits into from
Feb 16, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 58 additions & 6 deletions integrations/vllm.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: integration
name: vLLM Invocation Layer
description: Use a vLLM server or locally hosted instance in your Prompt Node
description: Use the vLLM inference engine with Haystack
authors:
- name: Lukas Kreussel
socials:
Expand All @@ -25,15 +25,67 @@ Simply use [vLLM](https://github.com/vllm-project/vllm) in your haystack pipelin
</a>
</p>

## Installation
### Table of Contents

- [Overview](#overview)
- [Haystack 2.0](#haystack-20)
- [Installation](#installation)
- [Usage](#usage)
- [Haystack 1.x](#haystack-1x)
- [Installation (1.x)](#installation-1x)
- [Usage (1.x)](#usage-1x)

## Overview

[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
It is an open-source project that allows serving open models in production, when you have GPU resources available.

For Haystack 1.x, the integration is available as a separate package, while for Haystack 2.x, the integration comes out of the box.

## Haystack 2.x

vLLM can be deployed as a server that implements the OpenAI API protocol.
This allows vLLM to be used with the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack.
anakin87 marked this conversation as resolved.
Show resolved Hide resolved

For an end-to-end example of [vLLM + Haystack 2.x, see this notebook](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/vllm_inference_engine.ipynb).


### Installation
vLLM should be installed.
- you can use `pip`: `pip install vllm` (more information in the [vLLM documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html))
- for production use cases, there are many other options, including Docker ([docs](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html))

### Usage
You first need to run an vLLM OpenAI-compatible server. You can do that using [Python](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) or [Docker](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html).

Then, you can use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.

```python
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = OpenAIChatGenerator(
api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), # for compatibility with the OpenAI API, a placeholder api_key is needed
model="mistralai/Mistral-7B-Instruct-v0.1",
api_base_url="http://localhost:8000/v1",
generation_kwargs = {"max_tokens": 512}
)

response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")])
```

## Haystack 1.x

### Installation (1.x)
Install the wrapper via pip: `pip install vllm-haystack`

## Usage
### Usage (1.x)
This integration provides two invocation layers:
- `vLLMInvocationLayer`: To use models hosted on a vLLM server
- `vLLMLocalInvocationLayer`: To use locally hosted vLLM models

### Use a Model Hosted on a vLLM Server
#### Use a Model Hosted on a vLLM Server
To utilize the wrapper the `vLLMInvocationLayer` has to be used.

Here is a simple example of how a `PromptNode` can be created with the wrapper.
Expand All @@ -52,12 +104,12 @@ prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)
The model will be inferred based on the model served on the vLLM server.
For more configuration examples, take a look at the unit-tests.

#### Hosting a vLLM Server
##### Hosting a vLLM Server

To create an *OpenAI-Compatible Server* via vLLM you can follow the steps in the
Quickstart section of their [documentation](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html#openai-compatible-server).

### Use a Model Hosted Locally
#### Use a Model Hosted Locally
⚠️To run `vLLM` locally you need to have `vllm` installed and a supported GPU.

If you don't want to use an API-Server this wrapper also provides a `vLLMLocalInvocationLayer` which executes the vLLM on the same node Haystack is running on.
Expand Down