From d7c2d5f0e9f396c7fcfba4979a567462aa185007 Mon Sep 17 00:00:00 2001 From: anakin87 Date: Fri, 16 Feb 2024 13:13:43 +0100 Subject: [PATCH 1/2] update vLLM integration page --- integrations/vllm.md | 64 +++++++++++++++++++++++++++++++++++++++----- 1 file changed, 58 insertions(+), 6 deletions(-) diff --git a/integrations/vllm.md b/integrations/vllm.md index 8638864c..9bb3fa6e 100644 --- a/integrations/vllm.md +++ b/integrations/vllm.md @@ -1,7 +1,7 @@ --- layout: integration name: vLLM Invocation Layer -description: Use a vLLM server or locally hosted instance in your Prompt Node +description: Use the vLLM inference engine with Haystack authors: - name: Lukas Kreussel socials: @@ -25,15 +25,67 @@ Simply use [vLLM](https://github.com/vllm-project/vllm) in your haystack pipelin

-## Installation +### Table of Contents + +- [Overview](#overview) +- [Haystack 2.0](#haystack-20) + - [Installation](#installation) + - [Usage](#usage) +- [Haystack 1.x](#haystack-1x) + - [Installation (1.x)](#installation-1x) + - [Usage (1.x)](#usage-1x) + +## Overview + +[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs. +It is an open-source project that allows serving open models in production, when you have GPU resources available. + +For Haystack 1.x, the integration is available as a separate package, while for Haystack 2.x, the integration comes out of the box. + +## Haystack 2.x + +vLLM can be deployed as a server that implements the OpenAI API protocol. +This allows vLLM to be used with the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack. + +For an end-to-end example of [vLLM + Haystack 2.x, see this notebook](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/vllm_inference_engine.ipynb). + + +### Installation +vLLM should be installed. +- you can use `pip`: `pip install vllm` (more information in the [vLLM documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html)) +- for production use cases, there are many other options, including Docker ([docs](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)) + +### Usage +You first need to run an vLLM OpenAI-compatible server. You can do that using [Python](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) or [Docker](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html). + +Then, you can use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server. + +```python +from haystack.components.generators.chat import OpenAIChatGenerator +from haystack.dataclasses import ChatMessage +from haystack.utils import Secret + +generator = OpenAIChatGenerator( + api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), # for compatibility with the OpenAI API, a placeholder api_key is needed + model="mistralai/Mistral-7B-Instruct-v0.1", + api_base_url="http://localhost:8000/v1", + generation_kwargs = {"max_tokens": 512} +) + +response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]) +``` + +## Haystack 1.x + +### Installation (1.x) Install the wrapper via pip: `pip install vllm-haystack` -## Usage +### Usage (1.x) This integration provides two invocation layers: - `vLLMInvocationLayer`: To use models hosted on a vLLM server - `vLLMLocalInvocationLayer`: To use locally hosted vLLM models -### Use a Model Hosted on a vLLM Server +#### Use a Model Hosted on a vLLM Server To utilize the wrapper the `vLLMInvocationLayer` has to be used. Here is a simple example of how a `PromptNode` can be created with the wrapper. @@ -52,12 +104,12 @@ prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256) The model will be inferred based on the model served on the vLLM server. For more configuration examples, take a look at the unit-tests. -#### Hosting a vLLM Server +##### Hosting a vLLM Server To create an *OpenAI-Compatible Server* via vLLM you can follow the steps in the Quickstart section of their [documentation](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html#openai-compatible-server). -### Use a Model Hosted Locally +#### Use a Model Hosted Locally ⚠️To run `vLLM` locally you need to have `vllm` installed and a supported GPU. If you don't want to use an API-Server this wrapper also provides a `vLLMLocalInvocationLayer` which executes the vLLM on the same node Haystack is running on. From 8ae6b80e6785deb6bc425e66239feb97b18f614f Mon Sep 17 00:00:00 2001 From: anakin87 Date: Fri, 16 Feb 2024 15:24:32 +0100 Subject: [PATCH 2/2] address feedback --- integrations/vllm.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/integrations/vllm.md b/integrations/vllm.md index 9bb3fa6e..1dbf953b 100644 --- a/integrations/vllm.md +++ b/integrations/vllm.md @@ -11,6 +11,7 @@ repo: https://github.com/LLukas22/vLLM-haystack-adapter type: Model Provider report_issue: https://github.com/LLukas22/vLLM-haystack-adapter/issues logo: /logos/vllm.png +version: Haystack 2.0 toc: true --- [![PyPI - Version](https://img.shields.io/pypi/v/vllm-haystack.svg)](https://pypi.org/project/vllm-haystack) @@ -45,7 +46,7 @@ For Haystack 1.x, the integration is available as a separate package, while for ## Haystack 2.x vLLM can be deployed as a server that implements the OpenAI API protocol. -This allows vLLM to be used with the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack. +This allows vLLM to be used with the [`OpenAIGenerator`](https://docs.haystack.deepset.ai/v2.0/docs/openaigenerator) and [`OpenAIChatGenerator`](https://docs.haystack.deepset.ai/v2.0/docs/openaichatgenerator) components in Haystack. For an end-to-end example of [vLLM + Haystack 2.x, see this notebook](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/vllm_inference_engine.ipynb).