deepsense-ai · eryk-dsai · Aug 31, 2023 · Aug 31, 2023 · Sep 1, 2023 · Sep 1, 2023
diff --git a/docs/extras/integrations/chat/huggingface_pipeline.ipynb b/docs/extras/integrations/chat/huggingface_pipeline.ipynb
@@ -0,0 +1,369 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hugging Face Pipelines as LangChain Chat Model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook demonstrates how to use Hugging Face models as LangChain Chat models, using the Llama 2 Chat model as an example. We use the Hugging Face tokenizer's 'apply_chat_template' method to handle different instruction tuned models with different prompting templates. If you want to change the prompt templateing behavior, you can find instructions in the Hugging Face [guide](https://huggingface.co/docs/transformers/main/en/chat_templating)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Hugging Face imports:\n",
+    "import torch\n",
+    "from transformers import (\n",
+    "    AutoModelForCausalLM,\n",
+    "    AutoTokenizer,\n",
+    "    BitsAndBytesConfig,\n",
+    "    pipeline,\n",
+    ")\n",
+    "\n",
+    "# LangChain imports:\n",
+    "from langchain.chat_models import ChatHuggingFacePipeline\n",
+    "from langchain.schema import AIMessage, HumanMessage, SystemMessage"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook assumes that you were granted with access to the Llama 2 models in the Hugging Face models hub. To use the model locally, you need to be [logged in](https://huggingface.co/docs/huggingface_hub/quick-start#login) with a Hugging Face account. \n",
+    "\n",
+    "To log in using CLI run the following command in your terminal:\n",
+    "```\n",
+    "huggingface-cli login\n",
+    "```\n",
+    "or using an environment variable\n",
+    "```\n",
+    "huggingface-cli login --token $HUGGINGFACE_TOKEN\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating Hugging Face Pipeline instance:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following section loads the 7b version of the Llama 2 Chat model and uses the `bitsandbytes` library to load a model in 4bit using NF4 quantization with double quantization and compute dtype bfloat16, which speeds up the underlying matrix multiplications.\n",
+    "\n",
+    "More information about these techniques can be found at: [link](https://huggingface.co/blog/4bit-transformers-bitsandbytes)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_name = \"meta-llama/Llama-2-7b-chat-hf\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To load the model in 4bit, make sure that the `accelerate`, `transformers` and `bitsandbytes` libraries are installed:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# !pip install -q -U bitsandbytes\n",
+    "# !pip install -q -U git+https://github.com/huggingface/transformers.git\n",
+    "# !pip install -q -U git+https://github.com/huggingface/peft.git\n",
+    "# !pip install -q -U git+https://github.com/huggingface/accelerate.git"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_quant_type=\"nf4\",\n",
+    "    bnb_4bit_use_double_quant=True,\n",
+    "    bnb_4bit_compute_dtype=torch.bfloat16,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8f5fbc7100b445f98d363702e53692fd",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+    "# disabling the default System Message of the Llama model \n",
+    "tokenizer.use_default_system_prompt = False\n",
+    "\n",
+    "model_4bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map=\"auto\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipe = pipeline(\n",
+    "    \"text-generation\",\n",
+    "    model=model_4bit,\n",
+    "    tokenizer=tokenizer,\n",
+    "    torch_dtype=torch.float16,\n",
+    "    device_map=\"auto\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Initializing the Chat Model instance"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chat = ChatHuggingFacePipeline(pipeline=pipe)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Besides defining arguments for `Pipeline` initialization, we can also control the generation process, by enabling sampling, chaning temperature or defining maximum length of single generation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generation kwargs:\n",
+    "pipeline_kwargs = {\n",
+    "    \"do_sample\": True,\n",
+    "    \"top_p\": 0.95,\n",
+    "    \"temperature\": 0.7,\n",
+    "    \"eos_token_id\": tokenizer.eos_token_id,\n",
+    "    \"max_length\": 512,    \n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Single calls:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can get chat completions by passing one or more messages to the chat model. The response will be a message:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Sure, I'd be happy to help! Here's the translation of \"I love programming\" from English to French:\n",
+      "Je adore le programming.\n",
+      "\n",
+      "I hope that helps! Let me know if you have any other sentences you'd like me to translate.\n"
+     ]
+    }
+   ],
+   "source": [
+    "messages = [\n",
+    "    SystemMessage(\n",
+    "        content=\"You are a helpful assistant that translates English to French.\"\n",
+    "    ),\n",
+    "    HumanMessage(\n",
+    "        content=\"Translate this sentence from English to French. I love programming.\"\n",
+    "    ),\n",
+    "]\n",
+    "result = chat(messages, **pipeline_kwargs)\n",
+    "print(result.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Single calls with stop words"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By utilizing Hugging Face [Stopping Criteria](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.StoppingCriteria) under the hood, we can provide phrases that, if generated by the model, will cause the generation process to stop."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Of course! Artificial\n"
+     ]
+    }
+   ],
+   "source": [
+    "messages = [\n",
+    "    SystemMessage(\n",
+    "        content=\"You are a helpful assistant.\"\n",
+    "    ),\n",
+    "    HumanMessage(\n",
+    "        content=\"Tell me the history of AI.\"\n",
+    "    ),\n",
+    "]\n",
+    "result = chat(messages, stop=[\"Artificial\", \"Inteligence\"], **pipeline_kwargs)\n",
+    "print(result.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Batch calls:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also go one step further and generate completions for multiple sets of messages:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "batch_messages = [\n",
+    "    [\n",
+    "        SystemMessage(content=\"You are a helpful assistant that translates English to French.\"),\n",
+    "        HumanMessage(content=\"I love programming.\")\n",
+    "    ],\n",
+    "    [\n",
+    "        SystemMessage(content=\"You are a helpful assistant that translates English to French.\"),\n",
+    "        HumanMessage(content=\"I love artificial intelligence.\")\n",
+    "    ],\n",
+    "]\n",
+    "result = chat.generate(batch_messages)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Response #0:\n",
+      " Great! \"Programmation\" is the French word for \"programming\".\n",
+      "\n",
+      "So, you love programmation? (programme)\n",
+      "\n",
+      "Response #1:\n",
+      " \"Je suis heureux que vous aimiez l'intelligence artificielle.\" (I am happy that you love artificial intelligence.)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i, generation in enumerate(result.generations):\n",
+    "    print(f\"Response #{i}:\\n{generation[0].text}\", end=\"\\n\\n\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.10.12 ('langchain_venv': venv)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "d1d3a3c58a58885896c5459933a599607cdbb9917d7e1ad7516c8786c51f2dd2"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/libs/langchain/langchain/chat_models/__init__.py b/libs/langchain/langchain/chat_models/__init__.py
@@ -27,6 +27,7 @@
 from langchain.chat_models.fake import FakeListChatModel
 from langchain.chat_models.fireworks import ChatFireworks
 from langchain.chat_models.google_palm import ChatGooglePalm
+from langchain.chat_models.huggingface_pipeline import ChatHuggingFacePipeline
 from langchain.chat_models.human import HumanInputChatModel
 from langchain.chat_models.javelin_ai_gateway import ChatJavelinAIGateway
 from langchain.chat_models.jinachat import JinaChat
@@ -50,6 +51,7 @@
     "ChatGooglePalm",
     "ChatMLflowAIGateway",
     "ChatOllama",
+    "ChatHuggingFacePipeline",
     "ChatVertexAI",
     "JinaChat",
     "HumanInputChatModel",