From a228f340f1d75654f6a84921c26bb683b9691507 Mon Sep 17 00:00:00 2001 From: Manuel Soria Date: Wed, 1 Nov 2023 20:21:34 -0300 Subject: [PATCH] Semantic search within postgreSQL using pgvector (#12365) Cookbook showing how to incoporate RAG search within a postgreSQL database using pgvector. --------- Co-authored-by: Lance Martin Co-authored-by: Bagatur Co-authored-by: Erick Friis --- cookbook/README.md | 1 + cookbook/retrieval_in_sql.ipynb | 688 ++++++++++++++++++++++++++++++++ 2 files changed, 689 insertions(+) create mode 100644 cookbook/retrieval_in_sql.ipynb diff --git a/cookbook/README.md b/cookbook/README.md index 271c784e97d79..63487ed65ef3f 100644 --- a/cookbook/README.md +++ b/cookbook/README.md @@ -42,6 +42,7 @@ Notebook | Description [plan_and_execute_agent.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/plan_and_execute_agent.ipynb) | Create plan-and-execute agents that accomplish objectives by planning tasks with a language model (llm) and executing them with a separate agent. [press_releases.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/press_releases.ipynb) | Retrieve and query company press release data powered by [Kay.ai](https://kay.ai). [program_aided_language_model.i...](https://github.com/langchain-ai/langchain/tree/master/cookbook/program_aided_language_model.ipynb) | Implement program-aided language models as described in the provided research paper. +[retrieval_in_sql.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/retrieval_in_sql.ipynb) | Perform retrieval-augmented-generation (rag) on a PostgreSQL database using pgvector. [sales_agent_with_context.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/sales_agent_with_context.ipynb) | Implement a context-aware ai sales agent, salesgpt, that can have natural sales conversations, interact with other systems, and use a product knowledge base to discuss a company's offerings. [self_query_hotel_search.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/self_query_hotel_search.ipynb) | Build a hotel room search feature with self-querying retrieval, using a specific hotel recommendation dataset. [smart_llm.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/smart_llm.ipynb) | Implement a smartllmchain, a self-critique chain that generates multiple output proposals, critiques them to find the best one, and then improves upon it to produce a final output. diff --git a/cookbook/retrieval_in_sql.ipynb b/cookbook/retrieval_in_sql.ipynb new file mode 100644 index 0000000000000..bc720fa92260f --- /dev/null +++ b/cookbook/retrieval_in_sql.ipynb @@ -0,0 +1,688 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Incoporating semantic similarity in tabular databases\n", + "\n", + "In this notebook we will cover how to run semantic search over a specific table column within a single SQL query, combining tabular query with RAG.\n", + "\n", + "\n", + "### Overall workflow\n", + "\n", + "1. Generating embeddings for a specific column\n", + "2. Storing the embeddings in a new column (if column has low cardinality, it's better to use another table containing unique values and their embeddings)\n", + "3. Querying using standard SQL queries with [PGVector](https://github.com/pgvector/pgvector) extension which allows using L2 distance (`<->`), Cosine distance (`<=>` or cosine similarity using `1 - <=>`) and Inner product (`<#>`)\n", + "4. Running standard SQL query\n", + "\n", + "### Requirements\n", + "\n", + "We will need a PostgreSQL database with [pgvector](https://github.com/pgvector/pgvector) extension enabled. For this example, we will use a `Chinook` database using a local PostgreSQL server." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import getpass\n", + "\n", + "os.environ[\"OPENAI_API_KEY\"] = os.environ.get(\"OPENAI_API_KEY\") or getpass.getpass(\n", + " \"OpenAI API Key:\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.sql_database import SQLDatabase\n", + "from langchain.chat_models import ChatOpenAI\n", + "\n", + "CONNECTION_STRING = \"postgresql+psycopg2://postgres:test@localhost:5432/vectordb\" # Replace with your own\n", + "db = SQLDatabase.from_uri(CONNECTION_STRING)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Embedding the song titles" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this example, we will run queries based on semantic meaning of song titles. In order to do this, let's start by adding a new column in the table for storing the embeddings:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# db.run('ALTER TABLE \"Track\" ADD COLUMN \"embeddings\" vector;')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's generate the embedding for each *track title* and store it as a new column in our \"Track\" table" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.embeddings import OpenAIEmbeddings\n", + "\n", + "embeddings_model = OpenAIEmbeddings()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3503" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tracks = db.run('SELECT \"Name\" FROM \"Track\"')\n", + "song_titles = [s[0] for s in eval(tracks)]\n", + "title_embeddings = embeddings_model.embed_documents(song_titles)\n", + "len(title_embeddings)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's insert the embeddings in the into the new column from our table" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm import tqdm\n", + "\n", + "for i in tqdm(range(len(title_embeddings))):\n", + " title = titles[i].replace(\"'\", \"''\")\n", + " embedding = title_embeddings[i]\n", + " sql_command = (\n", + " f'UPDATE \"Track\" SET \"embeddings\" = ARRAY{embedding} WHERE \"Name\" ='\n", + " + f\"'{title}'\"\n", + " )\n", + " db.run(sql_command)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can test the semantic search running the following query:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'[(\"Tomorrow\\'s Dream\",), (\\'Remember Tomorrow\\',), (\\'Remember Tomorrow\\',), (\\'The Best Is Yet To Come\\',), (\"Thinking \\'Bout Tomorrow\",)]'" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "embeded_title = embeddings_model.embed_query(\"hope about the future\")\n", + "query = (\n", + " 'SELECT \"Track\".\"Name\" FROM \"Track\" WHERE \"Track\".\"embeddings\" IS NOT NULL ORDER BY \"embeddings\" <-> '\n", + " + f\"'{embeded_title}' LIMIT 5\"\n", + ")\n", + "db.run(query)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating the SQL Chain" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's start by defining useful functions to get info from database and running the query:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "def get_schema(_):\n", + " return db.get_table_info()\n", + "\n", + "\n", + "def run_query(query):\n", + " return db.run(query)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's build the **prompt** we will use. This prompt is an extension from [text-to-postgres-sql](https://smith.langchain.com/hub/jacob/text-to-postgres-sql?organizationId=f9b614b8-5c3a-4e7c-afbc-6d7ad4fd8892) prompt" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.prompts import ChatPromptTemplate\n", + "\n", + "template = \"\"\"You are a Postgres expert. Given an input question, first create a syntactically correct Postgres query to run, then look at the results of the query and return the answer to the input question.\n", + "Unless the user specifies in the question a specific number of examples to obtain, query for at most 5 results using the LIMIT clause as per Postgres. You can order the results to return the most informative data in the database.\n", + "Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (\") to denote them as delimited identifiers.\n", + "Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.\n", + "Pay attention to use date('now') function to get the current date, if the question involves \"today\".\n", + "\n", + "You can use an extra extension which allows you to run semantic similarity using <-> operator on tables containing columns named \"embeddings\".\n", + "<-> operator can ONLY be used on embeddings columns.\n", + "The embeddings value for a given row typically represents the semantic meaning of that row.\n", + "The vector represents an embedding representation of the question, given below. \n", + "Do NOT fill in the vector values directly, but rather specify a `[search_word]` placeholder, which should contain the word that would be embedded for filtering.\n", + "For example, if the user asks for songs about 'the feeling of loneliness' the query could be:\n", + "'SELECT \"[whatever_table_name]\".\"SongName\" FROM \"[whatever_table_name]\" ORDER BY \"embeddings\" <-> '[loneliness]' LIMIT 5'\n", + "\n", + "Use the following format:\n", + "\n", + "Question: \n", + "SQLQuery: \n", + "SQLResult: \n", + "Answer: \n", + "\n", + "Only use the following tables:\n", + "\n", + "{schema}\n", + "\"\"\"\n", + "\n", + "\n", + "prompt = ChatPromptTemplate.from_messages(\n", + " [(\"system\", template), (\"human\", \"{question}\")]\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And we can create the chain using **[LangChain Expression Language](https://python.langchain.com/docs/expression_language/)**:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.chat_models import ChatOpenAI\n", + "from langchain.schema.output_parser import StrOutputParser\n", + "from langchain.schema.runnable import RunnablePassthrough\n", + "\n", + "db = SQLDatabase.from_uri(\n", + " CONNECTION_STRING\n", + ") # We reconnect to db so the new columns are loaded as well.\n", + "llm = ChatOpenAI(model_name=\"gpt-4\", temperature=0)\n", + "\n", + "sql_query_chain = (\n", + " RunnablePassthrough.assign(schema=get_schema)\n", + " | prompt\n", + " | llm.bind(stop=[\"\\nSQLResult:\"])\n", + " | StrOutputParser()\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'SQLQuery: SELECT \"Track\".\"Name\" FROM \"Track\" JOIN \"Genre\" ON \"Track\".\"GenreId\" = \"Genre\".\"GenreId\" WHERE \"Genre\".\"Name\" = \\'Rock\\' ORDER BY \"Track\".\"embeddings\" <-> \\'[dispair]\\' LIMIT 5'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sql_query_chain.invoke(\n", + " {\n", + " \"question\": \"Which are the 5 rock songs with titles about deep feeling of dispair?\"\n", + " }\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This chain simply generates the query. Now we will create the full chain that also handles the execution and the final result for the user:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "from langchain.schema.runnable import RunnableLambda\n", + "\n", + "\n", + "def replace_brackets(match):\n", + " words_inside_brackets = match.group(1).split(\", \")\n", + " embedded_words = [\n", + " str(embeddings_model.embed_query(word)) for word in words_inside_brackets\n", + " ]\n", + " return \"', '\".join(embedded_words)\n", + "\n", + "\n", + "def get_query(query):\n", + " sql_query = re.sub(r\"\\[([\\w\\s,]+)\\]\", replace_brackets, query)\n", + " return sql_query\n", + "\n", + "\n", + "template = \"\"\"Based on the table schema below, question, sql query, and sql response, write a natural language response:\n", + "{schema}\n", + "\n", + "Question: {question}\n", + "SQL Query: {query}\n", + "SQL Response: {response}\"\"\"\n", + "\n", + "prompt = ChatPromptTemplate.from_messages(\n", + " [(\"system\", template), (\"human\", \"{question}\")]\n", + ")\n", + "\n", + "full_chain = (\n", + " RunnablePassthrough.assign(query=sql_query_chain)\n", + " | RunnablePassthrough.assign(\n", + " schema=get_schema,\n", + " response=RunnableLambda(lambda x: db.run(get_query(x[\"query\"]))),\n", + " )\n", + " | prompt\n", + " | llm\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using the Chain" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example 1: Filtering a column based on semantic meaning" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's say we want to retrieve songs that express `deep feeling of dispair`, but filtering based on genre:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AIMessage(content=\"The 5 rock songs with titles that convey a deep feeling of despair are 'Sea Of Sorrow', 'Surrender', 'Indifference', 'Hard Luck Woman', and 'Desire'.\")" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "full_chain.invoke(\n", + " {\n", + " \"question\": \"Which are the 5 rock songs with titles about deep feeling of dispair?\"\n", + " }\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is substantially different in implementing this method is that we have combined:\n", + "- Semantic search (songs that have titles with some semantic meaning)\n", + "- Traditional tabular querying (running JOIN statements to filter track based on genre)\n", + "\n", + "This is something we _could_ potentially achieve using metadata filtering, but it's more complex to do so (we would need to use a vector database containing the embeddings, and use metadata filtering based on genre).\n", + "\n", + "However, for other use cases metadata filtering **wouldn't be enough**." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example 2: Combining filters" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AIMessage(content=\"The three albums which have the most amount of songs in the top 150 saddest songs are 'International Superhits' with 5 songs, 'Ten' with 4 songs, and 'Album Of The Year' with 3 songs.\")" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "full_chain.invoke(\n", + " {\n", + " \"question\": \"I want to know the 3 albums which have the most amount of songs in the top 150 saddest songs\"\n", + " }\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So we have result for 3 albums with most amount of songs in top 150 saddest ones. This **wouldn't** be possible using only standard metadata filtering. Without this _hybdrid query_, we would need some postprocessing to get the result.\n", + "\n", + "Another similar exmaple:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AIMessage(content=\"The 6 albums with the shortest titles that contain songs which are in the 20 saddest song list are 'Ten', 'Core', 'Big Ones', 'One By One', 'Black Album', and 'Miles Ahead'.\")" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "full_chain.invoke(\n", + " {\n", + " \"question\": \"I need the 6 albums with shortest title, as long as they contain songs which are in the 20 saddest song list.\"\n", + " }\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see what the query looks like to double check:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WITH \"SadSongs\" AS (\n", + " SELECT \"TrackId\" FROM \"Track\" \n", + " ORDER BY \"embeddings\" <-> '[sad]' LIMIT 20\n", + "),\n", + "\"SadAlbums\" AS (\n", + " SELECT DISTINCT \"AlbumId\" FROM \"Track\" \n", + " WHERE \"TrackId\" IN (SELECT \"TrackId\" FROM \"SadSongs\")\n", + ")\n", + "SELECT \"Album\".\"Title\" FROM \"Album\" \n", + "WHERE \"AlbumId\" IN (SELECT \"AlbumId\" FROM \"SadAlbums\") \n", + "ORDER BY \"title_len\" ASC \n", + "LIMIT 6\n" + ] + } + ], + "source": [ + "print(\n", + " sql_query_chain.invoke(\n", + " {\n", + " \"question\": \"I need the 6 albums with shortest title, as long as they contain songs which are in the 20 saddest song list.\"\n", + " }\n", + " )\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example 3: Combining two separate semantic searches\n", + "\n", + "One interesting aspect of this approach which is **substantially different from using standar RAG** is that we can even **combine** two semantic search filters:\n", + "- _Get 5 saddest songs..._\n", + "- _**...obtained from albums with \"lovely\" titles**_\n", + "\n", + "This could generalize to **any kind of combined RAG** (paragraphs discussing _X_ topic belonging from books about _Y_, replies to a tweet about _ABC_ topic that express _XYZ_ feeling)\n", + "\n", + "We will combine semantic search on songs and album titles, so we need to do the same for `Album` table:\n", + "1. Generate the embeddings\n", + "2. Add them to the table as a new column (which we need to add in the table)" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [], + "source": [ + "# db.run('ALTER TABLE \"Album\" ADD COLUMN \"embeddings\" vector;')" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 347/347 [00:01<00:00, 179.64it/s]\n" + ] + } + ], + "source": [ + "albums = db.run('SELECT \"Title\" FROM \"Album\"')\n", + "album_titles = [title[0] for title in eval(albums)]\n", + "album_title_embeddings = embeddings_model.embed_documents(album_titles)\n", + "for i in tqdm(range(len(album_title_embeddings))):\n", + " album_title = album_titles[i].replace(\"'\", \"''\")\n", + " album_embedding = album_title_embeddings[i]\n", + " sql_command = (\n", + " f'UPDATE \"Album\" SET \"embeddings\" = ARRAY{album_embedding} WHERE \"Title\" ='\n", + " + f\"'{album_title}'\"\n", + " )\n", + " db.run(sql_command)" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"[('Realize',), ('Morning Dance',), ('Into The Light',), ('New Adventures In Hi-Fi',), ('Miles Ahead',)]\"" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "embeded_title = embeddings_model.embed_query(\"hope about the future\")\n", + "query = (\n", + " 'SELECT \"Album\".\"Title\" FROM \"Album\" WHERE \"Album\".\"embeddings\" IS NOT NULL ORDER BY \"embeddings\" <-> '\n", + " + f\"'{embeded_title}' LIMIT 5\"\n", + ")\n", + "db.run(query)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can combine both filters:" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [], + "source": [ + "db = SQLDatabase.from_uri(\n", + " CONNECTION_STRING\n", + ") # We reconnect to dbso the new columns are loaded as well." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AIMessage(content='The songs about breakouts obtained from the top 5 albums about love are \\'Royal Orleans\\', \"Nobody\\'s Fault But Mine\", \\'Achilles Last Stand\\', \\'For Your Life\\', and \\'Hots On For Nowhere\\'.')" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "full_chain.invoke(\n", + " {\n", + " \"question\": \"I want to know songs about breakouts obtained from top 5 albums about love\"\n", + " }\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is something **different** that **couldn't be achieved** using standard metadata filtering over a vectordb." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.18" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}