From 0666fabbfb4de90b2d485f834dfb10827f08d816 Mon Sep 17 00:00:00 2001 From: Averi Kitsch Date: Fri, 4 Apr 2025 17:08:40 -0700 Subject: [PATCH] docs(migration): Add migration script --- README.md | 2 +- .../migrate_pgvector_to_pgvectorstore.ipynb | 329 ++++++++++++++++++ 2 files changed, 330 insertions(+), 1 deletion(-) create mode 100644 examples/migrate_pgvector_to_pgvectorstore.ipynb diff --git a/README.md b/README.md index 0b73949..af1c810 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ pip install -U langchain-postgres > See example for the [PGVector vectorstore here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb) `PGVector` is being deprecated. Please migrate to `PGVectorStore`. `PGVectorStore` is used for improved performance and manageability. -See the [migration guide](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.md) for details on how to migrate from `PGVector` to `PGVectorStore`. +See the [migration script](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.ipynb) for details on how to migrate from `PGVector` to `PGVectorStore`. > [!TIP] > All synchronous functions have corresponding asynchronous functions diff --git a/examples/migrate_pgvector_to_pgvectorstore.ipynb b/examples/migrate_pgvector_to_pgvectorstore.ipynb new file mode 100644 index 0000000..afe631a --- /dev/null +++ b/examples/migrate_pgvector_to_pgvectorstore.ipynb @@ -0,0 +1,329 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Migrate a `PGVector` vector store to `PGVectorStore`\n", + "\n", + "This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class.\n", + "\n", + "## Why migrate?\n", + "\n", + "This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability.\n", + "\n", + "Migrating to the PGVectorStore interface provides the following benefits:\n", + "\n", + "- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain.\n", + "- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements.\n", + "- **Schema flexibility**: The interface allows users to add tables into any database schema.\n", + "- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections.\n", + "- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows.\n", + "- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object.\n", + "\n", + "## Migration process\n", + "\n", + "> **_NOTE:_** The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IR54BmgvdHT_" + }, + "source": [ + "### Library Installation\n", + "Install the integration library, `langchain-postgres`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "0ZITIDE160OD", + "outputId": "e184bc0d-6541-4e0a-82d2-1e216db00a2d" + }, + "outputs": [], + "source": [ + "%pip install --upgrade --quiet langchain-postgres langchain-core SQLAlchemy" + ] + }, + { + "cell_type": "markdown", + "id": "f8f2830ee9ca1e01", + "metadata": { + "id": "f8f2830ee9ca1e01" + }, + "source": [ + "## Data Migration" + ] + }, + { + "cell_type": "markdown", + "id": "OMvzMWRrR6n7", + "metadata": { + "id": "OMvzMWRrR6n7" + }, + "source": [ + "### Set the postgres connection url\n", + "\n", + "`PGVectorStore` can be used with the `asyncpg` and `psycopg3` drivers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "irl7eMFnSPZr", + "metadata": { + "id": "irl7eMFnSPZr" + }, + "outputs": [], + "source": [ + "# @title Set Your Values Here { display-mode: \"form\" }\n", + "POSTGRES_USER = \"langchain\" # @param {type: \"string\"}\n", + "POSTGRES_PASSWORD = \"langchain\" # @param {type: \"string\"}\n", + "POSTGRES_HOST = \"localhost\" # @param {type: \"string\"}\n", + "POSTGRES_PORT = \"6024\" # @param {type: \"string\"}\n", + "POSTGRES_DB = \"langchain\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "id": "QuQigs4UoFQ2", + "metadata": { + "id": "QuQigs4UoFQ2" + }, + "source": [ + "### PGEngine Connection Pool\n", + "\n", + "One of the requirements and arguments to establish PostgreSQL as a vector store is a `PGEngine` object. The `PGEngine` configures a shared connection pool to your Postgres database. This is an industry best practice to manage number of connections and to reduce latency through cached database connections.\n", + "\n", + "To create a `PGEngine` using `PGEngine.from_connection_string()` you need to provide:\n", + "\n", + "1. `url` : Connection string using the `postgresql+asyncpg` driver.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note:** This tutorial demonstrates the async interface. All async methods have corresponding sync methods." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# See docker command above to launch a Postgres instance with pgvector enabled.\n", + "CONNECTION_STRING = (\n", + " f\"postgresql+asyncpg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}\"\n", + " f\":{POSTGRES_PORT}/{POSTGRES_DB}\"\n", + ")\n", + "# To use psycopg3 driver, set your connection string to `postgresql+psycopg://`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_postgres import PGEngine\n", + "\n", + "engine = PGEngine.from_connection_string(url=CONNECTION_STRING)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create a `PGEngine` using `PGEngine.from_engine()` you need to provide:\n", + "\n", + "1. `engine` : An object of `AsyncEngine`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sqlalchemy.ext.asyncio import create_async_engine\n", + "\n", + "# Create an SQLAlchemy Async Engine\n", + "pool = create_async_engine(\n", + " CONNECTION_STRING,\n", + ")\n", + "\n", + "engine = PGEngine.from_engine(engine=pool)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Get all collections\n", + "\n", + "This script migrates each collection to a new Vector Store table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names\n", + "\n", + "all_collection_names = await alist_pgvector_collection_names(engine)\n", + "print(all_collection_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D9Xs2qhm6X56" + }, + "source": [ + "### Create a new table(s) to migrate existing data\n", + "The `PGVectorStore` class requires a database table. The `PGEngine` engine has a helper method `ainit_vectorstore_table()` that can be used to create a table with the proper schema for you." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also specify a schema name by passing `schema_name` wherever you pass `table_name`. Eg:\n", + "\n", + "```python\n", + "SCHEMA_NAME=\"my_schema\"\n", + "\n", + "await engine.ainit_vectorstore_table(\n", + " table_name=TABLE_NAME,\n", + " vector_size=768,\n", + " schema_name=SCHEMA_NAME, # Default: \"public\"\n", + ")\n", + "```\n", + "\n", + "When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for:\n", + "\n", + "- **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval.\n", + "- **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column.\n", + "\n", + "```python\n", + "metadata_columns = [\n", + " Column(f\"col_0_{collection_name}\", \"VARCHAR\"),\n", + " Column(f\"col_1_{collection_name}\", \"VARCHAR\"),\n", + "]\n", + "engine.init_vectorstore_table(\n", + " table_name=\"destination_table\",\n", + " vector_size=VECTOR_SIZE,\n", + " metadata_columns=metadata_columns,\n", + " id_column=Column(\"langchain_id\", \"VARCHAR\"),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "avlyHEMn6gzU" + }, + "outputs": [], + "source": [ + "# Vertex AI embeddings uses a vector size of 768.\n", + "# Adjust this according to your embeddings service.\n", + "VECTOR_SIZE = 768\n", + "for collection_name in all_collection_names:\n", + " engine.init_vectorstore_table(\n", + " table_name=collection_name,\n", + " vector_size=VECTOR_SIZE,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a vector store and migrate data\n", + "\n", + "> **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialize a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table.\n", + "\n", + "If you have any customizations on the metadata or the id columns, add them to the vector store as follows:\n", + "\n", + "```python\n", + "from langchain_postgres import PGVectorStore\n", + "from langchain_core.embeddings import FakeEmbeddings\n", + "\n", + "destination_vector_store = PGVectorStore.create_sync(\n", + " engine,\n", + " embedding_service=FakeEmbeddings(size=VECTOR_SIZE),\n", + " table_name=DESTINATION_TABLE_NAME,\n", + " metadata_columns=[col.name for col in metadata_columns],\n", + " id_column=\"langchain_id\",\n", + ")\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "z-AZyzAQ7bsf" + }, + "outputs": [], + "source": [ + "from langchain_core.embeddings import FakeEmbeddings\n", + "from langchain_postgres import PGVectorStore\n", + "from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection\n", + "\n", + "for collection_name in all_collection_names:\n", + " destination_vector_store = await PGVectorStore.create(\n", + " engine,\n", + " embedding_service=FakeEmbeddings(size=VECTOR_SIZE),\n", + " table_name=collection_name,\n", + " )\n", + "\n", + " await amigrate_pgvector_collection(\n", + " engine,\n", + " # Set collection name here\n", + " collection_name=collection_name,\n", + " vector_store=destination_vector_store,\n", + " # This deletes data from the original table upon migration. You can choose to turn it off.\n", + " # The data will only be deleted from the original table once all of it has been successfully copied to the destination table.\n", + " delete_pg_collection=True,\n", + " )" + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}