From 0666fabbfb4de90b2d485f834dfb10827f08d816 Mon Sep 17 00:00:00 2001
From: Averi Kitsch <akitsch@google.com>
Date: Fri, 4 Apr 2025 17:08:40 -0700
Subject: [PATCH] docs(migration): Add migration script

---
 README.md                                     |   2 +-
 .../migrate_pgvector_to_pgvectorstore.ipynb   | 329 ++++++++++++++++++
 2 files changed, 330 insertions(+), 1 deletion(-)
 create mode 100644 examples/migrate_pgvector_to_pgvectorstore.ipynb

diff --git a/README.md b/README.md
index 0b73949..af1c810 100644
--- a/README.md
+++ b/README.md
@@ -31,7 +31,7 @@ pip install -U langchain-postgres
 > See example for the [PGVector vectorstore here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb)
 `PGVector` is being deprecated. Please migrate to `PGVectorStore`.
 `PGVectorStore` is used for improved performance and manageability.
-See the [migration guide](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.md) for details on how to migrate from `PGVector` to `PGVectorStore`.
+See the [migration script](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.ipynb) for details on how to migrate from `PGVector` to `PGVectorStore`.
 
 > [!TIP]
 > All synchronous functions have corresponding asynchronous functions
diff --git a/examples/migrate_pgvector_to_pgvectorstore.ipynb b/examples/migrate_pgvector_to_pgvectorstore.ipynb
new file mode 100644
index 0000000..afe631a
--- /dev/null
+++ b/examples/migrate_pgvector_to_pgvectorstore.ipynb
@@ -0,0 +1,329 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Migrate a `PGVector` vector store to `PGVectorStore`\n",
+    "\n",
+    "This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class.\n",
+    "\n",
+    "## Why migrate?\n",
+    "\n",
+    "This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability.\n",
+    "\n",
+    "Migrating to the PGVectorStore interface provides the following benefits:\n",
+    "\n",
+    "- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain.\n",
+    "- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements.\n",
+    "- **Schema flexibility**: The interface allows users to add tables into any database schema.\n",
+    "- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections.\n",
+    "- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows.\n",
+    "- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object.\n",
+    "\n",
+    "## Migration process\n",
+    "\n",
+    "> **_NOTE:_**  The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "IR54BmgvdHT_"
+   },
+   "source": [
+    "### Library Installation\n",
+    "Install the integration library, `langchain-postgres`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 1000
+    },
+    "id": "0ZITIDE160OD",
+    "outputId": "e184bc0d-6541-4e0a-82d2-1e216db00a2d"
+   },
+   "outputs": [],
+   "source": [
+    "%pip install --upgrade --quiet  langchain-postgres langchain-core SQLAlchemy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8f2830ee9ca1e01",
+   "metadata": {
+    "id": "f8f2830ee9ca1e01"
+   },
+   "source": [
+    "## Data Migration"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "OMvzMWRrR6n7",
+   "metadata": {
+    "id": "OMvzMWRrR6n7"
+   },
+   "source": [
+    "### Set the postgres connection url\n",
+    "\n",
+    "`PGVectorStore` can be used with the `asyncpg` and `psycopg3` drivers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "irl7eMFnSPZr",
+   "metadata": {
+    "id": "irl7eMFnSPZr"
+   },
+   "outputs": [],
+   "source": [
+    "# @title Set Your Values Here { display-mode: \"form\" }\n",
+    "POSTGRES_USER = \"langchain\"  # @param {type: \"string\"}\n",
+    "POSTGRES_PASSWORD = \"langchain\"  # @param {type: \"string\"}\n",
+    "POSTGRES_HOST = \"localhost\"  # @param {type: \"string\"}\n",
+    "POSTGRES_PORT = \"6024\"  # @param {type: \"string\"}\n",
+    "POSTGRES_DB = \"langchain\"  # @param {type: \"string\"}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "QuQigs4UoFQ2",
+   "metadata": {
+    "id": "QuQigs4UoFQ2"
+   },
+   "source": [
+    "### PGEngine Connection Pool\n",
+    "\n",
+    "One of the requirements and arguments to establish PostgreSQL as a vector store is a `PGEngine` object. The `PGEngine`  configures a shared connection pool  to your Postgres database. This is an industry best practice to manage number of connections and to reduce latency through cached database connections.\n",
+    "\n",
+    "To create a `PGEngine` using `PGEngine.from_connection_string()` you need to provide:\n",
+    "\n",
+    "1. `url` : Connection string using the `postgresql+asyncpg` driver.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note:** This tutorial demonstrates the async interface. All async methods have corresponding sync methods."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# See docker command above to launch a Postgres instance with pgvector enabled.\n",
+    "CONNECTION_STRING = (\n",
+    "    f\"postgresql+asyncpg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}\"\n",
+    "    f\":{POSTGRES_PORT}/{POSTGRES_DB}\"\n",
+    ")\n",
+    "# To use psycopg3 driver, set your connection string to `postgresql+psycopg://`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_postgres import PGEngine\n",
+    "\n",
+    "engine = PGEngine.from_connection_string(url=CONNECTION_STRING)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To create a `PGEngine` using `PGEngine.from_engine()` you need to provide:\n",
+    "\n",
+    "1. `engine` : An object of `AsyncEngine`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sqlalchemy.ext.asyncio import create_async_engine\n",
+    "\n",
+    "# Create an SQLAlchemy Async Engine\n",
+    "pool = create_async_engine(\n",
+    "    CONNECTION_STRING,\n",
+    ")\n",
+    "\n",
+    "engine = PGEngine.from_engine(engine=pool)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Get all collections\n",
+    "\n",
+    "This script migrates each collection to a new Vector Store table."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names\n",
+    "\n",
+    "all_collection_names = await alist_pgvector_collection_names(engine)\n",
+    "print(all_collection_names)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "D9Xs2qhm6X56"
+   },
+   "source": [
+    "### Create a new table(s) to migrate existing data\n",
+    "The `PGVectorStore` class requires a database table. The `PGEngine` engine has a helper method `ainit_vectorstore_table()` that can be used to create a table with the proper schema for you."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also specify a schema name by passing `schema_name` wherever you pass `table_name`. Eg:\n",
+    "\n",
+    "```python\n",
+    "SCHEMA_NAME=\"my_schema\"\n",
+    "\n",
+    "await engine.ainit_vectorstore_table(\n",
+    "    table_name=TABLE_NAME,\n",
+    "    vector_size=768,\n",
+    "    schema_name=SCHEMA_NAME,    # Default: \"public\"\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for:\n",
+    "\n",
+    "- **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval.\n",
+    "- **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column.\n",
+    "\n",
+    "```python\n",
+    "metadata_columns = [\n",
+    "    Column(f\"col_0_{collection_name}\", \"VARCHAR\"),\n",
+    "    Column(f\"col_1_{collection_name}\", \"VARCHAR\"),\n",
+    "]\n",
+    "engine.init_vectorstore_table(\n",
+    "    table_name=\"destination_table\",\n",
+    "    vector_size=VECTOR_SIZE,\n",
+    "    metadata_columns=metadata_columns,\n",
+    "    id_column=Column(\"langchain_id\", \"VARCHAR\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "avlyHEMn6gzU"
+   },
+   "outputs": [],
+   "source": [
+    "# Vertex AI embeddings uses a vector size of 768.\n",
+    "# Adjust this according to your embeddings service.\n",
+    "VECTOR_SIZE = 768\n",
+    "for collection_name in all_collection_names:\n",
+    "    engine.init_vectorstore_table(\n",
+    "        table_name=collection_name,\n",
+    "        vector_size=VECTOR_SIZE,\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create a vector store and migrate data\n",
+    "\n",
+    "> **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialize a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table.\n",
+    "\n",
+    "If you have any customizations on the metadata or the id columns, add them to the vector store as follows:\n",
+    "\n",
+    "```python\n",
+    "from langchain_postgres import PGVectorStore\n",
+    "from langchain_core.embeddings import FakeEmbeddings\n",
+    "\n",
+    "destination_vector_store = PGVectorStore.create_sync(\n",
+    "    engine,\n",
+    "    embedding_service=FakeEmbeddings(size=VECTOR_SIZE),\n",
+    "    table_name=DESTINATION_TABLE_NAME,\n",
+    "    metadata_columns=[col.name for col in metadata_columns],\n",
+    "    id_column=\"langchain_id\",\n",
+    ")\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "z-AZyzAQ7bsf"
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_core.embeddings import FakeEmbeddings\n",
+    "from langchain_postgres import PGVectorStore\n",
+    "from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection\n",
+    "\n",
+    "for collection_name in all_collection_names:\n",
+    "    destination_vector_store = await PGVectorStore.create(\n",
+    "        engine,\n",
+    "        embedding_service=FakeEmbeddings(size=VECTOR_SIZE),\n",
+    "        table_name=collection_name,\n",
+    "    )\n",
+    "\n",
+    "    await amigrate_pgvector_collection(\n",
+    "        engine,\n",
+    "        # Set collection name here\n",
+    "        collection_name=collection_name,\n",
+    "        vector_store=destination_vector_store,\n",
+    "        # This deletes data from the original table upon migration. You can choose to turn it off.\n",
+    "        # The data will only be deleted from the original table once all of it has been successfully copied to the destination table.\n",
+    "        delete_pg_collection=True,\n",
+    "    )"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": [],
+   "toc_visible": true
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}