Skip to content

docs(migration): Add migration script #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ pip install -U langchain-postgres
> [!WARNING]
> In v0.0.14+, `PGVector` is deprecated. Please migrate to `PGVectorStore`
> for improved performance and manageability.
> See the [migration guide](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.md) for details on how to migrate from `PGVector` to `PGVectorStore`.
> See the [migration guide](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.ipynb) for details on how to migrate from `PGVector` to `PGVectorStore`.

### Documentation

Expand Down
329 changes: 329 additions & 0 deletions examples/migrate_pgvector_to_pgvectorstore.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Migrate a `PGVector` vector store to `PGVectorStore`\n",
"\n",
"This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class.\n",
"\n",
"## Why migrate?\n",
"\n",
"This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability.\n",
"\n",
"Migrating to the PGVectorStore interface provides the following benefits:\n",
"\n",
"- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain.\n",
"- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements.\n",
"- **Schema flexibility**: The interface allows users to add tables into any database schema.\n",
"- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections.\n",
"- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows.\n",
"- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object.\n",
"\n",
"## Migration process\n",
"\n",
"> **_NOTE:_** The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IR54BmgvdHT_"
},
"source": [
"### Library Installation\n",
"Install the integration library, `langchain-postgres`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "0ZITIDE160OD",
"outputId": "e184bc0d-6541-4e0a-82d2-1e216db00a2d"
},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain-postgres langchain-core SQLAlchemy"
]
},
{
"cell_type": "markdown",
"id": "f8f2830ee9ca1e01",
"metadata": {
"id": "f8f2830ee9ca1e01"
},
"source": [
"## Data Migration"
]
},
{
"cell_type": "markdown",
"id": "OMvzMWRrR6n7",
"metadata": {
"id": "OMvzMWRrR6n7"
},
"source": [
"### Set the postgres connection url\n",
"\n",
"`PGVectorStore` can be used with the `asyncpg` and `psycopg3` drivers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "irl7eMFnSPZr",
"metadata": {
"id": "irl7eMFnSPZr"
},
"outputs": [],
"source": [
"# @title Set Your Values Here { display-mode: \"form\" }\n",
"POSTGRES_USER = \"langchain\" # @param {type: \"string\"}\n",
"POSTGRES_PASSWORD = \"langchain\" # @param {type: \"string\"}\n",
"POSTGRES_HOST = \"localhost\" # @param {type: \"string\"}\n",
"POSTGRES_PORT = \"6024\" # @param {type: \"string\"}\n",
"POSTGRES_DB = \"langchain\" # @param {type: \"string\"}"
]
},
{
"cell_type": "markdown",
"id": "QuQigs4UoFQ2",
"metadata": {
"id": "QuQigs4UoFQ2"
},
"source": [
"### PGEngine Connection Pool\n",
"\n",
"One of the requirements and arguments to establish PostgreSQL as a vector store is a `PGEngine` object. The `PGEngine` configures a shared connection pool to your Postgres database. This is an industry best practice to manage number of connections and to reduce latency through cached database connections.\n",
"\n",
"To create a `PGEngine` using `PGEngine.from_connection_string()` you need to provide:\n",
"\n",
"1. `url` : Connection string using the `postgresql+asyncpg` driver.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** This tutorial demonstrates the async interface. All async methods have corresponding sync methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# See docker command above to launch a Postgres instance with pgvector enabled.\n",
"CONNECTION_STRING = (\n",
" f\"postgresql+asyncpg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}\"\n",
" f\":{POSTGRES_PORT}/{POSTGRES_DB}\"\n",
")\n",
"# To use psycopg3 driver, set your connection string to `postgresql+psycopg://`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_postgres import PGEngine\n",
"\n",
"engine = PGEngine.from_connection_string(url=CONNECTION_STRING)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To create a `PGEngine` using `PGEngine.from_engine()` you need to provide:\n",
"\n",
"1. `engine` : An object of `AsyncEngine`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy.ext.asyncio import create_async_engine\n",
"\n",
"# Create an SQLAlchemy Async Engine\n",
"pool = create_async_engine(\n",
" CONNECTION_STRING,\n",
")\n",
"\n",
"engine = PGEngine.from_engine(engine=pool)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get all collections\n",
"\n",
"This script migrates each collection to a new Vector Store table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names\n",
"\n",
"all_collection_names = await alist_pgvector_collection_names(engine)\n",
"print(all_collection_names)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "D9Xs2qhm6X56"
},
"source": [
"### Create a new table(s) to migrate existing data\n",
"The `PGVectorStore` class requires a database table. The `PGEngine` engine has a helper method `ainit_vectorstore_table()` that can be used to create a table with the proper schema for you."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify a schema name by passing `schema_name` wherever you pass `table_name`. Eg:\n",
"\n",
"```python\n",
"SCHEMA_NAME=\"my_schema\"\n",
"\n",
"await engine.ainit_vectorstore_table(\n",
" table_name=TABLE_NAME,\n",
" vector_size=768,\n",
" schema_name=SCHEMA_NAME, # Default: \"public\"\n",
")\n",
"```\n",
"\n",
"When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for:\n",
"\n",
"- **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval.\n",
"- **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column.\n",
"\n",
"```python\n",
"metadata_columns = [\n",
" Column(f\"col_0_{collection_name}\", \"VARCHAR\"),\n",
" Column(f\"col_1_{collection_name}\", \"VARCHAR\"),\n",
"]\n",
"engine.init_vectorstore_table(\n",
" table_name=\"destination_table\",\n",
" vector_size=VECTOR_SIZE,\n",
" metadata_columns=metadata_columns,\n",
" id_column=Column(\"langchain_id\", \"VARCHAR\"),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "avlyHEMn6gzU"
},
"outputs": [],
"source": [
"# Vertex AI embeddings uses a vector size of 768.\n",
"# Adjust this according to your embeddings service.\n",
"VECTOR_SIZE = 768\n",
"for collection_name in all_collection_names:\n",
" engine.init_vectorstore_table(\n",
" table_name=collection_name,\n",
" vector_size=VECTOR_SIZE,\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a vector store and migrate data\n",
"\n",
"> **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialize a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table.\n",
"\n",
"If you have any customizations on the metadata or the id columns, add them to the vector store as follows:\n",
"\n",
"```python\n",
"from langchain_postgres import PGVectorStore\n",
"from langchain_core.embeddings import FakeEmbeddings\n",
"\n",
"destination_vector_store = PGVectorStore.create_sync(\n",
" engine,\n",
" embedding_service=FakeEmbeddings(size=VECTOR_SIZE),\n",
" table_name=DESTINATION_TABLE_NAME,\n",
" metadata_columns=[col.name for col in metadata_columns],\n",
" id_column=\"langchain_id\",\n",
")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "z-AZyzAQ7bsf"
},
"outputs": [],
"source": [
"from langchain_core.embeddings import FakeEmbeddings\n",
"from langchain_postgres import PGVectorStore\n",
"from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection\n",
"\n",
"for collection_name in all_collection_names:\n",
" destination_vector_store = await PGVectorStore.create(\n",
" engine,\n",
" embedding_service=FakeEmbeddings(size=VECTOR_SIZE),\n",
" table_name=collection_name,\n",
" )\n",
"\n",
" await amigrate_pgvector_collection(\n",
" engine,\n",
" # Set collection name here\n",
" collection_name=collection_name,\n",
" vector_store=destination_vector_store,\n",
" # This deletes data from the original table upon migration. You can choose to turn it off.\n",
" # The data will only be deleted from the original table once all of it has been successfully copied to the destination table.\n",
" delete_pg_collection=True,\n",
" )"
]
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}