Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Google BigQueryVectorSearch in vectorstore #14829

Merged
merged 31 commits into from
Jan 2, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
fdd5b2d
feat: add BigQueryVectorSearch on vectorstore
ashleyxuu Dec 17, 2023
c7fc182
feat: add BigQueryVectorSearch in vectorstore
ashleyxuu Dec 18, 2023
1c1b654
feat: add BigQueryVectorSearch on vectorstore
ashleyxuu Dec 17, 2023
c5bd203
resolve merge conflicts
ashleyxuu Dec 18, 2023
2da6077
resolve merge conflicts
ashleyxuu Dec 18, 2023
2546c6d
Merge branch 'master' into bq-vectorstore
ashleyxuu Dec 18, 2023
9e8084e
migrate the files abd add docs to google platform
ashleyxuu Dec 18, 2023
857af81
migrate the files and add docs in google platform
ashleyxuu Dec 18, 2023
ed2e4d1
fix minor typo
ashleyxuu Dec 18, 2023
a27d86c
add test file
ashleyxuu Dec 20, 2023
96bbeb5
Merge branch 'master' into bq-vectorstore
ashleyxuu Dec 20, 2023
78f5236
support more vector store methods
ashleyxuu Dec 22, 2023
d1a3072
Merge branch 'master' into bq-vectorstore
ashleyxuu Dec 22, 2023
2ec231d
resolve merging
ashleyxuu Dec 22, 2023
a64f1b6
resolve merging
ashleyxuu Dec 22, 2023
54b8ee4
fix formatting and address comments
ashleyxuu Dec 22, 2023
8019aa5
BigQueryVectorSearch block in vectorstores/__init__.py
vladkol Dec 22, 2023
547dd12
bigquery_vector_search.ipynb cleanup
vladkol Dec 23, 2023
1581d2b
address comments
ashleyxuu Dec 23, 2023
8055fe8
address the comments
ashleyxuu Dec 23, 2023
9388f7b
More linting fixes
vladkol Dec 23, 2023
700e493
Merging fixes
vladkol Dec 23, 2023
0318dd8
More formartting fixes
vladkol Dec 23, 2023
f714fd4
more lint fixes
ashleyxuu Dec 23, 2023
9546dc8
Test fixes
vladkol Dec 23, 2023
6f16d57
Linting fixes
vladkol Dec 23, 2023
8181525
Linting fixes
vladkol Dec 23, 2023
f13613b
Linting fixes
vladkol Dec 23, 2023
8e91a2f
remove the checks for minimum rows
ashleyxuu Jan 2, 2024
7f93159
Method to add textx with existing embeddings. Index creation refactor…
vladkol Jan 2, 2024
9ee97fa
Formatting fix
vladkol Jan 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions docs/docs/integrations/platforms/google.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,28 @@ See a [usage example](/docs/integrations/vectorstores/matchingengine).
from langchain.vectorstores import MatchingEngine
```

### Google BigQuery Vector Search

> [Google BigQuery](https://cloud.google.com/bigquery),
> BigQuery is a serverless and cost-effective enterprise data warehouse in Google Cloud.
>
> Google BigQuery Vector Search
> BigQuery vector search lets you use GoogleSQL to do semantic search, using vector indexes for fast but approximate results, or using brute force for exact results.

> It can calculate Euclidean or Cosine distance. With LangChain, we default to use Euclidean distance.

We need to install several python packages.

```bash
pip install google-cloud-bigquery
```

See a [usage example](/docs/integrations/vectorstores/bigquery_vector_search).

```python
from langchain.vectorstores import BigQueryVectorSearch
```

### Google ScaNN

>[Google ScaNN](https://github.com/google-research/google-research/tree/master/scann)
Expand Down
349 changes: 349 additions & 0 deletions docs/docs/integrations/vectorstores/bigquery_vector_search.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "E_RJy7C1bpCT"
},
"source": [
"# BigQueryVectorSearch\n",
"> **BigQueryVectorSearch**:\n",
"BigQuery vector search lets you use GoogleSQL to do semantic search, using vector indexes for fast but approximate results, or using brute force for exact results.\n",
"\n",
"\n",
"This tutorial illustrates how to work with an end-to-end data and embedding management system in LangChain, and provide scalable semantic search in BigQuery."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EmPJkpOCckyh"
},
"source": [
"## Getting started\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IR54BmgvdHT_"
},
"source": [
"### Install the library"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "0ZITIDE160OD",
"outputId": "e184bc0d-6541-4e0a-82d2-1e216db00a2d"
},
"outputs": [],
"source": [
"! pip install google-cloud-aiplatform langchain==0.0.316 google-cloud-bigquery pydantic==1.10.8 typing-inspect==0.8.0 typing_extensions==4.5.0 pandas openai==0.28.1 tiktoken datasets google-api-python-client pypdf faiss-cpu transformers config --upgrade --user"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v40bB_GMcr9f"
},
"source": [
"**Colab only:** Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6o0iGVIdDD6K"
},
"outputs": [],
"source": [
"# # Automatically restart kernel after installs so that your environment can access the new packages\n",
"# import IPython\n",
"\n",
"# app = IPython.Application.instance()\n",
"# app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Before you begin"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set your project ID\n",
"\n",
"If you don't know your project ID, try the following:\n",
"* Run `gcloud config list`.\n",
"* Run `gcloud projects list`.\n",
"* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROJECT_ID = \"\" # @param {type:\"string\"}\n",
"\n",
"# Set the project id\n",
"! gcloud config set project {PROJECT_ID}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set the region\n",
"\n",
"You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"REGION = \"US\" # @param {type: \"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Authenticating your notebook environment\n",
"\n",
"- If you are using **Colab** to run this notebook, uncomment the cell below and continue.\n",
"- If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from google.colab import auth as google_auth\n",
"google_auth.authenticate_user()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AD3yG49BdLlr"
},
"source": [
"## Demo: BigQueryVectorSearch"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import BigQueryVectorSearch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create an embedding in VectorStore"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Vb2RJocV9_LQ",
"outputId": "37f5dc74-2512-47b2-c135-f34c10afdcf4"
},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"import os\n",
"import getpass\n",
"\n",
"# We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "z0Ksm9Tk_xhq"
},
"outputs": [],
"source": [
"embedding = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initialize VectorStore from a list of strings + embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_texts = [\n",
" \"Apples and oranges\",\n",
" \"Cars and airplanes\",\n",
" \"Pineapple\",\n",
" \"Train\",\n",
" \"Banana\"\n",
"]\n",
"\n",
"store = BigQueryVectorSearch(\n",
" embedding,\n",
" project_id=PROJECT_ID,\n",
" dataset_name=\"<your_dataset>\",\n",
" table_name=\"<your_table>\",\n",
" location=REGION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Intialize VectorStore with existing dataset with embedding columns in BQ"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores.utils import DistanceStrategy\n",
"\n",
"DEFAULT_DISTANCE_STRATEGY = DistanceStrategy.EUCLIDEAN_DISTANCE\n",
"\n",
"bq_vector_search = BigQueryVectorSearch(\n",
" project_id=PROJECT_ID,\n",
" dataset_name=\"your_dataset\",\n",
" table_name=\"<your_table>\",\n",
" # Column {content_field} must be of STRING type\n",
" content_field=\"<your_content>\",\n",
" # Column {text_embedding_field} must be of ARRAY<FLOAT64> type\n",
" text_embedding_field=\"<your_embedding>\",\n",
" embedding=embedding,\n",
" distance_strategy=DEFAULT_DISTANCE_STRATEGY,\n",
" location=REGION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Add texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_texts = [\n",
" \"Apples and oranges\",\n",
" \"Cars and airplanes\",\n",
" \"Pineapple\",\n",
" \"Train\",\n",
" \"Banana\"\n",
"]\n",
"\n",
"store.add_texts(all_texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search for documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"I'd like a fruit.\"\n",
"docs = store.similarity_search(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search for documents by vector"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query_vector = embedding.embed_query(query)\n",
"docs = store.similarity_search_by_vector(query_vector, k=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search for documents with metadata filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs = store.similarity_search_by_vector(query_vector, filter={\"float_t\": 1.23})"
]
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
4 changes: 4 additions & 0 deletions docs/vercel.json
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,10 @@
"source": "/docs/integrations/providers/google_vertexai_matchingengine",
"destination": "/docs/integrations/platforms/google"
},
{
"source": "/docs/integrations/providers/google_bigquery_vector_search",
"destination": "/docs/integrations/platforms/google"
},
ashleyxuu marked this conversation as resolved.
Show resolved Hide resolved
{
"source": "/docs/integrations/providers/aws_s3",
"destination": "/docs/integrations/platforms/aws"
Expand Down
Loading