-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- **Description**: Adding vectorstore wrapper for [SemaDB](https://rapidapi.com/semafind-semadb/api/semadb). - **Issue**: None - **Dependencies**: None - **Twitter handle**: semafind Checks performed: - [x] `make format` - [x] `make lint` - [x] `make test` - [x] `make spell_check` - [x] `make docs_build` Documentation added: - SemaDB vectorstore wrapper tutorial
- Loading branch information
Showing
4 changed files
with
599 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# SemaDB | ||
|
||
>[SemaDB](https://semafind.com/) is a no fuss vector similarity search engine. It provides a low-cost cloud hosted version to help you build AI applications with ease. | ||
With SemaDB Cloud, our hosted version, no fuss means no pod size calculations, no schema definitions, no partition settings, no parameter tuning, no search algorithm tuning, no complex installation, no complex API. It is integrated with [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb) providing transparent billing, automatic sharding and an interactive API playground. | ||
|
||
## Installation | ||
|
||
None required, get started directly with SemaDB Cloud at [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb). | ||
|
||
## Vector Store | ||
|
||
There is a basic wrapper around `SemaDB` collections allowing you to use it as a vectorstore. | ||
|
||
```python | ||
from langchain.vectorstores import SemaDB | ||
``` | ||
|
||
You can follow a tutorial on how to use the wrapper in [this notebook](/docs/integrations/vectorstores/semadb.html). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,299 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fe1cf4b8-4fee-49d9-aad5-18adabaca692", | ||
"metadata": {}, | ||
"source": [ | ||
"# SemaDB\n", | ||
"\n", | ||
"> SemaDB is a no fuss vector similarity database for building AI applications. The hosted SemaDB Cloud offers a no fuss developer experience to get started.\n", | ||
"\n", | ||
"The full documentation of the API along with examples and an interactive playground is available on [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb).\n", | ||
"\n", | ||
"This notebook demonstrates how the `langchain` wrapper can be used with SemaDB Cloud." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "aa8c1970-52f0-4834-8f06-3ca8f7fac857", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load document embeddings\n", | ||
"\n", | ||
"To run things locally, we are using [Sentence Transformers](https://www.sbert.net/) which are commonly used for embedding sentences. You can use any embedding model LangChain offers." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "386a6b49-edee-45f2-9c0e-ebc125507ece", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install sentence_transformers" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "5bd07a44-34fd-4318-8033-4c8dbd327559", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.embeddings import HuggingFaceEmbeddings\n", | ||
"\n", | ||
"embeddings = HuggingFaceEmbeddings()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "b0079bdf-b3cd-4856-85d5-f7787f5d93d5", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"114\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from langchain.text_splitter import CharacterTextSplitter\n", | ||
"from langchain.document_loaders import TextLoader\n", | ||
"\n", | ||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", | ||
"documents = loader.load()\n", | ||
"text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=0)\n", | ||
"docs = text_splitter.split_documents(documents)\n", | ||
"print(len(docs))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "92ed5523-330d-4697-9008-c910044ac45a", | ||
"metadata": {}, | ||
"source": [ | ||
"## Connect to SemaDB\n", | ||
"\n", | ||
"SemaDB Cloud uses [RapidAPI keys](https://rapidapi.com/semafind-semadb/api/semadb) to authenticate. You can obtain yours by creating a free RapidAPI account." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "c4ffeeef-e6f5-4bcc-8c97-0e4222ca8282", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdin", | ||
"output_type": "stream", | ||
"text": [ | ||
"SemaDB API Key: ········\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import getpass\n", | ||
"import os\n", | ||
"\n", | ||
"os.environ['SEMADB_API_KEY'] = getpass.getpass(\"SemaDB API Key:\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "ba5f7a81-0f59-448a-93a8-5d8bf3bfc0f9", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.vectorstores import SemaDB\n", | ||
"from langchain.vectorstores.utils import DistanceStrategy" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "320f743c-39ae-456c-8c20-0683196358a4", | ||
"metadata": {}, | ||
"source": [ | ||
"The parameters to the SemaDB vector store reflect the API directly:\n", | ||
"\n", | ||
"- \"mycollection\": is the collection name in which we will store these vectors.\n", | ||
"- 768: is dimensions of the vectors. In our case, the sentence transformer embeddings yield 768 dimensional vectors.\n", | ||
"- API_KEY: is your RapidAPI key.\n", | ||
"- embeddings: correspond to how the embeddings of documents, texts and queries will be generated.\n", | ||
"- DistanceStrategy: is the distance metric used. The wrapper automatically normalises vectors if COSINE is used." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "c1cb1f78-c25e-41a7-8001-6c84d51514ea", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"db = SemaDB(\"mycollection\", 768, embeddings, DistanceStrategy.COSINE)\n", | ||
"\n", | ||
"# Create collection if running for the first time. If the collection\n", | ||
"# already exists this will fail.\n", | ||
"db.create_collection()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "44348469-1d1f-4f3e-9af3-a955aec3dd71", | ||
"metadata": {}, | ||
"source": [ | ||
"The SemaDB vector store wrapper adds the document text as point metadata to collect later. Storing large chunks of text is *not recommended*. If you are indexing a large collection, we instead recommend storing references to the documents such as external Ids." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "9adca5d3-e534-4fd2-aace-f436de4630ed", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['813c7ef3-9797-466b-8afa-587115592c6c',\n", | ||
" 'fc392f7f-082b-4932-bfcc-06800db5e017']" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"db.add_documents(docs)[:2]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fb177b0d-148b-4cbc-86cc-b62dff135a9d", | ||
"metadata": {}, | ||
"source": [ | ||
"## Similarity Search\n", | ||
"\n", | ||
"We use the default LangChain similarity search interface to search for the most similar sentences." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "7536aba2-a757-4a3f-beda-79cfee5c34cf", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"query = \"What did the president say about Ketanji Brown Jackson\"\n", | ||
"docs = db.similarity_search(query)\n", | ||
"print(docs[0].page_content)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "a51e940e-487e-484d-9dc4-1aa1a6371660", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"(Document(page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../modules/state_of_the_union.txt', 'text': 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'}),\n", | ||
" 0.42369342)" | ||
] | ||
}, | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"docs = db.similarity_search_with_score(query)\n", | ||
"docs[0]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "79aec3f4-d4d8-4c51-b4b2-074b6c22c3c0", | ||
"metadata": {}, | ||
"source": [ | ||
"## Clean up\n", | ||
"\n", | ||
"You can delete the collection to remove all data." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"id": "b00afad5-8ec1-4c19-be6b-1c2ae2d5fead", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"db.delete_collection()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "239a0bca-5c88-401f-9828-1cb0b652e7d0", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.