Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Added integrations for ThirdAI's NeuralDB as a Retriever #17334

Merged
merged 14 commits into from
Apr 16, 2024
148 changes: 148 additions & 0 deletions docs/docs/integrations/retrievers/thirdai_neuraldb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **NeuralDB**\n",
"NeuralDB is a CPU-friendly and fine-tunable retrieval engine developed by ThirdAI.\n",
"\n",
"### **Initialization**\n",
"There are two initialization methods:\n",
"- From Scratch: Basic model\n",
"- From Checkpoint: Load a model that was previously saved\n",
"\n",
"For all of the following initialization methods, the `thirdai_key` parameter can be ommitted if the `THIRDAI_KEY` environment variable is set.\n",
"\n",
"ThirdAI API keys can be obtained at https://www.thirdai.com/try-bolt/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.retrievers import NeuralDBRetriever\n",
"\n",
"# From scratch\n",
"retriever = NeuralDBRetriever.from_scratch(thirdai_key=\"your-thirdai-key\")\n",
"\n",
"# From checkpoint\n",
"retriever = NeuralDBRetriever.from_checkpoint(\n",
" # Path to a NeuralDB checkpoint. For example, if you call\n",
" # retriever.save(\"/path/to/checkpoint.ndb\") in one script, then you can\n",
" # call NeuralDBRetriever.from_checkpoint(\"/path/to/checkpoint.ndb\") in\n",
" # another script to load the saved model.\n",
" checkpoint=\"/path/to/checkpoint.ndb\",\n",
" thirdai_key=\"your-thirdai-key\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Inserting document sources**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"retriever.insert(\n",
" # If you have PDF, DOCX, or CSV files, you can directly pass the paths to the documents\n",
" sources=[\"/path/to/doc.pdf\", \"/path/to/doc.docx\", \"/path/to/doc.csv\"],\n",
" # When True this means that the underlying model in the NeuralDB will\n",
" # undergo unsupervised pretraining on the inserted files. Defaults to True.\n",
" train=True,\n",
" # Much faster insertion with a slight drop in performance. Defaults to True.\n",
" fast_mode=True,\n",
")\n",
"\n",
"from thirdai import neural_db as ndb\n",
"\n",
"retriever.insert(\n",
" # If you have files in other formats, or prefer to configure how\n",
" # your files are parsed, then you can pass in NeuralDB document objects\n",
" # like this.\n",
" sources=[\n",
" ndb.PDF(\n",
" \"/path/to/doc.pdf\",\n",
" version=\"v2\",\n",
" chunk_size=100,\n",
" metadata={\"published\": 2022},\n",
" ),\n",
" ndb.Unstructured(\"/path/to/deck.pptx\"),\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Retrieving documents**\n",
"To query the retriever, you can use the standard LangChain retriever method `get_relevant_documents`, which returns a list of LangChain Document objects. Each document object represents a chunk of text from the indexed files. For example, it may contain a paragraph from one of the indexed PDF files. In addition to the text, the document's metadata field contains information such as the document's ID, the source of this document (which file it came from), and the score of the document."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This returns a list of LangChain Document objects\n",
"documents = retriever.get_relevant_documents(\"query\", top_k=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Fine tuning**\n",
"NeuralDBRetriever can be fine-tuned to user behavior and domain-specific knowledge. It can be fine-tuned in two ways:\n",
"1. Association: the retriever associates a source phrase with a target phrase. When the retriever sees the source phrase, it will also consider results that are relevant to the target phrase.\n",
"2. Upvoting: the retriever upweights the score of a document for a specific query. This is useful when you want to fine-tune the retriever to user behavior. For example, if a user searches \"how is a car manufactured\" and likes the returned document with id 52, then we can upvote the document with id 52 for the query \"how is a car manufactured\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"retriever.associate(source=\"source phrase\", target=\"target phrase\")\n",
"retriever.associate_batch(\n",
" [\n",
" (\"source phrase 1\", \"target phrase 1\"),\n",
" (\"source phrase 2\", \"target phrase 2\"),\n",
" ]\n",
")\n",
"\n",
"retriever.upvote(query=\"how is a car manufactured\", document_id=52)\n",
"retriever.upvote_batch(\n",
" [\n",
" (\"query 1\", 52),\n",
" (\"query 2\", 20),\n",
" ]\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "langchain",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
14 changes: 1 addition & 13 deletions docs/docs/integrations/vectorstores/thirdai_neuraldb.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,8 @@
"\n",
"## Initialization\n",
"\n",
"There are three initialization methods:\n",
"There are two initialization methods:\n",
"- From Scratch: Basic model\n",
"- From Bazaar: Download a pretrained base model from our model bazaar for better performance\n",
"- From Checkpoint: Load a model that was previously saved\n",
"\n",
"For all of the following initialization methods, the `thirdai_key` parameter can be omitted if the `THIRDAI_KEY` environment variable is set.\n",
Expand All @@ -31,17 +30,6 @@
"# From scratch\n",
"vectorstore = NeuralDBVectorStore.from_scratch(thirdai_key=\"your-thirdai-key\")\n",
"\n",
"# From bazaar\n",
"vectorstore = NeuralDBVectorStore.from_bazaar(\n",
" # Name of base model to be downloaded from model bazaar.\n",
" # \"General QnA\" gives better performance on question-answering.\n",
" base=\"General QnA\",\n",
" # Path to a directory that caches models to prevent repeated downloading.\n",
" # Defaults to {CWD}/model_bazaar\n",
" bazaar_cache=\"/path/to/bazaar_cache\",\n",
" thirdai_key=\"your-thirdai-key\",\n",
")\n",
"\n",
"# From checkpoint\n",
"vectorstore = NeuralDBVectorStore.from_checkpoint(\n",
" # Path to a NeuralDB checkpoint. For example, if you call\n",
Expand Down
1 change: 1 addition & 0 deletions libs/community/langchain_community/retrievers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@
"YouRetriever": "langchain_community.retrievers.you",
"ZepRetriever": "langchain_community.retrievers.zep",
"ZillizRetriever": "langchain_community.retrievers.zilliz",
"NeuralDBRetriever": "langchain_community.retrievers.thirdai_neuraldb",
}


Expand Down
Loading
Loading