databricks: add vector search and embeddings (#25648)

### Summary Add `DatabricksVectorSearch` and `DatabricksEmbeddings` classes to the `langchain-databricks` partner packages. Core functionality is unchanged, but the vector search class is largely refactored for readability and maintainability. This PR does not add integration tests yet. This will be added once the Databricks test workspace is ready. Tagging @efriis as POC ### Tracker [✅] Create a package and imgrate ChatDatabricks [✍️] Migrate DatabricksVectorSearch, DatabricksEmbeddings, and their docs ~[ ] Migrate UCFunctionToolkit and its doc~ [ ] Add provider document and update README.md [ ] Add integration tests and set up secrets (after moved to an external package) [ ] Add deprecation note to the community implementations. --------- Signed-off-by: B-Step62 <[email protected]> Co-authored-by: Erick Friis <[email protected]>
langchain-ai · Aug 24, 2024 · c7a8af2 · c7a8af2
1 parent 71c0395
commit c7a8af2
Show file tree

Hide file tree

Showing 12 changed files with 2,265 additions and 159 deletions.
diff --git a/docs/docs/integrations/text_embedding/databricks.ipynb b/docs/docs/integrations/text_embedding/databricks.ipynb
@@ -1,22 +1,34 @@
 {
  "cells": [
   {
-   "attachments": {},
+   "cell_type": "raw",
+   "id": "afaf8039",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "sidebar_label: Databricks\n",
+    "---"
+   ]
+  },
+  {
    "cell_type": "markdown",
+   "id": "9a3d6f34",
    "metadata": {},
    "source": [
-    "# Databricks\n",
+    "# DatabricksEmbeddings\n",
     "\n",
     "> [Databricks](https://www.databricks.com/) Lakehouse Platform unifies data, analytics, and AI on one platform.\n",
     "\n",
-    "This notebook provides a quick overview for getting started with Databricks [embedding models](/docs/concepts/#embedding-models). For detailed documentation of all DatabricksEmbeddings features and configurations head to the [API reference](https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html).\n",
+    "This notebook provides a quick overview for getting started with Databricks [embedding models](/docs/concepts/#embedding-models). For detailed documentation of all `DatabricksEmbeddings` features and configurations head to the [API reference](https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html).\n",
     "\n",
     "\n",
     "\n",
     "## Overview\n",
+    "### Integration details\n",
     "\n",
-    "`DatabricksEmbeddings` class wraps an embedding model endpoint hosted on [Databricks Model Serving](https://docs.databricks.com/en/machine-learning/model-serving/index.html). This example notebook shows how to wrap your serving endpoint and use it as a embedding model in your LangChain application.\n",
-    "\n",
+    "| Class | Package |\n",
+    "| :--- | :--- |\n",
+    "| [DatabricksEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_databricks.embeddings.DatabricksEmbeddings.html) | [langchain-databricks](https://api.python.langchain.com/en/latest/databricks_api_reference.html) |\n",
     "\n",
     "### Supported Methods\n",
     "\n",
@@ -30,13 +42,9 @@
     "1. Foundation Models - Curated list of state-of-the-art foundation models such as BAAI General Embedding (BGE). These endpoint are ready to use in your Databricks workspace without any set up.\n",
     "2. Custom Models - You can also deploy custom embedding models to a serving endpoint via MLflow with\n",
     "your choice of framework such as LangChain, Pytorch, Transformers, etc.\n",
-    "3. External Models - Databricks endpoints can serve models that are hosted outside Databricks as a proxy, such as proprietary model service like OpenAI text-embedding-3.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "3. External Models - Databricks endpoints can serve models that are hosted outside Databricks as a proxy, such as proprietary model service like OpenAI text-embedding-3.\n",
+    "\n",
+    "\n",
     "## Setup\n",
     "\n",
     "To access Databricks models you'll need to create a Databricks account, set up credentials (only if you are outside Databricks workspace), and install required packages.\n",
@@ -51,6 +59,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "36521c2a",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -63,33 +72,27 @@
   },
   {
    "cell_type": "markdown",
+   "id": "d9664366",
    "metadata": {},
    "source": [
     "### Installation\n",
     "\n",
-    "The LangChain Databricks integration lives in the `langchain-community` package. Also, `mlflow >= 2.9 ` is required to run the code in this notebook."
+    "The LangChain Databricks integration lives in the `langchain-databricks` package:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "64853226",
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install -qU langchain-community mlflow>=2.9.0"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We first demonstrates how to query BGE model hosted as Foundation Models endpoint with `DatabricksEmbeddings`.\n",
-    "\n",
-    "For other type of endpoints, there are some difference in how to set up the endpoint itself, however, once the endpoint is ready, there is no difference in how to query it."
+    "%pip install -qU langchain-databricks"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "45dd1724",
    "metadata": {},
    "source": [
     "## Instantiation"
@@ -98,10 +101,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "9ea7a09b",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from langchain_community.embeddings import DatabricksEmbeddings\n",
+    "from langchain_databricks import DatabricksEmbeddings\n",
     "\n",
     "embeddings = DatabricksEmbeddings(\n",
     "    endpoint=\"databricks-bge-large-en\",\n",
@@ -113,65 +117,131 @@
   },
   {
    "cell_type": "markdown",
+   "id": "77d271b6",
+   "metadata": {},
+   "source": [
+    "## Indexing and Retrieval\n",
+    "\n",
+    "Embedding models are often used in retrieval-augmented generation (RAG) flows, both as part of indexing data as well as later retrieving it. For more detailed instructions, please see our RAG tutorials under the [working with external knowledge tutorials](/docs/tutorials/#working-with-external-knowledge).\n",
+    "\n",
+    "Below, see how to index and retrieve data using the `embeddings` object we initialized above. In this example, we will index and retrieve a sample document in the `InMemoryVectorStore`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d817716b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a vector store with a sample text\n",
+    "from langchain_core.vectorstores import InMemoryVectorStore\n",
+    "\n",
+    "text = \"LangChain is the framework for building context-aware reasoning applications\"\n",
+    "\n",
+    "vectorstore = InMemoryVectorStore.from_texts(\n",
+    "    [text],\n",
+    "    embedding=embeddings,\n",
+    ")\n",
+    "\n",
+    "# Use the vectorstore as a retriever\n",
+    "retriever = vectorstore.as_retriever()\n",
+    "\n",
+    "# Retrieve the most similar text\n",
+    "retrieved_document = retriever.invoke(\"What is LangChain?\")\n",
+    "\n",
+    "# show the retrieved document's content\n",
+    "retrieved_document[0].page_content"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e02b9855",
    "metadata": {},
    "source": [
-    "## Embed single text"
+    "## Direct Usage\n",
+    "\n",
+    "Under the hood, the vectorstore and retriever implementations are calling `embeddings.embed_documents(...)` and `embeddings.embed_query(...)` to create embeddings for the text(s) used in `from_texts` and retrieval `invoke` operations, respectively.\n",
+    "\n",
+    "You can directly call these methods to get embeddings for your own use cases.\n",
+    "\n",
+    "### Embed single texts\n",
+    "\n",
+    "You can embed single texts or documents with `embed_query`:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
+   "id": "0d2befcd",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[0.051055908203125, 0.007221221923828125, 0.003879547119140625]\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "embeddings.embed_query(\"hello\")[:3]"
+    "single_vector = embeddings.embed_query(text)\n",
+    "print(str(single_vector)[:100])  # Show the first 100 characters of the vector"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "1b5a7d03",
    "metadata": {},
    "source": [
-    "## Embed documents"
+    "### Embed multiple texts\n",
+    "\n",
+    "You can embed multiple texts with `embed_documents`:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "2f4d6e97",
    "metadata": {},
    "outputs": [],
    "source": [
-    "documents = [\"This is a dummy document.\", \"This is another dummy document.\"]\n",
-    "response = embeddings.embed_documents(documents)\n",
-    "print([e[:3] for e in response])  # Show first 3 elements of each embedding"
+    "text2 = (\n",
+    "    \"LangGraph is a library for building stateful, multi-actor applications with LLMs\"\n",
+    ")\n",
+    "two_vectors = embeddings.embed_documents([text, text2])\n",
+    "for vector in two_vectors:\n",
+    "    print(str(vector)[:100])  # Show the first 100 characters of the vector"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "98785c12",
    "metadata": {},
    "source": [
-    "## Wrapping Other Types of Endpoints\n",
+    "### Async Usage\n",
+    "\n",
+    "You can also use `aembed_query` and `aembed_documents` for producing embeddings asynchronously:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c3bef91",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "\n",
+    "\n",
+    "async def async_example():\n",
+    "    single_vector = await embeddings.aembed_query(text)\n",
+    "    print(str(single_vector)[:100])  # Show the first 100 characters of the vector\n",
     "\n",
-    "The example above uses an embedding model hosted as a Foundation Models API. To learn about how to use the other endpoint types, please refer to the documentation for `ChatDatabricks`. While the model type is different, required steps are the same.\n",
     "\n",
-    "* [Custom Model Endpoint](https://python.langchain.com/v0.2/docs/integrations/chat/databricks/#wrapping-custom-model-endpoint)\n",
-    "* [External Models](https://python.langchain.com/v0.2/docs/integrations/chat/databricks/#wrapping-external-models)"
+    "asyncio.run(async_example())"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "0d053b64",
    "metadata": {},
    "source": [
-    "## API reference\n",
+    "## API Reference\n",
     "\n",
-    "For detailed documentation of all ChatDatabricks features and configurations head to the API reference: https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html"
+    "For detailed documentation on `DatabricksEmbeddings` features and configuration options, please refer to the [API reference](https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html).\n"
    ]
   }
  ],
@@ -191,9 +261,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.5"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 4
+ "nbformat_minor": 5
 }