Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
ryuta-yoshimatsu authored Oct 8, 2024
0 parents commit d669a1f
Show file tree
Hide file tree
Showing 21 changed files with 2,230 additions and 0 deletions.
168 changes: 168 additions & 0 deletions 00_introduction.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Databricks notebook source
# MAGIC %md
# MAGIC This solution accelerator notebook is available at [Databricks Industry Solutions](https://github.com/databricks-industry-solutions).

# COMMAND ----------

# MAGIC %md
# MAGIC <img src='https://github.com/databricks-industry-solutions/.github/raw/main/profile/solacc_logo_wide.png' width="1000" ></img>
# MAGIC
# MAGIC # Semantic Cache Solution Accelerator
# MAGIC
# MAGIC Generative AI models are increasingly revolutionizing industries, with techniques like Retrieval Augmented Generation (RAG) and Compound AI systems leading the charge. These models empower organizations by enhancing capabilities such as information retrieval, decision-making, and content generation. However, the implementation of these systems is often accompanied by significant costs, especially in terms of computational resources. Despite these challenges, the rapid advancement of AI platforms and the development of more efficient algorithms are enabling businesses to optimize costs and scale AI-driven solutions more effectively.
# MAGIC
# MAGIC Semantic cache is a technique that is adopted by many enterprises to reduce the computational load of AI-driven systems. As generative AI models handle increasingly complex queries, there is often semantic overlap between different queries, such as users asking variations of the same question. Without semantic caching, these systems would need to repeatedly perform resource-intensive computations, leading to inefficiencies. By storing the previously processed queries and responses, semantic caching allows AI models to retrieve relevant information without recalculating, thereby reducing latency, lowering server load, and conserving computational resources. This becomes especially important as AI applications scale, ensuring cost-effectiveness and maintaining high performance, particularly in natural language processing, where nuanced query variations are frequent.

# COMMAND ----------

# MAGIC %md
# MAGIC ## How Semantic Caching works
# MAGIC
# MAGIC Semantic caching leverages a vector database to store and retrieve answers based on the meaning or semantics of a question rather than just its keywords. In this system, each question is embedded as a vector, and the cached answers are stored. When a new query is submitted, the system searches the database for similar vectors, returning a cached response when a match is found. When a suitable match is not found, the system proceeds to execute the standard pipeline to generate a response, and in turn persists the new question and answer pair in the database.
# MAGIC
# MAGIC This technique is particularly effective for handling high-volume, repetitive queries such as those often found in customer FAQs, where users frequently ask the same or similar questions. Some of the key business benefits of semantic cache are:
# MAGIC
# MAGIC - Reduce Costs: With fewer computationally expensive model calls, businesses will see significant cost savings. The system bypasses the need to generate new answers for questions that have already been asked, leading to reduced usage of cloud resources and lower operational costs.
# MAGIC - Faster Response Time: Customer satisfaction is closely tied to how quickly they receive answers. With semantic caching, chatbots can instantly retrieve answers from the cache, dramatically reducing the time it takes to respond to queries.
# MAGIC - Scalability: As businesses scale, so do the number of customer inquiries. Caching frequently asked questions ensures the chatbot can handle increased volumes without a corresponding increase in costs or latency.
# MAGIC
# MAGIC Some use cases we see in the market that are especially suitable for semantic caching include:
# MAGIC
# MAGIC - FAQs: Questions that customers frequently ask—such as product details, order statuses, or return policies—are prime candidates for caching. Businesses can quickly address these repetitive queries without taxing the system.
# MAGIC - Support Tickets: For companies that manage large-scale customer support, semantic caching can be implemented to address commonly recurring issues.
# MAGIC - Internal Knowledge Bases: Employees often ask the same internal queries, and caching these answers can improve productivity by providing instant access to stored knowledge.
# MAGIC

# COMMAND ----------

# MAGIC %md
# MAGIC ## Semantic Cache on Databricks Mosaic AI
# MAGIC
# MAGIC Databricks provides an optimal platform for building AI agents with semantic caching capabilities. With Databricks Mosaic AI, users have access to all necessary components such as a vector database, agent development framework, agent serving, and an agent evaluation framework on a unified, highly governed platform. This ensures that key assets, including data, vector indexes, models, agents, and endpoints, are centrally managed under robust governance.
# MAGIC
# MAGIC Mosaic AI also offers an open architecture, allowing users to experiment with various models for embeddings and generation. Leveraging the Mosaic AI Agent Framework and Evaluation tools, users can rapidly iterate on applications until they meet production-level standards. Once deployed, KPIs like hit ratios and latency can be monitored using MLflow traces, which are automatically logged in Inference Tables for easy tracking.
# MAGIC
# MAGIC If you're looking to implement semantic caching for your AI system on Databricks, we're excited to introduce the Semantic Cache Solution Accelerator. This accelerator is designed to help you get started quickly and efficiently, providing a streamlined path to implementation.

# COMMAND ----------

# MAGIC %md
# MAGIC ## Case Study
# MAGIC
# MAGIC Imagine you're operating an AI chatbot on your products' public documentation page. This chatbot answers visitors' questions about your products using a retriever-augmented generation (RAG) architecture. After reviewing the submitted user questions and responses, you notice a large number of redundant queries—phrased differently but carrying the same meaning. You're getting feedback from the users that the chatbot's response time is too long and also facing pressure from management to reduce the operational costs of the chatbot.
# MAGIC
# MAGIC In the following notebooks, we'll explore how semantic caching can significantly lower both total cost and latency, with only a minor trade-off in response quality.

# COMMAND ----------

# MAGIC %md
# MAGIC ## Exploratory Data Analysis
# MAGIC
# MAGIC The dataset we use for this solution accelrator was synthesized and is stored inside `./data/`. We first generated a list of 10 questions related to Databricks Machine Learning product features using [dbrx-instruct](https://docs.databricks.com/en/machine-learning/foundation-models/supported-models.html#dbrx-instruct). We then bootstrapped these questions to generate 100 questions. We reformulate each of the 10 questions slightly differently without changing the meaning. We used [Meta Llama 3.1 70B Instruct](https://docs.databricks.com/en/machine-learning/foundation-models/supported-models.html#meta-llama-31-70b-instruct) for this.
# MAGIC
# MAGIC
# MAGIC The goal of this exploratory data analysis is to identify the optimal similarity score threshold that separates semantically similar questions from non-similar ones. This threshold should maximize the cache hit rate while minimizing false positives. A synthesized dataset is helpful as it provides ground truth labels, meaning that the questions bootstrapped from the same original question belong to the same semantic class. This is captured in the colume `base` in the `data/synthetic_questions_100.csv dataset`. This dataset allows for accurate validation of the threshold's performance in separating similar and non-similar questions, which we we see in the following.
# MAGIC
# MAGIC Let's first load in the configuration parameters (find more information about `Config` in the next notebook).

# COMMAND ----------

# DBTITLE 1,Load parameters
from config import Config
config = Config()

# COMMAND ----------

# MAGIC %md
# MAGIC We will read in the dataset as a pandas DataFrame and apply an embedding model to the questions.

# COMMAND ----------

import pandas as pd
import mlflow.deployments
from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import StringType

df = pd.read_csv('data/synthetic_questions_100.csv')[['base', 'question']]

deploy_client = mlflow.deployments.get_deploy_client("databricks")
def get_embedding(question):
response = deploy_client.predict(endpoint=config.EMBEDDING_MODEL_SERVING_ENDPOINT_NAME, inputs={"input": question})
return response.data[0]["embedding"]

# Apply an embedding model to the 'question' column and create a new column 'embedding'
df["embedding"] = df["question"].apply(lambda x: get_embedding(x))

display(df)

# COMMAND ----------

# MAGIC %md
# MAGIC We will perform a cross join between all the questions to calculate the similarity score for every possible pair of combinations, which will result in 10,000 rows.

# COMMAND ----------

df = df.merge(df, how='cross')

# COMMAND ----------

# MAGIC %md
# MAGIC [Databricks Mosaic AI Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html) uses L2 distance as a similarity score:
# MAGIC
# MAGIC $$\frac{1}{(1 + dist(q,x)^2)}$$
# MAGIC
# MAGIC where dist is the Euclidean distance between the query q and the index entry x, defined as:
# MAGIC
# MAGIC $$dist(q,x) = \sqrt{(q_1-x_1)^2 + (q_2-x_2)^2 + \ldots + (q_d-x_d)^2}.$$
# MAGIC
# MAGIC We will calculate this metric for each combination of questions. The `similar` column shown below indicates whether both questions in the pair belong to the same semantic class.

# COMMAND ----------

import numpy as np

def get_similarity_score(embedding_x, embedding_y):
l_norm = np.linalg.norm(np.array(embedding_x) - np.array(embedding_y))
score = 1.0/(1.0 + l_norm*l_norm)
return score

# Apply an embedding model to the 'question' column and create a new column 'embedding'
df["score"] = df.apply(lambda x: get_similarity_score(x["embedding_x"], x["embedding_y"]), axis=1)
df = df.loc[df["score"] != 1] # Exclude the self-similar combinations
df ["similar"] = df.apply(lambda x: True if x["base_x"] == x["base_y"] else False, axis=1)
df = df[["similar", "score"]]

display(df)

# COMMAND ----------

# MAGIC %md
# MAGIC Let's look at the summary statistics and the distribution of the simiar and non-similar pairs.

# COMMAND ----------

df.groupby('similar').describe().T

# COMMAND ----------

df.groupby('similar')['score'].plot(
kind='hist',
bins=50,
alpha=0.65,
density=True,
figsize=(10, 6),
grid=True,
legend=True,
)

# COMMAND ----------

# MAGIC %md
# MAGIC The analysis shows that the similar and non-similar questions synthesized for this demo exhibit distinct distributions. However, there is a notable overlap between the two distributions, presenting a critical decision point for the solution. If we prioritize the hit rate and set a low similarity threshold (e.g., 0.005), we can achieve a recall of over 0.75, but this will come at the expense of precision. On the other hand, setting a higher threshold (e.g., 0.015) to prioritize precision will limit recall to around 0.25. This trade-off must be carefully evaluated by the team in collaboration with business stakeholders.
# MAGIC
# MAGIC In the following notebook, we will set the threshold to 0.01 as a balanced starting point.

# COMMAND ----------

# MAGIC %md
# MAGIC © 2024 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License.
137 changes: 137 additions & 0 deletions 01_data_preparation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Databricks notebook source
# MAGIC %md
# MAGIC This solution accelerator notebook is available at [Databricks Industry Solutions](https://github.com/databricks-industry-solutions).

# COMMAND ----------

# MAGIC %md
# MAGIC #Set up Vector Search for RAG
# MAGIC
# MAGIC Our AI chatbot utilizes a retriever-augmented generation (RAG) approach. Before implementing semantic caching, we’ll first set up the vector database that supports this RAG architecture. For this, we’ll use [Databricks Mosaic AI Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html).

# COMMAND ----------

# MAGIC %md
# MAGIC ## Cluster configuration
# MAGIC We recommend using a cluster with the following specifications to run this solution accelerator:
# MAGIC - Unity Catalog enabled cluster
# MAGIC - Databricks Runtime 15.4 LTS ML or above
# MAGIC - Single-node cluster: e.g. `m6id.2xlarge` on AWS or `Standard_D8ds_v4` on Azure Databricks.

# COMMAND ----------

# MAGIC %md
# MAGIC We install the required packages from the `requirements.txt` file into the current session.

# COMMAND ----------

# DBTITLE 1,Install requirements
# MAGIC %pip install -r requirements.txt --quiet
# MAGIC dbutils.library.restartPython()

# COMMAND ----------

# MAGIC %md
# MAGIC `config.py` is a key file that holds all the essential parameters for the application. Open the file and define the values for the parameters according to your specific setup, such as the embedding/generation model endpoint, catalog, schema, vector search endpoint, and more. The following cell will load these parameters into the `config` variable.

# COMMAND ----------

# DBTITLE 1,Load parameters
from config import Config
config = Config()

# COMMAND ----------

# MAGIC %md
# MAGIC In the next cell, we run the `99_init` notebook, which sets up the logging policy and downloads the chunked Databricks product documentation (if it doesn't already exist) into the specified tables under the catalog and schema you defined in config.py.

# COMMAND ----------

# DBTITLE 1,Run init notebok
# MAGIC %run ./99_init $reset_all_data=false

# COMMAND ----------

# MAGIC %md
# MAGIC ## Create Vector Search endpoint
# MAGIC
# MAGIC We create a Vector Search endpoint using custom functions defined in the `utils.py` script.

# COMMAND ----------

import utils
from databricks.vector_search.client import VectorSearchClient

# Instantiate the Vector Search Client
vsc = VectorSearchClient(disable_notice=True)

# Check if the endpoint exists, if not create it
if not utils.vs_endpoint_exists(vsc, config.VECTOR_SEARCH_ENDPOINT_NAME):
utils.create_or_wait_for_endpoint(vsc, config.VECTOR_SEARCH_ENDPOINT_NAME)

print(f"Endpoint named {config.VECTOR_SEARCH_ENDPOINT_NAME} is ready.")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Create Vector Search index
# MAGIC Create a Vector Search index from the chunks of documents loaded in the previous cell. We use custom functions defined in the `utils.py` script.

# COMMAND ----------

from databricks.sdk import WorkspaceClient

# Check if the index exists, if not create it
if not utils.index_exists(vsc, config.VECTOR_SEARCH_ENDPOINT_NAME, config.VS_INDEX_FULLNAME):

print(f"Creating index {config.VS_INDEX_FULLNAME} on endpoint {config.VECTOR_SEARCH_ENDPOINT_NAME}...")

# Create a delta sync index
vsc.create_delta_sync_index(
endpoint_name=config.VECTOR_SEARCH_ENDPOINT_NAME,
index_name=config.VS_INDEX_FULLNAME,
source_table_name=config.SOURCE_TABLE_FULLNAME,
pipeline_type="TRIGGERED",
primary_key="id",
embedding_source_column='content', # The column containing our text
embedding_model_endpoint_name=config.EMBEDDING_MODEL_SERVING_ENDPOINT_NAME, #The embedding endpoint used to create the embeddings
)

# Let's wait for the index to be ready and all our embeddings to be created and indexed
utils.wait_for_index_to_be_ready(vsc, config.VECTOR_SEARCH_ENDPOINT_NAME, config.VS_INDEX_FULLNAME)
else:
# Trigger a sync to update our vs content with the new data saved in the table
utils.wait_for_index_to_be_ready(vsc, config.VECTOR_SEARCH_ENDPOINT_NAME, config.VS_INDEX_FULLNAME)
vsc.get_index(config.VECTOR_SEARCH_ENDPOINT_NAME, config.VS_INDEX_FULLNAME).sync()

print(f"index {config.VS_INDEX_FULLNAME} on table {config.SOURCE_TABLE_FULLNAME} is ready")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Query Vector Search index
# MAGIC
# MAGIC Let's see if we can run a similarity search against the index.

# COMMAND ----------

# Let's search for the chunks that are most relevant to the query "What is Model Serving?"
results = vsc.get_index(
config.VECTOR_SEARCH_ENDPOINT_NAME,
config.VS_INDEX_FULLNAME
).similarity_search(
query_text="What is Model Serving?",
columns=["url", "content"],
num_results=1)
docs = results.get('result', {}).get('data_array', [])
docs

# COMMAND ----------

# MAGIC %md
# MAGIC We have successfully set up the vector database for our RAG chatbot. In the next `02_rag_chatbot` notebook, we will build a standard RAG chatbot without semantic caching, which will serve as a benchmark. Later, in the `03_rag_chatbot_with_cache notebook`, we will introduce semantic caching and compare its performance.

# COMMAND ----------

# MAGIC %md
# MAGIC © 2024 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License.
Loading

0 comments on commit d669a1f

Please sign in to comment.