diff --git a/gemini/evaluation/evaluate_langchain_chains.ipynb b/gemini/evaluation/evaluate_langchain_chains.ipynb index b642da7a84b..fe292c127d6 100644 --- a/gemini/evaluation/evaluate_langchain_chains.ipynb +++ b/gemini/evaluation/evaluate_langchain_chains.ipynb @@ -1,929 +1,953 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "OsXAs2gcIpbC" - }, - "outputs": [], - "source": [ - "# Copyright 2024 Google LLC\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7ZX50cNFOFBt" - }, - "source": [ - " # Evaluate LangChain | Rapid Evaluation SDK Tutorial\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \"Google
Open in Colab\n", - "
\n", - "
\n", - " \n", - " \"Google
Open in Colab Enterprise\n", - "
\n", - "
\n", - " \n", - " \"Vertex
Open in Workbench\n", - "
\n", - "
\n", - " \n", - " \"GitHub
View on GitHub\n", - "
\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "usd0d_LiOFBt" - }, - "source": [ - "| | |\n", - "|-|-|\n", - "|Author(s) | [Elia Secchi](https://github.com/eliasecchig) |" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MjDmmmDaOFBt" - }, - "source": [ - "## Overview\n", - "\n", - "With this tutorial, you learn how to evaluate the performance of a conversational LangChain chain using the Vertex AI Rapid Evaluation SDK. The notebook utilizes a dummy chatbot designed to provide recipe suggestions.\n", - "\n", - "The tutorial goes trough:\n", - "1. Data preparation\n", - "2. Setting up the LangChain chain\n", - "3. Set-up a custom metric\n", - "4. Run evaluation with a combination of custom metrics and built-in metrics.\n", - "5. Log results into an experiment run and analyze different runs.\n", - "\n", - "### Costs\n", - "\n", - "This tutorial uses billable components of Google Cloud:\n", - "\n", - "- Vertex AI\n", - "\n", - "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w-OcPSC8_FUX" - }, - "source": [ - "## Getting Started" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-7Jso8-FO4N8" - }, - "source": [ - "### Install Vertex AI SDK for Rapid Evaluation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tUat7NRq5JDC" - }, - "outputs": [], - "source": [ - "%pip install --quiet --upgrade nest_asyncio\n", - "%pip install --upgrade --user --quiet langchain-core langchain-google-vertexai langchain\n", - "%pip install --upgrade --user --quiet \"google-cloud-aiplatform[rapid_evaluation]\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "R5Xep4W9lq-Z" - }, - "source": [ - "### Restart runtime\n", - "\n", - "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n", - "\n", - "The restart might take a minute or longer. After it's restarted, continue to the next step." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XRvKdaPDTznN" - }, - "outputs": [], - "source": [ - "import IPython\n", - "\n", - "app = IPython.Application.instance()\n", - "app.kernel.do_shutdown(True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SbmM4z7FOBpM" - }, - "source": [ - "
\n", - "⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dmWOrTJ3gx13" - }, - "source": [ - "### Authenticate your notebook environment (Colab only)\n", - "\n", - "If you're running this notebook on Google Colab, run the cell below to authenticate your environment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "NyKGtVQjgx13" - }, - "outputs": [], - "source": [ - "# import sys\n", - "\n", - "# if \"google.colab\" in sys.modules:\n", - "# from google.colab import auth\n", - "\n", - "# auth.authenticate_user()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DF4l8DTdWgPY" - }, - "source": [ - "### Set Google Cloud project information and initialize Vertex AI SDK\n", - "\n", - "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", - "\n", - "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Nqwi-5ufWp_B" - }, - "outputs": [], - "source": [ - "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", - "LOCATION = \"us-central1\" # @param {type:\"string\"}\n", - "\n", - "\n", - "import vertexai\n", - "\n", - "vertexai.init(project=PROJECT_ID, location=LOCATION)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dvhI92xhQTzk" - }, - "source": [ - "### Import libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qP4ihOCkEBje" - }, - "outputs": [], - "source": [ - "from concurrent.futures import ThreadPoolExecutor\n", - "from functools import partial\n", - "import json\n", - "from json.decoder import JSONDecodeError\n", - "import logging\n", - "from typing import Any\n", - "import warnings\n", - "\n", - "from google.cloud import aiplatform\n", - "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n", - "from langchain_google_vertexai import ChatVertexAI, VertexAI\n", - "import nest_asyncio\n", - "import pandas as pd\n", - "from tqdm import tqdm\n", - "\n", - "# Main\n", - "import vertexai\n", - "from vertexai.preview.evaluation import EvalTask, make_metric\n", - "\n", - "# General\n", - "import yaml" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bDmLyf-K5nRz" - }, - "source": [ - "### Library settings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "KRkatYS95mLP" - }, - "outputs": [], - "source": [ - "logging.getLogger(\"urllib3.connectionpool\").setLevel(logging.ERROR)\n", - "nest_asyncio.apply()\n", - "warnings.filterwarnings(\"ignore\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "l26gX-cHOFBu" - }, - "source": [ - "### Helper functions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gT_OJBHfCg4Q" - }, - "outputs": [], - "source": [ - "def generate_multiturn_history(df: pd.DataFrame):\n", - " \"\"\"Processes a DataFrame of messages to add conversation history for each message.\n", - "\n", - " This function takes a DataFrame containing message data and iterates through each row.\n", - " For each message in a row, it constructs the conversation history up to that point by\n", - " accumulating previous user and AI messages. This conversation history is then added\n", - " to the message data, and the processed messages are returned as a new DataFrame.\n", - "\n", - " Args:\n", - " df: A DataFrame containing message data. It is expected to have a column named\n", - " \"messages\" where each entry is a list of dictionaries representing messages in\n", - " a conversation. Each message dictionary should have \"user\" and \"reference\" keys.\n", - "\n", - " Returns:\n", - " A DataFrame with the processed messages. Each message dictionary will now have an\n", - " additional \"conversation_history\" key containing a list of tuples representing the\n", - " conversation history leading up to that message. The tuples are of the form\n", - " (\"user\", message_text) or (\"ai\", message_text).\n", - " \"\"\"\n", - " processed_messages = []\n", - " for i, row in df.iterrows():\n", - " conversation_history = []\n", - " for message in row[\"messages\"]:\n", - " message[\"conversation_history\"] = conversation_history\n", - " processed_messages.append(message)\n", - " conversation_history = conversation_history + [\n", - " (\"user\", message[\"user\"]),\n", - " (\"ai\", message[\"reference\"]),\n", - " ]\n", - " return pd.DataFrame(processed_messages)\n", - "\n", - "\n", - "def pairwise(iterable):\n", - " \"\"\"Creates an iterable with tuples paired together\n", - " e.g s -> (s0, s1), (s2, s3), (s4, s5), ...\n", - " \"\"\"\n", - " a = iter(iterable)\n", - " return zip(a, a)\n", - "\n", - "\n", - "def batch_generate_message(row: dict, callable: Any) -> dict:\n", - " \"\"\"\n", - " Predicts a response from a chat agent.\n", - "\n", - " Args:\n", - " callable (ChatAgent): A chat agent.\n", - " row (dict): A message.\n", - " Returns:\n", - " dict: The predicted response.\n", - " \"\"\"\n", - " index, message = row\n", - "\n", - " messages = []\n", - " for user_message, ground_truth in pairwise(message.get(\"conversation_history\", [])):\n", - " messages.append((\"user\", user_message))\n", - " messages.append((\"ai\", ground_truth))\n", - " messages.append((\"user\", message[\"user\"]))\n", - " input_callable = {\"messages\": messages, **message.get(\"callable_kwargs\", {})}\n", - " response = callable.invoke(input_callable)\n", - " message[\"response\"] = response.content\n", - " message[\"response_obj\"] = response\n", - " return message\n", - "\n", - "\n", - "def batch_generate_messages(\n", - " messages: pd.DataFrame, callable: Any, max_workers: int = 4\n", - ") -> pd.DataFrame:\n", - " \"\"\"\n", - " Generates AI-powered responses to a series of user messages using a provided callable.\n", - "\n", - " This function efficiently processes a Pandas DataFrame containing user-AI message pairs,\n", - " utilizing the specified callable (either a LangChain Chain or a custom class with an\n", - " `invoke` method) to predict AI responses in parallel.\n", - "\n", - " Args:\n", - " callable (callable): A callable object (e.g., LangChain Chain, custom class) used\n", - " for response generation. Must have an `invoke(messages: dict) ->\n", - " langchain_core.messages.ai.AIMessage` method.\n", - " The `messages` dict follows this structure:\n", - " {\"messages\" [(\"user\", \"first\"),(\"ai\", \"a response\"), (\"user\", \"a follow up\")]}\n", - "\n", - " messages (pd.DataFrame): A DataFrame with one column named 'messages' containing\n", - " the list of user-AI message pairs as described above.\n", - "\n", - " max_workers (int, optional): The number of worker processes to use for parallel\n", - " prediction. Defaults to the number of available CPU cores.\n", - "\n", - " Returns:\n", - " pd.DataFrame: A DataFrame containing the original messages and a new column with the predicted AI responses.\n", - "\n", - " Example:\n", - " ```python\n", - " messages_df = pd.DataFrame({\n", - " \"messages\": [\n", - " [{\"user\": \"What's the weather today?\", \"reference\": \"It's sunny.\"}],\n", - " [{\"user\": \"Tell me a joke.\", \"reference\": \"Why did the scarecrow win an award?...\n", - " Because he was outstanding in his field!\"}]\n", - " ]\n", - " })\n", - "\n", - " responses_df = batch_generate_messages(my_callable, messages_df)\n", - " ```\n", - " \"\"\"\n", - " logging.info(\"Executing batch scoring\")\n", - " predicted_messages = []\n", - " with ThreadPoolExecutor(max_workers) as pool:\n", - " partial_func = partial(batch_generate_message, callable=callable)\n", - " for message in tqdm(\n", - " pool.map(partial_func, messages.iterrows()), total=len(messages)\n", - " ):\n", - " predicted_messages.append(message)\n", - " return pd.DataFrame(predicted_messages)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SZMXiJhKD2TY" - }, - "source": [ - "## Import ground truth data for evaluation\n", - "\n", - "In this sample, we will use 2 conversations as a ground truth. Every message in the conversations, along with the relative history, will be used to produce a response with a foundational model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "jwqlusT7OJ4J" - }, - "outputs": [], - "source": [ - "%%writefile chats.yaml\n", - "- messages:\n", - " - user: Hi\n", - " reference: Hi, how can I help you?\n", - " - user: I'm looking for a recipe for a healthy dinner. Do you have any recommendations?\n", - " reference: Sure, I can help you with that. What are your dietary restrictions? Are you vegetarian, vegan, gluten-free, or anything else?\n", - " - user: I'm not vegetarian or vegan, but I am gluten-free.\n", - " reference: 'Okay, I ll keep that in mind. Here are a few recipes that I think you might like:\n", - " * **Grilled Salmon with Roasted Vegetables:** This is a delicious and healthy recipe that is perfect for a weeknight meal. The salmon is grilled to perfection and the roasted vegetables add a touch of sweetness.\n", - " * **Chicken Stir-Fry:** This is a quick and easy stir-fry that is perfect for busy weeknights. The chicken is cooked with vegetables and a light sauce.\n", - " * **Lentil Soup:** This is a hearty and healthy soup that is perfect for a cold winter day. The lentils are packed with protein and fiber, and the soup is also gluten-free.'\n", - " - user: Those all sound great! I think I'm going to try the grilled salmon with roasted vegetables.\n", - " reference: That's a great choice! I hope you enjoy it.\n", - " - user: Thanks for your help!\n", - " reference: You're welcome! Is there anything else I can help you with today?\n", - " - user: No, that's all. Thanks again!\n", - " reference: You're welcome! Have a great day!\n", - "- messages:\n", - " - user: Hi\n", - " reference: Hi, how can I help you?\n", - " - user: I'm looking for a recipe for a romantic dinner. Do you have any recommendations?\n", - " reference: Sure, I can help you with that. What are your dietary restrictions? Are you vegetarian, vegan, gluten-free, or anything else?\n", - " - user: I'm vegetarian.\n", - " reference: 'Sure, I can help you find a healthy vegetarian dinner recipe. Here are a few ideas:\n", - " * **Burnt aubergine veggie chilli:** This is a hearty and flavorful dish that is packed with nutrients. The roasted aubergine gives it a smoky flavor, and the lentils and beans add protein and fiber.\n", - " * **Simple mushroom curry:** This is a quick and easy curry that is perfect for a weeknight meal. The mushrooms are cooked in a creamy sauce with spices, and the whole dish is ready in under 30 minutes.\n", - " * **Vegetarian enchiladas:** This is a classic Mexican dish that is easy to make vegetarian. The enchiladas are filled with a variety of vegetables, and they are topped with a delicious sauce.\n", - " * **Braised sesame tofu:** This is a flavorful and satisfying dish that is perfect for a cold night. The tofu is braised in a sauce with sesame, ginger, and garlic, and it is served over rice or noodles.\n", - " * **Roast garlic & tahini spinach:** This is a light and healthy dish that is perfect for a spring or summer meal. The spinach is roasted with garlic and tahini, and it is served with a side of pita bread.\n", - "\n", - " These are just a few ideas to get you started. There are many other great vegetarian dinner recipes out there, so you are sure to find something that you will enjoy.'\n", - " - user: Those all sound great! I like the Burnt aubergine veggie chilli\n", - " reference: That's a great choice! I hope you enjoy it." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ie7KTd_OjwdE" - }, - "source": [ - "Let's now load all the messages into a Pandas DataFrame:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "j0eSy_avGCW7" - }, - "outputs": [], - "source": [ - "y = yaml.safe_load(open(\"chats.yaml\"))\n", - "df = pd.DataFrame(y)\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xA7G0bDTj84z" - }, - "source": [ - "**Decomposing the chat message input/output pairs**\n", - "\n", - "We are now ready for decomposing multi-turn history. This is essential to enable batch prediction.\n", - "\n", - "We decompose the `messages` list into single input/output pairs. The input is always composed by `user message`, `reference message` and `conversation_history`.\n", - "\n", - "**Example:**\n", - "\n", - "Given the following chat:\n", - "\n", - "```yaml\n", - "- messages:\n", - "- user: Hi\n", - "reference: Hi, how can I help you?\n", - "- user: I'm looking for a recipe for a healthy dinner. Do you have any recommendations?\n", - "reference: Sure, I can help you with that. What are your dietary restrictions? Are you vegetarian, vegan, gluten-free, or anything else?\n", - "```\n", - "\n", - "We can generate these two input/output samples:\n", - "\n", - "```yaml\n", - "- user: Hi\n", - " reference: Hi, how can I help you?\n", - " conversation_history: []\n", - "\n", - "- user: I'm looking for a recipe for a healthy dinner....\n", - " reference: Sure, I can help you with that. What are your ...\n", - " conversation_history:\n", - " - user: Hi\n", - " reference: Hi, how can I help you?\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "1FL0R3YXPj3e" - }, - "outputs": [], - "source": [ - "df_processed = generate_multiturn_history(df)\n", - "df_processed" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8OhQLR1lGsEq" - }, - "source": [ - "## Let's define our LangChain chain!\n", - "\n", - "We now need to define our LangChain Chain. For this tutorial, we will create a simple conversational chain capable of producing cooking recipes for users." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "u5B0ufczjfA_" - }, - "outputs": [], - "source": [ - "llm = ChatVertexAI(model_name=\"gemini-1.5-flash-001\", temperature=0)\n", - "\n", - "template = ChatPromptTemplate.from_messages(\n", - " [\n", - " (\n", - " \"system\",\n", - " \"\"\"You are a conversational bot that produce nice recipes for users based on a question.\"\"\",\n", - " ),\n", - " MessagesPlaceholder(variable_name=\"messages\"),\n", - " ]\n", - ")\n", - "\n", - "chain = template | llm" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MS9rlK6eq5qY" - }, - "source": [ - "We can test our chain:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qzMotN92qrqv" - }, - "outputs": [], - "source": [ - "chain.invoke([(\"human\", \"Hi there!\")])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NqdVpTMkq_pc" - }, - "source": [ - "## Batch scoring\n", - "\n", - "We are now ready to perform batch scoring. To perform batch scoring we will leverage the utility function `batch_generate_messages`\n", - "\n", - "Have a look at the definition to see the expected input format." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Qc4DiVYOKk6H" - }, - "outputs": [], - "source": [ - "help(batch_generate_messages)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Pip0WjopXMsu" - }, - "outputs": [], - "source": [ - "scored_data = batch_generate_messages(messages=df_processed, callable=chain)\n", - "scored_data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7Tf1bssiSBk7" - }, - "source": [ - "## Evaluation\n", - "\n", - "We'll utilize [Vertex AI Rapid Evaluation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/rapid-evaluation) to assess our generative AI model's performance. This service within Vertex AI streamlines the evaluation process, integrates with [Vertex AI Experiments](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments) for tracking, and offers a range of [pre-built metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#task-and-metrics) and the capability to define custom ones.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5ZSzjLI_rBSp" - }, - "source": [ - "#### Define a CustomMetric using Gemini model\n", - "\n", - "Define a customized Gemini model-based metric function, with explanations for the score. The registered custom metrics are computed on the client side, without using online evaluation service APIs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "aT0uclHrSBlC" - }, - "outputs": [], - "source": [ - "evaluator_llm = VertexAI(model_name=\"gemini-1.5-flash-001\", temperature=0)\n", - "\n", - "\n", - "def custom_faithfulness(instance):\n", - " prompt = f\"\"\"You are examining written text content. Here is the text:\n", - "************\n", - "Written content: {instance[\"response\"]}\n", - "************\n", - "Original source data: {instance[\"reference\"]}\n", - "\n", - "Examine the text and determine whether the text is faithful or not.\n", - "Faithfulness refers to how accurately a generated summary reflects the essential information and key concepts present in the original source document.\n", - "A faithful summary stays true to the facts and meaning of the source text, without introducing distortions, hallucinations, or information that wasn't originally there.\n", - "\n", - "Your response must be an explanation of your thinking along with single integer number on a scale of 0-5, 0\n", - "the least faithful and 5 being the most faithful.\n", - "\n", - "Produce results in JSON\n", - "\n", - "Expected format:\n", - "\n", - "```json\n", - "{{\n", - " \"explanation\": \"< your explanation>\",\n", - " \"custom_faithfulness\": \n", - "}}\n", - "```\n", - "\"\"\"\n", - "\n", - " result = evaluator_llm.invoke(prompt)\n", - " try:\n", - " result = json.loads(result.replace(\"```\", \"\").replace(\"json\", \"\"))\n", - " except JSONDecodeError:\n", - " result = {\"explanation\": None, \"custom_faithfulness\": None}\n", - " return result\n", - "\n", - "\n", - "# Register Custom Metric\n", - "custom_faithfulness_metric = make_metric(\n", - " name=\"custom_faithfulness\",\n", - " metric_function=custom_faithfulness,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JpdRqYlLq53t" - }, - "source": [ - "### Run Evaluation with CustomMetric" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "w4iU_mIhoY93" - }, - "outputs": [], - "source": [ - "experiment_name = \"rapid-eval-langchain-eval\" # @param {type:\"string\"}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "U3r3NiKNqn3u" - }, - "source": [ - "We are now ready to run the evaluation. We will use different metrics, combining the custom metric we defined above with some pre-built metrics.\n", - "\n", - "Results of the evaluation will be automatically tagged into the experiment_name we defined.\n", - "\n", - "You can click `View Experiment`, to see the experiment in Google Cloud Console." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "zeVra-g1rAFV" - }, - "outputs": [], - "source": [ - "metrics = [\"fluency\", \"coherence\", \"safety\", custom_faithfulness_metric]\n", - "eval_task = EvalTask(dataset=scored_data, metrics=metrics, experiment=experiment_name)\n", - "eval_result = eval_task.evaluate()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ka3tZCL_uurD" - }, - "source": [ - "Once an eval result is produced, we are able to display summary metrics:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "KheOvIvtiRlz" - }, - "outputs": [], - "source": [ - "eval_result.summary_metrics" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JcALGGlwu0p_" - }, - "source": [ - "We are also able to display a pandas dataframe containing a detailed summary of how our eval dataset performed and relative granular metrics." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9zJ686YYiWJC" - }, - "outputs": [], - "source": [ - "eval_result.metrics_table" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L1NsUKA3vEu8" - }, - "source": [ - "## Iterating over the prompt\n", - "\n", - "Let's perform some simple changes to our chain to see how our evaluation results change." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "O_8PnvLSv7Nu" - }, - "outputs": [], - "source": [ - "template = ChatPromptTemplate.from_messages(\n", - " [\n", - " (\n", - " \"system\",\n", - " \"\"\"You are a conversational bot that produce nice recipes for users based on a question.\n", - "Before suggesting a recipe, you should ask for the dietary requirements..\"\"\",\n", - " ),\n", - " MessagesPlaceholder(variable_name=\"messages\"),\n", - " ]\n", - ")\n", - "\n", - "new_chain = template | llm" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "JF_Po81twPbr" - }, - "outputs": [], - "source": [ - "scored_data = batch_generate_messages(messages=df_processed, callable=new_chain)\n", - "scored_data.rename(columns={\"text\": \"response\"}, inplace=True)\n", - "scored_data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "snIQ0itfwUZa" - }, - "outputs": [], - "source": [ - "metrics = [\"fluency\", \"coherence\", \"safety\", custom_faithfulness_metric]\n", - "eval_task = EvalTask(dataset=scored_data, metrics=metrics, experiment=experiment_name)\n", - "eval_result = eval_task.evaluate()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "CT6Ma5FLwfbF" - }, - "outputs": [], - "source": [ - "eval_result.summary_metrics" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "b42l5juJsyyY" - }, - "source": [ - "#### Let's compare both eval results\n", - "\n", - "We can do that by using the method `display_runs` for a given `eval task` object to see which prompt template performed best on our dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "HP0Zcm1yvh95" - }, - "outputs": [], - "source": [ - "df = vertexai.preview.get_experiment_df(experiment_name).T\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TpV-iwP9qw9c" - }, - "source": [ - "## Cleaning up\n", - "\n", - "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", - "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", - "\n", - "Otherwise, you can delete the individual resources you created in this tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sx_vKniMq9ZX" - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "# Delete Experiments\n", - "delete_experiments = True\n", - "if delete_experiments or os.getenv(\"IS_TESTING\"):\n", - " experiments_list = aiplatform.Experiment.list()\n", - " for experiment in experiments_list:\n", - " experiment.delete()" - ] - } - ], - "metadata": { - "colab": { - "name": "evaluate_langchain_chains.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OsXAs2gcIpbC" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7ZX50cNFOFBt" + }, + "source": [ + " # Evaluate LangChain | Rapid Evaluation SDK Tutorial\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google
Open in Colab\n", + "
\n", + "
\n", + " \n", + " \"Google
Open in Colab Enterprise\n", + "
\n", + "
\n", + " \n", + " \"Vertex
Open in Workbench\n", + "
\n", + "
\n", + " \n", + " \"GitHub
View on GitHub\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "usd0d_LiOFBt" + }, + "source": [ + "| | |\n", + "|-|-|\n", + "|Author(s) | [Elia Secchi](https://github.com/eliasecchig) |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MjDmmmDaOFBt" + }, + "source": [ + "## Overview\n", + "\n", + "With this tutorial, you learn how to evaluate the performance of a conversational LangChain chain using the Vertex AI Rapid Evaluation SDK. The notebook utilizes a dummy chatbot designed to provide recipe suggestions.\n", + "\n", + "The tutorial goes trough:\n", + "1. Data preparation\n", + "2. Setting up the LangChain chain\n", + "3. Set-up a custom metric\n", + "4. Run evaluation with a combination of custom metrics and built-in metrics.\n", + "5. Log results into an experiment run and analyze different runs.\n", + "\n", + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "- Vertex AI\n", + "\n", + "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w-OcPSC8_FUX" + }, + "source": [ + "## Getting Started" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7Jso8-FO4N8" + }, + "source": [ + "### Install Vertex AI SDK for Rapid Evaluation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tUat7NRq5JDC" + }, + "outputs": [], + "source": [ + "%pip install --quiet --upgrade nest_asyncio\n", + "%pip install --upgrade --user --quiet langchain-core langchain-google-vertexai langchain\n", + "%pip install --upgrade --user --quiet \"google-cloud-aiplatform[rapid_evaluation]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R5Xep4W9lq-Z" + }, + "source": [ + "### Restart runtime\n", + "\n", + "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n", + "\n", + "The restart might take a minute or longer. After it's restarted, continue to the next step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XRvKdaPDTznN" + }, + "outputs": [], + "source": [ + "import IPython\n", + "\n", + "app = IPython.Application.instance()\n", + "app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SbmM4z7FOBpM" + }, + "source": [ + "
\n", + "⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dmWOrTJ3gx13" + }, + "source": [ + "### Authenticate your notebook environment (Colab only)\n", + "\n", + "If you're running this notebook on Google Colab, run the cell below to authenticate your environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NyKGtVQjgx13" + }, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "if \"google.colab\" in sys.modules:\n", + " from google.colab import auth\n", + "\n", + " auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DF4l8DTdWgPY" + }, + "source": [ + "### Set Google Cloud project information and initialize Vertex AI SDK\n", + "\n", + "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", + "\n", + "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Nqwi-5ufWp_B" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"genai-blackbelt-fishfooding\" # @param {type:\"string\"}\n", + "LOCATION = \"us-central1\" # @param {type:\"string\"}\n", + "\n", + "\n", + "import vertexai\n", + "\n", + "vertexai.init(project=PROJECT_ID, location=LOCATION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dvhI92xhQTzk" + }, + "source": [ + "### Import libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qP4ihOCkEBje" + }, + "outputs": [], + "source": [ + "# General\n", + "import yaml\n", + "import json\n", + "import logging\n", + "import warnings\n", + "from concurrent.futures import ThreadPoolExecutor\n", + "from functools import partial\n", + "from json.decoder import JSONDecodeError\n", + "from typing import Any\n", + "import nest_asyncio\n", + "import pandas as pd\n", + "\n", + "# Main\n", + "import vertexai\n", + "from google.cloud import aiplatform\n", + "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n", + "from langchain_google_vertexai import ChatVertexAI, VertexAI\n", + "from tqdm import tqdm\n", + "from vertexai.evaluation import EvalTask, CustomMetric" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bDmLyf-K5nRz" + }, + "source": [ + "### Library settings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KRkatYS95mLP" + }, + "outputs": [], + "source": [ + "logging.getLogger(\"urllib3.connectionpool\").setLevel(logging.ERROR)\n", + "nest_asyncio.apply()\n", + "warnings.filterwarnings(\"ignore\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l26gX-cHOFBu" + }, + "source": [ + "### Helper functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gT_OJBHfCg4Q" + }, + "outputs": [], + "source": [ + "def generate_multiturn_history(df: pd.DataFrame):\n", + " \"\"\"Processes a DataFrame of messages to add conversation history for each message.\n", + "\n", + " This function takes a DataFrame containing message data and iterates through each row.\n", + " For each message in a row, it constructs the conversation history up to that point by\n", + " accumulating previous user and AI messages. This conversation history is then added\n", + " to the message data, and the processed messages are returned as a new DataFrame.\n", + "\n", + " Args:\n", + " df: A DataFrame containing message data. It is expected to have a column named\n", + " \"messages\" where each entry is a list of dictionaries representing messages in\n", + " a conversation. Each message dictionary should have \"user\" and \"reference\" keys.\n", + "\n", + " Returns:\n", + " A DataFrame with the processed messages. Each message dictionary will now have an\n", + " additional \"conversation_history\" key containing a list of tuples representing the\n", + " conversation history leading up to that message. The tuples are of the form\n", + " (\"user\", message_text) or (\"ai\", message_text).\n", + " \"\"\"\n", + " processed_messages = []\n", + " for i, row in df.iterrows():\n", + " conversation_history = []\n", + " for message in row[\"messages\"]:\n", + " message[\"conversation_history\"] = conversation_history\n", + " processed_messages.append(message)\n", + " conversation_history = conversation_history + [\n", + " (\"user\", message[\"user\"]),\n", + " (\"ai\", message[\"reference\"]),\n", + " ]\n", + " return pd.DataFrame(processed_messages)\n", + "\n", + "\n", + "def pairwise(iterable):\n", + " \"\"\"Creates an iterable with tuples paired together\n", + " e.g s -> (s0, s1), (s2, s3), (s4, s5), ...\n", + " \"\"\"\n", + " a = iter(iterable)\n", + " return zip(a, a)\n", + "\n", + "\n", + "def batch_generate_message(row: dict, callable: Any) -> dict:\n", + " \"\"\"\n", + " Predicts a response from a chat agent.\n", + "\n", + " Args:\n", + " callable (ChatAgent): A chat agent.\n", + " row (dict): A message.\n", + " Returns:\n", + " dict: The predicted response.\n", + " \"\"\"\n", + " index, message = row\n", + "\n", + " messages = []\n", + " for user_message, ground_truth in pairwise(message.get(\"conversation_history\", [])):\n", + " messages.append((\"user\", user_message))\n", + " messages.append((\"ai\", ground_truth))\n", + " messages.append((\"user\", message[\"user\"]))\n", + " input_callable = {\"messages\": messages, **message.get(\"callable_kwargs\", {})}\n", + " response = callable.invoke(input_callable)\n", + " message[\"response\"] = response.content\n", + " message[\"response_obj\"] = response\n", + " return message\n", + "\n", + "\n", + "def batch_generate_messages(\n", + " messages: pd.DataFrame, callable: Any, max_workers: int = 4\n", + ") -> pd.DataFrame:\n", + " \"\"\"\n", + " Generates AI-powered responses to a series of user messages using a provided callable.\n", + "\n", + " This function efficiently processes a Pandas DataFrame containing user-AI message pairs,\n", + " utilizing the specified callable (either a LangChain Chain or a custom class with an\n", + " `invoke` method) to predict AI responses in parallel.\n", + "\n", + " Args:\n", + " callable (callable): A callable object (e.g., LangChain Chain, custom class) used\n", + " for response generation. Must have an `invoke(messages: dict) ->\n", + " langchain_core.messages.ai.AIMessage` method.\n", + " The `messages` dict follows this structure:\n", + " {\"messages\" [(\"user\", \"first\"),(\"ai\", \"a response\"), (\"user\", \"a follow up\")]}\n", + "\n", + " messages (pd.DataFrame): A DataFrame with one column named 'messages' containing\n", + " the list of user-AI message pairs as described above.\n", + "\n", + " max_workers (int, optional): The number of worker processes to use for parallel\n", + " prediction. Defaults to the number of available CPU cores.\n", + "\n", + " Returns:\n", + " pd.DataFrame: A DataFrame containing the original messages and a new column with the predicted AI responses.\n", + "\n", + " Example:\n", + " ```python\n", + " messages_df = pd.DataFrame({\n", + " \"messages\": [\n", + " [{\"user\": \"What's the weather today?\", \"reference\": \"It's sunny.\"}],\n", + " [{\"user\": \"Tell me a joke.\", \"reference\": \"Why did the scarecrow win an award?...\n", + " Because he was outstanding in his field!\"}]\n", + " ]\n", + " })\n", + "\n", + " responses_df = batch_generate_messages(my_callable, messages_df)\n", + " ```\n", + " \"\"\"\n", + " logging.info(\"Executing batch scoring\")\n", + " predicted_messages = []\n", + " with ThreadPoolExecutor(max_workers) as pool:\n", + " partial_func = partial(batch_generate_message, callable=callable)\n", + " for message in tqdm(\n", + " pool.map(partial_func, messages.iterrows()), total=len(messages)\n", + " ):\n", + " predicted_messages.append(message)\n", + " return pd.DataFrame(predicted_messages)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SZMXiJhKD2TY" + }, + "source": [ + "## Import ground truth data for evaluation\n", + "\n", + "In this sample, we will use 2 conversations as a ground truth. Every message in the conversations, along with the relative history, will be used to produce a response with a foundational model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jwqlusT7OJ4J" + }, + "outputs": [], + "source": [ + "%%writefile chats.yaml\n", + "- messages:\n", + " - user: Hi\n", + " reference: Hi, how can I help you?\n", + " - user: I'm looking for a recipe for a healthy dinner. Do you have any recommendations?\n", + " reference: Sure, I can help you with that. What are your dietary restrictions? Are you vegetarian, vegan, gluten-free, or anything else?\n", + " - user: I'm not vegetarian or vegan, but I am gluten-free.\n", + " reference: 'Okay, I ll keep that in mind. Here are a few recipes that I think you might like:\n", + " * **Grilled Salmon with Roasted Vegetables:** This is a delicious and healthy recipe that is perfect for a weeknight meal. The salmon is grilled to perfection and the roasted vegetables add a touch of sweetness.\n", + " * **Chicken Stir-Fry:** This is a quick and easy stir-fry that is perfect for busy weeknights. The chicken is cooked with vegetables and a light sauce.\n", + " * **Lentil Soup:** This is a hearty and healthy soup that is perfect for a cold winter day. The lentils are packed with protein and fiber, and the soup is also gluten-free.'\n", + " - user: Those all sound great! I think I'm going to try the grilled salmon with roasted vegetables.\n", + " reference: That's a great choice! I hope you enjoy it.\n", + " - user: Thanks for your help!\n", + " reference: You're welcome! Is there anything else I can help you with today?\n", + " - user: No, that's all. Thanks again!\n", + " reference: You're welcome! Have a great day!\n", + "- messages:\n", + " - user: Hi\n", + " reference: Hi, how can I help you?\n", + " - user: I'm looking for a recipe for a romantic dinner. Do you have any recommendations?\n", + " reference: Sure, I can help you with that. What are your dietary restrictions? Are you vegetarian, vegan, gluten-free, or anything else?\n", + " - user: I'm vegetarian.\n", + " reference: 'Sure, I can help you find a healthy vegetarian dinner recipe. Here are a few ideas:\n", + " * **Burnt aubergine veggie chilli:** This is a hearty and flavorful dish that is packed with nutrients. The roasted aubergine gives it a smoky flavor, and the lentils and beans add protein and fiber.\n", + " * **Simple mushroom curry:** This is a quick and easy curry that is perfect for a weeknight meal. The mushrooms are cooked in a creamy sauce with spices, and the whole dish is ready in under 30 minutes.\n", + " * **Vegetarian enchiladas:** This is a classic Mexican dish that is easy to make vegetarian. The enchiladas are filled with a variety of vegetables, and they are topped with a delicious sauce.\n", + " * **Braised sesame tofu:** This is a flavorful and satisfying dish that is perfect for a cold night. The tofu is braised in a sauce with sesame, ginger, and garlic, and it is served over rice or noodles.\n", + " * **Roast garlic & tahini spinach:** This is a light and healthy dish that is perfect for a spring or summer meal. The spinach is roasted with garlic and tahini, and it is served with a side of pita bread.\n", + "\n", + " These are just a few ideas to get you started. There are many other great vegetarian dinner recipes out there, so you are sure to find something that you will enjoy.'\n", + " - user: Those all sound great! I like the Burnt aubergine veggie chilli\n", + " reference: That's a great choice! I hope you enjoy it." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ie7KTd_OjwdE" + }, + "source": [ + "Let's now load all the messages into a Pandas DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j0eSy_avGCW7" + }, + "outputs": [], + "source": [ + "y = yaml.safe_load(open(\"chats.yaml\"))\n", + "df = pd.DataFrame(y)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xA7G0bDTj84z" + }, + "source": [ + "**Decomposing the chat message input/output pairs**\n", + "\n", + "We are now ready for decomposing multi-turn history. This is essential to enable batch prediction.\n", + "\n", + "We decompose the `messages` list into single input/output pairs. The input is always composed by `user message`, `reference message` and `conversation_history`.\n", + "\n", + "**Example:**\n", + "\n", + "Given the following chat:\n", + "\n", + "```yaml\n", + "- messages:\n", + "- user: Hi\n", + "reference: Hi, how can I help you?\n", + "- user: I'm looking for a recipe for a healthy dinner. Do you have any recommendations?\n", + "reference: Sure, I can help you with that. What are your dietary restrictions? Are you vegetarian, vegan, gluten-free, or anything else?\n", + "```\n", + "\n", + "We can generate these two input/output samples:\n", + "\n", + "```yaml\n", + "- user: Hi\n", + " reference: Hi, how can I help you?\n", + " conversation_history: []\n", + "\n", + "- user: I'm looking for a recipe for a healthy dinner....\n", + " reference: Sure, I can help you with that. What are your ...\n", + " conversation_history:\n", + " - user: Hi\n", + " reference: Hi, how can I help you?\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1FL0R3YXPj3e" + }, + "outputs": [], + "source": [ + "df_processed = generate_multiturn_history(df)\n", + "df_processed" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8OhQLR1lGsEq" + }, + "source": [ + "## Let's define our LangChain chain!\n", + "\n", + "We now need to define our LangChain Chain. For this tutorial, we will create a simple conversational chain capable of producing cooking recipes for users." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u5B0ufczjfA_" + }, + "outputs": [], + "source": [ + "llm = ChatVertexAI(model_name=\"gemini-1.5-flash-001\", temperature=0)\n", + "\n", + "template = ChatPromptTemplate.from_messages(\n", + " [\n", + " (\n", + " \"system\",\n", + " \"\"\"You are a conversational bot that produce nice recipes for users based on a question.\"\"\",\n", + " ),\n", + " MessagesPlaceholder(variable_name=\"messages\"),\n", + " ]\n", + ")\n", + "\n", + "chain = template | llm" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MS9rlK6eq5qY" + }, + "source": [ + "We can test our chain:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qzMotN92qrqv" + }, + "outputs": [], + "source": [ + "chain.invoke([(\"human\", \"Hi there!\")])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NqdVpTMkq_pc" + }, + "source": [ + "## Batch scoring\n", + "\n", + "We are now ready to perform batch scoring. To perform batch scoring we will leverage the utility function `batch_generate_messages`\n", + "\n", + "Have a look at the definition to see the expected input format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Qc4DiVYOKk6H" + }, + "outputs": [], + "source": [ + "help(batch_generate_messages)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Pip0WjopXMsu" + }, + "outputs": [], + "source": [ + "scored_data = batch_generate_messages(messages=df_processed, callable=chain)\n", + "# scored_data.rename(columns={\"user\": \"prompt\"}, inplace=True)\n", + "scored_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Tf1bssiSBk7" + }, + "source": [ + "## Evaluation\n", + "\n", + "We'll utilize [Vertex AI Rapid Evaluation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/rapid-evaluation) to assess our generative AI model's performance. This service within Vertex AI streamlines the evaluation process, integrates with [Vertex AI Experiments](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments) for tracking, and offers a range of [pre-built metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#task-and-metrics) and the capability to define custom ones.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5ZSzjLI_rBSp" + }, + "source": [ + "#### Define a CustomMetric using Gemini model\n", + "\n", + "Define a customized Gemini model-based metric function, with explanations for the score. The registered custom metrics are computed on the client side, without using online evaluation service APIs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aT0uclHrSBlC" + }, + "outputs": [], + "source": [ + "evaluator_llm = ChatVertexAI(\n", + " model_name=\"gemini-1.5-flash-001\",\n", + " temperature=0,\n", + " response_mime_type=\"application/json\",\n", + ")\n", + "\n", + "\n", + "def custom_faithfulness(instance):\n", + " prompt = f\"\"\"You are examining written text content. Here is the text:\n", + "************\n", + "Written content: {instance[\"response\"]}\n", + "************\n", + "Original source data: {instance[\"reference\"]}\n", + "\n", + "Examine the text and determine whether the text is faithful or not.\n", + "Faithfulness refers to how accurately a generated summary reflects the essential information and key concepts present in the original source document.\n", + "A faithful summary stays true to the facts and meaning of the source text, without introducing distortions, hallucinations, or information that wasn't originally there.\n", + "\n", + "Your response must be an explanation of your thinking along with single integer number on a scale of 0-5, 0\n", + "the least faithful and 5 being the most faithful.\n", + "\n", + "Produce results in JSON\n", + "\n", + "Expected format:\n", + "\n", + "```json\n", + "{{\n", + " \"explanation\": \"< your explanation>\",\n", + " \"custom_faithfulness\": \n", + "}}\n", + "```\n", + "\"\"\"\n", + "\n", + " result = evaluator_llm.invoke([(\"human\", prompt)])\n", + " result = json.loads(result.content)\n", + " return result\n", + "\n", + "\n", + "# Register Custom Metric\n", + "custom_faithfulness_metric = CustomMetric(\n", + " name=\"custom_faithfulness\",\n", + " metric_function=custom_faithfulness,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JpdRqYlLq53t" + }, + "source": [ + "### Run Evaluation with CustomMetric" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w4iU_mIhoY93" + }, + "outputs": [], + "source": [ + "experiment_name = \"rapid-eval-langchain-eval\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U3r3NiKNqn3u" + }, + "source": [ + "We are now ready to run the evaluation. We will use different metrics, combining the custom metric we defined above with some pre-built metrics.\n", + "\n", + "Results of the evaluation will be automatically tagged into the experiment_name we defined.\n", + "\n", + "You can click `View Experiment`, to see the experiment in Google Cloud Console." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zeVra-g1rAFV" + }, + "outputs": [], + "source": [ + "metrics = [\"fluency\", \"coherence\", \"safety\", custom_faithfulness_metric]\n", + "\n", + "eval_task = EvalTask(\n", + " dataset=scored_data,\n", + " metrics=metrics,\n", + " experiment=experiment_name,\n", + " metric_column_mapping={\"prompt\": \"user\"},\n", + ")\n", + "eval_result = eval_task.evaluate()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ka3tZCL_uurD" + }, + "source": [ + "Once an eval result is produced, we are able to display summary metrics:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KheOvIvtiRlz" + }, + "outputs": [], + "source": [ + "eval_result.summary_metrics" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JcALGGlwu0p_" + }, + "source": [ + "We are also able to display a pandas dataframe containing a detailed summary of how our eval dataset performed and relative granular metrics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9zJ686YYiWJC" + }, + "outputs": [], + "source": [ + "eval_result.metrics_table" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L1NsUKA3vEu8" + }, + "source": [ + "## Iterating over the prompt\n", + "\n", + "Let's perform some simple changes to our chain to see how our evaluation results change." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "O_8PnvLSv7Nu" + }, + "outputs": [], + "source": [ + "template = ChatPromptTemplate.from_messages(\n", + " [\n", + " (\n", + " \"system\",\n", + " \"\"\"You are a conversational bot that produce nice recipes for users based on a question.\n", + "Before suggesting a recipe, you should ask for the dietary requirements..\"\"\",\n", + " ),\n", + " MessagesPlaceholder(variable_name=\"messages\"),\n", + " ]\n", + ")\n", + "\n", + "new_chain = template | llm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JF_Po81twPbr" + }, + "outputs": [], + "source": [ + "scored_data = batch_generate_messages(messages=df_processed, callable=new_chain)\n", + "scored_data.rename(columns={\"text\": \"response\"}, inplace=True)\n", + "scored_data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "snIQ0itfwUZa" + }, + "outputs": [], + "source": [ + "metrics = [\"fluency\", \"coherence\", \"safety\", custom_faithfulness_metric]\n", + "eval_task = EvalTask(\n", + " dataset=scored_data,\n", + " metrics=metrics,\n", + " experiment=experiment_name,\n", + " metric_column_mapping={\"prompt\": \"user\"},\n", + ")\n", + "eval_result = eval_task.evaluate()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CT6Ma5FLwfbF" + }, + "outputs": [], + "source": [ + "eval_result.summary_metrics" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b42l5juJsyyY" + }, + "source": [ + "#### Let's compare both eval results\n", + "\n", + "We can do that by using the method `display_runs` for a given `eval task` object to see which prompt template performed best on our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HP0Zcm1yvh95" + }, + "outputs": [], + "source": [ + "df = vertexai.preview.get_experiment_df(experiment_name).T\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TpV-iwP9qw9c" + }, + "source": [ + "## Cleaning up\n", + "\n", + "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", + "\n", + "Otherwise, you can delete the individual resources you created in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sx_vKniMq9ZX" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# Delete Experiments\n", + "delete_experiments = True\n", + "if delete_experiments or os.getenv(\"IS_TESTING\"):\n", + " experiments_list = aiplatform.Experiment.list()\n", + " for experiment in experiments_list:\n", + " experiment.delete()" + ] + } + ], + "metadata": { + "colab": { + "toc_visible": true, + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file