diff --git a/docs/evaluation/tutorials/backtesting.mdx b/docs/evaluation/tutorials/backtesting.mdx index 674afe1b..f4fc090a 100644 --- a/docs/evaluation/tutorials/backtesting.mdx +++ b/docs/evaluation/tutorials/backtesting.mdx @@ -4,21 +4,24 @@ sidebar_position: 5 # Backtesting -Deploying your app into production is just one step in a longer journey continuous improvement. You'll likely want to develop other candidate systems that improve on your production model using improved prompts, llms, indexing strategies, and other techniques. While you may have a set of offline datasets already created by this point, it's often useful to compare system performance on more recent production data. +Deploying your app into production is just one step in a longer journey of continuous improvement. +You'll likely want to develop other candidate systems that improve on your production model using +improved prompts, llms, tools, and other techniques. While you may have a set of offline datasets +you can compare your new system against, it's often useful to compare system performance on more recent production data. +This can help you get a good idea of how the new system performs in the real world. -This notebook shows how to do this in LangSmith. - -The basic steps are: +The basic steps for doing such backtesting are: 1. Sample runs to test against from your production tracing project. -2. Convert runs to dataset + initial experiment. -3. Run new system against the dataset to compare. +2. Convert these runs to a dataset + initial experiment. +3. Run your new system against the inputs of those runs and compare the results. You will then have a new dataset of representative inputs you can you can version and backtest your models against. -![](./static/dataset_page.png) - -**Note:** In most cases, you won't have "ground truth" answers in this case, but you can manually compare and label or use reference-free evaluators to score the outputs.(If your application DOES permit capturing ground-truth labels, then we obviously recommend you use those. +:::note Ground Truth Data +In most cases, you won't have "ground truth" answers in this case, but you can manually compare and label or use reference-free evaluators to score the outputs. +If your application DOES permit capturing ground-truth labels, then we recommend you use those. +::: ## Prerequisites @@ -33,65 +36,85 @@ Install + set environment variables. This requires `langsmith>=0.1.29` to use th import os # Set the project name to whichever project you'd like to be testing against -project_name = "Tweet Critic" +project_name = "Tweet Writing Task" +os.environ["LANGCHAIN_PROJECT"] = project_name os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" os.environ["LANGCHAIN_TRACING_V2"] = "true" -os.environ["ANTHROPIC_API_KEY"] = "YOUR ANTHROPIC API KEY" -os.environ["LANGCHAIN_PROJECT"] = project_name +os.environ["OPENAI_API_KEY"] = "YOUR OPENAI API KEY" +# Optional, you can use DuckDuckGo which requires no API key +os.environ["TAVILY_API_KEY"] = "YOUR TAVILY API KEY" ``` -#### (Preliminary) Production Deployment +## Create some production runs -You likely have a project already and can skip this step. +If you already have a tracing project with tracing runs, you can skip this step. +For people who don't we will provide some starter code so you can follow along for the rest of the tutorial. -We'll simulate one here so no one reading this notebook gets left out. -Our example app is a "tweet critic" that revises tweets we put out. +The example app we will make is a Tweet writer: ```python -from langchain import hub -from langchain_anthropic import ChatAnthropic -from langchain_core.messages import HumanMessage -from langchain_core.output_parsers import StrOutputParser +from langchain_openai import ChatOpenAI +from langchain_core.messages import HumanMessage, convert_to_openai_messages +from langgraph.prebuilt import create_react_agent +from langchain_community.tools import DuckDuckGoSearchRun, TavilySearchResults +from langchain_core.runnables import RunnableLambda +from langchain_core.rate_limiters import InMemoryRateLimiter + +rate_limiter = InMemoryRateLimiter( + requests_per_second=0.08 # If you have a higher tiered API plan you can increase this +) + +# We will use GPT-3.5 Turbo as the baseline and compare against GPT-4o +gpt_3_5_turbo = ChatOpenAI(model="gpt-3.5-turbo") + +# The modifier is passed as a system message to the agent +modifier = "You are a tweet writing assistant. Use at least 3 emojis in each tweet and remember that tweets are constrained to 280 characters. " +"Please use the search tool to gather recent information on the tweet topic. Respond only with the tweet content." +"Make sure the tweet uses information you retrieved during your search, also mention the source of the information. Make your tweet as engaging as possible." -prompt = hub.pull("wfh/tweet-critic:7e4f539e") -llm = ChatAnthropic(model="claude-3-haiku-20240307") -system = prompt | llm | StrOutputParser() +# Define the tools our agent can use +tools = [TavilySearchResults(max_results=3, rate_limiter=rate_limiter)] +# Use DuckDuckGo if you don't have a Tavily API key +# tools = [DuckDuckGoSearchRun(rate_limiter=rate_limiter)] +agent = create_react_agent(gpt_3_5_turbo, tools=tools, state_modifier=modifier) +def target(inputs): + raw = agent.invoke(inputs) + return {"messages": convert_to_openai_messages(raw['messages'])} + +agent_runnable = RunnableLambda(target) inputs = [ - """RAG From Scratch: Our RAG From Scratch video series covers some important RAG concepts in short, focused videos with code. This is the 10th video and it discusses query routing. Problem: We sometimes have multiple datastores (e.g., different vector DBs, SQL DBs, etc) and prompts to choose from based on a user query. Idea: Logical routing can use an LLM to decide which datastore is more appropriate. Semantic routing embeds the query and prompts, then chooses the best prompt based on similarity. Video: https://youtu.be/pfpIndq7Fi8 Code: https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_10_and_11.ipynb""", - """@Voyage_AI_ Embedding Integration Package Use the same custom embeddings that power Chat LangChain via the new langchain-voyageai package! Voyage AI builds custom embedding models that can improve retrieval quality. ChatLangChain: https://chat.langchain.com Python Docs: https://python.langchain.com/docs/integrations/providers/voyageai""", - """Implementing RAG: How to Write a Graph Retrieval Query in LangChain Our friends at @neo4j have a nice guide on combining LLMs and graph databases. Blog:""", - """Text-to-PowerPoint with LangGraph.js You can now generate PowerPoint presentations from text! @TheGreatBonnie wrote a guide showing how to use LangGraph.js, @tavilyai, and @CopilotKit to build a Next.js app for this. Tutorial: https://dev.to/copilotkit/how-to-build-an-ai-powered-powerpoint-generator-langchain-copilotkit-openai-nextjs-4c76 Repo: https://github.com/TheGreatBonnie/aipoweredpowerpointapp""", - """Build an Answer Engine Using Groq, Mixtral, Langchain, Brave & OpenAI in 10 Min Our friends at @Dev__Digest have a tutorial on building an answer engine over the internet. Code: https://github.com/developersdigest/llm-answer-engine YouTube: https://youtube.com/watch?v=43ZCeBTcsS8&t=96s""", - """Building a RAG Pipeline with LangChain and Amazon Bedrock Amazon Bedrock has great models for building LLM apps. This guide covers how to get started with them to build a RAG pipeline. https://gettingstarted.ai/langchain-bedrock/""", - """SF Meetup on March 27! Join our meetup to hear from LangChain and Pulumi experts and learn about building AI-enabled capabilities. Sign up: https://meetup.com/san-francisco-pulumi-user-group/events/299491923/?utm_campaign=FY2024Q3_Meetup_PUG%20SF&utm_content=286236214&utm_medium=social&utm_source=twitter&hss_channel=tw-837770064870817792""", - """Chat model response metadata @LangChainAI chat model invocations now include metadata like logprobs directly in the output. Upgrade your version of `langchain-core` to try it. PY: https://python.langchain.com/docs/modules/model_io/chat/logprobs JS: https://js.langchain.com/docs/integrations/chat/openai#generation-metadata""", - """Benchmarking Query Analysis in High Cardinality Situations Handling high-cardinality categorical values can be challenging. This blog explores 6 different approaches you can take in these situations. Blog: https://blog.langchain.dev/high-cardinality""", - """Building Google's Dramatron with LangGraph.js & Claude 3 We just released a long YouTube video (1.5 hours!) on building Dramatron using LangGraphJS and @AnthropicAI's Claude 3 "Haiku" model. It's a perfect fit for LangGraph.js and Haiku's speed. Check out the tutorial: https://youtube.com/watch?v=alHnQjyn7hg""", - """Document Loading Webinar with @AirbyteHQ Join a webinar on document loading with PyAirbyte and LangChain on 3/14 at 10am PDT. Features our founding engineer @eyfriis and the @aaronsteers and Bindi Pankhudi team. Register: https://airbyte.com/session/airbyte-monthly-ai-demo""", + "LangChain recent news", + "The 2024 NFL Season", + "Agents built using LangGraph", + "The 2024 United States presidential election", + "Research related to mechanistic interpretability", + "The latest frontier LLM models", + "Host cities for the 2026 world cup", + "The stock market over the last week", + "Willow, Google's Quantum Computing Chip", + "The Billboard 100 for the first week of december 2024" ] -_ = system.batch( +agent_runnable.batch( [{"messages": [HumanMessage(content=content)]} for content in inputs], - {"max_concurrency": 3}, ) ``` -## Convert Prod Runs to Experiment +## Convert Production Traces to Experiment The first step is to generate a dataset based on the production _inputs_. -Then copy over all the traces to serve as a baseline run. +Then copy over all the traces to serve as a baseline experiment. -`convert_runs_to_test` is a function which takes some runs and does the following: +### Select runs to backtest on -1. The inputs, and optionally the outputs, are saved to a dataset as Examples. -2. The inputs and outputs are stored as an experiment, as if you had run the `evaluate` - function and received those outputs. +You can select the runs to backtest on using the `filter` argument of `list_runs`. +The `filter` argument uses the LangSmith [trace query syntax](/reference/data_formats/trace_query_syntax) to select runs. ```python from datetime import datetime, timedelta, timezone - +from uuid import uuid4 from langsmith import Client from langsmith.beta import convert_runs_to_test @@ -111,9 +134,21 @@ prod_runs = list( filter=run_filter, ) ) +baseline_experiment_name = f"prod-baseline-gpt-3.5-turbo-{str(uuid4())[:4]}" +``` + +### Convert runs to experiment + +`convert_runs_to_test` is a function which takes some runs and does the following: + +1. The inputs, and optionally the outputs, are saved to a dataset as Examples. +2. The inputs and outputs are stored as an experiment, as if you had run the `evaluate` + function and received those outputs. +```python # Name of the dataset we want to create dataset_name = f'{project_name}-backtesting {start_time.strftime("%Y-%m-%d")}-{end_time.strftime("%Y-%m-%d")}' + # This converts the runs to a dataset + experiment # It does not actually invoke your model convert_runs_to_test( @@ -125,48 +160,99 @@ convert_runs_to_test( # Whether to include the full traces in the resulting experiment # (default is to just include the root run) load_child_runs=True, + # Name of the experiment so we can apply evalautors to it after + test_project_name=baseline_experiment_name ) ``` -## Benchmark new system +Once this step is complete, you should see a new dataset in your LangSmith project +called "Tweet Writing Task-backtesting TODAYS DATE", with a single experiment like so: + +![](./static/baseline_experiment.png) + +## Benchmark against new system + +Now we can start the process of benchmarking our production runs against a new system. + +### Define evaluators + +First let's define the evaluators we will use to compare the two systems. + +```python +import emoji +from pydantic import BaseModel, Field + +class Grounded(BaseModel): + """Whether the tweet was grounded in the retrieved context""" + grounded: bool = Field(..., description="Is the majorits of the tweet content supported by the retrieved context?") + +grounded_model = ChatOpenAI(model = "gpt-4o").with_structured_output(Grounded) + +def length_evaluator(outputs): + return {"key": "satisfies_tweet_length", "score": int(len(outputs["messages"][-1]['content']) <= 280)} + +def emoji_evaluator(outputs): + return {"key": "satisfies_emoji_requirement", "score": (len(emoji.emoji_list(outputs["messages"][-1]['content'])) >= 3)} + +def grounded_evaluator(outputs): + context = "" + for message in outputs["messages"]: + if message['role'] == "tool" and ('status' in message and message['status'] != "error"): + # Tool message outputs are the results returned from the Tavily/DuckDuckGo tool + context += message['content'] + return {"key": "grounded_in_context", "score": int(grounded_model.invoke(f"Is the tweet about the following context: {context}?").grounded)} +``` -Now we have the dataset and prod runs saved as an experiment. +### Evaluate baseline -Let's run inference on our new system to compare. +Now, let's run our evaluators against the baseline experiment. ```python from langsmith import evaluate -def predict(example_input: dict): - # The dataset includes serialized messages that we - # must convert to a format accepted by our system. - messages = { - "messages": [ - (message["type"], message["content"]) - for message in example_input["messages"] - ] - } - return system.invoke(messages) - - -# Use an updated version of the prompt -prompt = hub.pull("wfh/tweet-critic:34c57e4f") -llm = ChatAnthropic(model="claude-3-haiku-20240307") -system = prompt | llm | StrOutputParser() - -test_results = evaluate( - predict, data=dataset_name, experiment_prefix="HaikuBenchmark", max_concurrency=3 +evaluate( + baseline_experiment_name, + evaluators=[length_evaluator, emoji_evaluator, grounded_evaluator], +) +``` + +### Define and evaluate new system + +Now, let's define and evaluate our new system. In this example our new system +will be the same as the old system, but will use GPT-4o instead of GPT-3.5. + +```python +gpt_4o = ChatOpenAI(model="gpt-4o") +new_agent = create_react_agent(gpt_4o, tools=tools, state_modifier=modifier) + +def new_target(inputs): + rate_limiter.acquire() + raw = new_agent.invoke(inputs) + return {"messages": convert_to_openai_messages(raw['messages'])} + +evaluate( + new_target, + data = dataset_name, + evaluators = [length_evaluator, emoji_evaluator, grounded_evaluator], + experiment_prefix="new-baseline-gpt-4o", ) ``` -## Review runs +## Compare results + +Your dataset should now have two experiments: + +![](./static/dataset_page.png) -You can now compare the outputs in the UI. +We can see that the GPT-4o model does a better job of writing tweets that are +under 280 characters. We can enter the comparison view to see the exact runs on +which GPT-4o is better than GPT-3.5: ![](./static/comparison_view.png) -## Conclusion +## Next steps -Congrats! You've sampled production runs and started benchmarking other systems against them. -In this exercise, we chose not to apply any evaluators to simplify things (since we lack ground-truth answers for this task). -You can manually review the results in LangSmith and/or apply a reference-free evaluator to the results to generate metrics instead. +This was a simple example to show how you could backtest your production system against a new system. +There are many different ways to do this backtesting, including using targeted filters +to select the production runs you want to improve on, or using human preference +data instead of evlautors defined in code. \ No newline at end of file diff --git a/docs/evaluation/tutorials/static/baseline_experiment.png b/docs/evaluation/tutorials/static/baseline_experiment.png new file mode 100644 index 00000000..c1b280f4 Binary files /dev/null and b/docs/evaluation/tutorials/static/baseline_experiment.png differ diff --git a/docs/evaluation/tutorials/static/comparison_view.png b/docs/evaluation/tutorials/static/comparison_view.png index 1f5423fd..80940a63 100644 Binary files a/docs/evaluation/tutorials/static/comparison_view.png and b/docs/evaluation/tutorials/static/comparison_view.png differ diff --git a/docs/evaluation/tutorials/static/dataset_page.png b/docs/evaluation/tutorials/static/dataset_page.png index 625ef570..4817a063 100644 Binary files a/docs/evaluation/tutorials/static/dataset_page.png and b/docs/evaluation/tutorials/static/dataset_page.png differ