From 73089115e812fe881bb471cf7df7970bdf6514b2 Mon Sep 17 00:00:00 2001 From: Bagatur Date: Fri, 20 Dec 2024 11:36:40 -0800 Subject: [PATCH 1/3] bulletted --- docs/evaluation/concepts/index.mdx | 577 +++++++++++++++++------------ 1 file changed, 344 insertions(+), 233 deletions(-) diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx index 36000d7a..8746418e 100644 --- a/docs/evaluation/concepts/index.mdx +++ b/docs/evaluation/concepts/index.mdx @@ -1,16 +1,21 @@ -# Evaluation concepts +# Evaluation Concepts The pace of AI application development is often limited by high-quality evaluations. Evaluations are methods designed to assess the performance and capabilities of AI applications. -Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected. -LangSmith makes building high-quality evaluations easy. +Good evaluations enable you to: +- Iteratively improve prompts +- Select optimal models +- Test different architectures +- Ensure deployed applications maintain expected performance This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly. -The core components of LangSmith evaluations are: -- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications. -- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs. +## Core Components + +LangSmith evaluations consist of two essential parts: +- [**Datasets**](#datasets): Collections of test inputs and optional reference outputs +- [**Evaluators**](#evaluators): Functions that score outputs based on dataset inputs ## Datasets @@ -20,346 +25,452 @@ A dataset contains a collection of examples used for evaluating an application. ### Examples -Each example consists of: +Each example in a dataset consists of three components: -- **Inputs**: a dictionary of input variables to pass to your application. -- **Reference outputs** (optional): a dictionary of reference outputs. These do not get passed to your application, they are only used in evaluators. -- **Metadata** (optional): a dictionary of additional information that can be used to create filtered views of a dataset. +1. **Inputs**: A dictionary of input variables passed to your application +2. **Reference outputs** (optional): A dictionary of expected outputs used only by evaluators +3. **Metadata** (optional): Additional information for creating filtered dataset views ![Example](./static/example_concept.png) -### Dataset curation - -There are various ways to build datasets for evaluation, including: +### Dataset Curation -#### Manually curated examples +There are several effective approaches to building evaluation datasets: -This is how we typically recommend people get started creating datasets. -From building your application, you probably have some idea of what types of inputs you expect your application to be able to handle, -and what "good" responses may be. -You probably want to cover a few different common edge cases or situations you can imagine. -Even 10-20 high-quality, manually-curated examples can go a long way. +#### 1. Manually Curated Examples +This is our recommended starting point for dataset creation. Benefits include: +- Leverages your understanding of expected application inputs +- Allows definition of "good" response criteria +- Enables coverage of common edge cases +- Even 10-20 high-quality examples can provide valuable insights -#### Historical traces +#### 2. Historical Traces +Once your application is in production, real user interactions become valuable examples. Selection methods include: -Once you have an application in production, you start getting valuable information: how are users actually using it? -These real-world runs make for great examples because they're, well, the most realistic! +- **User Feedback**: + - Collect end user responses + - Focus on examples with negative feedback + - Identify areas where application underperformed -If you're getting a lot of traffic, how can you determine which runs are valuable to add to a dataset? -There are a few techniques you can use: +- **Heuristics**: + - Identify "interesting" datapoints + - Flag runs with longer completion times + - Track unusual patterns -- **User feedback**: If possible - try to collect end user feedback. You can then see which datapoints got negative feedback. - That is super valuable! These are spots where your application did not perform well. - You should add these to your dataset to test against in the future. -- **Heuristics**: You can also use other heuristics to identify "interesting" datapoints. For example, runs that took a long time to complete could be interesting to look at and add to a dataset. -- **LLM feedback**: You can use another LLM to detect noteworthy runs. For example, you could use an LLM to label chatbot conversations where the user had to rephrase their question or correct the model in some way, indicating the chatbot did not initially respond correctly. +- **LLM Feedback**: + - Use LLMs to detect noteworthy runs + - Flag conversations requiring user rephrasing + - Identify incorrect initial responses -#### Synthetic data - -Once you have a few examples, you can try to artificially generate some more. -It's generally advised to have a few good hand-crafted examples before this, as this synthetic data will often resemble them in some way. -This can be a useful way to get a lot of datapoints, quickly. +#### 3. Synthetic Data +After establishing baseline examples, synthetic data can expand your dataset: +- Build upon existing hand-crafted examples +- Useful for rapid dataset growth +- Maintain quality through careful generation ### Splits -When setting up your evaluation, you may want to partition your dataset into different splits. For example, you might use a smaller split for many rapid and cheap iterations and a larger split for your final evaluation. In addition, splits can be important for the interpretability of your experiments. For example, if you have a RAG application, you may want your dataset splits to focus on different types of questions (e.g., factual, opinion, etc) and to evaluate your application on each split separately. +Dataset splits serve multiple purposes in evaluation: +- Enable rapid iteration on smaller subsets +- Support final evaluation on larger datasets +- Improve experiment interpretability + +For example, in RAG applications, splits can focus on different question types: +- Factual questions +- Opinion-based questions +- Complex reasoning tasks Learn how to [create and manage dataset splits](/evaluation/how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits). ### Versions -Datasets are [versioned](/evaluation/how_to_guides/version_datasets) such that every time you add, update, or delete examples in your dataset, a new version of the dataset is created. -This makes it easy to inspect and revert changes to your dataset in case you make a mistake. -You can also [tag versions](/evaluation/how_to_guides/version_datasets#tag-a-version) of your dataset to give them a more human-readable name. -This can be useful for marking important milestones in your dataset's history. +Dataset versioning is a key feature that: +- Creates new versions upon any modification (add/update/delete) +- Enables inspection and reverting of changes +- Supports [version tagging](/evaluation/how_to_guides/version_datasets#tag-a-version) for human-readable milestones -You can run evaluations on specific versions of a dataset. This can be useful when running evaluations in CI, to make sure that a dataset update doesn't accidentally break your CI pipelines. +Version-specific evaluations are particularly valuable for CI pipelines, ensuring dataset updates don't disrupt existing processes. ## Evaluators -Evaluators are functions that score how well your application performs on a particular example. - -#### Evaluator inputs - -Evaluators receive these inputs: - -- [Example](/evaluation/concepts#examples): The example(s) from your [Dataset](/evaluation/concepts#datasets). Contains inputs, (reference) outputs, and metadata. -- [Run](/observability/concepts#runs): The actual outputs and intermediate steps (child runs) from passing the example inputs to the application. - -#### Evaluator outputs - -An evaluator returns one or more metrics. These should be returned as a dictionary or list of dictionaries of the form: - -- `key`: The name of the metric. -- `score` | `value`: The value of the metric. Use `score` if it's a numerical metric and `value` if it's categorical. -- `comment` (optional): The reasoning or additional string information justifying the score. - -#### Defining evaluators - -There are a number of ways to define and run evaluators: - -- **Custom code**: Define [custom evaluators](/evaluation/how_to_guides/custom_evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI. -- **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI. - -You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets. - -#### Evaluation techniques - -There are a few high-level approaches to LLM evaluation: - -### Human - -Human evaluation is [often a great starting point for evaluation](https://hamel.dev/blog/posts/evals/#looking-at-your-traces). LangSmith makes it easy to review your LLM application outputs as well as the traces (all intermediate steps). - -LangSmith's [annotation queues](/evaluation/concepts#annotation-queues) make it easy to get human feedback on your application's outputs. - -### Heuristic - -Heuristic evaluators are deterministic, rule-based functions. These are good for simple checks like making sure that a chatbot's response isn't empty, that a snippet of generated code can be compiled, or that a classification is exactly correct. - -### LLM-as-judge - -LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference). - -With LLM-as-judge evaluators, it is important to carefully review the resulting scores and tune the grader prompt if needed. Often it is helpful to write these as few-shot evaluators, where you provide examples of inputs, outputs, and expected grades as part of the grader prompt. - -Learn about [how to define an LLM-as-a-judge evaluator](/evaluation/how_to_guides/llm_as_judge). - -### Pairwise - -Pairwise evaluators allow you to compare the outputs of two versions of an application. -Think [LMSYS Chatbot Arena](https://chat.lmsys.org/) - this is the same concept, but applied to AI applications more generally, not just models! -This can use either a heuristic ("which response is longer"), an LLM (with a specific pairwise prompt), or human (asking them to manually annotate examples). - -**When should you use pairwise evaluation?** - -Pairwise evaluation is helpful when it is difficult to directly score an LLM output, but easier to compare two outputs. -This can be the case for tasks like summarization - it may be hard to give a summary an absolute score, but easy to choose which of two summaries is more informative. - -Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise). +Evaluators are functions that assess application performance on specific examples. + +### Evaluator Components + +#### Inputs +Evaluators receive: +- [Example](/evaluation/concepts#examples): Dataset examples containing: + - Inputs + - Reference outputs (optional) + - Metadata +- [Run](/observability/concepts#runs): Actual application performance data: + - Outputs + - Intermediate steps (child runs) + +#### Outputs +Evaluators return metrics as dictionaries or lists of dictionaries containing: +- `key`: Metric name +- `score` | `value`: Metric result (numerical or categorical) +- `comment` (optional): Reasoning or additional context + +### Evaluator Implementation + +Multiple approaches are available: +1. **Custom code**: + - Define evaluators in Python or TypeScript + - Run client-side via SDKs + - Execute server-side through UI + +2. **Built-in evaluators**: + - Use LangSmith's pre-configured evaluators + - Configure via UI + +Evaluators can be executed through: +- LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript) +- [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) +- [Rules](../../observability/how_to_guides/monitoring/rules) for automated evaluation + +### Evaluation Techniques + +#### 1. Human Evaluation +- Often the best starting point +- Enables detailed review of outputs and traces +- Facilitated by LangSmith's [annotation queues](/evaluation/concepts#annotation-queues) + +#### 2. Heuristic Evaluation +- Deterministic, rule-based functions +- Ideal for simple checks: + - Non-empty chatbot responses + - Code compilation verification + - Exact classification matching + +#### 3. LLM-as-judge +Evaluators using LLMs to score outputs: +- Encode grading criteria in prompts +- Can be reference-free or reference-based +- Require careful review and prompt tuning +- Benefit from few-shot examples in grader prompts + +Learn about [implementing LLM-as-judge evaluators](/evaluation/how_to_guides/llm_as_judge). + +#### 4. Pairwise Evaluation +Compare outputs from two application versions: +- Similar to [LMSYS Chatbot Arena](https://chat.lmsys.org/) +- Applicable to general AI applications +- Can use: + - Heuristics + - LLM-based comparison + - Human annotation + +**When to use pairwise evaluation:** +- Direct scoring is difficult +- Comparing outputs is more straightforward +- Example: Summarization quality assessment + +Learn [how to implement pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise). ## Experiment -Each time we evaluate an application on a dataset, we are conducting an experiment. -An experiment is a single execution of the example inputs in your dataset through your application. -Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs). -In LangSmith, you can easily view all the experiments associated with your dataset. -Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results). +An experiment represents a single evaluation run of your application on a dataset: +- Tests specific application configurations +- Enables comparison of different: + - Prompts + - LLMs + - Architectures +- Supports [multi-experiment comparison](/evaluation/how_to_guides/compare_experiment_results) ![Example](./static/comparing_multiple_experiments.png) -## Annotation queues +## Annotation Queues -Human feedback is often the most valuable feedback you can gather on your application. -With [annotation queues](/evaluation/how_to_guides/annotation_queues) you can flag runs of your application for annotation. -Human annotators then have a streamlined view to review and provide feedback on the runs in a queue. -Often (some subset of) these annotated runs are then transferred to a [dataset](/evaluation/concepts#datasets) for future evaluations. -While you can always [annotate runs inline](/evaluation/how_to_guides/annotate_traces_inline), annotation queues provide another option to group runs together, specify annotation criteria, and configure permissions. +Human feedback provides crucial insights for application improvement. [Annotation queues](/evaluation/how_to_guides/annotation_queues) enable: +- Systematic flagging of runs for review +- Streamlined interface for annotators +- Efficient feedback collection + +While [inline run annotation](/evaluation/how_to_guides/annotate_traces_inline) is always available, annotation queues offer additional benefits: +- Grouped run organization +- Specific annotation criteria +- Configurable permissions +- Easy transfer to evaluation datasets Learn more about [annotation queues and human feedback](/evaluation/how_to_guides#annotation-queues-and-human-feedback). -## Offline evaluation +## Offline Evaluation -Evaluating an application on a dataset is what we call "offline" evaluation. -It is offline because we're evaluating on a pre-compiled set of data. -An online evaluation, on the other hand, is one in which we evaluate a deployed application's outputs on real traffic, in near realtime. -Offline evaluations are used for testing a version(s) of your application pre-deployment. +Offline evaluation involves testing applications on pre-compiled datasets before deployment, contrasting with online evaluation which assesses live application performance. -You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset. +You can run offline evaluations: +- Client-side: Using LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript) +- Server-side: Via [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) +- Automated: Through [automations](/observability/how_to_guides/monitoring/rules) ![Offline](./static/offline.png) ### Benchmarking -Perhaps the most common type of offline evaluation is one in which we curate a dataset of representative inputs, define the key performance metrics, and benchmark multiple versions of our application to find the best one. -Benchmarking can be laborious because for many use cases you have to curate a dataset with gold-standard reference outputs and design good metrics for comparing experimental outputs to them. -For a RAG Q&A bot this might look like a dataset of questions and reference answers, and an LLM-as-judge evaluator that determines if the actual answer is semantically equivalent to the reference answer. -For a ReACT agent this might look like a dataset of user requests and a reference set of all the tool calls the model is supposed to make, and a heuristic evaluator that checks if all of the reference tool calls were made. +Benchmarking is a common offline evaluation approach that involves: +1. Curating representative input datasets +2. Defining key performance metrics +3. Testing multiple application versions -### Unit tests +Challenges include: +- Dataset curation effort +- Creating gold-standard references +- Designing effective metrics -Unit tests are used in software development to verify the correctness of individual system components. -[Unit tests in the context of LLMs are often rule-based assertions](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) on LLM inputs or outputs (e.g., checking that LLM-generated code can be compiled, JSON can be loaded, etc.) that validate basic functionality. +**Example**: RAG Q&A bot evaluation +- Dataset: Questions with reference answers +- Evaluator: LLM-as-judge for semantic equivalence +- Metrics: Answer accuracy and relevance -Unit tests are often written with the expectation that they should always pass. -These types of tests are nice to run as part of CI. -Note that when doing so it is useful to set up a cache to minimize LLM calls (because those can quickly rack up!). +**Example**: ReACT agent evaluation +- Dataset: User requests with expected tool calls +- Evaluator: Heuristic checking of tool call sequences +- Metrics: Tool call accuracy and completeness -### Regression tests +### Unit Tests -Regression tests are used to measure performance across versions of your application over time. -They are used to, at the very least, ensure that a new app version does not regress on examples that your current version correctly handles, and ideally to measure how much better your new version is relative to the current. -Often these are triggered when you are making app updates (e.g. updating models or architectures) that are expected to influence the user experience. +Unit tests verify individual component functionality: +- Focus on basic assertions +- Expected to consistently pass +- Ideal for CI integration -LangSmith's comparison view has native support for regression testing, allowing you to quickly see examples that have changed relative to the baseline. -Regressions are highlighted red, improvements green. +⚠️ **Important**: Consider caching for CI to manage LLM costs -![Regression](./static/regression.png) +### Regression Tests -### Backtesting +Regression tests track performance across versions: +- Ensure new versions maintain existing functionality +- Measure improvements over previous versions +- Typically triggered by significant updates: + - Model changes + - Architecture modifications + - Major prompt revisions -Backtesting is an approach that combines dataset creation (discussed above) with evaluation. If you have a collection of production logs, you can turn them into a dataset. Then, you can re-run those production examples with newer application versions. This allows you to assess performance on past and realistic user inputs. +LangSmith's comparison view highlights: +- Regressions (red) +- Improvements (green) +- Performance changes relative to baseline -This is commonly used to evaluate new model versions. -Anthropic dropped a new model? No problem! Grab the 1000 most recent runs through your application and pass them through the new model. -Then compare those results to what actually happened in production. +![Regression](./static/regression.png) -### Pairwise evaluation +### Backtesting -For some tasks [it is easier](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) for a human or LLM grader to determine if "version A is better than B" than to assign an absolute score to either A or B. -Pairwise evaluations are just this — a scoring of the outputs of two versions against each other as opposed to against some reference output or absolute criteria. -Pairwise evaluations are often useful when using LLM-as-judge evaluators on more general tasks. -For example, if you have a summarizer application, it may be easier for an LLM-as-judge to determine "Which of these two summaries is more clear and concise?" than to give an absolute score like "Give this summary a score of 1-10 in terms of clarity and concision." +Backtesting combines dataset creation with evaluation: +1. Convert production logs to evaluation datasets +2. Run updated versions on historical inputs +3. Compare against original production results -Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise). +**Common Use Case**: Model Version Testing +- Collect recent production runs +- Test with new model versions +- Compare performance metrics -## Online evaluation +### Pairwise Evaluation -Evaluating a deployed application's outputs in (roughly) realtime is what we call "online" evaluation. -In this case there is no dataset involved and no possibility of reference outputs — we're running evaluators on real inputs and real outputs as they're produced. -This is be useful for monitoring your application and flagging unintended behavior. -Online evaluation can also work hand-in-hand with offline evaluation: for example, an online evaluator can be used to classify input questions into a set of categories that can be later used to curate a dataset for offline evaluation. +Pairwise evaluation is particularly valuable when: +- Absolute scoring is challenging +- Relative comparisons are more reliable +- Evaluating general tasks -Online evaluators are generally intended to be run server-side. LangSmith has built-in [LLM-as-judge evaluators](/evaluation/how_to_guides/llm_as_judge) that you can configure, or you can define custom code evaluators that are also run within LangSmith. +For example, in summarization: +- Instead of: "Rate this summary 1-10" +- Better: "Which summary is clearer and more concise?" -![Online](./static/online.png) +Learn [how to implement pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise). -## Application-specific techniques +## Online Evaluation -Below, we will discuss evaluation of a few specific, popular LLM applications. +Online evaluation assesses deployed applications in real-time: +- No pre-existing dataset +- No reference outputs available +- Focus on immediate behavior assessment -### Agents +Use cases include: +- Application monitoring +- Behavior flagging +- Performance tracking -[LLM-powered autonomous agents](https://lilianweng.github.io/posts/2023-06-23-agent/) combine three components (1) Tool calling, (2) Memory, and (3) Planning. Agents [use tool calling](https://python.langchain.com/v0.1/docs/modules/agents/agent_types/tool_calling/) with planning (e.g., often via prompting) and memory (e.g., often short-term message history) to generate responses. [Tool calling](https://python.langchain.com/v0.1/docs/modules/model_io/chat/function_calling/) allows a model to respond to a given prompt by generating two things: (1) a tool to invoke and (2) the input arguments required. +Online evaluators typically run server-side using: +- Built-in [LLM-as-judge evaluators](/evaluation/how_to_guides/llm_as_judge) +- Custom code evaluators within LangSmith -![Tool use](./static/tool_use.png) +![Online](./static/online.png) -Below is a tool-calling agent in [LangGraph](https://langchain-ai.github.io/langgraph/tutorials/introduction/). The `assistant node` is an LLM that determines whether to invoke a tool based upon the input. The `tool condition` sees if a tool was selected by the `assistant node` and, if so, routes to the `tool node`. The `tool node` executes the tool and returns the output as a tool message to the `assistant node`. This loop continues until as long as the `assistant node` selects a tool. If no tool is selected, then the agent directly returns the LLM response. +## Application-specific Techniques -![Agent](./static/langgraph_agent.png) +### Agents -This sets up three general types of agent evaluations that users are often interested in: +[LLM-powered autonomous agents](https://lilianweng.github.io/posts/2023-06-23-agent/) combine three core components: +1. Tool calling +2. Memory +3. Planning -- `Final Response`: Evaluate the agent's final response. -- `Single step`: Evaluate any agent step in isolation (e.g., whether it selects the appropriate tool). -- `Trajectory`: Evaluate whether the agent took the expected path (e.g., of tool calls) to arrive at the final answer. +Agents [leverage tool calling](https://python.langchain.com/v0.1/docs/modules/agents/agent_types/tool_calling/) with planning (often via prompting) and memory (typically short-term message history) for response generation. [Tool calling](https://python.langchain.com/v0.1/docs/modules/model_io/chat/function_calling/) enables models to: +- Select appropriate tools +- Generate required input arguments -![Agent-eval](./static/agent_eval.png) +![Tool use](./static/tool_use.png) -Below we will cover what these are, the components (inputs, outputs, evaluators) needed for each one, and when you should consider this. -Note that you likely will want to do multiple (if not all!) of these types of evaluations - they are not mutually exclusive! +#### Agent Architecture Example -#### Evaluating an agent's final response +Below shows a tool-calling agent in [LangGraph](https://langchain-ai.github.io/langgraph/tutorials/introduction/): -One way to evaluate an agent is to assess its overall performance on a task. This basically involves treating the agent as a black box and simply evaluating whether or not it gets the job done. +1. `assistant node`: LLM determines tool usage based on input +2. `tool condition`: Routes to tool node if tool selected +3. `tool node`: Executes tool and returns output +4. Loop continues until no tool is selected -The inputs should be the user input and (optionally) a list of tools. In some cases, tool are hardcoded as part of the agent and they don't need to be passed in. In other cases, the agent is more generic, meaning it does not have a fixed set of tools and tools need to be passed in at run time. +![Agent](./static/langgraph_agent.png) -The output should be the agent's final response. +#### Evaluation Types -The evaluator varies depending on the task you are asking the agent to do. Many agents perform a relatively complex set of steps and the output a final text response. Similar to RAG, LLM-as-judge evaluators are often effective for evaluation in these cases because they can assess whether the agent got a job done directly from the text response. +This architecture enables three primary evaluation approaches: -However, there are several downsides to this type of evaluation. First, it usually takes a while to run. Second, you are not evaluating anything that happens inside the agent, so it can be hard to debug when failures occur. Third, it can sometimes be hard to define appropriate evaluation metrics. +1. **Final Response Evaluation** + - Assesses overall task completion + - Treats agent as black box + - Evaluates end result -#### Evaluating a single step of an agent +2. **Single Step Evaluation** + - Examines individual actions + - Focuses on tool selection + - Evaluates step accuracy -Agents generally perform multiple actions. While it is useful to evaluate them end-to-end, it can also be useful to evaluate these individual actions. This generally involves evaluating a single step of the agent - the LLM call where it decides what to do. +3. **Trajectory Evaluation** + - Analyzes complete action sequence + - Verifies expected tool call paths + - Assesses overall strategy -The inputs should be the input to a single step. Depending on what you are testing, this could just be the raw user input (e.g., a prompt and / or a set of tools) or it can also include previously completed steps. +![Agent-eval](./static/agent_eval.png) -The outputs are just the output of that step, which is usually the LLM response. The LLM response often contains tool calls, indicating what action the agent should take next. +#### Detailed Evaluation Approaches -The evaluator for this is usually some binary score for whether the correct tool call was selected, as well as some heuristic for whether the input to the tool was correct. The reference tool can be simply specified as a string. +##### 1. Final Response Evaluation -There are several benefits to this type of evaluation. It allows you to evaluate individual actions, which lets you hone in where your application may be failing. They are also relatively fast to run (because they only involve a single LLM call) and evaluation often uses simple heuristic evaluation of the selected tool relative to the reference tool. One downside is that they don't capture the full agent - only one particular step. Another downside is that dataset creation can be challenging, particular if you want to include past history in the agent input. It is pretty easy to generate a dataset for steps early on in an agent's trajectory (e.g., this may only include the input prompt), but it can be difficult to generate a dataset for steps later on in the trajectory (e.g., including numerous prior agent actions and responses). +**Components needed:** +- Inputs: User input and optional tool list +- Output: Final agent response +- Evaluator: Task-dependent, often LLM-as-judge -#### Evaluating an agent's trajectory +**Considerations:** +- (+) Evaluates complete task success +- (-) Longer execution time +- (-) Limited debugging insight +- (-) Complex metric definition -Evaluating an agent's trajectory involves evaluating all the steps an agent took. +##### 2. Single Step Evaluation -The inputs are again the inputs to the overall agent (the user input, and optionally a list of tools). +**Components needed:** +- Inputs: Single step context +- Outputs: LLM response/tool calls +- Evaluator: Binary scoring for tool selection -The outputs are a list of tool calls, which can be formulated as an "exact" trajectory (e.g., an expected sequence of tool calls) or simply a set of tool calls that are expected (in any order). +**Benefits:** +- Precise debugging +- Fast execution +- Simple evaluation metrics -The evaluator here is some function over the steps taken. Assessing the "exact" trajectory can use a single binary score that confirms an exact match for each tool name in the sequence. This is simple, but has some flaws. Sometimes there can be multiple correct paths. This evaluation also does not capture the difference between a trajectory being off by a single step versus being completely wrong. +**Challenges:** +- Dataset creation complexity +- Limited context consideration +- Focuses on individual steps only -To address these flaws, evaluation metrics can focused on the number of "incorrect" steps taken, which better accounts for trajectories that are close versus ones that deviate significantly. Evaluation metrics can also focus on whether all of the expected tools are called in any order. +##### 3. Trajectory Evaluation -However, none of these approaches evaluate the input to the tools; they only focus on the tools selected. In order to account for this, another evaluation technique is to pass the full agent's trajectory (along with a reference trajectory) as a set of messages (e.g., all LLM responses and tool calls) an LLM-as-judge. This can evaluate the complete behavior of the agent, but it is the most challenging reference to compile (luckily, using a framework like LangGraph can help with this!). Another downside is that evaluation metrics can be somewhat tricky to come up with. +**Evaluation methods:** +1. Exact sequence matching +2. Tool set verification +3. Full trajectory analysis -### Retrieval augmented generation (RAG) +**Metrics options:** +- Binary exact match +- Step deviation count +- Expected tool coverage +- LLM-based trajectory assessment -Retrieval Augmented Generation (RAG) is a powerful technique that involves retrieving relevant documents based on a user's input and passing them to a language model for processing. RAG enables AI applications to generate more informed and context-aware responses by leveraging external knowledge. +### RAG (Retrieval Augmented Generation) :::info - -For a comprehensive review of RAG concepts, see our [`RAG From Scratch` series](https://github.com/langchain-ai/rag-from-scratch). - +For comprehensive RAG understanding, see our [`RAG From Scratch` series](https://github.com/langchain-ai/rag-from-scratch). ::: -#### Dataset - -When evaluating RAG applications, a key consideration is whether you have (or can easily obtain) reference answers for each input question. Reference answers serve as ground truth for assessing the correctness of the generated responses. However, even in the absence of reference answers, various evaluations can still be performed using reference-free RAG evaluation prompts (examples provided below). +#### Dataset Considerations -#### Evaluator +Key factors: +- Reference answer availability +- Ease of obtaining ground truth +- Evaluation goals -`LLM-as-judge` is a commonly used evaluator for RAG because it's an effective way to evaluate factual accuracy or consistency between texts. +#### Evaluation Types ![rag-types.png](./static/rag-types.png) -When evaluating RAG applications, you can have evaluators that require reference outputs and those that don't: - -1. **Require reference output**: Compare the RAG chain's generated answer or retrievals against a reference answer (or retrievals) to assess its correctness. -2. **Don't require reference output**: Perform self-consistency checks using prompts that don't require a reference answer (represented by orange, green, and red in the above figure). +1. **Reference-based** + - Compares against ground truth + - Assesses correctness + - Requires reference outputs -#### Applying RAG Evaluation +2. **Reference-free** + - Self-consistency checks + - No reference required + - Multiple evaluation aspects -When applying RAG evaluation, consider the following approaches: +#### RAG Evaluation Framework -1. `Offline evaluation`: Use offline evaluation for any prompts that rely on a reference answer. This is most commonly used for RAG answer correctness evaluation, where the reference is a ground truth (correct) answer. +| Evaluator | Purpose | Reference Needed | LLM-as-judge | Pairwise Compatible | +|--------------------|----------------------------------------------------|------------------|------------------------------------------------------------------------------------|-------------------| +| Document relevance | Assess retrieval quality | No | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-document-relevance) | No | +| Answer faithfulness | Verify grounding in documents | No | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-hallucination) | No | +| Answer helpfulness | Evaluate user value | No | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-helpfulness) | No | +| Answer correctness | Check reference consistency | Yes | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-vs-reference) | No | +| Pairwise comparison | Compare multiple versions | No | [prompt](https://smith.langchain.com/hub/langchain-ai/pairwise-evaluation-rag) | Yes | -2. `Online evaluation`: Employ online evaluation for any reference-free prompts. This allows you to assess the RAG application's performance in real-time scenarios. - -3. `Pairwise evaluation`: Utilize pairwise evaluation to compare answers produced by different RAG chains. This evaluation focuses on user-specified criteria (e.g., answer format or style) rather than correctness, which can be evaluated using self-consistency or a ground truth reference. +### Summarization -#### RAG evaluation summary +Summarization evaluation focuses on assessing free-form writing against specific criteria. -| Evaluator | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant | -| ------------------- | ------------------------------------------------- | ---------------------- | ------------------------------------------------------------------------------------- | ----------------- | -| Document relevance | Are documents relevant to the question? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/rag-document-relevance) | No | -| Answer faithfulness | Is the answer grounded in the documents? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-hallucination) | No | -| Answer helpfulness | Does the answer help address the question? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-helpfulness) | No | -| Answer correctness | Is the answer consistent with a reference answer? | Yes | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-vs-reference) | No | -| Pairwise comparison | How do multiple answer versions compare? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/pairwise-evaluation-rag) | Yes | +#### Dataset Sources -### Summarization +1. **Developer Examples** + - Manually curated texts + - [Example dataset](https://smith.langchain.com/public/659b07af-1cab-4e18-b21a-91a69a4c3990/d) -Summarization is one specific type of free-form writing. The evaluation aim is typically to examine the writing (summary) relative to a set of criteria. +2. **Production Logs** + - Real user interactions + - Online evaluation compatible -`Developer curated examples` of texts to summarize are commonly used for evaluation (see a dataset example [here](https://smith.langchain.com/public/659b07af-1cab-4e18-b21a-91a69a4c3990/d)). However, `user logs` from a production (summarization) app can be used for online evaluation with any of the `Reference-free` evaluation prompts below. +#### Evaluation Framework -`LLM-as-judge` is typically used for evaluation of summarization (as well as other types of writing) using `Reference-free` prompts that follow provided criteria to grade a summary. It is less common to provide a particular `Reference` summary, because summarization is a creative task and there are many possible correct answers. +| Use Case | Purpose | Reference Needed | LLM-as-judge | +|-----------------|--------------------------------------------------------------|------------------|----------------------------------------------------------------------------------------| +| Factual accuracy | Verify source accuracy | No | [prompt](https://smith.langchain.com/hub/langchain-ai/summary-accurancy-evaluator) | +| Faithfulness | Check for hallucinations | No | [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | +| Helpfulness | Assess user value | No | [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator) | -`Online` or `Offline` evaluation are feasible because of the `Reference-free` prompt used. `Pairwise` evaluation is also a powerful way to perform comparisons between different summarization chains (e.g., different summarization prompts or LLMs): +### Classification/Tagging -| Use Case | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant | -| ---------------- | -------------------------------------------------------------------------- | ---------------------- | -------------------------------------------------------------------------------------------- | ----------------- | -| Factual accuracy | Is the summary accurate relative to the source documents? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-accurancy-evaluator) | Yes | -| Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes | -| Helpfulness | Is summary helpful relative to user need? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator) | Yes | +Classification evaluation approaches depend on reference label availability. -### Classification / Tagging +#### With Reference Labels -Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below: +Standard metrics: +- Accuracy +- Precision +- Recall -A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc). +Implementation: Custom [heuristic evaluator](./how_to_guides/custom_evaluator) -If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference). +#### Without Reference Labels -`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc). +Approach: +- LLM-as-judge evaluation +- Criteria-based assessment +- Suitable for online evaluation -| Use Case | Detail | Needs reference output | LLM-as-judge? | Pairwise relevant | -| --------- | ------------------- | ---------------------- | ------------- | ----------------- | -| Accuracy | Standard definition | Yes | No | No | -| Precision | Standard definition | Yes | No | No | -| Recall | Standard definition | Yes | No | No | +| Metric | Reference Required | LLM-as-judge | Best Use Case | +|-----------|-------------------|---------------|---------------------| +| Accuracy | Yes | No | Binary classification| +| Precision | Yes | No | False positive focus | +| Recall | Yes | No | False negative focus | \ No newline at end of file From 8b258f3552bf07db5019c138cad913920fd6e059 Mon Sep 17 00:00:00 2001 From: Bagatur Date: Fri, 20 Dec 2024 12:27:47 -0800 Subject: [PATCH 2/3] o1 --- docs/evaluation/concepts/index.mdx | 474 ++++++----------------------- 1 file changed, 91 insertions(+), 383 deletions(-) diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx index 8746418e..1f6b1c27 100644 --- a/docs/evaluation/concepts/index.mdx +++ b/docs/evaluation/concepts/index.mdx @@ -1,476 +1,184 @@ -# Evaluation Concepts +Evaluation Concepts +=================== -The pace of AI application development is often limited by high-quality evaluations. -Evaluations are methods designed to assess the performance and capabilities of AI applications. +High-quality evaluations are essential for refining, testing, and iterating AI applications. Meaningful evaluations make it easier to tailor prompts, choose models, experiment with new architectures, and confirm that your deployed applications continue to function as intended. LangSmith is designed to simplify the process of constructing these effective evaluations. -Good evaluations enable you to: -- Iteratively improve prompts -- Select optimal models -- Test different architectures -- Ensure deployed applications maintain expected performance +This guide walks through LangSmith’s evaluation framework and the underlying concepts for evaluating AI applications. It explores: -This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly. +• Datasets, which serve as test collections for your application’s inputs (and optionally reference outputs). +• Evaluators, which are functions that measure how well your application’s outputs meet certain criteria. -## Core Components +Datasets +-------- -LangSmith evaluations consist of two essential parts: -- [**Datasets**](#datasets): Collections of test inputs and optional reference outputs -- [**Evaluators**](#evaluators): Functions that score outputs based on dataset inputs +A dataset is a curated collection of examples—each containing inputs and optional reference outputs—that you use to measure your application’s performance. -## Datasets - -A dataset contains a collection of examples used for evaluating an application. - -![Dataset](./static/dataset_concept.png) +Illustration: Datasets consist of examples. Each example may include inputs, optional reference outputs, and metadata. ### Examples -Each example in a dataset consists of three components: - -1. **Inputs**: A dictionary of input variables passed to your application -2. **Reference outputs** (optional): A dictionary of expected outputs used only by evaluators -3. **Metadata** (optional): Additional information for creating filtered dataset views +Each example represents one test case and generally includes three parts. First, it has one or more inputs—organized in a dictionary—that your application receives during a run. Second, it may include reference outputs, also called target or gold-standard outputs. These reference outputs are typically reserved for evaluators (instead of being fed directly into your application). Lastly, you can attach metadata in a dictionary format to keep track of any descriptive notes or tags you want to associate with the example. This metadata can then be used to filter or slice your dataset when performing evaluations. -![Example](./static/example_concept.png) +Illustration: An example has inputs, possible reference outputs, and optional metadata. ### Dataset Curation -There are several effective approaches to building evaluation datasets: - -#### 1. Manually Curated Examples -This is our recommended starting point for dataset creation. Benefits include: -- Leverages your understanding of expected application inputs -- Allows definition of "good" response criteria -- Enables coverage of common edge cases -- Even 10-20 high-quality examples can provide valuable insights - -#### 2. Historical Traces -Once your application is in production, real user interactions become valuable examples. Selection methods include: - -- **User Feedback**: - - Collect end user responses - - Focus on examples with negative feedback - - Identify areas where application underperformed +When building datasets to represent your application’s use cases, there are a few methods you can follow. -- **Heuristics**: - - Identify "interesting" datapoints - - Flag runs with longer completion times - - Track unusual patterns +Manually Curated Examples. This approach is a strong starting point if you already know the kinds of tasks your app needs to handle and what good outputs look like. Carefully selected examples can catch both typical cases and edge cases. Even a small set—perhaps 10 to 20 entries—can offer significant insights. -- **LLM Feedback**: - - Use LLMs to detect noteworthy runs - - Flag conversations requiring user rephrasing - - Identify incorrect initial responses +Historical Traces. Once your system is active in production, you can gather real-world runs to see how actual users interact with your application. You might pick out runs flagged by user complaints or poor ratings, examine cases where runtime anomalies occurred, or programmatically detect interesting patterns (such as users repeating themselves when the system didn’t address their query effectively). -#### 3. Synthetic Data -After establishing baseline examples, synthetic data can expand your dataset: -- Build upon existing hand-crafted examples -- Useful for rapid dataset growth -- Maintain quality through careful generation +Synthetic Generation. You can bolster your dataset by asking a language model to generate fresh test examples. This scales efficiently but works best if you have previously curated a small batch of high-quality examples for the model to emulate. ### Splits -Dataset splits serve multiple purposes in evaluation: -- Enable rapid iteration on smaller subsets -- Support final evaluation on larger datasets -- Improve experiment interpretability +Datasets in LangSmith can be partitioned into one or more splits. Splits enable you to separate your data in ways that help you run cost-effective experiments on a smaller slice while retaining more extensive tests for comprehensive evaluation. For instance, with a retrieval-augmented generation (RAG) system, you could divide data between factual queries and opinion-based queries, testing each category independently. -For example, in RAG applications, splits can focus on different question types: -- Factual questions -- Opinion-based questions -- Complex reasoning tasks - -Learn how to [create and manage dataset splits](/evaluation/how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits). +Learn more about how to create and manage dataset splits here: +(/evaluation/how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits) ### Versions -Dataset versioning is a key feature that: -- Creates new versions upon any modification (add/update/delete) -- Enables inspection and reverting of changes -- Supports [version tagging](/evaluation/how_to_guides/version_datasets#tag-a-version) for human-readable milestones - -Version-specific evaluations are particularly valuable for CI pipelines, ensuring dataset updates don't disrupt existing processes. - -## Evaluators +Every time you modify a dataset—adding, editing, or removing examples—LangSmith automatically creates a new version. This versioning allows you to revisit or revert earlier dataset states, making it easier to keep track of your changes as your application evolves. You can label these versions with meaningful tags that denote specific milestones or stable states of the dataset. You can also run evaluations on specific dataset versions if you want to lock a particular set of tests into a continuous integration (CI) pipeline. +More details on dataset versioning are available here: +(/evaluation/how_to_guides/version_datasets) -Evaluators are functions that assess application performance on specific examples. +Evaluators +---------- -### Evaluator Components +Evaluators are functions that assign one or more metrics to your application’s outputs. They provide “grades” indicating how closely the application’s outputs align with the desired criteria. -#### Inputs -Evaluators receive: -- [Example](/evaluation/concepts#examples): Dataset examples containing: - - Inputs - - Reference outputs (optional) - - Metadata -- [Run](/observability/concepts#runs): Actual application performance data: - - Outputs - - Intermediate steps (child runs) +### Evaluator Inputs -#### Outputs -Evaluators return metrics as dictionaries or lists of dictionaries containing: -- `key`: Metric name -- `score` | `value`: Metric result (numerical or categorical) -- `comment` (optional): Reasoning or additional context +Evaluators receive both the example (which supplies the input data and any reference outputs) and the run (the actual output produced by your application). The run may include the final output and any intermediate steps that occurred along the way, such as tool calls. -### Evaluator Implementation +### Evaluator Outputs -Multiple approaches are available: -1. **Custom code**: - - Define evaluators in Python or TypeScript - - Run client-side via SDKs - - Execute server-side through UI +Evaluators produce metrics in a dictionary or list of dictionaries. Typically, these metrics will have: +• A “key” which names the metric. +• A “score” or “value” that holds either a numeric measure or a categorical label. +• An optional “comment” that explains how the evaluator arrived at the score or label. -2. **Built-in evaluators**: - - Use LangSmith's pre-configured evaluators - - Configure via UI +### Defining Evaluators -Evaluators can be executed through: -- LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript) -- [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) -- [Rules](../../observability/how_to_guides/monitoring/rules) for automated evaluation +You can define and run LangSmith evaluators in a variety of ways. You can write your own custom evaluators in Python or TypeScript or rely on built-in evaluators that come with LangSmith. Evaluation can be triggered through the LangSmith SDK (in Python or TypeScript), the Prompt Playground (a feature within LangSmith), or via automated rules you set up in your project. ### Evaluation Techniques -#### 1. Human Evaluation -- Often the best starting point -- Enables detailed review of outputs and traces -- Facilitated by LangSmith's [annotation queues](/evaluation/concepts#annotation-queues) - -#### 2. Heuristic Evaluation -- Deterministic, rule-based functions -- Ideal for simple checks: - - Non-empty chatbot responses - - Code compilation verification - - Exact classification matching - -#### 3. LLM-as-judge -Evaluators using LLMs to score outputs: -- Encode grading criteria in prompts -- Can be reference-free or reference-based -- Require careful review and prompt tuning -- Benefit from few-shot examples in grader prompts - -Learn about [implementing LLM-as-judge evaluators](/evaluation/how_to_guides/llm_as_judge). +When building evaluators for large language model (LLM) applications, you can choose from several common strategies: -#### 4. Pairwise Evaluation -Compare outputs from two application versions: -- Similar to [LMSYS Chatbot Arena](https://chat.lmsys.org/) -- Applicable to general AI applications -- Can use: - - Heuristics - - LLM-based comparison - - Human annotation +Human Review. You or your team members can manually examine outputs for correctness and user satisfaction. This direct feedback is crucial, particularly in the early stages of development. LangSmith Annotation Queues allow you to structure this process for efficiency, including permissions and guidelines. -**When to use pairwise evaluation:** -- Direct scoring is difficult -- Comparing outputs is more straightforward -- Example: Summarization quality assessment +Heuristic Checking. Basic, rule-based evaluators can check for empty responses, monitoring how long responses are, or ensuring that certain keywords appear or do not appear. -Learn [how to implement pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise). +LLM-as-Judge. You can use a language model to evaluate or grade outputs. This approach often involves encoding your evaluation instructions in a prompt. It can be used in situations either with or without reference outputs. -## Experiment +Pairwise Comparisons. When you’re testing two versions of your application, you can have an evaluator decide which version performed better for a given example. This is often simpler for tasks like summarization, where “which is better?” is easier to judge than producing an absolute numeric performance score. -An experiment represents a single evaluation run of your application on a dataset: -- Tests specific application configurations -- Enables comparison of different: - - Prompts - - LLMs - - Architectures -- Supports [multi-experiment comparison](/evaluation/how_to_guides/compare_experiment_results) +Experiment +---------- -![Example](./static/comparing_multiple_experiments.png) +Every time you run your dataset’s inputs through your application, you’re effectively launching a new experiment. Using LangSmith, you can track every experiment linked to a dataset, making it simple to compare different versions of your application side by side. This comparison helps you detect regressions or measure gains accurately as you refine prompts, models, or other system components. -## Annotation Queues +Illustration: Compare multiple experiments side by side to see changes in scores or outputs. -Human feedback provides crucial insights for application improvement. [Annotation queues](/evaluation/how_to_guides/annotation_queues) enable: -- Systematic flagging of runs for review -- Streamlined interface for annotators -- Efficient feedback collection +Annotation Queues +----------------- -While [inline run annotation](/evaluation/how_to_guides/annotate_traces_inline) is always available, annotation queues offer additional benefits: -- Grouped run organization -- Specific annotation criteria -- Configurable permissions -- Easy transfer to evaluation datasets +Gathering real user input is a key part of refining your system. With annotation queues, you can sort runs into a review flow where human annotators examine outputs and assign feedback or corrective notes. These collected annotations can eventually form a dataset for future evaluations. In some cases, you might label only a sample of runs; in others, you may label them all. Annotation queues provide a structured environment for capturing this feedback while making it easy to keep track of who reviewed what. -Learn more about [annotation queues and human feedback](/evaluation/how_to_guides#annotation-queues-and-human-feedback). +To learn more about annotation queues and best practices for managing human feedback, see: +(/evaluation/how_to_guides#annotation-queues-and-human-feedback) -## Offline Evaluation +Offline Evaluation +------------------ -Offline evaluation involves testing applications on pre-compiled datasets before deployment, contrasting with online evaluation which assesses live application performance. +Offline evaluation is done on a static dataset rather than live end-user queries. It’s often the best way to verify changes to your model or your workflow prior to deployment, since you can test your system on curated or historical examples and measure the results in a controlled environment. -You can run offline evaluations: -- Client-side: Using LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript) -- Server-side: Via [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) -- Automated: Through [automations](/observability/how_to_guides/monitoring/rules) - -![Offline](./static/offline.png) +Illustration: Offline evaluations let you pass many curated or historical inputs into your application and systematically measure performance. ### Benchmarking -Benchmarking is a common offline evaluation approach that involves: -1. Curating representative input datasets -2. Defining key performance metrics -3. Testing multiple application versions - -Challenges include: -- Dataset curation effort -- Creating gold-standard references -- Designing effective metrics - -**Example**: RAG Q&A bot evaluation -- Dataset: Questions with reference answers -- Evaluator: LLM-as-judge for semantic equivalence -- Metrics: Answer accuracy and relevance - -**Example**: ReACT agent evaluation -- Dataset: User requests with expected tool calls -- Evaluator: Heuristic checking of tool call sequences -- Metrics: Tool call accuracy and completeness +Benchmarking involves running your application against a carefully assembled dataset to compare one or more metrics. For instance, you might supply question-answer pairs and then measure semantic similarity between your application’s answers and the reference answers. Alternatively, you could rely on LLM-as-judge prompts to assess correctness or helpfulness. Because creating and maintaining large reference datasets can be expensive, teams often reserve thorough benchmarking for high-stakes or major version releases. ### Unit Tests -Unit tests verify individual component functionality: -- Focus on basic assertions -- Expected to consistently pass -- Ideal for CI integration - -⚠️ **Important**: Consider caching for CI to manage LLM costs +When dealing with LLMs, you can still implement traditional “unit tests” in your codebase. In many cases, these tests are rule-based checks that verify whether a generated response meets basic criteria—such as being valid JSON or avoiding empty outputs. Including them in your continuous integration (CI) pipeline can automate detection of small but critical issues any time your underlying system changes. ### Regression Tests -Regression tests track performance across versions: -- Ensure new versions maintain existing functionality -- Measure improvements over previous versions -- Typically triggered by significant updates: - - Model changes - - Architecture modifications - - Major prompt revisions - -LangSmith's comparison view highlights: -- Regressions (red) -- Improvements (green) -- Performance changes relative to baseline +Regression tests examine how your system handles a known set of examples after an update. Suppose you tweak your prompt or switch to a new model. By re-running the same dataset and comparing the old outputs against the new outputs, you can quickly spot examples where performance has worsened. The LangSmith dashboard presents this visually, highlighting negative changes in red and improvements in green. -![Regression](./static/regression.png) +Illustration: Regression view highlights newly broken examples in red, improvements in green. ### Backtesting -Backtesting combines dataset creation with evaluation: -1. Convert production logs to evaluation datasets -2. Run updated versions on historical inputs -3. Compare against original production results +Backtesting replays your stored production traces—queries from real user sessions—through a newer version of your system. By comparing real user interactions against the new model’s outputs, you get a clear idea of whether the next model release will benefit your user base before you adopt it in production. -**Common Use Case**: Model Version Testing -- Collect recent production runs -- Test with new model versions -- Compare performance metrics +### Pairwise Evaluation (Offline) -### Pairwise Evaluation +Offline pairwise evaluation directly compares outputs from two different system versions on the same collection of examples. Rather than trying to score a single run in isolation, you simply choose which of two outputs is superior. This approach is particularly helpful in tasks like summarization. -Pairwise evaluation is particularly valuable when: -- Absolute scoring is challenging -- Relative comparisons are more reliable -- Evaluating general tasks +Online Evaluation +----------------- -For example, in summarization: -- Instead of: "Rate this summary 1-10" -- Better: "Which summary is clearer and more concise?" +Online evaluation continuously measures your application’s performance in a live setting. Rather than waiting until an offline batch test is complete, online evaluation monitors production runs in near real time, allowing you to detect errors or performance deterioration the moment they appear. This can be done with heuristic methods, reference-free LLM prompts that check for common failure modes, or any custom-coded logic you choose to deploy in production. -Learn [how to implement pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise). +Illustration: Online evaluation actively checks real-time runs for undesired application outputs. -## Online Evaluation +Application-Specific Techniques +------------------------------- -Online evaluation assesses deployed applications in real-time: -- No pre-existing dataset -- No reference outputs available -- Focus on immediate behavior assessment - -Use cases include: -- Application monitoring -- Behavior flagging -- Performance tracking - -Online evaluators typically run server-side using: -- Built-in [LLM-as-judge evaluators](/evaluation/how_to_guides/llm_as_judge) -- Custom code evaluators within LangSmith - -![Online](./static/online.png) - -## Application-specific Techniques +Below are a few evaluation strategies tailored to specific LLM application patterns. ### Agents -[LLM-powered autonomous agents](https://lilianweng.github.io/posts/2023-06-23-agent/) combine three core components: -1. Tool calling -2. Memory -3. Planning - -Agents [leverage tool calling](https://python.langchain.com/v0.1/docs/modules/agents/agent_types/tool_calling/) with planning (often via prompting) and memory (typically short-term message history) for response generation. [Tool calling](https://python.langchain.com/v0.1/docs/modules/model_io/chat/function_calling/) enables models to: -- Select appropriate tools -- Generate required input arguments - -![Tool use](./static/tool_use.png) +Autonomous LLM-driven agents combine an LLM for decision-making with tools for calls and memory for context. Each agent step typically involves the LLM deciding whether to invoke a tool, how to parse the user’s request, and what to do next based on prior steps. -#### Agent Architecture Example +Illustration: The agent uses an LLM to decide whether to call a tool and how. -Below shows a tool-calling agent in [LangGraph](https://langchain-ai.github.io/langgraph/tutorials/introduction/): +You can evaluate agents by focusing on: -1. `assistant node`: LLM determines tool usage based on input -2. `tool condition`: Routes to tool node if tool selected -3. `tool node`: Executes tool and returns output -4. Loop continues until no tool is selected +• Final Response: Assess whether the ultimate answer is correct or helpful, ignoring the chain of actions the agent took. +• Single Step: Look at each decision independently. Did the agent choose the correct tool or produce the correct query at each stage? +• Trajectory: Check whether the sequence of actions is logical. You could compare the agent’s chosen tools with a reference “ideal” list, or see if the agent’s overall plan leads to the correct outcome. -![Agent](./static/langgraph_agent.png) +#### Evaluating an Agent’s Final Response -#### Evaluation Types +If your concern is whether the end result is correct, you can evaluate it just as you would any other LLM-generated answer. This method disregards intermediate steps, so it’s simpler to implement but doesn’t highlight at which point errors occur. -This architecture enables three primary evaluation approaches: +#### Evaluating a Single Step -1. **Final Response Evaluation** - - Assesses overall task completion - - Treats agent as black box - - Evaluates end result +An agent often makes multiple decisions in sequence. By evaluating each step individually, you can catch smaller mistakes immediately. This requires more granular data on which tool was chosen at each step and why, and makes data collection slightly more complex. -2. **Single Step Evaluation** - - Examines individual actions - - Focuses on tool selection - - Evaluates step accuracy +#### Evaluating an Agent’s Trajectory -3. **Trajectory Evaluation** - - Analyzes complete action sequence - - Verifies expected tool call paths - - Assesses overall strategy +With trajectory-based evaluation, you consider the entire path from start to finish. This might involve matching the agent’s tool usage and outputs against a known “correct” chain of thought or simply passing the full trace to an evaluator (human or model) for a holistic verdict. Trajectory evaluations provide the richest feedback but require more setup and careful dataset construction. -![Agent-eval](./static/agent_eval.png) +### Retrieval Augmented Generation (RAG) -#### Detailed Evaluation Approaches +A RAG system fetches relevant documents to feed to the LLM. This is useful in tasks such as question-answering, enterprise search, or knowledge-based chat experiences. -##### 1. Final Response Evaluation +Comprehensive RAG details: +https://github.com/langchain-ai/rag-from-scratch -**Components needed:** -- Inputs: User input and optional tool list -- Output: Final agent response -- Evaluator: Task-dependent, often LLM-as-judge +#### Dataset -**Considerations:** -- (+) Evaluates complete task success -- (-) Longer execution time -- (-) Limited debugging insight -- (-) Complex metric definition +For RAG, your dataset generally consists of queries and possibly reference answers. If reference answers exist, you can compare generated answers to these references for offline evaluation. If no reference answers exist, you can still measure whether relevant documents were retrieved or ask an LLM-as-judge to check whether the answer is faithful to the retrieved passages. -##### 2. Single Step Evaluation +#### Evaluator -**Components needed:** -- Inputs: Single step context -- Outputs: LLM response/tool calls -- Evaluator: Binary scoring for tool selection - -**Benefits:** -- Precise debugging -- Fast execution -- Simple evaluation metrics - -**Challenges:** -- Dataset creation complexity -- Limited context consideration -- Focuses on individual steps only - -##### 3. Trajectory Evaluation - -**Evaluation methods:** -1. Exact sequence matching -2. Tool set verification -3. Full trajectory analysis - -**Metrics options:** -- Binary exact match -- Step deviation count -- Expected tool coverage -- LLM-based trajectory assessment - -### RAG (Retrieval Augmented Generation) - -:::info -For comprehensive RAG understanding, see our [`RAG From Scratch` series](https://github.com/langchain-ai/rag-from-scratch). -::: - -#### Dataset Considerations - -Key factors: -- Reference answer availability -- Ease of obtaining ground truth -- Evaluation goals - -#### Evaluation Types - -![rag-types.png](./static/rag-types.png) - -1. **Reference-based** - - Compares against ground truth - - Assesses correctness - - Requires reference outputs - -2. **Reference-free** - - Self-consistency checks - - No reference required - - Multiple evaluation aspects - -#### RAG Evaluation Framework - -| Evaluator | Purpose | Reference Needed | LLM-as-judge | Pairwise Compatible | -|--------------------|----------------------------------------------------|------------------|------------------------------------------------------------------------------------|-------------------| -| Document relevance | Assess retrieval quality | No | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-document-relevance) | No | -| Answer faithfulness | Verify grounding in documents | No | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-hallucination) | No | -| Answer helpfulness | Evaluate user value | No | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-helpfulness) | No | -| Answer correctness | Check reference consistency | Yes | [prompt](https://smith.langchain.com/hub/langchain-ai/rag-answer-vs-reference) | No | -| Pairwise comparison | Compare multiple versions | No | [prompt](https://smith.langchain.com/hub/langchain-ai/pairwise-evaluation-rag) | Yes | +Evaluators for RAG systems often revolve around factual accuracy and alignment with retrieved documents. You can assess how relevant the retrieved documents were and whether or not the final answer relies on accurate information. This can be done offline if reference answers are available, online if you want immediate monitoring in production, or through pairwise evaluations to compare different retrieval strategies. ### Summarization -Summarization evaluation focuses on assessing free-form writing against specific criteria. - -#### Dataset Sources - -1. **Developer Examples** - - Manually curated texts - - [Example dataset](https://smith.langchain.com/public/659b07af-1cab-4e18-b21a-91a69a4c3990/d) - -2. **Production Logs** - - Real user interactions - - Online evaluation compatible - -#### Evaluation Framework - -| Use Case | Purpose | Reference Needed | LLM-as-judge | -|-----------------|--------------------------------------------------------------|------------------|----------------------------------------------------------------------------------------| -| Factual accuracy | Verify source accuracy | No | [prompt](https://smith.langchain.com/hub/langchain-ai/summary-accurancy-evaluator) | -| Faithfulness | Check for hallucinations | No | [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | -| Helpfulness | Assess user value | No | [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator) | - -### Classification/Tagging - -Classification evaluation approaches depend on reference label availability. - -#### With Reference Labels - -Standard metrics: -- Accuracy -- Precision -- Recall +When summarizing text, there isn’t always a single “correct” summary. Consequently, LLM-based evaluators are popular. The model can be asked to check clarity, factual accuracy, or faithfulness to the original text. You can conduct these evaluations offline on a curated set of source documents or run them online in near real time for user-generated inputs. Pairwise evaluation is also a common approach, since it may be easier to choose which of two summaries is better than to assign an absolute quality score to a single summary. -Implementation: Custom [heuristic evaluator](./how_to_guides/custom_evaluator) +### Classification / Tagging -#### Without Reference Labels +Classification tasks assign labels or tags to inputs. If you already have a labeled dataset, standard precision, recall, and accuracy metrics can be calculated. If labels are not available, you can use an LLM-as-judge to categorize inputs according to specified criteria and check for consistency. -Approach: -- LLM-as-judge evaluation -- Criteria-based assessment -- Suitable for online evaluation +When reference labels exist, you can build a custom evaluator that compares predictions to ground truth and produces numeric performance metrics. Without reference labels, you could still rely on carefully designed prompts that instruct your model to classify inputs appropriately. Pairwise comparisons can be beneficial if you are testing two different classification systems and want to see which approach yields more satisfactory labels according to certain guidelines. -| Metric | Reference Required | LLM-as-judge | Best Use Case | -|-----------|-------------------|---------------|---------------------| -| Accuracy | Yes | No | Binary classification| -| Precision | Yes | No | False positive focus | -| Recall | Yes | No | False negative focus | \ No newline at end of file +All the techniques discussed—offline or online evaluation, pairwise comparisons, heuristic checks, and more—can help ensure your classification tasks remain reliable as your system evolves. \ No newline at end of file From 4e5b7561b1bf7767bfacf779c7f1c58320264347 Mon Sep 17 00:00:00 2001 From: Bagatur Date: Fri, 20 Dec 2024 12:39:52 -0800 Subject: [PATCH 3/3] o1 v2 --- docs/evaluation/concepts/index.mdx | 135 +++++++++++++---------------- 1 file changed, 62 insertions(+), 73 deletions(-) diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx index 1f6b1c27..2c41ca4c 100644 --- a/docs/evaluation/concepts/index.mdx +++ b/docs/evaluation/concepts/index.mdx @@ -1,184 +1,173 @@ Evaluation Concepts =================== -High-quality evaluations are essential for refining, testing, and iterating AI applications. Meaningful evaluations make it easier to tailor prompts, choose models, experiment with new architectures, and confirm that your deployed applications continue to function as intended. LangSmith is designed to simplify the process of constructing these effective evaluations. +High-quality evaluations are key to creating, refining, and validating AI applications. In LangSmith, you’ll find the tools you need to structure these evaluations so that you can iterate efficiently, confirm that changes to your application improve performance, and ensure that your system continues to work as intended. -This guide walks through LangSmith’s evaluation framework and the underlying concepts for evaluating AI applications. It explores: +This guide explores LangSmith’s evaluation framework and core concepts, including: -• Datasets, which serve as test collections for your application’s inputs (and optionally reference outputs). -• Evaluators, which are functions that measure how well your application’s outputs meet certain criteria. +- **Datasets**, which hold test examples for your application’s inputs (and, optionally, reference outputs). +- **Evaluators**, which assess how well your application’s outputs align with the desired criteria. Datasets -------- -A dataset is a curated collection of examples—each containing inputs and optional reference outputs—that you use to measure your application’s performance. +A dataset is a curated set of test examples. Each example can include inputs (the data you feed into your application), optional reference outputs (the “gold-standard” or target answers), and any metadata you find helpful. -Illustration: Datasets consist of examples. Each example may include inputs, optional reference outputs, and metadata. +![Dataset](./static/dataset_concept.png) ### Examples -Each example represents one test case and generally includes three parts. First, it has one or more inputs—organized in a dictionary—that your application receives during a run. Second, it may include reference outputs, also called target or gold-standard outputs. These reference outputs are typically reserved for evaluators (instead of being fed directly into your application). Lastly, you can attach metadata in a dictionary format to keep track of any descriptive notes or tags you want to associate with the example. This metadata can then be used to filter or slice your dataset when performing evaluations. +Each example corresponds to a single test case. In most scenarios, an example has three components. First, it has one or more inputs provided as a dictionary. Next, there can be reference outputs (if you have a known response to compare against). Finally, you can attach metadata in a dictionary format to store notes or tags, making it easy to slice, filter, or categorize examples later. -Illustration: An example has inputs, possible reference outputs, and optional metadata. +![Example](./static/example_concept.png) ### Dataset Curation -When building datasets to represent your application’s use cases, there are a few methods you can follow. +When constructing datasets, there are a few common ways to ensure they match real-world use cases: -Manually Curated Examples. This approach is a strong starting point if you already know the kinds of tasks your app needs to handle and what good outputs look like. Carefully selected examples can catch both typical cases and edge cases. Even a small set—perhaps 10 to 20 entries—can offer significant insights. - -Historical Traces. Once your system is active in production, you can gather real-world runs to see how actual users interact with your application. You might pick out runs flagged by user complaints or poor ratings, examine cases where runtime anomalies occurred, or programmatically detect interesting patterns (such as users repeating themselves when the system didn’t address their query effectively). - -Synthetic Generation. You can bolster your dataset by asking a language model to generate fresh test examples. This scales efficiently but works best if you have previously curated a small batch of high-quality examples for the model to emulate. +- **Manually Curated Examples**. This includes handpicking representative tasks and responses that illustrate normal usage and tricky edge cases. Even a small selection of 10 to 20 examples can yield substantial insights. +- **Historical Traces**. If your system is already in production, gather actual production runs, including examples flagged as problematic by users or system logs. Filtering based on complaints, repeating questions, anomaly detection, or LLM-as-judge feedback can provide a realistic snapshot of real-world usage. +- **Synthetic Generation**. A language model can help you automatically generate new test scenarios, which is especially efficient if you have a baseline set of high-quality examples to guide it. ### Splits -Datasets in LangSmith can be partitioned into one or more splits. Splits enable you to separate your data in ways that help you run cost-effective experiments on a smaller slice while retaining more extensive tests for comprehensive evaluation. For instance, with a retrieval-augmented generation (RAG) system, you could divide data between factual queries and opinion-based queries, testing each category independently. +To organize your dataset, LangSmith allows you to create one or more splits. Splits let you isolate subsets of data for targeted experiments. For instance, you might keep a small “dev” split for rapid iterating and a larger “test” split for comprehensive performance checks. In a retrieval-augmented generation (RAG) system, for example, you could divide data between factual vs. opinion-oriented queries. -Learn more about how to create and manage dataset splits here: -(/evaluation/how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits) +Read more about creating and managing splits [here](/evaluation/how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits). ### Versions -Every time you modify a dataset—adding, editing, or removing examples—LangSmith automatically creates a new version. This versioning allows you to revisit or revert earlier dataset states, making it easier to keep track of your changes as your application evolves. You can label these versions with meaningful tags that denote specific milestones or stable states of the dataset. You can also run evaluations on specific dataset versions if you want to lock a particular set of tests into a continuous integration (CI) pipeline. -More details on dataset versioning are available here: -(/evaluation/how_to_guides/version_datasets) +Every time your dataset changes—if you add new examples, edit existing ones, or remove any entries—LangSmith creates a new version automatically. This ensures you can always revert or revisit earlier states if needed. You can label versions with meaningful tags to mark particular stages of your dataset, and you can run evaluations on any specific version for consistent comparisons over time. + +Further details on dataset versioning are provided [here](/evaluation/how_to_guides/version_datasets). Evaluators ---------- -Evaluators are functions that assign one or more metrics to your application’s outputs. They provide “grades” indicating how closely the application’s outputs align with the desired criteria. - -### Evaluator Inputs +Evaluators assign metrics or grades to your application’s outputs, making it easier to see how well those outputs meet your desired standards. -Evaluators receive both the example (which supplies the input data and any reference outputs) and the run (the actual output produced by your application). The run may include the final output and any intermediate steps that occurred along the way, such as tool calls. +### Techniques -### Evaluator Outputs +Below are common strategies for evaluating outputs from large language models (LLMs): -Evaluators produce metrics in a dictionary or list of dictionaries. Typically, these metrics will have: -• A “key” which names the metric. -• A “score” or “value” that holds either a numeric measure or a categorical label. -• An optional “comment” that explains how the evaluator arrived at the score or label. +- **Human Review**. You and your team can manually assess outputs for correctness and user satisfaction. Use LangSmith Annotation Queues for a structured workflow, including permissions, guidelines, and progress tracking. +- **Heuristic Checking**. Basic rule-based evaluators help detect issues such as empty responses, excessive length, or missing essential keywords. +- **LLM-as-Judge**. A language model can serve as the evaluator, typically via a dedicated prompt that checks correctness, helpfulness, or style. This method works with or without reference outputs. +- **Pairwise Comparisons**. When deciding between two application versions, it can be simpler to ask, “Which output is better?” rather than to assign absolute scores, especially in creative tasks like summarization. ### Defining Evaluators -You can define and run LangSmith evaluators in a variety of ways. You can write your own custom evaluators in Python or TypeScript or rely on built-in evaluators that come with LangSmith. Evaluation can be triggered through the LangSmith SDK (in Python or TypeScript), the Prompt Playground (a feature within LangSmith), or via automated rules you set up in your project. +You can use LangSmith’s built-in evaluators or build your own in Python or TypeScript. You can then run these evaluators through the LangSmith SDK, the Prompt Playground (inside LangSmith), or in any automation pipeline you set up. +#### Evaluator Inputs -### Evaluation Techniques +An evaluator has access to both the example (input data and optional reference outputs) and the run (the live output from your application). Because each run often includes details like the final answer or any intermediate steps (e.g., tool calls), evaluators can capture nuanced performance metrics. -When building evaluators for large language model (LLM) applications, you can choose from several common strategies: +#### Evaluator Outputs -Human Review. You or your team members can manually examine outputs for correctness and user satisfaction. This direct feedback is crucial, particularly in the early stages of development. LangSmith Annotation Queues allow you to structure this process for efficiency, including permissions and guidelines. +Evaluators usually produce responses in the form of dictionaries (or lists of dictionaries). Each entry typically contains: -Heuristic Checking. Basic, rule-based evaluators can check for empty responses, monitoring how long responses are, or ensuring that certain keywords appear or do not appear. - -LLM-as-Judge. You can use a language model to evaluate or grade outputs. This approach often involves encoding your evaluation instructions in a prompt. It can be used in situations either with or without reference outputs. - -Pairwise Comparisons. When you’re testing two versions of your application, you can have an evaluator decide which version performed better for a given example. This is often simpler for tasks like summarization, where “which is better?” is easier to judge than producing an absolute numeric performance score. +- A “key” or name for the metric. +- A “score” or “value” (numeric or categorical). +- An optional “comment” to explain how or why the score was assigned. Experiment ---------- -Every time you run your dataset’s inputs through your application, you’re effectively launching a new experiment. Using LangSmith, you can track every experiment linked to a dataset, making it simple to compare different versions of your application side by side. This comparison helps you detect regressions or measure gains accurately as you refine prompts, models, or other system components. +Any time you pass your dataset’s inputs into your application—whether you’re testing a new prompt, a new model, or a new system configuration—you’re effectively starting an experiment. LangSmith keeps track of these experiments so you can compare differences in outputs side by side. This makes it easier to catch regressions, confirm improvements, and refine your system step by step. -Illustration: Compare multiple experiments side by side to see changes in scores or outputs. +![Experiment](./static/comparing_multiple_experiments.png) Annotation Queues ----------------- -Gathering real user input is a key part of refining your system. With annotation queues, you can sort runs into a review flow where human annotators examine outputs and assign feedback or corrective notes. These collected annotations can eventually form a dataset for future evaluations. In some cases, you might label only a sample of runs; in others, you may label them all. Annotation queues provide a structured environment for capturing this feedback while making it easy to keep track of who reviewed what. +Annotation queues power the process of collecting real user feedback. They let you direct runs into a pipeline where human annotators can label, grade, or comment on outputs. You might label every run, or just a sample if your traffic is large. Over time, these labels can form their own dataset for further offline evaluation. Annotation queues are thus a key tool for harnessing human feedback in a consistent, transparent manner. -To learn more about annotation queues and best practices for managing human feedback, see: -(/evaluation/how_to_guides#annotation-queues-and-human-feedback) +To learn more about annotation queues, visit [here](/evaluation/how_to_guides#annotation-queues-and-human-feedback) Offline Evaluation ------------------ -Offline evaluation is done on a static dataset rather than live end-user queries. It’s often the best way to verify changes to your model or your workflow prior to deployment, since you can test your system on curated or historical examples and measure the results in a controlled environment. +Offline evaluation focuses on a static dataset rather than live user queries. It’s an excellent practice to verify changes before deployment or measure how your system handles historical use cases. -Illustration: Offline evaluations let you pass many curated or historical inputs into your application and systematically measure performance. +![Offline](./static/offline.png) ### Benchmarking -Benchmarking involves running your application against a carefully assembled dataset to compare one or more metrics. For instance, you might supply question-answer pairs and then measure semantic similarity between your application’s answers and the reference answers. Alternatively, you could rely on LLM-as-judge prompts to assess correctness or helpfulness. Because creating and maintaining large reference datasets can be expensive, teams often reserve thorough benchmarking for high-stakes or major version releases. +Benchmarking compares your system’s outputs to some fixed standard. For question-answering tasks, you might compare the model’s responses against reference answers and compute similarity. Or you could use an LLM-as-judge approach. Typically, large-scale benchmarking is reserved for major system updates, since it requires maintaining extensive curated datasets. ### Unit Tests -When dealing with LLMs, you can still implement traditional “unit tests” in your codebase. In many cases, these tests are rule-based checks that verify whether a generated response meets basic criteria—such as being valid JSON or avoiding empty outputs. Including them in your continuous integration (CI) pipeline can automate detection of small but critical issues any time your underlying system changes. +Classic “unit tests” can still be applied to LLMs. You can write logic-based checks looking for empty strings, invalid JSON, or other fundamental errors. These tests can run during your continuous integration (CI) process, catching critical issues anytime you change prompts, models, or other code. ### Regression Tests -Regression tests examine how your system handles a known set of examples after an update. Suppose you tweak your prompt or switch to a new model. By re-running the same dataset and comparing the old outputs against the new outputs, you can quickly spot examples where performance has worsened. The LangSmith dashboard presents this visually, highlighting negative changes in red and improvements in green. +Regression tests help ensure that today’s improvements don’t break yesterday’s successes. After a prompt tweak or model update, you can re-run the same dataset and directly compare new results against old ones. LangSmith’s dashboard highlights any degradations in red and improvements in green, making it easy to see how the changes affect overall performance. Illustration: Regression view highlights newly broken examples in red, improvements in green. ### Backtesting -Backtesting replays your stored production traces—queries from real user sessions—through a newer version of your system. By comparing real user interactions against the new model’s outputs, you get a clear idea of whether the next model release will benefit your user base before you adopt it in production. +Backtesting replays past production runs against your updated system. By comparing new outputs to what you served previously, you gain a real-world perspective on whether the upgrade will solve user pain points or potentially introduce new problems—all without impacting live users. -### Pairwise Evaluation (Offline) +### Pairwise Evaluation -Offline pairwise evaluation directly compares outputs from two different system versions on the same collection of examples. Rather than trying to score a single run in isolation, you simply choose which of two outputs is superior. This approach is particularly helpful in tasks like summarization. +Sometimes it’s more natural to decide which output is better rather than relying on absolute scoring. With offline pairwise evaluation, you run both system versions on the same set of inputs and directly compare each example’s outputs. This is commonly used for tasks such as summarization, where multiple outputs may be valid but differ in overall quality. Online Evaluation ----------------- -Online evaluation continuously measures your application’s performance in a live setting. Rather than waiting until an offline batch test is complete, online evaluation monitors production runs in near real time, allowing you to detect errors or performance deterioration the moment they appear. This can be done with heuristic methods, reference-free LLM prompts that check for common failure modes, or any custom-coded logic you choose to deploy in production. +Online evaluation measures performance in production, giving you near real-time feedback on potential issues. Instead of waiting for a batch evaluation to conclude, you can detect errors or regressions as soon as they arise. This immediate visibility can be achieved through heuristic checks, LLM-based evaluators, or any custom logic you deploy alongside your live application. -Illustration: Online evaluation actively checks real-time runs for undesired application outputs. +![Online](./static/online.png) Application-Specific Techniques ------------------------------- -Below are a few evaluation strategies tailored to specific LLM application patterns. +LangSmith evaluations can be tailored to fit a variety of common LLM application patterns. Below are some popular scenarios and potential evaluation approaches. ### Agents -Autonomous LLM-driven agents combine an LLM for decision-making with tools for calls and memory for context. Each agent step typically involves the LLM deciding whether to invoke a tool, how to parse the user’s request, and what to do next based on prior steps. +Agents use an LLM to manage decisions, often with access to external tools and memory. Agents break problems into multiple steps, deciding whether to call a tool, how to parse user instructions, and how to proceed based on the results of prior steps. -Illustration: The agent uses an LLM to decide whether to call a tool and how. +You can assess agents in several ways: -You can evaluate agents by focusing on: - -• Final Response: Assess whether the ultimate answer is correct or helpful, ignoring the chain of actions the agent took. -• Single Step: Look at each decision independently. Did the agent choose the correct tool or produce the correct query at each stage? -• Trajectory: Check whether the sequence of actions is logical. You could compare the agent’s chosen tools with a reference “ideal” list, or see if the agent’s overall plan leads to the correct outcome. +- **Final Response**. Measure the correctness or helpfulness of the final answer alone, ignoring intermediate steps. +- **Single Step**. Look at each decision in isolation to catch small mistakes earlier in the process. +- **Trajectory**. Examine the agent’s entire chain of actions to see whether it deployed the correct tools or if a suboptimal decision early on led to overall failure. #### Evaluating an Agent’s Final Response -If your concern is whether the end result is correct, you can evaluate it just as you would any other LLM-generated answer. This method disregards intermediate steps, so it’s simpler to implement but doesn’t highlight at which point errors occur. +If your main concern is whether the agent’s end answer is correct, you can evaluate it as you would any LLM output. This avoids complexity but may not show where a chain of reasoning went awry. #### Evaluating a Single Step -An agent often makes multiple decisions in sequence. By evaluating each step individually, you can catch smaller mistakes immediately. This requires more granular data on which tool was chosen at each step and why, and makes data collection slightly more complex. +Agents can make multiple decisions in a single run. Evaluating each step separately allows you to spot incremental errors. This approach requires storing detailed run histories for each choice or tool invocation. #### Evaluating an Agent’s Trajectory -With trajectory-based evaluation, you consider the entire path from start to finish. This might involve matching the agent’s tool usage and outputs against a known “correct” chain of thought or simply passing the full trace to an evaluator (human or model) for a holistic verdict. Trajectory evaluations provide the richest feedback but require more setup and careful dataset construction. +A trajectory-based approach looks at the entire flow, from the initial prompt to the final answer. This might involve comparing the agent’s chain of tool calls to a known “ideal” chain or having an LLM or human reviewer judge the agent’s reasoning. It’s the most thorough method but also the most involved to set up. ### Retrieval Augmented Generation (RAG) -A RAG system fetches relevant documents to feed to the LLM. This is useful in tasks such as question-answering, enterprise search, or knowledge-based chat experiences. +RAG systems fetch context or documentation from external sources to shape the LLM’s output. These are often used for Q&A applications, enterprise searches, or knowledge-based interactions. -Comprehensive RAG details: +Comprehensive details on building RAG systems can be found here: https://github.com/langchain-ai/rag-from-scratch #### Dataset -For RAG, your dataset generally consists of queries and possibly reference answers. If reference answers exist, you can compare generated answers to these references for offline evaluation. If no reference answers exist, you can still measure whether relevant documents were retrieved or ask an LLM-as-judge to check whether the answer is faithful to the retrieved passages. +For RAG, you typically have queries (and possibly reference answers) in your dataset. With reference answers, offline evaluations can measure how accurately your final output matches the ground truth. Even without reference answers, you can still evaluate by checking whether retrieved documents are relevant and whether the system’s answer is faithful to those documents. #### Evaluator -Evaluators for RAG systems often revolve around factual accuracy and alignment with retrieved documents. You can assess how relevant the retrieved documents were and whether or not the final answer relies on accurate information. This can be done offline if reference answers are available, online if you want immediate monitoring in production, or through pairwise evaluations to compare different retrieval strategies. +RAG evaluators commonly focus on factual correctness and faithfulness to the retrieved information. You can carry out these checks offline (with reference answers), online (in near real-time for live queries), or in pairwise comparisons (to compare different ranking or retrieval methods). ### Summarization -When summarizing text, there isn’t always a single “correct” summary. Consequently, LLM-based evaluators are popular. The model can be asked to check clarity, factual accuracy, or faithfulness to the original text. You can conduct these evaluations offline on a curated set of source documents or run them online in near real time for user-generated inputs. Pairwise evaluation is also a common approach, since it may be easier to choose which of two summaries is better than to assign an absolute quality score to a single summary. +Summarization tasks are often subjective, making it challenging to define a single “correct” output. In this context, LLM-as-judge strategies are particularly useful. By asking a language model to grade clarity, accuracy, or coverage, you can track your summarizer’s performance. Alternatively, offline pairwise comparisons can help you see which summary outperforms the other—especially if you’re testing new prompt styles or models. ### Classification / Tagging -Classification tasks assign labels or tags to inputs. If you already have a labeled dataset, standard precision, recall, and accuracy metrics can be calculated. If labels are not available, you can use an LLM-as-judge to categorize inputs according to specified criteria and check for consistency. - -When reference labels exist, you can build a custom evaluator that compares predictions to ground truth and produces numeric performance metrics. Without reference labels, you could still rely on carefully designed prompts that instruct your model to classify inputs appropriately. Pairwise comparisons can be beneficial if you are testing two different classification systems and want to see which approach yields more satisfactory labels according to certain guidelines. +Classification tasks apply labels or tags to inputs. If you have reference labels, you can compute metrics like accuracy, precision, or recall. If not, you can still apply LLM-as-judge techniques, instructing the model to validate whether a predicted label matches labeling guidelines. Pairwise evaluation is also an option if you need to compare two classification systems. -All the techniques discussed—offline or online evaluation, pairwise comparisons, heuristic checks, and more—can help ensure your classification tasks remain reliable as your system evolves. \ No newline at end of file +In all these application patterns, LangSmith’s offline and online tools—and the combination of heuristics, LLM-based evaluations, human feedback, and pairwise comparisons—can help maintain and improve performance as your system evolves. \ No newline at end of file