langchain-ai · baskaryan · Nov 23, 2024 · Nov 13, 2024 · Nov 14, 2024 · Nov 14, 2024
diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx
@@ -1,4 +1,4 @@
-# Concepts
+# Evaluation concepts
 
 The pace of AI application development is often rate-limited by high-quality evaluations because there is a paradox of choice. Developers often wonder how to engineer their prompt or which LLM best balances accuracy, latency, and cost. High quality evaluations can help you rapidly answer these types of questions with confidence.
 
@@ -130,7 +130,8 @@ See documentation on our workflow to audit and manually correct evaluator scores
 
 ### Pairwise
 
-Pairwise evaluators pick the better of two task outputs based upon some criteria.
+Pairwise evaluators allow you to compare the outputs of two versions of an application.
+Think [LMSYS Chatbot Arena](https://chat.lmsys.org/) - this is the same concept, but applied to AI applications more generally, not just models!
 This can use either a heuristic ("which response is longer"), an LLM (with a specific pairwise prompt), or human (asking them to manually annotate examples).
 
 **When should you use pairwise evaluation?** Pairwise evaluation is helpful when it is difficult to directly score an LLM output, but easier to compare two outputs.
@@ -224,7 +225,7 @@ LangSmith evaluations are kicked off using a single function, `evaluate`, which
 
 :::tip
 
-See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#step-4-run-the-evaluation-and-view-the-results).
+See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application).
 
 :::
 
@@ -235,7 +236,7 @@ One of the most common questions when evaluating AI applications is: how can I b
 :::tip
 
 - See the [video on `Repetitions` in our LangSmith Evaluation series](https://youtu.be/Pvz24JdzzF8)
-- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#evaluate-on-a-dataset-with-repetitions)
+- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/repetition)
 
 :::
 
@@ -433,7 +434,7 @@ Classification / Tagging applies a label to a given input (e.g., for toxicity de
 
 A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
 
-If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#use-custom-evaluators) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
+If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
 
 `Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
 

diff --git a/docs/evaluation/how_to_guides/datasets/export_filtered_traces_to_dataset.mdx b/docs/evaluation/how_to_guides/datasets/export_filtered_traces_to_dataset.mdx
@@ -2,7 +2,7 @@
 sidebar_position: 6
 ---
 
-# Export filtered traces from experiment to dataset
+# How to export filtered traces from experiment to dataset
 
 After running an offline evaluation in LangSmith, you may want to export traces that met some evaluation criteria to a dataset.
 

diff --git a/docs/evaluation/how_to_guides/datasets/manage_datasets_in_application.mdx b/docs/evaluation/how_to_guides/datasets/manage_datasets_in_application.mdx
@@ -2,7 +2,7 @@
 sidebar_position: 1
 ---
 
-# Manage datasets in the application
+# How to manage datasets in the UI
 
 :::tip Recommended Reading
 Before diving into this content, it might be helpful to read the following:

diff --git a/docs/evaluation/how_to_guides/datasets/manage_datasets_programmatically.mdx b/docs/evaluation/how_to_guides/datasets/manage_datasets_programmatically.mdx
@@ -8,7 +8,7 @@ import {
   TypeScriptBlock,
 } from "@site/src/components/InstructionsWithCode";
 
-# Manage datasets programmatically
+# How to manage datasets programmatically
 
 You can use the Python and TypeScript SDK to manage datasets programmatically. This includes creating, updating, and deleting datasets, as well as adding examples to them.
 
@@ -382,9 +382,9 @@ Additionally, you can also chain multiple filters together using the `and` opera
   tabs={[
     PythonBlock(
       `examples = client.list_examples(
-                dataset_name=dataset_name,
-                filter='and(not(has(metadata, \\'{"foo": "bar"}\\')), exists(metadata, "tenant_id"))'
-            )`
+    dataset_name=dataset_name,
+    filter='and(not(has(metadata, \\'{"foo": "bar"}\\')), exists(metadata, "tenant_id"))'
+)`
     ),
     TypeScriptBlock(
       `const examples = await client.listExamples({datasetName: datasetName, filter: 'and(not(has(metadata, \\'{"foo": "bar"}\\')), exists(metadata, "tenant_id"))'});`

diff --git a/docs/evaluation/how_to_guides/datasets/share_dataset.mdx b/docs/evaluation/how_to_guides/datasets/share_dataset.mdx
@@ -4,7 +4,7 @@ sidebar_position: 4
 
 import { RegionalUrl } from "@site/src/components/RegionalUrls";
 
-# Share or unshare a dataset publicly
+# How to share or unshare a dataset publicly
 
 :::caution
 

diff --git a/docs/evaluation/how_to_guides/datasets/version_datasets.mdx b/docs/evaluation/how_to_guides/datasets/version_datasets.mdx
@@ -2,7 +2,7 @@
 sidebar_position: 3
 ---
 
-# Version datasets
+# How to version datasets
 
 In LangSmith, datasets are versioned. This means that every time you add, update, or delete examples in your dataset, a new version of the dataset is created.
 
@@ -46,4 +46,4 @@ client.update_dataset_tag(
 )
 ```
 
-To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](../evaluation/evaluate_llm_application#evaluate-on-a-particular-version-of-a-dataset).
+To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](../evaluation/dataset_version).
diff --git a/docs/evaluation/how_to_guides/evaluation/async.mdx b/docs/evaluation/how_to_guides/evaluation/async.mdx
@@ -0,0 +1,80 @@
+import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";
+
+# How to run an evaluation asynchronously
+
+:::info Key concepts
+
+[Evaluations](../../concepts#applying-evaluations) | [Evaluators](../../concepts#evaluators) | [Datasets](../../concepts#datasets) | [Experiments](../../concepts#experiments)
+
+:::
+
+We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
+which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
+You can learn more about how to use the `evaluate()` function [here](../../how_to_guides/evaluation/evaluate_llm_application).
+
+:::info Python only
+
+This guide is only relevant when using the Python SDK.
+In JS/TS the `evaluate()` function is already async.
+You can see how to use it [here](../../how_to_guides/evaluation/evaluate_llm_application).
+
+:::
+
+## Use `aevaluate()`
+
+<CodeTabs
+  groupId="client-language"
+  tabs={[
+    python({caption: "Requires `langsmith>=0.1.145`"})`
+        from langsmith import aevaluate, wrappers, Client
+        from openai import AsyncOpenAI
+
+        # Optionally wrap the OpenAI client to trace all model calls.
+        oai_client = wrappers.wrap_openai(AsyncOpenAI())
+
+        # Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
+        @traceable
+        async def researcher_app(inputs: dict) -> str:
+            instructions = """You are an excellent researcher. Given a high-level research idea, \\
+
+list 5 concrete questions that should be investigated to determine if the idea is worth pursuing."""
+
+            response = await oai_client.chat.completions.create(
+                model="gpt-4o-mini",
+                messages=[
+                    {"role": "system", "content": instructions},
+                    {"role": "user", "content": inputs["idea"]},
+                ],
+            )
+            return response.choices[0].message.content
+
+        # Evaluator functions can be sync or async
+        def concise(inputs: dict, output: dict) -> bool:
+            return len(output["output"]) < 3 * len(inputs["idea"])
+
+        ls_client = Client()
+
+        examples = ["universal basic income", "nuclear fusion", "hyperloop", "nuclear powered rockets"]
+        dataset = ls_client.create_dataset("research ideas")
+        ls_client.create_examples(
+            dataset_name=dataset.name,
+            inputs=[{"idea": e} for e in examples,
+        )
+
+        results = await aevaluate(
+            researcher_app,
+            data=dataset,
+            evaluators=[concise],
+            # Optional, no max_concurrency by default but it is recommended to set one.
+            max_concurrency=2,
+            experiment_prefix="gpt-4o-mini-baseline"  # Optional, random by default.
+        )
+    `,
+
+]}
+/>
+
+## Related
+
+- [Run an evaluation (synchronously)](../../how_to_guides/evaluation/evaluate_llm_application)
+- [Handle model rate limits](../../how_to_guides/evaluation/rate_limiting)
diff --git a/docs/evaluation/how_to_guides/evaluation/audit_evaluator_scores.mdx b/docs/evaluation/how_to_guides/evaluation/audit_evaluator_scores.mdx
@@ -8,7 +8,7 @@ import {
   python,
 } from "@site/src/components/InstructionsWithCode";
 
-# Audit evaluator scores
+# How to audit evaluator scores
 
 LLM-as-a-judge evaluators don't always get it right. Because of this, it is often useful for a human to manually audit the scores left by an evaluator and correct them where necessary. LangSmith allows you to make corrections on evaluator scores in the UI or SDK.
 

diff --git a/docs/evaluation/how_to_guides/evaluation/bind_evaluator_to_dataset.mdx b/docs/evaluation/how_to_guides/evaluation/bind_evaluator_to_dataset.mdx
@@ -2,7 +2,7 @@
 sidebar_position: 2
 ---
 
-# Bind an evaluator to a dataset in the UI
+# How to bind an evaluator to a dataset in the UI
 
 While you can specify evaluators to grade the results of your experiments programmatically (see [this guide](./evaluate_llm_application) for more information), you can also bind evaluators to a dataset in the UI.
 This allows you to configure automatic evaluators that grade your experiment results. We have support for both LLM-based evaluators, and custom python code evaluators.

diff --git a/docs/evaluation/how_to_guides/evaluation/compare_experiment_results.mdx b/docs/evaluation/how_to_guides/evaluation/compare_experiment_results.mdx
@@ -2,7 +2,7 @@
 sidebar_position: 5
 ---
 
-# Compare experiment results
+# How to compare experiment results
 
 Oftentimes, when you are iterating on your LLM application (such as changing the model or the prompt), you will want to compare the results of different experiments.
 

diff --git a/docs/evaluation/how_to_guides/evaluation/create_few_shot_evaluators.mdx b/docs/evaluation/how_to_guides/evaluation/create_few_shot_evaluators.mdx
@@ -2,7 +2,7 @@
 sidebar_position: 10
 ---
 
-# Create few-shot evaluators
+How to create few-shot evaluators
 
 Using LLM-as-a-Judge evaluators can be very helpful when you can't evaluate your system programmatically. However, improving/iterating on these prompts can add unnecessary
 overhead to the development process of an LLM-based application - you now need to maintain both your application **and** your evaluators. To make this process easier, LangSmith allows

diff --git a/docs/evaluation/how_to_guides/evaluation/custom_evaluator.mdx b/docs/evaluation/how_to_guides/evaluation/custom_evaluator.mdx
@@ -0,0 +1,142 @@
+import {
+  CodeTabs,
+  python,
+  typescript,
+} from "@site/src/components/InstructionsWithCode";
+
+# How to define a custom evaluator
+
+:::info Key concepts
+
+- [Evaluators](../../concepts#evaluators)
+
+:::
+
+Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics.
+These functions can be passed directly into [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html).
+
+## Basic example
+
+<CodeTabs
+  groupId="client-language"
+  tabs={[
+    python`
+        from langsmith import evaluate
+
+        def correct(outputs: dict, reference_outputs: dict) -> bool:
+        """Check if the answer exactly matches the expected answer."""
+            return outputs["answer"] == reference_outputs["answer"]
+
+        def dummy_app(inputs: dict) -> dict:
+            return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}
+
+        results = evaluate(
+            dummy_app,
+            data="dataset_name",
+            evaluators=[correct]
+        )
+    `,
+    typescript`
+      import type { EvaluationResult } from "langsmith/evaluation";
+      import type { Run, Example } from "langsmith/schemas";
+
+      function correct(run: Run, example: Example): EvaluationResult {
+        const score = run.outputs?.output === example.outputs?.output;
+        return { key: "correct", score };
+      }
+    `,
+
+]}
+/>
+
+## Evaluator args
+
+Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:
+
+Python and JS/TS
+
+- `run: langsmith.schemas.Run`: The full Run object generated by the application on the given example.
+- `example: langsmith.schemas.Example`: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).
+
+Currently Python only
+
+- `inputs: dict`: A dictionary of the inputs corresponding to a single example in a dataset.
+- `outputs: dict`: A dictionary of the outputs generated by the application on the given `inputs`.
+- `reference_outputs: dict`: A dictionary of the reference outputs associated with the example, if available.
+
+For most use cases you'll only need `inputs`, `outputs`, and `reference_outputs`. `run` and `example` are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
+
+## Evaluator output
+
+Custom evaluators are expected to return one of the following types:
+
+Python and JS/TS
+
+- `dict`: dicts of the form `{"score" | "value": ..., "name": ...}` allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
+
+Currently Python only
+
+- `int | float | bool`: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.
+- `str`: this is intepreted as a categorical metric. The function name is used as the name of the metric.
+- `list[dict]`: return multiple metrics using a single function.
+
+## Additional examples
+
+<CodeTabs
+  groupId="client-language"
+  tabs={[
+    python({caption: "Requires `langsmith>=0.1.145`"})`
+        from langsmith import evaluate, wrappers
+        from openai import AsyncOpenAI
+        # Assumes you've installed pydantic.
+        from pydantic import BaseModel
+
+        # Compare actual and reference outputs
+        def correct(outputs: dict, reference_outputs: dict) -> bool:
+            """Check if the answer exactly matches the expected answer."""
+            return outputs["answer"] == reference_outputs["answer"]
+
+        # Just evaluate actual outputs
+        def concision(outputs: dict) -> int:
+            """Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
+            return min(len(outputs["answer"]) // 1000, 4) + 1
+
+        # Use an LLM-as-a-judge
+        oai_client = wrappers.wrap_openai(AsyncOpenAI())
+
+        async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
+            """Use an LLM to judge if the reasoning and the answer are consistent."""
+
+            instructions = """\\
+
+Given the following question, answer, and reasoning, determine if the reasoning for the \\
+answer is logically valid and consistent with question and the answer."""
+
+            class Response(BaseModel):
+                reasoning_is_valid: bool
+
+            msg = f"Question: {inputs['question']}\\nAnswer: {outputs['answer']}\\nReasoning: {outputs['reasoning']}"
+            response = await oai_client.beta.chat.completions.parse(
+                model="gpt-4o-mini",
+                messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
+                response_format=Response
+            )
+            return response.choices[0].message.parsed.reasoning_is_valid
+
+        def dummy_app(inputs: dict) -> dict:
+            return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}
+
+        results = evaluate(
+            dummy_app,
+            data="dataset_name",
+            evaluators=[correct, concision, valid_reasoning]
+        )
+    `,
+
+]}
+/>
+
+## Related
+
+- [Evaluate aggregate experiment results](../../how_to_guides/evaluation/summary): Define summary evaluators, which compute metrics for an entire experiment.
+- [Run an evaluation comparing two experiments](../../how_to_guides/evaluation/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.