Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: eval how to revamp #525

Merged
merged 30 commits into from
Nov 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions docs/evaluation/concepts/index.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Concepts
# Evaluation concepts

The pace of AI application development is often rate-limited by high-quality evaluations because there is a paradox of choice. Developers often wonder how to engineer their prompt or which LLM best balances accuracy, latency, and cost. High quality evaluations can help you rapidly answer these types of questions with confidence.

Expand Down Expand Up @@ -130,7 +130,8 @@ See documentation on our workflow to audit and manually correct evaluator scores

### Pairwise

Pairwise evaluators pick the better of two task outputs based upon some criteria.
Pairwise evaluators allow you to compare the outputs of two versions of an application.
Think [LMSYS Chatbot Arena](https://chat.lmsys.org/) - this is the same concept, but applied to AI applications more generally, not just models!
This can use either a heuristic ("which response is longer"), an LLM (with a specific pairwise prompt), or human (asking them to manually annotate examples).

**When should you use pairwise evaluation?** Pairwise evaluation is helpful when it is difficult to directly score an LLM output, but easier to compare two outputs.
Expand Down Expand Up @@ -224,7 +225,7 @@ LangSmith evaluations are kicked off using a single function, `evaluate`, which

:::tip

See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#step-4-run-the-evaluation-and-view-the-results).
See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application).

:::

Expand All @@ -235,7 +236,7 @@ One of the most common questions when evaluating AI applications is: how can I b
:::tip

- See the [video on `Repetitions` in our LangSmith Evaluation series](https://youtu.be/Pvz24JdzzF8)
- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#evaluate-on-a-dataset-with-repetitions)
- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/repetition)

:::

Expand Down Expand Up @@ -433,7 +434,7 @@ Classification / Tagging applies a label to a given input (e.g., for toxicity de

A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).

If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#use-custom-evaluators) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).

`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 6
---

# Export filtered traces from experiment to dataset
# How to export filtered traces from experiment to dataset
baskaryan marked this conversation as resolved.
Show resolved Hide resolved

After running an offline evaluation in LangSmith, you may want to export traces that met some evaluation criteria to a dataset.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 1
---

# Manage datasets in the application
# How to manage datasets in the UI

:::tip Recommended Reading
Before diving into this content, it might be helpful to read the following:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import {
TypeScriptBlock,
} from "@site/src/components/InstructionsWithCode";

# Manage datasets programmatically
# How to manage datasets programmatically

You can use the Python and TypeScript SDK to manage datasets programmatically. This includes creating, updating, and deleting datasets, as well as adding examples to them.

Expand Down Expand Up @@ -382,9 +382,9 @@ Additionally, you can also chain multiple filters together using the `and` opera
tabs={[
PythonBlock(
`examples = client.list_examples(
dataset_name=dataset_name,
filter='and(not(has(metadata, \\'{"foo": "bar"}\\')), exists(metadata, "tenant_id"))'
)`
dataset_name=dataset_name,
filter='and(not(has(metadata, \\'{"foo": "bar"}\\')), exists(metadata, "tenant_id"))'
)`
),
TypeScriptBlock(
`const examples = await client.listExamples({datasetName: datasetName, filter: 'and(not(has(metadata, \\'{"foo": "bar"}\\')), exists(metadata, "tenant_id"))'});`
Expand Down
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/datasets/share_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar_position: 4

import { RegionalUrl } from "@site/src/components/RegionalUrls";

# Share or unshare a dataset publicly
# How to share or unshare a dataset publicly

:::caution

Expand Down
4 changes: 2 additions & 2 deletions docs/evaluation/how_to_guides/datasets/version_datasets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 3
---

# Version datasets
# How to version datasets

In LangSmith, datasets are versioned. This means that every time you add, update, or delete examples in your dataset, a new version of the dataset is created.

Expand Down Expand Up @@ -46,4 +46,4 @@ client.update_dataset_tag(
)
```

To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](../evaluation/evaluate_llm_application#evaluate-on-a-particular-version-of-a-dataset).
To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](../evaluation/dataset_version).
80 changes: 80 additions & 0 deletions docs/evaluation/how_to_guides/evaluation/async.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";

# How to run an evaluation asynchronously

:::info Key concepts

[Evaluations](../../concepts#applying-evaluations) | [Evaluators](../../concepts#evaluators) | [Datasets](../../concepts#datasets) | [Experiments](../../concepts#experiments)

:::

We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
You can learn more about how to use the `evaluate()` function [here](../../how_to_guides/evaluation/evaluate_llm_application).

:::info Python only

This guide is only relevant when using the Python SDK.
In JS/TS the `evaluate()` function is already async.
You can see how to use it [here](../../how_to_guides/evaluation/evaluate_llm_application).

:::

## Use `aevaluate()`

<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
from langsmith import aevaluate, wrappers, Client
from openai import AsyncOpenAI

# Optionally wrap the OpenAI client to trace all model calls.
oai_client = wrappers.wrap_openai(AsyncOpenAI())

# Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
@traceable
async def researcher_app(inputs: dict) -> str:
instructions = """You are an excellent researcher. Given a high-level research idea, \\

list 5 concrete questions that should be investigated to determine if the idea is worth pursuing."""

response = await oai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": instructions},
{"role": "user", "content": inputs["idea"]},
],
)
return response.choices[0].message.content

# Evaluator functions can be sync or async
def concise(inputs: dict, output: dict) -> bool:
return len(output["output"]) < 3 * len(inputs["idea"])

ls_client = Client()

examples = ["universal basic income", "nuclear fusion", "hyperloop", "nuclear powered rockets"]
dataset = ls_client.create_dataset("research ideas")
ls_client.create_examples(
dataset_name=dataset.name,
inputs=[{"idea": e} for e in examples,
)

results = await aevaluate(
researcher_app,
data=dataset,
evaluators=[concise],
# Optional, no max_concurrency by default but it is recommended to set one.
max_concurrency=2,
experiment_prefix="gpt-4o-mini-baseline" # Optional, random by default.
)
`,

]}
/>

## Related

- [Run an evaluation (synchronously)](../../how_to_guides/evaluation/evaluate_llm_application)
- [Handle model rate limits](../../how_to_guides/evaluation/rate_limiting)
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import {
python,
} from "@site/src/components/InstructionsWithCode";

# Audit evaluator scores
# How to audit evaluator scores

LLM-as-a-judge evaluators don't always get it right. Because of this, it is often useful for a human to manually audit the scores left by an evaluator and correct them where necessary. LangSmith allows you to make corrections on evaluator scores in the UI or SDK.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 2
---

# Bind an evaluator to a dataset in the UI
# How to bind an evaluator to a dataset in the UI

While you can specify evaluators to grade the results of your experiments programmatically (see [this guide](./evaluate_llm_application) for more information), you can also bind evaluators to a dataset in the UI.
This allows you to configure automatic evaluators that grade your experiment results. We have support for both LLM-based evaluators, and custom python code evaluators.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 5
---

# Compare experiment results
# How to compare experiment results

Oftentimes, when you are iterating on your LLM application (such as changing the model or the prompt), you will want to compare the results of different experiments.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 10
---

# Create few-shot evaluators
How to create few-shot evaluators

Using LLM-as-a-Judge evaluators can be very helpful when you can't evaluate your system programmatically. However, improving/iterating on these prompts can add unnecessary
overhead to the development process of an LLM-based application - you now need to maintain both your application **and** your evaluators. To make this process easier, LangSmith allows
Expand Down
142 changes: 142 additions & 0 deletions docs/evaluation/how_to_guides/evaluation/custom_evaluator.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
import {
CodeTabs,
python,
typescript,
} from "@site/src/components/InstructionsWithCode";

# How to define a custom evaluator

:::info Key concepts

- [Evaluators](../../concepts#evaluators)

:::

Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics.
These functions can be passed directly into [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html).

## Basic example

<CodeTabs
groupId="client-language"
tabs={[
python`
from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct]
)
`,
typescript`
import type { EvaluationResult } from "langsmith/evaluation";
import type { Run, Example } from "langsmith/schemas";

function correct(run: Run, example: Example): EvaluationResult {
const score = run.outputs?.output === example.outputs?.output;
return { key: "correct", score };
}
`,

]}
/>

## Evaluator args

Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:

Python and JS/TS

- `run: langsmith.schemas.Run`: The full Run object generated by the application on the given example.
- `example: langsmith.schemas.Example`: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).

Currently Python only

- `inputs: dict`: A dictionary of the inputs corresponding to a single example in a dataset.
- `outputs: dict`: A dictionary of the outputs generated by the application on the given `inputs`.
- `reference_outputs: dict`: A dictionary of the reference outputs associated with the example, if available.

For most use cases you'll only need `inputs`, `outputs`, and `reference_outputs`. `run` and `example` are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.

## Evaluator output

Custom evaluators are expected to return one of the following types:

Python and JS/TS

- `dict`: dicts of the form `{"score" | "value": ..., "name": ...}` allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.

Currently Python only

- `int | float | bool`: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.
- `str`: this is intepreted as a categorical metric. The function name is used as the name of the metric.
- `list[dict]`: return multiple metrics using a single function.

## Additional examples

<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
from langsmith import evaluate, wrappers
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# Compare actual and reference outputs
def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
"""Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
"""Use an LLM to judge if the reasoning and the answer are consistent."""

instructions = """\\

Given the following question, answer, and reasoning, determine if the reasoning for the \\
answer is logically valid and consistent with question and the answer."""

class Response(BaseModel):
reasoning_is_valid: bool

msg = f"Question: {inputs['question']}\\nAnswer: {outputs['answer']}\\nReasoning: {outputs['reasoning']}"
response = await oai_client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
response_format=Response
)
return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct, concision, valid_reasoning]
)
`,

]}
/>

## Related

- [Evaluate aggregate experiment results](../../how_to_guides/evaluation/summary): Define summary evaluators, which compute metrics for an entire experiment.
- [Run an evaluation comparing two experiments](../../how_to_guides/evaluation/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
Loading
Loading