Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten eval how to guides directory #552

Merged
merged 11 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ If the header is not present, operations will default to the workspace the API k
## Security Settings

:::note
"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/datasets/share_dataset.mdx).
"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/share_dataset.mdx).
:::

- <RegionalUrl
Expand Down
18 changes: 9 additions & 9 deletions docs/evaluation/concepts/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ When setting up your evaluation, you may want to partition your dataset into dif
To learn more about creating dataset splits in LangSmith:

- See our video on [`dataset splits`](https://youtu.be/FQMn_FQV-fI?feature=shared) in the LangSmith Evaluation series.
- See our documentation [here](https://docs.smith.langchain.com/how_to_guides/datasets/manage_datasets_in_application#create-and-manage-dataset-splits).
- See our documentation [here](./how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits).

:::

Expand Down Expand Up @@ -105,7 +105,7 @@ Heuristic evaluators are hard-coded functions that perform computations to deter
For some tasks, like code generation, custom heuristic evaluation (e.g., import and code execution-evaluation) are often extremely useful and superior to other evaluations (e.g., LLM-as-judge, discussed below).

- Watch the [`Custom evaluator` video in our LangSmith Evaluation series](https://www.youtube.com/watch?v=w31v_kFvcNw) for a comprehensive overview.
- Read our [documentation](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_on_intermediate_steps#3-define-your-custom-evaluators) on custom evaluators.
- Read our [documentation](./how_to_guides/custom_evaluator) on custom evaluators.
- See our [blog](https://blog.langchain.dev/code-execution-with-langgraph/) using custom evaluators for code generation.

:::
Expand All @@ -124,7 +124,7 @@ With LLM-as-judge evaluators, it is important to carefully review the resulting

:::tip

See documentation on our workflow to audit and manually correct evaluator scores [here](https://docs.smith.langchain.com/how_to_guides/evaluation/audit_evaluator_scores).
See documentation on our workflow to audit and manually correct evaluator scores [here](./how_to_guides/audit_evaluator_scores).

:::

Expand Down Expand Up @@ -225,7 +225,7 @@ LangSmith evaluations are kicked off using a single function, `evaluate`, which

:::tip

See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application).
See documentation on using `evaluate` [here](./how_to_guides/evaluate_llm_application).

:::

Expand All @@ -236,7 +236,7 @@ One of the most common questions when evaluating AI applications is: how can I b
:::tip

- See the [video on `Repetitions` in our LangSmith Evaluation series](https://youtu.be/Pvz24JdzzF8)
- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/repetition)
- See our documentation on [`Repetitions`](./how_to_guides/repetition)

:::

Expand Down Expand Up @@ -281,7 +281,7 @@ However, there are several downsides to this type of evaluation. First, it usual

:::tip

See our tutorial on [evaluating agent response](https://docs.smith.langchain.com/tutorials/Developers/agents#response-evaluation).
See our tutorial on [evaluating agent response](./tutorials/agents).

:::

Expand All @@ -299,7 +299,7 @@ There are several benefits to this type of evaluation. It allows you to evaluate

:::tip

See our tutorial on [evaluating a single step of an agent](https://docs.smith.langchain.com/tutorials/Developers/agents#single-step-evaluation).
See our tutorial on [evaluating a single step of an agent](./tutorials/agents#single-step-evaluation).

:::

Expand All @@ -319,7 +319,7 @@ However, none of these approaches evaluate the input to the tools; they only foc

:::tip

See our tutorial on [evaluating agent trajectory](https://docs.smith.langchain.com/tutorials/Developers/agents#trajectory).
See our tutorial on [evaluating agent trajectory](./tutorials/agents#trajectory).

:::

Expand Down Expand Up @@ -434,7 +434,7 @@ Classification / Tagging applies a label to a given input (e.g., for toxicity de

A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).

If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).

`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,19 @@ import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";

:::info Key concepts

[Evaluations](../../concepts#applying-evaluations) | [Evaluators](../../concepts#evaluators) | [Datasets](../../concepts#datasets) | [Experiments](../../concepts#experiments)
[Evaluations](../concepts#applying-evaluations) | [Evaluators](../concepts#evaluators) | [Datasets](../concepts#datasets) | [Experiments](../concepts#experiments)

:::

We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
You can learn more about how to use the `evaluate()` function [here](../../how_to_guides/evaluation/evaluate_llm_application).
You can learn more about how to use the `evaluate()` function [here](./evaluate_llm_application).

:::info Python only

This guide is only relevant when using the Python SDK.
In JS/TS the `evaluate()` function is already async.
You can see how to use it [here](../../how_to_guides/evaluation/evaluate_llm_application).
You can see how to use it [here](./evaluate_llm_application).

:::

Expand Down Expand Up @@ -76,5 +76,5 @@ list 5 concrete questions that should be investigated to determine if the idea i

## Related

- [Run an evaluation (synchronously)](../../how_to_guides/evaluation/evaluate_llm_application)
- [Handle model rate limits](../../how_to_guides/evaluation/rate_limiting)
- [Run an evaluation (synchronously)](./evaluate_llm_application)
- [Handle model rate limits](./rate_limiting)
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ In the comparison view, you may click on any feedback tag to bring up the feedba
If you would like, you may also attach an explanation to your correction. This is useful if you are using a [few-shot evaluator](./create_few_shot_evaluators) and will be automatically inserted into your few-shot examples
in place of the `few_shot_explanation` prompt variable.

![Audit Evaluator Comparison View](../evaluation/static/corrections_comparison_view.png)
![Audit Evaluator Comparison View](./static/corrections_comparison_view.png)

## In the runs table

In the runs table, find the "Feedback" column and click on the feedback tag to bring up the feedback details. Again, click the "edit" icon on the right to bring up the corrections view.

![Audit Evaluator Runs Table](../evaluation/static/corrections_runs_table.png)
![Audit Evaluator Runs Table](./static/corrections_runs_table.png)

## In the SDK

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The next steps vary based on the evaluator type.
1. **Select the LLM as judge type evaluator**
2. **Give your evaluator a name** and **set an inline prompt or load a prompt from the prompt hub** that will be used to evaluate the results of the runs in the experiment.

![Add evaluator name and prompt](../evaluation/static/create_evaluator.png)
![Add evaluator name and prompt](./static/create_evaluator.png)

Importantly, evaluator prompts can only contain the following input variables:

Expand All @@ -42,11 +42,11 @@ LangSmith currently doesn't support setting up evaluators in the application tha

You can specify the scoring criteria in the "schema" field. In this example, we are asking the LLM to grade on "correctness" of the output with respect to the reference, with a boolean output of 0 or 1. The name of the field in the schema will be interpreted as the feedback key and the type will be the type of the score.

![Evaluator prompt](../evaluation/static/evaluator_prompt.png)
![Evaluator prompt](./static/evaluator_prompt.png)

3. **Save the evaluator** and navigate back to the dataset details page. Each **subsequent** experiment run from the dataset will now be evaluated by the evaluator you configured. Note that in the below image, each run in the experiment has a "correctness" score.

![Playground evaluator results](../evaluation/static/playground_evaluator_results.png)
![Playground evaluator results](./static/playground_evaluator_results.png)

## Custom code evaluators

Expand All @@ -70,7 +70,7 @@ You can specify the scoring criteria in the "schema" field. In this example, we

In the UI, you will see a panel that lets you write your code inline, with some starter code:

![](../evaluation/static/code-autoeval-popup.png)
![](./static/code-autoeval-popup.png)

Custom Code evaluators take in two arguments:

Expand Down Expand Up @@ -127,8 +127,8 @@ To visualize the feedback left on new experiments, try running a new experiment
On the dataset, if you now click to the `experiments` tab -> `+ Experiment` -> `Run in Playground`, you can see the results in action.
Your runs in your experiments will be automatically marked with the key specified in your code sample above (here, `formatted`):

![](../evaluation/static/show-feedback-from-autoeval-code.png)
![](./static/show-feedback-from-autoeval-code.png)

And if you navigate back to your dataset, you'll see summary stats for said experiment in the `experiments` tab:

![](../evaluation/static/experiments-tab-code-results.png)
![](./static/experiments-tab-code-results.png)
Original file line number Diff line number Diff line change
Expand Up @@ -8,65 +8,65 @@ Oftentimes, when you are iterating on your LLM application (such as changing the

LangSmith supports a powerful comparison view that lets you hone in on key differences, regressions, and improvements between different experiments.

![](../evaluation/static/regression_test.gif)
![](./static/regression_test.gif)

## Open the comparison view

To open the comparison view, select two or more experiments from the "Experiments" tab from a given dataset page. Then, click on the "Compare" button at the bottom of the page.

![](../evaluation/static/open_comparison_view.png)
![](./static/open_comparison_view.png)

## Toggle different views

You can toggle between different views by clicking on the "Display" dropdown at the top right of the page. You can toggle different views to be displayed.

Toggling Full Text will show the full text of the input, output and reference output for each run. If the reference output is too long to display in the table, you can click on expand to view the full content.

![](../evaluation/static/toggle_views.png)
![](./static/toggle_views.png)

## View regressions and improvements

In the LangSmith comparison view, runs that _regressed_ on your specified feedback key against your baseline experiment will be highlighted in red, while runs that _improved_
will be highlighted in green. At the top of each column, you can see how many runs in that experiment did better and how many did worse than your baseline experiment.

![Regressions](../evaluation/static/regression_view.png)
![Regressions](./static/regression_view.png)

## Filter on regressions or improvements

Click on the regressions or improvements buttons on the top of each column to filter to the runs that regressed or improved in that specific experiment.

![Regressions Filter](../evaluation/static/filter_to_regressions.png)
![Regressions Filter](./static/filter_to_regressions.png)

## Update baseline experiment

In order to track regressions, you need a baseline experiment against which to compare. This will be automatically assigned as the first experiment in your comparison, but you can
change it from the dropdown at the top of the page.

![Baseline](../evaluation/static/select_baseline.png)
![Baseline](./static/select_baseline.png)

## Select feedback key

You will also want to select the feedback key (evaluation metric) on which you would like focus on. This can be selected via another dropdown at the top. Again, one will be assigned by
default, but you can adjust as needed.

![Feedback](../evaluation/static/select_feedback.png)
![Feedback](./static/select_feedback.png)

## Open a trace

If tracing is enabled for the evaluation run, you can click on the trace icon in the hover state of any experiment cell to open the trace view for that run. This will open up a trace in the side panel.

![](../evaluation/static/open_trace_comparison.png)
![](./static/open_trace_comparison.png)

## Expand detailed view

From any cell, you can click on the expand icon in the hover state to open up a detailed view of all experiment results on that particular example input, along with feedback keys and scores.

![](../evaluation/static/expanded_view.png)
![](./static/expanded_view.png)

## Update display settings

You can adjust the display settings for comparison view by clicking on "Display" in the top right corner.

Here, you'll be able to toggle feedback, metrics, summary charts, and expand full text.

![](../evaluation/static/update_display.png)
![](./static/update_display.png)
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ as your output key. For example, if your main prompt has variables `question` an
You may also specify the number of few-shot examples to use. The default is 5. If your examples will tend to be very long, you may want to set this number lower to save tokens - whereas if your examples tend
to be short, you can set a higher number in order to give your evaluator more examples to learn from. If you have more examples in your dataset than this number, we will randomly choose them for you.

![Use corrections as few-shot examples](../evaluation/static/use_corrections_as_few_shot.png)
![Use corrections as few-shot examples](./static/use_corrections_as_few_shot.png)

Note that few-shot examples are not currently supported in evaluators that use Hub prompts.

Expand All @@ -51,20 +51,20 @@ begin seeing examples populated inside your corrections dataset. As you make cor
The inputs to the few-shot examples will be the relevant fields from the inputs, outputs, and reference (if this an offline evaluator) of your chain/dataset.
The outputs will be the corrected evaluator score and the explanations that you created when you left the corrections. Feel free to edit these to your liking. Here is an example of a few-shot example in a corrections dataset:

![Few-shot example](../evaluation/static/few_shot_example.png)
![Few-shot example](./static/few_shot_example.png)

Note that the corrections may take a minute or two to be populated into your few-shot dataset. Once they are there, future runs of your evaluator will include them in the prompt!

## View your corrections dataset

In order to view your corrections dataset, go to your rule and click "Edit Rule" (or "Edit Evaluator" from a dataset):

![Edit Evaluator](../evaluation/static/edit_evaluator.png)
![Edit Evaluator](./static/edit_evaluator.png)

If this is an online evaluator (in a tracing project), you will need to click to edit your prompt:

![Edit Prompt](../evaluation/static/click_to_edit_prompt.png)
![Edit Prompt](./static/click_to_edit_prompt.png)

From this screen, you will see a button that says "View few-shot dataset". Clicking this will bring you to your dataset of corrections, where you can view and update your few-shot examples:

![View few-shot dataset](../evaluation/static/view_few_shot_ds.png)
![View few-shot dataset](./static/view_few_shot_ds.png)
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import {

:::info Key concepts

- [Evaluators](../../concepts#evaluators)
- [Evaluators](../concepts#evaluators)

:::

Expand Down Expand Up @@ -138,5 +138,5 @@ answer is logically valid and consistent with question and the answer."""

## Related

- [Evaluate aggregate experiment results](../../how_to_guides/evaluation/summary): Define summary evaluators, which compute metrics for an entire experiment.
- [Run an evaluation comparing two experiments](../../how_to_guides/evaluation/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
- [Evaluate aggregate experiment results](./summary): Define summary evaluators, which compute metrics for an entire experiment.
- [Run an evaluation comparing two experiments](./evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ import {

Before diving into this content, it might be helpful to read:

- [guide on fetching examples](../datasets/manage_datasets_programmatically#fetch-examples).
- [guide on creating/managing dataset splits](../datasets/manage_datasets_in_application#create-and-manage-dataset-splits)
- [guide on fetching examples](./manage_datasets_programmatically#fetch-examples).
- [guide on creating/managing dataset splits](./manage_datasets_in_application#create-and-manage-dataset-splits)

:::

Expand Down Expand Up @@ -49,7 +49,7 @@ One common workflow is to fetch examples that have a certain metadata key-value
]}
/>

For more advanced filtering capabilities see this [how-to guide](../datasets/manage_datasets_programmatically#list-examples-by-structured-filter).
For more advanced filtering capabilities see this [how-to guide](./manage_datasets_programmatically#list-examples-by-structured-filter).

## Evaluate on a dataset split

Expand Down Expand Up @@ -85,4 +85,4 @@ You can use the `list_examples` / `listExamples` method to evaluate on one or mu

## Related

- More on [how to filter datasets](../datasets/manage_datasets_programmatically#list-examples-by-structured-filter)
- More on [how to filter datasets](./manage_datasets_programmatically#list-examples-by-structured-filter)
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ import {

:::tip Recommended reading

Before diving into this content, it might be helpful to read the [guide on versioning datasets](../datasets/version_datasets).
Additionally, it might be helpful to read the [guide on fetching examples](../datasets/manage_datasets_programmatically#fetch-examples).
Before diving into this content, it might be helpful to read the [guide on versioning datasets](./version_datasets).
Additionally, it might be helpful to read the [guide on fetching examples](./manage_datasets_programmatically#fetch-examples).

:::

Expand Down
5 changes: 0 additions & 5 deletions docs/evaluation/how_to_guides/datasets/_category_.json

This file was deleted.

Loading
Loading