Skip to content

Commit

Permalink
fix
Browse files Browse the repository at this point in the history
  • Loading branch information
baskaryan committed Nov 23, 2024
1 parent d8b3fc4 commit a6df8ea
Show file tree
Hide file tree
Showing 7 changed files with 14 additions and 7 deletions.
6 changes: 3 additions & 3 deletions docs/evaluation/concepts/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ LangSmith evaluations are kicked off using a single function, `evaluate`, which

:::tip

See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#step-4-run-the-evaluation-and-view-the-results).
See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application).

:::

Expand All @@ -236,7 +236,7 @@ One of the most common questions when evaluating AI applications is: how can I b
:::tip

- See the [video on `Repetitions` in our LangSmith Evaluation series](https://youtu.be/Pvz24JdzzF8)
- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#evaluate-on-a-dataset-with-repetitions)
- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/repetition)

:::

Expand Down Expand Up @@ -434,7 +434,7 @@ Classification / Tagging applies a label to a given input (e.g., for toxicity de

A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).

If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#use-custom-evaluators) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).

`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ client.update_dataset_tag(
)
```

To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](../evaluation/evaluate_llm_application#evaluate-on-a-particular-version-of-a-dataset).
To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](../evaluation/dataset_version).
Empty file.
4 changes: 4 additions & 0 deletions docs/evaluation/how_to_guides/evaluation/metric_type.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,7 @@ Here are some examples:

]}
/>

## Related

- [Return multiple metrics in one evaluator](./how_to_guides/evaluation/multiple_scores)
4 changes: 4 additions & 0 deletions docs/evaluation/how_to_guides/evaluation/multiple_scores.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,7 @@ Example:
Rows from the resulting experiment will display each of the scores.

![](../evaluation/static/multiple_scores.png)

## Related

- [Return categorical vs numerical metrics](./how_to_guides/evaluation/metric_type)
1 change: 0 additions & 1 deletion docs/evaluation/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ Evaluate and improve your application before deploying it.
- [Evaluate intermediate steps](./how_to_guides/evaluation/evaluate_on_intermediate_steps)
- [Return multiple metrics in one evaluator](./how_to_guides/evaluation/multiple_scores)
- [Return categorical vs numerical metrics](./how_to_guides/evaluation/metric_type)
- [Check your evaluator setup](./how_to_guides/evaluation/check_evaluator)

### Configure the evaluation data

Expand Down
4 changes: 2 additions & 2 deletions docs/evaluation/tutorials/agents.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -460,7 +460,7 @@ See the full overview of single step evaluation in our [conceptual guide](https:

:::

We can check a specific tool call using [a custom evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#use-custom-evaluators):
We can check a specific tool call using [a custom evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator):

- Here, we just invoke the assistant, `assistant_runnable`, with a prompt and check if the resulting tool call is as expected.
- Here, we are using a specialized agent where the tools are hard-coded (rather than passed with the dataset input).
Expand Down Expand Up @@ -507,7 +507,7 @@ experiment_results = evaluate(

### Trajectory

We can check a trajectory of tool calls using [custom evaluators](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#use-custom-evaluators):
We can check a trajectory of tool calls using [custom evaluators](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator):

- Here, we just invoke the agent, `graph.invoke`, with a prompt.
- Here, we are using a specialized agent where the tools are hard-coded (rather than passed with the dataset input).
Expand Down

0 comments on commit a6df8ea

Please sign in to comment.