Skip to content

Commit

Permalink
fix links
Browse files Browse the repository at this point in the history
  • Loading branch information
baskaryan committed Nov 25, 2024
1 parent b2f1118 commit 57baca8
Show file tree
Hide file tree
Showing 14 changed files with 43 additions and 43 deletions.
14 changes: 7 additions & 7 deletions docs/evaluation/how_to_guides/async.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,19 @@ import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";

:::info Key concepts

[Evaluations](../../concepts#applying-evaluations) | [Evaluators](../../concepts#evaluators) | [Datasets](../../concepts#datasets) | [Experiments](../../concepts#experiments)
[Evaluations](../concepts#applying-evaluations) | [Evaluators](../concepts#evaluators) | [Datasets](../concepts#datasets) | [Experiments](../concepts#experiments)

:::

We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
You can learn more about how to use the `evaluate()` function [here](../../how_to_guides/evaluate_llm_application).
We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._arunner.aevaluate.html),
which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
You can learn more about how to use the `evaluate()` function [here](./evaluate_llm_application).

:::info Python only

This guide is only relevant when using the Python SDK.
In JS/TS the `evaluate()` function is already async.
You can see how to use it [here](../../how_to_guides/evaluate_llm_application).
You can see how to use it [here](./evaluate_llm_application).

:::

Expand Down Expand Up @@ -76,5 +76,5 @@ list 5 concrete questions that should be investigated to determine if the idea i

## Related

- [Run an evaluation (synchronously)](../../how_to_guides/evaluate_llm_application)
- [Handle model rate limits](../../how_to_guides/rate_limiting)
- [Run an evaluation (synchronously)](./evaluate_llm_application)
- [Handle model rate limits](./rate_limiting)
8 changes: 4 additions & 4 deletions docs/evaluation/how_to_guides/custom_evaluator.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ import {

:::info Key concepts

- [Evaluators](../../concepts#evaluators)
- [Evaluators](../concepts#evaluators)

:::

Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics.
These functions can be passed directly into [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html).
These functions can be passed directly into [evaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._arunner.aevaluate.html).

## Basic example

Expand Down Expand Up @@ -138,5 +138,5 @@ answer is logically valid and consistent with question and the answer."""

## Related

- [Evaluate aggregate experiment results](../../how_to_guides/summary): Define summary evaluators, which compute metrics for an entire experiment.
- [Run an evaluation comparing two experiments](../../how_to_guides/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
- [Evaluate aggregate experiment results](./summary): Define summary evaluators, which compute metrics for an entire experiment.
- [Run an evaluation comparing two experiments](./evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
28 changes: 14 additions & 14 deletions docs/evaluation/how_to_guides/evaluate_llm_application.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,17 @@ import {

:::info Key concepts

[Evaluations](../../concepts#applying-evaluations) | [Evaluators](../../concepts#evaluators) | [Datasets](../../concepts#datasets)
[Evaluations](../concepts#applying-evaluations) | [Evaluators](../concepts#evaluators) | [Datasets](../concepts#datasets)

:::

In this guide we'll go over how to evaluate an application using the [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) method in the LangSmith SDK.
In this guide we'll go over how to evaluate an application using the [evaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._runner.evaluate.html) method in the LangSmith SDK.

:::tip

For larger evaluation jobs in Python we recommend using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html), the asynchronous version of `evaluate()`.
For larger evaluation jobs in Python we recommend using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._arunner.aevaluate.html), the asynchronous version of `evaluate()`.
It is still worthwhile to read this guide first, as the two have nearly identical interfaces,
and then read the how-to guide on [running an evaluation asynchronously](../../how_to_guides/async).
and then read the how-to guide on [running an evaluation asynchronously](./async).

:::

Expand Down Expand Up @@ -92,7 +92,7 @@ To understand how to annotate your code for tracing, please refer to [this guide

## Create or select a dataset

We need a [Dataset](../../concepts#datasets) to evaluate our application on. Our dataset will contain labeled [examples](../../concepts#examples) of toxic and non-toxic text.
We need a [Dataset](../concepts#datasets) to evaluate our application on. Our dataset will contain labeled [examples](../concepts#examples) of toxic and non-toxic text.

<CodeTabs
groupId="client-language"
Expand Down Expand Up @@ -150,11 +150,11 @@ We need a [Dataset](../../concepts#datasets) to evaluate our application on. Our
]}
/>

See [here](../../how_to_guides#dataset-management) for more on dataset management.
See [here](.#dataset-management) for more on dataset management.

## Define an evaluator

[Evaluators](../../concepts#evaluators) are functions for scoring your application's outputs. They take in the example inputs, actual outputs, and, when present, the reference outputs.
[Evaluators](../concepts#evaluators) are functions for scoring your application's outputs. They take in the example inputs, actual outputs, and, when present, the reference outputs.
Since we have labels for this task, our evaluator can directly check if the actual outputs match the reference outputs.

<CodeTabs
Expand All @@ -176,11 +176,11 @@ Since we have labels for this task, our evaluator can directly check if the actu
]}
/>

See [here](../../how_to_guides#define-an-evaluator) for more on how to define evaluators.
See [here](.#define-an-evaluator) for more on how to define evaluators.

## Run the evaluation

We'll use the [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html) methods to run the evaluation.
We'll use the [evaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._arunner.aevaluate.html) methods to run the evaluation.

The key arguments are:

Expand Down Expand Up @@ -214,11 +214,11 @@ The key arguments are:
]}
/>

See [here](../../how_to_guides#run-an-evaluation) for other ways to kick off evaluations and [here](../../how_to_guides#configure-an-evaluation-job) for how to configure evaluation jobs.
See [here](.#run-an-evaluation) for other ways to kick off evaluations and [here](.#configure-an-evaluation-job) for how to configure evaluation jobs.

## Explore the results

Each invocation of `evaluate()` creates an [Experiment](../../concepts#experiments) which can be viewed in the LangSmith UI or queried via the SDK.
Each invocation of `evaluate()` creates an [Experiment](../concepts#experiments) which can be viewed in the LangSmith UI or queried via the SDK.
Evaluation scores are stored against each actual output as feedback.

_If you've annotated your code for tracing, you can open the trace of each row in a side panel view._
Expand Down Expand Up @@ -364,6 +364,6 @@ _If you've annotated your code for tracing, you can open the trace of each row i

## Related

- [Run an evaluation asynchronously](../../how_to_guides/async)
- [Run an evaluation via the REST API](../../how_to_guides/run_evals_api_only)
- [Run an evaluation from the prompt playground](../../how_to_guides/run_evaluation_from_prompt_playground)
- [Run an evaluation asynchronously](./async)
- [Run an evaluation via the REST API](./run_evals_api_only)
- [Run an evaluation from the prompt playground](./run_evaluation_from_prompt_playground)
Original file line number Diff line number Diff line change
Expand Up @@ -391,4 +391,4 @@ The experiment will contain the results of the evaluation, including the scores

## Related

- [Evaluate a `langgraph` graph](../evaluation/langgraph)
- [Evaluate a `langgraph` graph](./langgraph)
4 changes: 2 additions & 2 deletions docs/evaluation/how_to_guides/evaluate_pairwise.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ import {

:::info Key concepts

- [Pairwise evaluations](../../concepts#pairwise)
- [Pairwise evaluations](../concepts#pairwise)

:::

LangSmith supports evaluating **existing** experiments in a comparative manner.
This allows you to score the outputs from multiple experiments against each other, rather than being confined to evaluating outputs one at a time.
Think [LMSYS Chatbot Arena](https://chat.lmsys.org/) - this is the same concept!
To do this, use the [evaluate_comparative](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate_comparative.html) / `evaluateComparative` function with two existing experiments.
To do this, use the [evaluate_comparative](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._runner.evaluate_comparative.html) / `evaluateComparative` function with two existing experiments.

If you haven't already created experiments to compare, check out our [quick start](https://docs.smith.langchain.com/#5-run-your-first-evaluation) or oue [how-to guide](https://docs.smith.langchain.com/how_to_guides/evaluate_llm_application) to get started with evaluations.

Expand Down
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/langchain_runnable.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,4 @@ The runnable is traced appropriately for each output.

## Related

- [How to evaluate a `langgraph` graph](../evaluation/langgraph)
- [How to evaluate a `langgraph` graph](./langgraph)
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/langgraph.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ If we need access to information about intermediate steps that isn't in state, w

:::tip Custom evaluators

See more about what arguments you can pass to custom evaluators in this [how-to guide](../evaluation/custom_evaluator).
See more about what arguments you can pass to custom evaluators in this [how-to guide](./custom_evaluator).

:::

Expand Down
6 changes: 3 additions & 3 deletions docs/evaluation/how_to_guides/llm_as_judge.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import {

:::info Key concepts

- [LLM-as-a-judge evaluator](../../concepts#llm-as-judge)
- [LLM-as-a-judge evaluator](../concepts#llm-as-judge)

:::

Expand Down Expand Up @@ -72,8 +72,8 @@ for the answer is logically valid and consistent with question and the answer.\\
]}
/>

See [here](../../how_to_guides/custom_evaluator) for more on how to write a custom evaluator.
See [here](./custom_evaluator) for more on how to write a custom evaluator.

## Prebuilt evaluator via `langchain`

See [here](../../how_to_guides/use_langchain_off_the_shelf_evaluators) for how to use prebuilt evaluators from `langchain`.
See [here](./use_langchain_off_the_shelf_evaluators) for how to use prebuilt evaluators from `langchain`.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ sidebar_position: 1
:::tip Recommended Reading
Before diving into this content, it might be helpful to read the following:

- [Concepts guide on evaluation and datasets](../../concepts#datasets-and-examples)
- [Concepts guide on evaluation and datasets](../concepts#datasets-and-examples)

:::

Expand Down Expand Up @@ -36,14 +36,14 @@ Certain fields in your schema have a `+ Transformations` option.
Transformations are preprocessing steps that, if enabled, update your examples when you add them to the dataset.
For example the `convert to OpenAI messages` transformation will convert message-like objects, like LangChain messages, to OpenAI message format.

For the full list of available transformations, see [our reference](/reference/evaluation/dataset_transformations).
For the full list of available transformations, see [our reference](/referen./dataset_transformations).

:::note
If you plan to collect production traces in your dataset from LangChain
[ChatModels](https://python.langchain.com/docs/concepts/chat_models/)
or from OpenAI calls using the [LangSmith OpenAI wrapper](/observability/how_to_guides/tracing/annotate_code#wrap-the-openai-client), we offer a prebuilt Chat Model schema that converts messages and tools into industry standard openai formats that can be used downstream with any model for testing. You can also customize the template settings to match your use case.

Please see the [dataset transformations reference](/reference/evaluation/dataset_transformations) for more information.
Please see the [dataset transformations reference](/referen./dataset_transformations) for more information.
:::

## Add runs to your dataset in the UI
Expand Down
4 changes: 2 additions & 2 deletions docs/evaluation/how_to_guides/metric_type.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import {

# How to return categorical vs numerical metrics

LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](../../how_to_guides/custom_evaluator).
LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](./custom_evaluator).

For an evaluator result to be logged as a numerical metric, it must returned as:

Expand Down Expand Up @@ -68,4 +68,4 @@ Here are some examples:

## Related

- [Return multiple metrics in one evaluator](../../how_to_guides/multiple_scores)
- [Return multiple metrics in one evaluator](./multiple_scores)
4 changes: 2 additions & 2 deletions docs/evaluation/how_to_guides/multiple_scores.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import {

# How to return multiple scores in one evaluator

Sometimes it is useful for a [custom evaluator function](../../how_to_guides/custom_evaluator) or [summary evaluator function](../../how_to_guides/summary) to return multiple metrics.
Sometimes it is useful for a [custom evaluator function](./custom_evaluator) or [summary evaluator function](./summary) to return multiple metrics.
For example, if you have multiple metrics being generated by an LLM judge, you can save time and money by making a single LLM call that generates multiple metrics instead of making multiple LLM calls.

To return multiple scores using the Python SDK, simply return a list of dictionaries/objects of the following form:
Expand Down Expand Up @@ -75,4 +75,4 @@ Rows from the resulting experiment will display each of the scores.

## Related

- [Return categorical vs numerical metrics](../../how_to_guides/metric_type)
- [Return categorical vs numerical metrics](./metric_type)
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/rate_limiting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ See some examples of how to do this in the [OpenAI docs](https://platform.openai
## Limiting max_concurrency

Limiting the number of concurrent calls you're making to your application and evaluators is another way to decrease the frequency of model calls you're making, and in that way avoid rate limit errors.
`max_concurrency` can be set directly on the [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html) functions.
`max_concurrency` can be set directly on the [evaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/late./langsmith.evaluation._arunner.aevaluate.html) functions.

<CodeTabs
groupId="client-language"
Expand Down
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/run_evals_api_only.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ for model_name in model_names:
## Run a pairwise experiment

Next, we'll demonstrate how to run a pairwise experiment. In a pairwise experiment, you compare two examples against each other.
For more information, check out [this guide](../evaluation/evaluate_pairwise).
For more information, check out [this guide](./evaluate_pairwise).

```python
# A comparative experiment allows you to provide a preferential ranking on the outputs of two or more experiments
Expand Down
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/version_datasets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ client.update_dataset_tag(
)
```

To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](../evaluation/dataset_version).
To run an evaluation on a particular tagged version of a dataset, you can follow [this guide](./dataset_version).

0 comments on commit 57baca8

Please sign in to comment.