diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx index fb4056ea..bb9b8704 100644 --- a/docs/evaluation/concepts/index.mdx +++ b/docs/evaluation/concepts/index.mdx @@ -66,7 +66,7 @@ When setting up your evaluation, you may want to partition your dataset into dif To learn more about creating dataset splits in LangSmith: - See our video on [`dataset splits`](https://youtu.be/FQMn_FQV-fI?feature=shared) in the LangSmith Evaluation series. -- See our documentation [here](../how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits). +- See our documentation [here](./how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits). ::: @@ -105,7 +105,7 @@ Heuristic evaluators are hard-coded functions that perform computations to deter For some tasks, like code generation, custom heuristic evaluation (e.g., import and code execution-evaluation) are often extremely useful and superior to other evaluations (e.g., LLM-as-judge, discussed below). - Watch the [`Custom evaluator` video in our LangSmith Evaluation series](https://www.youtube.com/watch?v=w31v_kFvcNw) for a comprehensive overview. -- Read our [documentation](../how_to_guides/custom_evaluator) on custom evaluators. +- Read our [documentation](./how_to_guides/custom_evaluator) on custom evaluators. - See our [blog](https://blog.langchain.dev/code-execution-with-langgraph/) using custom evaluators for code generation. ::: @@ -124,7 +124,7 @@ With LLM-as-judge evaluators, it is important to carefully review the resulting :::tip -See documentation on our workflow to audit and manually correct evaluator scores [here](../how_to_guides/audit_evaluator_scores). +See documentation on our workflow to audit and manually correct evaluator scores [here](./how_to_guides/audit_evaluator_scores). ::: @@ -225,7 +225,7 @@ LangSmith evaluations are kicked off using a single function, `evaluate`, which :::tip -See documentation on using `evaluate` [here](../how_to_guides/evaluate_llm_application). +See documentation on using `evaluate` [here](./how_to_guides/evaluate_llm_application). ::: @@ -236,7 +236,7 @@ One of the most common questions when evaluating AI applications is: how can I b :::tip - See the [video on `Repetitions` in our LangSmith Evaluation series](https://youtu.be/Pvz24JdzzF8) -- See our documentation on [`Repetitions`](../how_to_guides/repetition) +- See our documentation on [`Repetitions`](./how_to_guides/repetition) ::: @@ -252,7 +252,7 @@ Below, we will discuss evaluation of a few specific, popular LLM applications. ![Tool use](../concepts/static/tool_use.png) -Below is a tool-calling agent in [LangGraph](https://langchain-ai.github.io/langgraph/tutorials/introduction/). The `assistant node` is an LLM that determines whether to invoke a tool based upon the input. The `tool condition` sees if a tool was selected by the `assistant node` and, if so, routes to the `tool node`. The `tool node` executes the tool and returns the output as a tool message to the `assistant node`. This loop continues until as long as the `assistant node` selects a tool. If no tool is selected, then the agent directly returns the LLM response. +Below is a tool-calling agent in [LangGraph](https://langchain-ai.github.io/langgra./tutorials/introduction/). The `assistant node` is an LLM that determines whether to invoke a tool based upon the input. The `tool condition` sees if a tool was selected by the `assistant node` and, if so, routes to the `tool node`. The `tool node` executes the tool and returns the output as a tool message to the `assistant node`. This loop continues until as long as the `assistant node` selects a tool. If no tool is selected, then the agent directly returns the LLM response. ![Agent](../concepts/static/langgraph_agent.png) @@ -281,7 +281,7 @@ However, there are several downsides to this type of evaluation. First, it usual :::tip -See our tutorial on [evaluating agent response](../tutorials/agents). +See our tutorial on [evaluating agent response](./tutorials/agents). ::: @@ -299,7 +299,7 @@ There are several benefits to this type of evaluation. It allows you to evaluate :::tip -See our tutorial on [evaluating a single step of an agent](../tutorials/agents#single-step-evaluation). +See our tutorial on [evaluating a single step of an agent](./tutorials/agents#single-step-evaluation). ::: @@ -319,7 +319,7 @@ However, none of these approaches evaluate the input to the tools; they only foc :::tip -See our tutorial on [evaluating agent trajectory](../tutorials/agents#trajectory). +See our tutorial on [evaluating agent trajectory](./tutorials/agents#trajectory). ::: @@ -434,7 +434,7 @@ Classification / Tagging applies a label to a given input (e.g., for toxicity de A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc). -If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](../how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference). +If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference). `Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc). diff --git a/docs/evaluation/how_to_guides/dataset_subset.mdx b/docs/evaluation/how_to_guides/dataset_subset.mdx index ca51c10e..efc914c9 100644 --- a/docs/evaluation/how_to_guides/dataset_subset.mdx +++ b/docs/evaluation/how_to_guides/dataset_subset.mdx @@ -10,8 +10,8 @@ import { Before diving into this content, it might be helpful to read: -- [guide on fetching examples](../datasets/manage_datasets_programmatically#fetch-examples). -- [guide on creating/managing dataset splits](../datasets/manage_datasets_in_application#create-and-manage-dataset-splits) +- [guide on fetching examples](./manage_datasets_programmatically#fetch-examples). +- [guide on creating/managing dataset splits](./manage_datasets_in_application#create-and-manage-dataset-splits) ::: @@ -49,7 +49,7 @@ One common workflow is to fetch examples that have a certain metadata key-value ]} /> -For more advanced filtering capabilities see this [how-to guide](../datasets/manage_datasets_programmatically#list-examples-by-structured-filter). +For more advanced filtering capabilities see this [how-to guide](./manage_datasets_programmatically#list-examples-by-structured-filter). ## Evaluate on a dataset split @@ -85,4 +85,4 @@ You can use the `list_examples` / `listExamples` method to evaluate on one or mu ## Related -- More on [how to filter datasets](../datasets/manage_datasets_programmatically#list-examples-by-structured-filter) +- More on [how to filter datasets](./manage_datasets_programmatically#list-examples-by-structured-filter) diff --git a/docs/evaluation/how_to_guides/dataset_version.mdx b/docs/evaluation/how_to_guides/dataset_version.mdx index e592bcad..564c1295 100644 --- a/docs/evaluation/how_to_guides/dataset_version.mdx +++ b/docs/evaluation/how_to_guides/dataset_version.mdx @@ -8,8 +8,8 @@ import { :::tip Recommended reading -Before diving into this content, it might be helpful to read the [guide on versioning datasets](../datasets/version_datasets). -Additionally, it might be helpful to read the [guide on fetching examples](../datasets/manage_datasets_programmatically#fetch-examples). +Before diving into this content, it might be helpful to read the [guide on versioning datasets](./version_datasets). +Additionally, it might be helpful to read the [guide on fetching examples](./manage_datasets_programmatically#fetch-examples). ::: diff --git a/docs/evaluation/how_to_guides/run_evals_api_only.mdx b/docs/evaluation/how_to_guides/run_evals_api_only.mdx index 77c125cc..40fc5fbd 100644 --- a/docs/evaluation/how_to_guides/run_evals_api_only.mdx +++ b/docs/evaluation/how_to_guides/run_evals_api_only.mdx @@ -26,7 +26,7 @@ This guide will show you how to run evals using the REST API, using the `request ## Create a dataset -Here, we are using the python SDK for convenience. You can also use the API directly use the UI, see [this guide](../datasets/manage_datasets_in_application) for more information. +Here, we are using the python SDK for convenience. You can also use the API directly use the UI, see [this guide](./manage_datasets_in_application) for more information. ```python import openai