Skip to content

Commit

Permalink
fix formatting (#584)
Browse files Browse the repository at this point in the history
  • Loading branch information
baskaryan authored Dec 11, 2024
1 parent 23913ef commit 15b1e47
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 12 deletions.
16 changes: 8 additions & 8 deletions docs/evaluation/concepts/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ LangSmith allows you to build high-quality evaluations for your AI application.
- `Dataset`: These are the inputs to your application used for conducting evaluations.
- `Evaluator`: An evaluator is a function responsible for scoring your AI application based on the provided dataset.

![Summary](../concepts/static/langsmith_summary.png)
![Summary](./static/langsmith_summary.png)

## Datasets

Expand Down Expand Up @@ -141,11 +141,11 @@ This can be the case for tasks like summarization - it may be hard to give a sum

We can visualize the above ideas collectively in the below diagram. To review, `datasets` are composed of `examples` that can be curated from a variety of sources such as historical logs or user curated examples. `Evaluators` are functions that score how well your application performs on each `example` in your `dataset`. Evaluators can use different scoring functions, such as `human`, `heuristic`, `LLM-as-judge`, or `pairwise`. And if the `dataset` contains `reference` outputs, then the evaluator can compare the application output to the `reference`.

![Overview](../concepts/static/langsmith_overview.png)
![Overview](./static/langsmith_overview.png)

Each time we run an evaluation, we are conducting an experiment. An experiment is a single execution of all the example inputs in your `dataset` through your `task`. Typically, we will run multiple experiments on a given `dataset`, testing different tweaks to our `task` (e.g., different prompts or LLMs). In LangSmith, you can easily view all the experiments associated with your `dataset` and track your application's performance over time. Additionally, you can compare multiple experiments in a comparison view.

![Example](../concepts/static/comparing_multiple_experiments.png)
![Example](./static/comparing_multiple_experiments.png)

In the `Dataset` section above, we discussed a few ways to build datasets (e.g., from historical logs or manual curation). One common way to use these datasets is offline evaluation, which is usually conducted prior to deployment of your LLM application. Below we'll discuss a few common paradigms for offline evaluation.

Expand Down Expand Up @@ -178,7 +178,7 @@ They are also commonly done when evaluating new or different models.

LangSmith's comparison view has native support for regression testing, allowing you to quickly see examples that have changed relative to the baseline (with regressions on specific examples shown in red and improvements in green):

![Regression](../concepts/static/regression.png)
![Regression](./static/regression.png)

### Back-testing

Expand Down Expand Up @@ -250,19 +250,19 @@ Below, we will discuss evaluation of a few specific, popular LLM applications.

[LLM-powered autonomous agents](https://lilianweng.github.io/posts/2023-06-23-agent/) combine three components (1) Tool calling, (2) Memory, and (3) Planning. Agents [use tool calling](https://python.langchain.com/v0.1/docs/modules/agents/agent_types/tool_calling/) with planning (e.g., often via prompting) and memory (e.g., often short-term message history) to generate responses. [Tool calling](https://python.langchain.com/v0.1/docs/modules/model_io/chat/function_calling/) allows a model to respond to a given prompt by generating two things: (1) a tool to invoke and (2) the input arguments required.

![Tool use](../concepts/static/tool_use.png)
![Tool use](./static/tool_use.png)

Below is a tool-calling agent in [LangGraph](https://langchain-ai.github.io/langgraph/tutorials/introduction/). The `assistant node` is an LLM that determines whether to invoke a tool based upon the input. The `tool condition` sees if a tool was selected by the `assistant node` and, if so, routes to the `tool node`. The `tool node` executes the tool and returns the output as a tool message to the `assistant node`. This loop continues until as long as the `assistant node` selects a tool. If no tool is selected, then the agent directly returns the LLM response.

![Agent](../concepts/static/langgraph_agent.png)
![Agent](./static/langgraph_agent.png)

This sets up three general types of agent evaluations that users are often interested in:

- `Final Response`: Evaluate the agent's final response.
- `Single step`: Evaluate any agent step in isolation (e.g., whether it selects the appropriate tool).
- `Trajectory`: Evaluate whether the agent took the expected path (e.g., of tool calls) to arrive at the final answer.

![Agent-eval](../concepts/static/agent_eval.png)
![Agent-eval](./static/agent_eval.png)

Below we will cover what these are, the components (inputs, outputs, evaluators) needed for each one, and when you should consider this.
Note that you likely will want to do multiple (if not all!) of these types of evaluations - they are not mutually exclusive!
Expand Down Expand Up @@ -355,7 +355,7 @@ When evaluating RAG applications, a key consideration is whether you have (or ca

`LLM-as-judge` is a commonly used evaluator for RAG because it's an effective way to evaluate factual accuracy or consistency between texts.

![rag-types.png](../concepts/static/rag-types.png)
![rag-types.png](./static/rag-types.png)

When evaluating RAG applications, you have two main options:

Expand Down
8 changes: 4 additions & 4 deletions docs/evaluation/how_to_guides/unit_testing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,25 @@ sidebar_position: 7

# How to unit test applications (Python only)

LangSmith functional tests are assertions and expectations designed to **quickly** identify obvious bugs and regressions in your AI system.
LangSmith functional tests are assertions and expectations designed to **quickly** identify obvious bugs and regressions in your AI system.
Relative to evaluations, tests typically are designed to be **fast** and **cheap** to run, focusing on **specific** functionality and edge cases with binary assertions.
We recommend using LangSmith to track any unit tests, end-to-end integration tests, or other specific assertions that touch an LLM or other non-deterministic part of your AI system.
Ideally these run on every commit in your CI pipeline to catch regressions early.

:::info Version requirement
`@unit` requires `langsmith` Python version `>=0.1.74`.
`@unit` requires `langsmith` Python version `>=0.1.74`.
:::

:::info TypeScript support
If you are interested in unit testing functionality in TypeScript or other languages, please upvote/comment on [this GitHub Issue](https://github.com/langchain-ai/langsmith-sdk/issues/1321).
If you are interested in unit testing functionality in TypeScript or other languages, please upvote/comment on [this GitHub Issue](https://github.com/langchain-ai/langsmith-sdk/issues/1321).
:::

## Write a @unit

To write a LangSmith functional test, decorate your test function with `@unit`.
If you want to track the full nested trace of the system or component being tested, you can mark those functions with `@traceable`. For example:

```python
```python
# my_app/main.py
from langsmith import traceable

Expand Down

0 comments on commit 15b1e47

Please sign in to comment.