Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2 docs #563

Merged
merged 24 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
install-vercel-deps:
yum -y update
yum install gcc bzip2-devel libffi-devel zlib-devel wget tar gzip rsync -y

PYTHON = .venv/bin/python

build-api-ref:
git clone --depth=1 https://github.com/langchain-ai/langsmith-sdk.git
python3 -m venv .venv
. .venv/bin/activate
$(PYTHON) -m pip install --upgrade pip
$(PYTHON) -m pip install --upgrade uv
cd langsmith-sdk && ../$(PYTHON) -m uv pip install -r python/docs/requirements.txt
$(PYTHON) langsmith-sdk/python/docs/create_api_rst.py
LC_ALL=C $(PYTHON) -m sphinx -T -E -b html -d langsmith-sdk/python/docs/_build/doctrees -c langsmith-sdk/python/docs langsmith-sdk/python/docs langsmith-sdk/python/docs/_build/html -j auto
$(PYTHON) langsmith-sdk/python/docs/scripts/custom_formatter.py langsmith-sdk/docs/_build/html/


vercel-build: install-vercel-deps build-api-ref
mkdir -p static/reference/python
mv langsmith-sdk/python/docs/_build/html/* static/reference/python/
rm -rf langsmith-sdk
NODE_OPTIONS="--max-old-space-size=5000" yarn run docusaurus build

13 changes: 8 additions & 5 deletions docs/evaluation/how_to_guides/async.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";

:::

We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-docs-git-bagatur-rfcbuiltinsdkref-langchain.vercel.app/reference/python/evaluation/langsmith.evaluation._arunner.aevaluate),
which accepts all of the same arguments as [evaluate()](https://langsmith-docs-git-bagatur-rfcbuiltinsdkref-langchain.vercel.app/reference/python/evaluation/langsmith.evaluation._runner.evaluate) but expects the application function to be asynchronous.
You can learn more about how to use the `evaluate()` function [here](./evaluate_llm_application).

:::info Python only
Expand All @@ -25,8 +25,8 @@ You can see how to use it [here](./evaluate_llm_application).
<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
from langsmith import aevaluate, wrappers, Client
python({caption: "Requires `langsmith>=0.2.0`"})`
from langsmith import wrappers, Client
from openai import AsyncOpenAI

# Optionally wrap the OpenAI client to trace all model calls.
Expand Down Expand Up @@ -61,7 +61,10 @@ list 5 concrete questions that should be investigated to determine if the idea i
inputs=[{"idea": e} for e in examples,
)

results = await aevaluate(
# Can equivalently use the 'aevaluate' function directly:
# from langsmith import aevaluate
# await aevaluate(...)
results = await ls_client.aevaluate(
researcher_app,
data=dataset,
evaluators=[concise],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 10
---

How to create few-shot evaluators
# How to create few-shot evaluators

Using LLM-as-a-Judge evaluators can be very helpful when you can't evaluate your system programmatically. However, improving/iterating on these prompts can add unnecessary
overhead to the development process of an LLM-based application - you now need to maintain both your application **and** your evaluators. To make this process easier, LangSmith allows
Expand Down
29 changes: 14 additions & 15 deletions docs/evaluation/how_to_guides/custom_evaluator.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ import {
:::

Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics.
These functions can be passed directly into [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html).
These functions can be passed directly into [evaluate()](https://langsmith-docs-git-bagatur-rfcbuiltinsdkref-langchain.vercel.app/reference/python/evaluation/langsmith.evaluation._runner.evaluate) / [aevaluate()](https://langsmith-docs-git-bagatur-rfcbuiltinsdkref-langchain.vercel.app/reference/python/evaluation/langsmith.evaluation._arunner.aevaluate).

## Basic example

<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
python({caption: "Requires `langsmith>=0.2.0`"})`
from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
Expand All @@ -36,12 +36,14 @@ These functions can be passed directly into [evaluate()](https://langsmith-sdk.r
evaluators=[correct]
)
`,
typescript`
typescript({caption: "Requires `langsmith>=0.2.9`"})`
import type { EvaluationResult } from "langsmith/evaluation";
import type { Run, Example } from "langsmith/schemas";

function correct(run: Run, example: Example): EvaluationResult {
const score = run.outputs?.output === example.outputs?.output;
const correct = async ({ outputs, referenceOutputs }: {
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}): Promise<EvaluationResult> => {
const score = outputs?.answer === referenceOutputs?.answer;
return { key: "correct", score };
}
`,
Expand All @@ -53,19 +55,16 @@ These functions can be passed directly into [evaluate()](https://langsmith-sdk.r

Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:

Python and JS/TS

- `run: langsmith.schemas.Run`: The full Run object generated by the application on the given example.
- `example: langsmith.schemas.Example`: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).

Currently Python only

- `run: Run`: The full [Run](/reference/data_formats/run_data_format) object generated by the application on the given example.
- `example: Example`: The full dataset [Example](/reference/data_formats/example_data_format), including the example inputs, outputs (if available), and metdata (if available).
- `inputs: dict`: A dictionary of the inputs corresponding to a single example in a dataset.
- `outputs: dict`: A dictionary of the outputs generated by the application on the given `inputs`.
- `reference_outputs: dict`: A dictionary of the reference outputs associated with the example, if available.
- `reference_outputs/referenceOutputs: dict`: A dictionary of the reference outputs associated with the example, if available.

For most use cases you'll only need `inputs`, `outputs`, and `reference_outputs`. `run` and `example` are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.

When using JS/TS these should all be passed in as part of a single object argument.

## Evaluator output

Custom evaluators are expected to return one of the following types:
Expand All @@ -85,7 +84,7 @@ Currently Python only
<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
python({caption: "Requires `langsmith>=0.2.0`"})`
from langsmith import evaluate, wrappers
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
Expand Down
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/dataset_subset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,4 @@ You can use the `list_examples` / `listExamples` method to evaluate on one or mu

## Related

- More on [how to filter datasets](./manage_datasets_programmatically#list-examples-by-structured-filter)
- Learn more about how to fetch views of a dataset [here](./manage_datasets_programmatically#fetch-datasets)
46 changes: 32 additions & 14 deletions docs/evaluation/how_to_guides/dataset_version.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,35 +13,53 @@ Additionally, it might be helpful to read the [guide on fetching examples](./man

:::

You can take advantage of the fact that `evaluate` allows passing in an iterable of examples to evaluate on a particular version of a dataset.
Simply use `list_examples` / `listExamples` to fetch examples from a particular version tag using `as_of` / `asOf`.
## Using `list_examples`

You can take advantage of the fact that `evaluate` / `aevaluate` allows passing in an iterable of examples to evaluate on a particular version of a dataset.
Simply use `list_examples` / `listExamples` to fetch examples from a particular version tag using `as_of` / `asOf` and pass that in to the `data` argument.

<CodeTabs
groupId="client-language"
tabs={[
python`
from langsmith import evaluate

latest_data=client.list_examples(dataset_name=toxic_dataset_name, as_of="latest")

results = evaluate(
lambda inputs: label_text(inputs["text"]),
data=latest_data,
evaluators=[correct_label],
experiment_prefix="Toxic Queries",
from langsmith import Client

ls_client = Client()

# Assumes actual outputs have a 'class' key.
# Assumes example outputs have a 'label' key.
def correct(outputs: dict, reference_outputs: dict) -> bool:
return outputs["class"] == reference_outputs["label"]

results = ls_client.evaluate(
lambda inputs: {"class": "Not toxic"},
# Pass in filtered data here:
# highlight-next-line
data=ls_client.list_examples(
# highlight-next-line
dataset_name="Toxic Queries",
# highlight-next-line
as_of="latest", # specify version here
# highlight-next-line
),
evaluators=[correct],
)
`,
typescript`
import { evaluate } from "langsmith/evaluation";

await evaluate((inputs) => labelText(inputs["input"]), {
data: langsmith.listExamples({
datasetName: datasetName,
asOf: "latest",
}),
evaluators: [correctLabel],
experimentPrefix: "Toxic Queries",
});
`,
]}

]}
/>

## Related

- Learn more about how to fetch views of a dataset [here](./manage_datasets_programmatically#fetch-datasets)
56 changes: 8 additions & 48 deletions docs/evaluation/how_to_guides/evaluate_existing_experiment.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,60 +4,20 @@ sidebar_position: 6

# How to evaluate an existing experiment (Python only)

:::note
Currently, `evaluate_existing` is only supported in the Python SDK.
:::inof
Evaluation of existing experiments is currently only supported in the Python SDK.
:::

If you have already run an experiment and want to add additional evaluation metrics, you
can apply any evaluators to the experiment using the `evaluate_existing` method.

```python
from langsmith import evaluate_existing

def always_half(run, example):
return {"score": 0.5}

experiment_name = "my-experiment:abcd123" # Replace with an actual experiment name or ID
evaluate_existing(experiment_name, evaluators=[always_half])
```

## Example

Suppose you are evaluating a semantic router. You may first run an experiment:
can apply any evaluators to the experiment using the `evaluate()` / `aevaluate()` methods as before.
Just pass in the experiment name / ID instead of a target function:

```python
from langsmith import evaluate
def semantic_router(inputs: dict):
return {"class": 1}

def accuracy(run, example):
prediction = run.outputs["class"]
expected = example.outputs["label"]
return {"score": prediction == expected}

results = evaluate(semantic_router, data="Router Classification Dataset", evaluators=[accuracy])
experiment_name = results.experiment_name
```

Later, you realize you want to add precision and recall summary metrics. The `evaluate_existing` method accepts the same arguments as the `evaluate` method, replacing the `target` system with the `experiment` you wish to add metrics to, meaning
you can add both instance-level `evaluator`'s and aggregate `summary_evaluator`'s.

```python
from langsmith import evaluate_existing
def always_half(inputs: dict, outputs: dict) -> float:
return 0.5

def precision(runs: list, examples: list):
true_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] == example.outputs["label"]])
false_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] != example.outputs["label"]])
return {"score": true_positives / (true_positives + false_positives)}

def recall(runs: list, examples: list):
true_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] == example.outputs["label"]])
false_negatives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] != example.outputs["label"]])
return {"score": true_positives / (true_positives + false_negatives)}

evaluate_existing(experiment_name, summary_evaluators=[precision, recall])
experiment_name = "my-experiment:abc" # Replace with an actual experiment name or ID
evaluate(experiment_name, evaluators=[always_half])
```

The precision and recall metrics will now be available in the LangSmith UI for the `experiment_name` experiment.

As is the case with the `evaluate` function, there is an identical, asynchronous `aevaluate_existing` function that can be used to evaluate experiments asynchronously.
Loading
Loading