Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add attachments to evaluate #1237

Merged
merged 21 commits into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
9bd5969
wip
isahers1 Nov 20, 2024
a72a268
rip keys
isahers1 Nov 20, 2024
16e5e69
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 2, 2024
0b6e2c4
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 6, 2024
7307836
changes
isahers1 Dec 6, 2024
cf53bbe
fmt
isahers1 Dec 7, 2024
2a87196
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
0cb2118
refactor
isahers1 Dec 9, 2024
3c92c38
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
a342f86
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
9289225
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
8986216
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
799f69c
fmt
isahers1 Dec 9, 2024
ac16178
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
bc36039
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
cedd8af
attachment_urls -> attachments
isahers1 Dec 9, 2024
8cc8ce3
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
14130fa
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 9, 2024
b99cdc4
fmt
isahers1 Dec 10, 2024
4d27f41
Merge branch 'isaac/multipartstuff' into isaac/addattachmentsevaluator
isahers1 Dec 10, 2024
24d0159
fmt
isahers1 Dec 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions python/langsmith/client.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Client for interacting with the LangSmith API.

Check notice on line 1 in python/langsmith/client.py

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

........... WARNING: the benchmark result may be unstable * the standard deviation (87.2 ms) is 12% of the mean (718 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_5_000_run_trees: Mean +- std dev: 718 ms +- 87 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (214 ms) is 15% of the mean (1.44 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_10_000_run_trees: Mean +- std dev: 1.44 sec +- 0.21 sec ........... WARNING: the benchmark result may be unstable * the standard deviation (177 ms) is 13% of the mean (1.39 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_20_000_run_trees: Mean +- std dev: 1.39 sec +- 0.18 sec ........... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 718 us +- 60 us ........... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 24.8 ms +- 0.3 ms ........... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 102 ms +- 0 ms ........... dumps_dataclass_nested_50x100: Mean +- std dev: 25.3 ms +- 0.4 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (17.3 ms) is 24% of the mean (72.9 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 72.9 ms +- 17.3 ms ........... dumps_pydanticv1_nested_50x100: Mean +- std dev: 201 ms +- 5 ms

Check notice on line 1 in python/langsmith/client.py

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------------------+----------+------------------------+ | Benchmark | main | changes | +===============================================+==========+========================+ | dumps_pydanticv1_nested_50x100 | 218 ms | 201 ms: 1.09x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_100x200 | 103 ms | 102 ms: 1.02x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_50x100 | 24.9 ms | 24.8 ms: 1.00x faster | +-----------------------------------------------+----------+------------------------+ | dumps_dataclass_nested_50x100 | 25.3 ms | 25.3 ms: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | create_5_000_run_trees | 707 ms | 718 ms: 1.02x slower | +-----------------------------------------------+----------+------------------------+ | create_20_000_run_trees | 1.37 sec | 1.39 sec: 1.02x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_branch_and_leaf_200x400 | 700 us | 718 us: 1.03x slower | +-----------------------------------------------+----------+------------------------+ | create_10_000_run_trees | 1.39 sec | 1.44 sec: 1.04x slower | +-----------------------------------------------+----------+------------------------+ | dumps_pydantic_nested_50x100 | 65.8 ms | 72.9 ms: 1.11x slower | +-----------------------------------------------+----------+------------------------+ | Geometric mean | (ref) | 1.01x slower | +-----------------------------------------------+----------+------------------------+

Use the client to customize API keys / workspace ocnnections, SSl certs,
etc. for tracing.
Expand Down Expand Up @@ -140,6 +140,16 @@
URLLIB3_SUPPORTS_BLOCKSIZE = "key_blocksize" in signature(PoolKey).parameters


class AutoSeekBytesIO(io.BytesIO):
isahers1 marked this conversation as resolved.
Show resolved Hide resolved
"""BytesIO class that resets on read."""

def read(self, *args, **kwargs):
"""Reset on read."""
data = super().read(*args, **kwargs)
self.seek(0)
return data


def _parse_token_or_url(
url_or_token: Union[str, uuid.UUID],
api_url: str,
Expand Down Expand Up @@ -3808,7 +3818,7 @@
for key, value in example["attachment_urls"].items():
response = requests.get(value["presigned_url"], stream=True)
response.raise_for_status()
reader = io.BytesIO(response.content)
reader = AutoSeekBytesIO(response.content)
attachment_urls[key.split(".")[1]] = (
value["presigned_url"],
reader,
Expand Down Expand Up @@ -3895,7 +3905,7 @@
for key, value in example["attachment_urls"].items():
response = requests.get(value["presigned_url"], stream=True)
response.raise_for_status()
reader = io.BytesIO(response.content)
reader = AutoSeekBytesIO(response.content)
attachment_urls[key.split(".")[1]] = (
value["presigned_url"],
reader,
Expand Down
11 changes: 10 additions & 1 deletion python/langsmith/evaluation/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -624,7 +624,14 @@ def _normalize_evaluator_func(
Callable[[Run, Optional[Example]], _RUNNABLE_OUTPUT],
Callable[[Run, Optional[Example]], Awaitable[_RUNNABLE_OUTPUT]],
]:
supported_args = ("run", "example", "inputs", "outputs", "reference_outputs")
supported_args = (
"run",
"example",
"inputs",
"outputs",
"reference_outputs",
"attachments",
isahers1 marked this conversation as resolved.
Show resolved Hide resolved
)
sig = inspect.signature(func)
positional_args = [
pname
Expand Down Expand Up @@ -659,6 +666,7 @@ async def awrapper(
"example": example,
"inputs": example.inputs if example else {},
"outputs": run.outputs or {},
"attachments": example.attachment_urls or {},
"reference_outputs": example.outputs or {} if example else {},
}
args = (arg_map[arg] for arg in positional_args)
Expand All @@ -679,6 +687,7 @@ def wrapper(run: Run, example: Example) -> _RUNNABLE_OUTPUT:
"example": example,
"inputs": example.inputs if example else {},
"outputs": run.outputs or {},
"attachments": example.attachment_urls or {},
"reference_outputs": example.outputs or {} if example else {},
}
args = (arg_map[arg] for arg in positional_args)
Expand Down
129 changes: 109 additions & 20 deletions python/tests/integration_tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from requests_toolbelt import MultipartEncoder, MultipartEncoderMonitor

from langsmith.client import ID_TYPE, Client
from langsmith.evaluation import evaluate
from langsmith.evaluation import aevaluate, evaluate
from langsmith.schemas import (
DataType,
Example,
Expand Down Expand Up @@ -1233,9 +1233,6 @@ def create_encoder(*args, **kwargs):
assert not caplog.records


@pytest.mark.skip(
reason="Need to land https://github.com/langchain-ai/langsmith-sdk/pull/1209 first"
)
def test_list_examples_attachments_keys(langchain_client: Client) -> None:
"""Test list_examples returns same keys with and without attachments."""
dataset_name = "__test_list_examples_attachments" + uuid4().hex[:4]
Expand Down Expand Up @@ -1271,24 +1268,16 @@ def test_list_examples_attachments_keys(langchain_client: Client) -> None:
langchain_client.delete_dataset(dataset_id=dataset.id)


@pytest.mark.skip(
reason="Need to land https://github.com/langchain-ai/langsmith-sdk/pull/1209 first"
)
def test_evaluate_with_attachments(langchain_client: Client) -> None:
"""Test evaluating examples with attachments."""
dataset_name = "__test_evaluate_attachments" + uuid4().hex[:4]
langchain_client = Client(
api_key="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",
api_url="https://dev.api.smith.langchain.com",
)
# 1. Create dataset

dataset = langchain_client.create_dataset(
dataset_name,
description="Test dataset for evals with attachments",
data_type=DataType.kv,
)

# 2. Create example with attachments
example = ExampleUpsertWithAttachments(
dataset_id=dataset.id,
inputs={"question": "What is shown in the image?"},
Expand All @@ -1300,23 +1289,25 @@ def test_evaluate_with_attachments(langchain_client: Client) -> None:

langchain_client.upsert_examples_multipart(upserts=[example])

# 3. Define target function that uses attachments
def target(inputs: Dict[str, Any], attachments: Dict[str, Any]) -> Dict[str, Any]:
# Verify we receive the attachment data
assert "image" in attachments
image_url, image_data = attachments["image"]
assert image_data.read() == b"fake image data for testing"
return {"answer": "test image"}

# 4. Define simple evaluator
def evaluator(run: Run, example: Example) -> Dict[str, Any]:
def evaluator(
outputs: dict, reference_outputs: dict, attachments: dict
) -> Dict[str, Any]:
assert "image" in attachments
image_url, image_data = attachments["image"]
assert image_data.read() == b"fake image data for testing"
return {
"score": float(
run.outputs.get("answer") == example.outputs.get("answer") # type: ignore
reference_outputs.get("answer") == outputs.get("answer") # type: ignore
)
}

# 5. Run evaluation
results = evaluate(
target,
data=dataset_name,
Expand All @@ -1325,12 +1316,10 @@ def evaluator(run: Run, example: Example) -> Dict[str, Any]:
num_repetitions=2,
)

# 6. Verify results
assert len(results) == 2
for result in results:
assert result["evaluation_results"]["results"][0].score == 1.0

# Cleanup
langchain_client.delete_dataset(dataset_name=dataset_name)


Expand Down Expand Up @@ -1381,6 +1370,106 @@ def evaluator(run: Run, example: Example) -> Dict[str, Any]:

langchain_client.delete_dataset(dataset_name=dataset_name)

async def test_aevaluate_with_attachments(langchain_client: Client) -> None:
"""Test evaluating examples with attachments."""
dataset_name = "__test_aevaluate_attachments" + uuid4().hex[:4]
dataset = langchain_client.create_dataset(
dataset_name,
description="Test dataset for evals with attachments",
data_type=DataType.kv,
)

example = ExampleUpsertWithAttachments(
dataset_id=dataset.id,
inputs={"question": "What is shown in the image?"},
outputs={"answer": "test image"},
attachments={
"image": ("image/png", b"fake image data for testing"),
},
)

langchain_client.upsert_examples_multipart(upserts=[example])

async def target(
inputs: Dict[str, Any], attachments: Dict[str, Any]
) -> Dict[str, Any]:
# Verify we receive the attachment data
assert "image" in attachments
image_url, image_data = attachments["image"]
assert image_data.read() == b"fake image data for testing"
return {"answer": "test image"}

async def evaluator(
outputs: dict, reference_outputs: dict, attachments: dict
) -> Dict[str, Any]:
assert "image" in attachments
image_url, image_data = attachments["image"]
assert image_data.read() == b"fake image data for testing"
return {
"score": float(
reference_outputs.get("answer") == outputs.get("answer") # type: ignore
)
}

results = await aevaluate(
target, data=dataset_name, evaluators=[evaluator], client=langchain_client
)

assert len(results) == 1
async for result in results:
assert result["evaluation_results"]["results"][0].score == 1.0

langchain_client.delete_dataset(dataset_name=dataset_name)


async def test_aevaluate_with_no_attachments(langchain_client: Client) -> None:
"""Test evaluating examples without attachments using a target with attachments."""
dataset_name = "__test_aevaluate_no_attachments" + uuid4().hex[:4]
dataset = langchain_client.create_dataset(
dataset_name,
description="Test dataset for evals without attachments",
data_type=DataType.kv,
)

# Create example using old way, attachments should be set to {}
langchain_client.create_example(
dataset_id=dataset.id,
inputs={"question": "What is 2+2?"},
outputs={"answer": "4"},
)

# Verify we can create example the new way without attachments
example = ExampleUpsertWithAttachments(
dataset_id=dataset.id,
inputs={"question": "What is 3+1?"},
outputs={"answer": "4"},
)
langchain_client.upsert_examples_multipart(upserts=[example])

async def target(
inputs: Dict[str, Any], attachments: Dict[str, Any]
) -> Dict[str, Any]:
# Verify we receive an empty attachments dict
assert isinstance(attachments, dict)
assert len(attachments) == 0
return {"answer": "4"}

async def evaluator(run: Run, example: Example) -> Dict[str, Any]:
return {
"score": float(
run.outputs.get("answer") == example.outputs.get("answer") # type: ignore
)
}

results = await aevaluate(
isahers1 marked this conversation as resolved.
Show resolved Hide resolved
target, data=dataset_name, evaluators=[evaluator], client=langchain_client
)

assert len(results) == 2
async for result in results:
assert result["evaluation_results"]["results"][0].score == 1.0

langchain_client.delete_dataset(dataset_name=dataset_name)

def test_examples_length_validation(langchain_client: Client) -> None:
"""Test that mismatched lengths raise ValueError for create and update examples."""
Expand Down
Loading
Loading