Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LangSmith 0.2.x #1247

Merged
merged 24 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion js/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{

Check notice on line 1 in js/package.json

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

........... WARNING: the benchmark result may be unstable * the standard deviation (114 ms) is 17% of the mean (683 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_5_000_run_trees: Mean +- std dev: 683 ms +- 114 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (160 ms) is 12% of the mean (1.32 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_10_000_run_trees: Mean +- std dev: 1.32 sec +- 0.16 sec ........... WARNING: the benchmark result may be unstable * the standard deviation (180 ms) is 13% of the mean (1.41 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_20_000_run_trees: Mean +- std dev: 1.41 sec +- 0.18 sec ........... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 705 us +- 9 us ........... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 25.4 ms +- 1.8 ms ........... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 104 ms +- 2 ms ........... dumps_dataclass_nested_50x100: Mean +- std dev: 25.2 ms +- 0.3 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (15.6 ms) is 22% of the mean (70.3 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 70.3 ms +- 15.6 ms ........... dumps_pydanticv1_nested_50x100: Mean +- std dev: 196 ms +- 3 ms

Check notice on line 1 in js/package.json

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------------------+----------+------------------------+ | Benchmark | main | changes | +===============================================+==========+========================+ | dumps_pydanticv1_nested_50x100 | 223 ms | 196 ms: 1.14x faster | +-----------------------------------------------+----------+------------------------+ | create_5_000_run_trees | 731 ms | 683 ms: 1.07x faster | +-----------------------------------------------+----------+------------------------+ | create_10_000_run_trees | 1.41 sec | 1.32 sec: 1.07x faster | +-----------------------------------------------+----------+------------------------+ | dumps_dataclass_nested_50x100 | 25.6 ms | 25.2 ms: 1.02x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_100x200 | 105 ms | 104 ms: 1.01x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_50x100 | 25.4 ms | 25.4 ms: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_branch_and_leaf_200x400 | 704 us | 705 us: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | create_20_000_run_trees | 1.41 sec | 1.41 sec: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | dumps_pydantic_nested_50x100 | 68.3 ms | 70.3 ms: 1.03x slower | +-----------------------------------------------+----------+------------------------+ | Geometric mean | (ref) | 1.03x faster | +-----------------------------------------------+----------+------------------------+
"name": "langsmith",
"version": "0.2.8",
"version": "0.2.9",
"description": "Client library to connect to the LangSmith LLM Tracing and Evaluation Platform.",
"packageManager": "[email protected]",
"files": [
Expand Down
2 changes: 1 addition & 1 deletion js/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ export { RunTree, type RunTreeConfig } from "./run_trees.js";
export { overrideFetchImplementation } from "./singletons/fetch.js";

// Update using yarn bump-version
export const __version__ = "0.2.8";
export const __version__ = "0.2.9";
16 changes: 9 additions & 7 deletions python/langsmith/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -5825,7 +5825,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[EXPERIMENT_T] = None,
Expand All @@ -5844,7 +5844,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[EXPERIMENT_T] = None,
Expand All @@ -5866,7 +5866,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[EXPERIMENT_T] = None,
Expand Down Expand Up @@ -5894,7 +5894,8 @@ def evaluate(
Defaults to None.
description (str | None): A free-form text description for the experiment.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. Defaults to None (max number of workers).
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
blocking (bool): Whether to block until the evaluation is complete.
Defaults to True.
num_repetitions (int): The number of times to run the evaluation.
Expand Down Expand Up @@ -6077,7 +6078,7 @@ async def aevaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[Union[schemas.TracerSession, str, uuid.UUID]] = None,
Expand All @@ -6102,8 +6103,9 @@ async def aevaluate(
experiment_prefix (Optional[str]): A prefix to provide for your experiment name.
Defaults to None.
description (Optional[str]): A description of the experiment.
max_concurrency (Optional[int]): The maximum number of concurrent
evaluations to run. Defaults to None.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
num_repetitions (int): The number of times to run the evaluation.
Each item in the dataset will be run and evaluated this many times.
Defaults to 1.
Expand Down
13 changes: 8 additions & 5 deletions python/langsmith/evaluation/_arunner.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ async def aevaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
client: Optional[langsmith.Client] = None,
blocking: bool = True,
Expand All @@ -110,8 +110,9 @@ async def aevaluate(
experiment_prefix (Optional[str]): A prefix to provide for your experiment name.
Defaults to None.
description (Optional[str]): A description of the experiment.
max_concurrency (Optional[int]): The maximum number of concurrent
evaluations to run. Defaults to None.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
num_repetitions (int): The number of times to run the evaluation.
Each item in the dataset will be run and evaluated this many times.
Defaults to 1.
Expand Down Expand Up @@ -332,7 +333,7 @@ async def aevaluate_existing(
evaluators: Optional[Sequence[Union[EVALUATOR_T, AEVALUATOR_T]]] = None,
summary_evaluators: Optional[Sequence[SUMMARY_EVALUATOR_T]] = None,
metadata: Optional[dict] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
client: Optional[langsmith.Client] = None,
load_nested: bool = False,
blocking: bool = True,
Expand All @@ -345,7 +346,9 @@ async def aevaluate_existing(
summary_evaluators (Optional[Sequence[SUMMARY_EVALUATOR_T]]): Optional sequence of evaluators
to apply over the entire dataset.
metadata (Optional[dict]): Optional metadata to include in the evaluation results.
max_concurrency (Optional[int]): Optional maximum number of concurrent evaluations.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
client (Optional[langsmith.Client]): Optional Langsmith client to use for evaluation.
load_nested: Whether to load all child runs for the experiment.
Default is to only load the top-level root runs.
Expand Down
33 changes: 23 additions & 10 deletions python/langsmith/evaluation/_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
client: Optional[langsmith.Client] = None,
blocking: bool = True,
Expand All @@ -121,7 +121,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
client: Optional[langsmith.Client] = None,
blocking: bool = True,
Expand All @@ -142,7 +142,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
client: Optional[langsmith.Client] = None,
blocking: bool = True,
Expand Down Expand Up @@ -171,7 +171,8 @@ def evaluate(
Defaults to None.
description (str | None): A free-form text description for the experiment.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. Defaults to None (max number of workers).
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
client (langsmith.Client | None): The LangSmith client to use.
Defaults to None.
blocking (bool): Whether to block until the evaluation is complete.
Expand Down Expand Up @@ -440,7 +441,7 @@ def evaluate_existing(
evaluators: Optional[Sequence[EVALUATOR_T]] = None,
summary_evaluators: Optional[Sequence[SUMMARY_EVALUATOR_T]] = None,
metadata: Optional[dict] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
client: Optional[langsmith.Client] = None,
load_nested: bool = False,
blocking: bool = True,
Expand All @@ -454,7 +455,9 @@ def evaluate_existing(
summary_evaluators (Optional[Sequence[SUMMARY_EVALUATOR_T]]): Optional sequence of evaluators
to apply over the entire dataset.
metadata (Optional[dict]): Optional metadata to include in the evaluation results.
max_concurrency (Optional[int]): Optional maximum number of concurrent evaluations.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
client (Optional[langsmith.Client]): Optional Langsmith client to use for evaluation.
load_nested: Whether to load all child runs for the experiment.
Default is to only load the top-level root runs.
Expand Down Expand Up @@ -1597,7 +1600,7 @@ def _score(
(e.g. from a previous prediction step)
"""
with ls_utils.ContextThreadPoolExecutor(
max_workers=max_concurrency
max_workers=max_concurrency or 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) as executor:
if max_concurrency == 0:
context = copy_context()
Expand Down Expand Up @@ -1815,14 +1818,24 @@ def _get_run(r: rt.RunTree) -> None:
return _ForwardResults(run=cast(schemas.Run, run), example=example)


def _is_valid_uuid(value: str) -> bool:
try:
uuid.UUID(value)
return True
except ValueError:
return False


def _resolve_data(
data: DATA_T, *, client: langsmith.Client
) -> Iterable[schemas.Example]:
"""Return the examples for the given dataset."""
if isinstance(data, str):
return client.list_examples(dataset_name=data)
elif isinstance(data, uuid.UUID):
if isinstance(data, uuid.UUID):
return client.list_examples(dataset_id=data)
elif isinstance(data, str) and _is_valid_uuid(data):
return client.list_examples(dataset_id=uuid.UUID(data))
elif isinstance(data, str):
return client.list_examples(dataset_name=data)
elif isinstance(data, schemas.Dataset):
return client.list_examples(dataset_id=data.id)
return data
Expand Down
9 changes: 3 additions & 6 deletions python/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions python/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "langsmith"
version = "0.1.147"
version = "0.2.0"
description = "Client library to connect to the LangSmith LLM Tracing and Evaluation Platform."
authors = ["LangChain <[email protected]>"]
license = "MIT"
Expand All @@ -25,7 +25,7 @@ packages = [{ include = "langsmith" }]
langsmith = "langsmith.cli.main:main"

[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
python = ">=3.9,<4.0"
pydantic = [
{ version = ">=1,<3", python = "<3.12.4" },
{ version = "^2.7.4", python = ">=3.12.4" },
Expand Down
2 changes: 1 addition & 1 deletion python/tests/evaluation/test_evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,7 @@ async def predict(inputs: dict):
data=ds_name,
)

with pytest.raises(ValueError, match=match_val):
with pytest.raises(ValueError, match="Must specify 'data'"):
await aevaluate(
predict,
data=[],
Expand Down
11 changes: 5 additions & 6 deletions python/tests/unit_tests/evaluation/test_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,8 @@

import pytest

from langsmith import evaluate
from langsmith import Client, aevaluate, evaluate
from langsmith import schemas as ls_schemas
from langsmith.client import Client
from langsmith.evaluation._arunner import aevaluate, aevaluate_existing
from langsmith.evaluation._runner import evaluate_existing
from langsmith.evaluation.evaluator import (
_normalize_comparison_evaluator_func,
_normalize_evaluator_func,
Expand Down Expand Up @@ -276,6 +273,7 @@ def summary_eval_outputs_reference(outputs, reference_outputs):
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
upload_results=upload_results,
max_concurrency=None,
)
if not blocking:
deltas = []
Expand Down Expand Up @@ -327,7 +325,7 @@ def summary_eval_outputs_reference(outputs, reference_outputs):
def score_value(run, example):
return {"score": 0.7}

ex_results = evaluate_existing(
ex_results = evaluate(
fake_request.created_session["name"],
evaluators=[score_value],
client=client,
Expand Down Expand Up @@ -549,6 +547,7 @@ def summary_eval_outputs_reference(outputs, reference_outputs):
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
upload_results=upload_results,
max_concurrency=None,
)
if not blocking:
deltas = []
Expand Down Expand Up @@ -606,7 +605,7 @@ async def score_value(run, example):
return {"score": 0.7}

if upload_results:
ex_results = await aevaluate_existing(
ex_results = await aevaluate(
fake_request.created_session["name"],
evaluators=[score_value],
client=client,
Expand Down
Loading