Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add local & enforced tracing modes for comparative evaluation #1337

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 41 additions & 25 deletions python/langsmith/evaluation/_runner.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""V2 Evaluation Interface."""

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

........... WARNING: the benchmark result may be unstable * the standard deviation (107 ms) is 16% of the mean (680 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_5_000_run_trees: Mean +- std dev: 680 ms +- 107 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (193 ms) is 13% of the mean (1.47 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_10_000_run_trees: Mean +- std dev: 1.47 sec +- 0.19 sec ........... WARNING: the benchmark result may be unstable * the standard deviation (178 ms) is 13% of the mean (1.42 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_20_000_run_trees: Mean +- std dev: 1.42 sec +- 0.18 sec ........... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 685 us +- 7 us ........... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 24.9 ms +- 0.2 ms ........... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 104 ms +- 3 ms ........... dumps_dataclass_nested_50x100: Mean +- std dev: 25.2 ms +- 0.2 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (16.6 ms) is 23% of the mean (71.8 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 71.8 ms +- 16.6 ms ........... dumps_pydanticv1_nested_50x100: Mean +- std dev: 198 ms +- 3 ms

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------------------+----------+------------------------+ | Benchmark | main | changes | +===============================================+==========+========================+ | dumps_pydanticv1_nested_50x100 | 217 ms | 198 ms: 1.09x faster | +-----------------------------------------------+----------+------------------------+ | create_5_000_run_trees | 721 ms | 680 ms: 1.06x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_branch_and_leaf_200x400 | 695 us | 685 us: 1.01x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_50x100 | 24.8 ms | 24.9 ms: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | dumps_dataclass_nested_50x100 | 25.2 ms | 25.2 ms: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_100x200 | 103 ms | 104 ms: 1.01x slower | +-----------------------------------------------+----------+------------------------+ | create_20_000_run_trees | 1.36 sec | 1.42 sec: 1.04x slower | +-----------------------------------------------+----------+------------------------+ | create_10_000_run_trees | 1.37 sec | 1.47 sec: 1.07x slower | +-----------------------------------------------+----------+------------------------+ | dumps_pydantic_nested_50x100 | 64.9 ms | 71.8 ms: 1.11x slower | +-----------------------------------------------+----------+------------------------+ | Geometric mean | (ref) | 1.01x slower | +-----------------------------------------------+----------+------------------------+

from __future__ import annotations

Expand Down Expand Up @@ -651,6 +651,7 @@
metadata: Optional[dict] = None,
load_nested: bool = False,
randomize_order: bool = False,
upload_results: bool = True,
) -> ComparativeExperimentResults:
r"""Evaluate existing experiment runs against each other.

Expand All @@ -675,6 +676,8 @@
Default is to only load the top-level root runs.
randomize_order (bool): Whether to randomize the order of the outputs for each evaluation.
Default is False.
upload_results (bool): Whether to upload the results to LangSmith.
Default is True.

Returns:
ComparativeExperimentResults: The results of the comparative evaluation.
Expand Down Expand Up @@ -910,6 +913,7 @@

comparators = [comparison_evaluator(evaluator) for evaluator in evaluators or []]
results: dict = {}
tracing_mode = "local" if not upload_results else True

def evaluate_and_submit_feedback(
runs_list: list[schemas.Run],
Expand All @@ -920,10 +924,26 @@
feedback_group_id = uuid.uuid4()
if randomize_order:
random.shuffle(runs_list)
with rh.tracing_context(project_name="evaluators", client=client):
current_context = rh.get_tracing_context()
metadata = (current_context["metadata"] or {}) | {
"experiment": comparative_experiment.name,
"experiment_id": comparative_experiment.id,
"reference_example_id": example.id,
"reference_run_ids": [r.id for r in runs_list],
}
with rh.tracing_context(
**(
current_context
| {
"project_name": "evaluators",
"metadata": metadata,
"enabled": tracing_mode,
"client": client,
}
)
):
result = comparator.compare_runs(runs_list, example)
if client is None:
raise ValueError("Client is required to submit feedback.")

comments = (
{str(rid): result.comment for rid in result.scores}
if isinstance(result.comment, str)
Expand Down Expand Up @@ -1548,22 +1568,21 @@
executor: cf.ThreadPoolExecutor,
) -> ExperimentResultRow:
current_context = rh.get_tracing_context()
metadata = {
**(current_context["metadata"] or {}),
**{
"experiment": self.experiment_name,
"reference_example_id": current_results["example"].id,
"reference_run_id": current_results["run"].id,
},
metadata = (current_context["metadata"] or {}) | {
"experiment": self.experiment_name,
"reference_example_id": current_results["example"].id,
"reference_run_id": current_results["run"].id,
}
with rh.tracing_context(
**{
**current_context,
"project_name": "evaluators",
"metadata": metadata,
"enabled": "local" if not self._upload_results else True,
"client": self.client,
}
**(
current_context
| {
"project_name": "evaluators",
"metadata": metadata,
"enabled": "local" if not self._upload_results else True,
"client": self.client,
}
)
):
run = current_results["run"]
example = current_results["example"]
Expand Down Expand Up @@ -1681,16 +1700,13 @@
with ls_utils.ContextThreadPoolExecutor() as executor:
project_id = self._get_experiment().id if self._upload_results else None
current_context = rh.get_tracing_context()
metadata = {
**(current_context["metadata"] or {}),
**{
"experiment": self.experiment_name,
"experiment_id": project_id,
},
metadata = (current_context["metadata"] or {}) | {
"experiment": self.experiment_name,
"experiment_id": project_id,
}
with rh.tracing_context(
**{
**current_context,
current_context
| {
"project_name": "evaluators",
"metadata": metadata,
"client": self.client,
Expand Down
Loading