Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LangSmith 0.2.x #1247

Merged
merged 24 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion js/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{

Check notice on line 1 in js/package.json

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

........... WARNING: the benchmark result may be unstable * the standard deviation (76.2 ms) is 11% of the mean (700 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_5_000_run_trees: Mean +- std dev: 700 ms +- 76 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (197 ms) is 14% of the mean (1.45 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_10_000_run_trees: Mean +- std dev: 1.45 sec +- 0.20 sec ........... WARNING: the benchmark result may be unstable * the standard deviation (146 ms) is 10% of the mean (1.46 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_20_000_run_trees: Mean +- std dev: 1.46 sec +- 0.15 sec ........... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 701 us +- 7 us ........... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 25.2 ms +- 0.5 ms ........... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 104 ms +- 3 ms ........... dumps_dataclass_nested_50x100: Mean +- std dev: 25.2 ms +- 0.2 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (17.4 ms) is 24% of the mean (72.4 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 72.4 ms +- 17.4 ms ........... dumps_pydanticv1_nested_50x100: Mean +- std dev: 200 ms +- 3 ms

Check notice on line 1 in js/package.json

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------------------+----------+------------------------+ | Benchmark | main | changes | +===============================================+==========+========================+ | dumps_pydanticv1_nested_50x100 | 218 ms | 200 ms: 1.09x faster | +-----------------------------------------------+----------+------------------------+ | create_5_000_run_trees | 712 ms | 700 ms: 1.02x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_branch_and_leaf_200x400 | 702 us | 701 us: 1.00x faster | +-----------------------------------------------+----------+------------------------+ | dumps_dataclass_nested_50x100 | 25.2 ms | 25.2 ms: 1.00x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_100x200 | 104 ms | 104 ms: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_50x100 | 25.2 ms | 25.2 ms: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | create_10_000_run_trees | 1.38 sec | 1.45 sec: 1.05x slower | +-----------------------------------------------+----------+------------------------+ | create_20_000_run_trees | 1.37 sec | 1.46 sec: 1.06x slower | +-----------------------------------------------+----------+------------------------+ | dumps_pydantic_nested_50x100 | 64.8 ms | 72.4 ms: 1.12x slower | +-----------------------------------------------+----------+------------------------+ | Geometric mean | (ref) | 1.01x slower | +-----------------------------------------------+----------+------------------------+
"name": "langsmith",
"version": "0.2.8",
"version": "0.2.9",
"description": "Client library to connect to the LangSmith LLM Tracing and Evaluation Platform.",
"packageManager": "[email protected]",
"files": [
Expand Down
70 changes: 39 additions & 31 deletions js/src/evaluation/_runner.ts
Original file line number Diff line number Diff line change
Expand Up @@ -58,17 +58,17 @@ export type SummaryEvaluatorT =
| DeprecatedSyncSummaryEvaluator
| DeprecatedAsyncSummaryEvaluator
| ((args: {
runs?: Array<Run>;
examples?: Array<Example>;
inputs?: Array<Record<string, any>>;
outputs?: Array<Record<string, any>>;
runs: Array<Run>;
examples: Array<Example>;
inputs: Array<Record<string, any>>;
outputs: Array<Record<string, any>>;
referenceOutputs?: Array<Record<string, any>>;
}) => EvaluationResult | EvaluationResults)
| ((args: {
runs?: Array<Run>;
examples?: Array<Example>;
inputs?: Array<Record<string, any>>;
outputs?: Array<Record<string, any>>;
runs: Array<Run>;
examples: Array<Example>;
inputs: Array<Record<string, any>>;
outputs: Array<Record<string, any>>;
referenceOutputs?: Array<Record<string, any>>;
}) => Promise<EvaluationResult | EvaluationResults>);

Expand All @@ -93,17 +93,17 @@ export type EvaluatorT =
| DeprecatedFunctionEvaluator
| DeprecatedAsyncFunctionEvaluator
| ((args: {
run?: Run;
example?: Example;
inputs?: Record<string, any>;
outputs?: Record<string, any>;
run: Run;
example: Example;
inputs: Record<string, any>;
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}) => EvaluationResult | EvaluationResults)
| ((args: {
run?: Run;
example?: Example;
inputs?: Record<string, any>;
outputs?: Record<string, any>;
run: Run;
example: Example;
inputs: Record<string, any>;
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}) => Promise<EvaluationResult | EvaluationResults>);

Expand All @@ -130,11 +130,6 @@ interface _ExperimentManagerArgs {
}

type BaseEvaluateOptions = {
/**
* The dataset to evaluate on. Can be a dataset name, a list of
* examples, or a generator of examples.
*/
data: DataT;
/**
* Metadata to attach to the experiment.
* @default undefined
Expand Down Expand Up @@ -178,6 +173,11 @@ export interface EvaluateOptions extends BaseEvaluateOptions {
* @default undefined
*/
summaryEvaluators?: Array<SummaryEvaluatorT>;
/**
* The dataset to evaluate on. Can be a dataset name, a list of
* examples, or a generator of examples.
*/
data: DataT;
}

export interface ComparativeEvaluateOptions extends BaseEvaluateOptions {
Expand Down Expand Up @@ -934,8 +934,10 @@ async function _evaluate(
);

let manager = await new _ExperimentManager({
data: Array.isArray(fields.data) ? undefined : fields.data,
examples: Array.isArray(fields.data) ? fields.data : undefined,
data: Array.isArray(standardFields.data) ? undefined : standardFields.data,
examples: Array.isArray(standardFields.data)
? standardFields.data
: undefined,
client,
metadata: fields.metadata,
experiment: experiment_ ?? fields.experimentPrefix,
Expand Down Expand Up @@ -1063,10 +1065,12 @@ function _resolveData(
async function wrapSummaryEvaluators(
evaluators: SummaryEvaluatorT[],
optionsArray?: Partial<RunTreeConfig>[]
): Promise<SummaryEvaluatorT[]> {
): Promise<
Array<DeprecatedAsyncSummaryEvaluator | DeprecatedSyncSummaryEvaluator>
> {
async function _wrap(
evaluator: SummaryEvaluatorT
): Promise<SummaryEvaluatorT> {
): Promise<DeprecatedAsyncSummaryEvaluator | DeprecatedSyncSummaryEvaluator> {
const evalName = evaluator.name || "BatchEvaluator";

const wrapperInner = (
Expand All @@ -1087,10 +1091,10 @@ async function wrapSummaryEvaluators(
return Promise.resolve(
(
evaluator as (args: {
runs?: Run[];
examples?: Example[];
inputs?: Record<string, any>[];
outputs?: Record<string, any>[];
runs: Run[];
examples: Example[];
inputs: Record<string, any>[];
outputs: Record<string, any>[];
referenceOutputs?: Record<string, any>[];
}) => EvaluationResult | EvaluationResults
)({
Expand All @@ -1103,7 +1107,9 @@ async function wrapSummaryEvaluators(
);
}
// Otherwise use the traditional (runs, examples) signature
return Promise.resolve(evaluator(runs, examples));
return Promise.resolve(
(evaluator as DeprecatedSyncSummaryEvaluator)(runs, examples)
);
},
{ ...optionsArray, name: evalName }
);
Expand All @@ -1119,7 +1125,9 @@ async function wrapSummaryEvaluators(
return wrapperInner;
}

const results: SummaryEvaluatorT[] = [];
const results: Array<
DeprecatedAsyncSummaryEvaluator | DeprecatedSyncSummaryEvaluator
> = [];
for (let i = 0; i < evaluators.length; i++) {
results.push(await _wrap(evaluators[i]));
}
Expand Down
2 changes: 1 addition & 1 deletion js/src/evaluation/evaluate_comparative.ts
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ export type _ComparativeEvaluator = (args: {
runs: Run[];
example: Example;
inputs: Record<string, any>;
outputs?: Record<string, any>[];
outputs: Record<string, any>[];
referenceOutputs?: Record<string, any>;
}) => ComparisonEvaluationResultRow | Promise<ComparisonEvaluationResultRow>;

Expand Down
21 changes: 13 additions & 8 deletions js/src/evaluation/evaluator.ts
Original file line number Diff line number Diff line change
Expand Up @@ -96,18 +96,23 @@ export type RunEvaluatorLike =
example?: Example
) => Promise<EvaluationResult | EvaluationResults>)
| ((run: Run, example?: Example) => EvaluationResult | EvaluationResults)
| ((
run: Run,
example: Example
) => Promise<EvaluationResult | EvaluationResults>)
| ((run: Run, example: Example) => EvaluationResult | EvaluationResults)
| ((args: {
run?: Run;
example?: Example;
inputs?: Record<string, any>;
outputs?: Record<string, any>;
run: Run;
example: Example;
inputs: Record<string, any>;
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}) => EvaluationResult | EvaluationResults)
| ((args: {
run?: Run;
example?: Example;
inputs?: Record<string, any>;
outputs?: Record<string, any>;
run: Run;
example: Example;
inputs: Record<string, any>;
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}) => Promise<EvaluationResult | EvaluationResults>);

Expand Down
2 changes: 1 addition & 1 deletion js/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ export { RunTree, type RunTreeConfig } from "./run_trees.js";
export { overrideFetchImplementation } from "./singletons/fetch.js";

// Update using yarn bump-version
export const __version__ = "0.2.8";
export const __version__ = "0.2.9";
65 changes: 11 additions & 54 deletions js/src/tests/evaluate.int.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import {
EvaluationResults,
} from "../evaluation/evaluator.js";
import { evaluate } from "../evaluation/_runner.js";
import { waitUntilRunFound } from "./utils.js";
import { Example, Run, TracerSession } from "../schemas.js";
import { Client } from "../index.js";
import { afterAll, beforeAll } from "@jest/globals";
Expand Down Expand Up @@ -1115,6 +1116,8 @@ test("evaluate handles partial summary evaluator parameters correctly", async ()
});

test("evaluate handles comparative target with ComparativeEvaluateOptions", async () => {
const client = new Client();

// First, create two experiments to compare
const targetFunc1 = (input: Record<string, any>) => {
return {
Expand All @@ -1139,13 +1142,18 @@ test("evaluate handles comparative target with ComparativeEvaluateOptions", asyn
description: "Second experiment for comparison",
});

await Promise.all(
[exp1, exp2].flatMap(({ results }) =>
results.flatMap(({ run }) => waitUntilRunFound(client, run.id))
)
);
// Create comparative evaluator
const comparativeEvaluator = ({
runs,
example,
}: {
runs?: Run[];
example?: Example;
runs: Run[];
example: Example;
}) => {
if (!runs || !example) throw new Error("Missing required parameters");

Expand All @@ -1167,7 +1175,6 @@ test("evaluate handles comparative target with ComparativeEvaluateOptions", asyn
const compareRes = await evaluate(
[exp1.experimentName, exp2.experimentName],
{
data: TESTING_DATASET_NAME,
evaluators: [comparativeEvaluator],
description: "Comparative evaluation test",
randomizeOrder: true,
Expand All @@ -1177,6 +1184,7 @@ test("evaluate handles comparative target with ComparativeEvaluateOptions", asyn

// Verify we got ComparisonEvaluationResults
expect(compareRes.experimentName).toBeDefined();
expect(compareRes.experimentName).toBeDefined();
expect(compareRes.results).toBeDefined();
expect(Array.isArray(compareRes.results)).toBe(true);

Expand Down Expand Up @@ -1212,59 +1220,8 @@ test("evaluate enforces correct evaluator types for comparative evaluation at ru
await expect(
// @ts-expect-error - Should error because standardEvaluator is not a ComparativeEvaluator
evaluate([exp1.experimentName, exp2.experimentName], {
data: TESTING_DATASET_NAME,
evaluators: [standardEvaluator],
description: "Should fail at runtime",
})
).rejects.toThrow(); // You might want to be more specific about the error message
});

test("evaluate comparative options includes comparative-specific fields", async () => {
const exp1 = await evaluate(
(input: Record<string, any>) => ({ foo: input.input + 1 }),
{
data: TESTING_DATASET_NAME,
}
);

const exp2 = await evaluate(
(input: Record<string, any>) => ({ foo: input.input + 2 }),
{
data: TESTING_DATASET_NAME,
}
);

const comparativeEvaluator = ({
runs,
example,
}: {
runs?: Run[];
example?: Example;
}) => {
if (!runs || !example) throw new Error("Missing required parameters");
return {
key: "comparative_score",
scores: Object.fromEntries(
runs.map((run) => [
run.id,
run.outputs?.foo === example.outputs?.output ? 1 : 0,
])
),
};
};

// Test that comparative-specific options work
const compareRes = await evaluate(
[exp1.experimentName, exp2.experimentName],
{
data: TESTING_DATASET_NAME,
evaluators: [comparativeEvaluator],
randomizeOrder: true, // Comparative-specific option
loadNested: true, // Comparative-specific option
description: "Testing comparative-specific options",
}
);

expect(compareRes.experimentName).toBeDefined();
expect(compareRes.results).toBeDefined();
});
21 changes: 14 additions & 7 deletions python/langsmith/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -5842,7 +5842,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[EXPERIMENT_T] = None,
Expand All @@ -5861,7 +5861,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[EXPERIMENT_T] = None,
Expand All @@ -5883,7 +5883,7 @@ def evaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[EXPERIMENT_T] = None,
Expand Down Expand Up @@ -5911,7 +5911,8 @@ def evaluate(
Defaults to None.
description (str | None): A free-form text description for the experiment.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. Defaults to None (max number of workers).
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
blocking (bool): Whether to block until the evaluation is complete.
Defaults to True.
num_repetitions (int): The number of times to run the evaluation.
Expand Down Expand Up @@ -6053,6 +6054,8 @@ def evaluate(
... summary_evaluators=[precision],
... ) # doctest: +ELLIPSIS
View the evaluation results for experiment:...

.. versionadded:: 0.2.0
""" # noqa: E501
from langsmith.evaluation._runner import evaluate as evaluate_

Expand Down Expand Up @@ -6094,7 +6097,7 @@ async def aevaluate(
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[Union[schemas.TracerSession, str, uuid.UUID]] = None,
Expand All @@ -6119,8 +6122,9 @@ async def aevaluate(
experiment_prefix (Optional[str]): A prefix to provide for your experiment name.
Defaults to None.
description (Optional[str]): A description of the experiment.
max_concurrency (Optional[int]): The maximum number of concurrent
evaluations to run. Defaults to None.
max_concurrency (int | None): The maximum number of concurrent
evaluations to run. If None then no limit is set. If 0 then no concurrency.
Defaults to 0.
num_repetitions (int): The number of times to run the evaluation.
Each item in the dataset will be run and evaluated this many times.
Defaults to 1.
Expand Down Expand Up @@ -6259,6 +6263,9 @@ async def aevaluate(
... )
... ) # doctest: +ELLIPSIS
View the evaluation results for experiment:...

.. versionadded:: 0.2.0

""" # noqa: E501
from langsmith.evaluation._arunner import aevaluate as aevaluate_

Expand Down
Loading
Loading