wip: eval how to revamp #525

baskaryan · 2024-11-13T01:48:49Z

No description provided.

vercel · 2024-11-13T01:48:52Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langsmith-docs	✅ Ready (Inspect)	Visit Preview	💬 25 unresolved ✅ 5 resolved	Nov 23, 2024 7:11pm

agola11

some flyby comments but overall like the structure

docs/evaluation/how_to_guides/datasets/export_filtered_traces_to_dataset.mdx

docs/evaluation/how_to_guides/evaluation/async.mdx

docs/evaluation/how_to_guides/evaluation/dataset_subset.mdx

agola11 · 2024-11-13T05:27:47Z

docs/evaluation/how_to_guides/index.md

+- [Run an evaluation using the SDK](./how_to_guides/evaluation/evaluate_llm_application)
+- [Run an evaluation asynchronously](./how_to_guides/evaluation/async)
+- [Run an evaluation comparing two experiments](./how_to_guides/evaluation/evaluate_pairwise)
+- [Run an evaluation of a LangChain / LangGraph object](./how_to_guides/evaluation/langchain_runnable)


Probably worth it to cover LangChain and LangGraph in separate guides?

There's some more interesting things in LG like evaluating a single node, traectory, etc. It might be worth it to call out the documentation for running a larger eval in the LG eval guide.

We have this tutorial I hacked together with Lance a while back but it's definitely overly complicated and I wonder if bits of it can be adapted to a more concise how to guide https://docs.smith.langchain.com/evaluation/tutorials/agents

i kinda think this should live in langgraph docs and be linked from here

agola11 · 2024-11-13T05:28:57Z

docs/evaluation/how_to_guides/index.md

+### Run an evaluation
+- [Run an evaluation using the SDK](./how_to_guides/evaluation/evaluate_llm_application)
+- [Run an evaluation asynchronously](./how_to_guides/evaluation/async)
+- [Run an evaluation comparing two experiments](./how_to_guides/evaluation/evaluate_pairwise)


nit: Run a comparative evaluation?

baskaryan · 2024-11-14T03:04:24Z

docs/evaluation/how_to_guides/evaluation/evaluate_llm_application.mdx

@agola11 can review this one

jakerachleff · 2024-11-22T18:33:41Z

docs/evaluation/how_to_guides/evaluation/evaluate_llm_application.mdx

        const score = rootRun.outputs?.outputs === example.outputs?.output;
-        return { key: "correct_label", score };
+        return { key: "correct", score };


Is this code correct? The score doesn't have a corresponding value....

jakerachleff · 2024-11-22T18:38:13Z

docs/evaluation/how_to_guides/evaluation/langgraph.mdx

+
+`,
+    typescript`
+// ToDo


Do we need to fill out before merging in?

jakerachleff · 2024-11-22T18:40:09Z

docs/evaluation/how_to_guides/evaluation/llm_as_judge.mdx

+
+# How to define an LLM-as-a-judge evaluator
+
+:::info Key concepts


Shouldn't be part of this PR since it'll take too much time, but users want a lot more guidance on good LLM as judge definition than we give rn. Even our concepts doc isn't very helpful.

jakerachleff · 2024-11-22T18:40:21Z

docs/evaluation/how_to_guides/evaluation/llm_as_judge.mdx

+    typescript`
+      import type { EvaluationResult } from "langsmith/evaluation";
+      import type { Run, Example } from "langsmith/schemas";
+


needs code?

jakerachleff · 2024-11-22T18:41:12Z

docs/evaluation/how_to_guides/evaluation/metric_type.mdx

+
+LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](../../how_to_guides/evaluation/custom_evaluator).
+
+For an evaluator result to be logged as a numerical metric, it must returned as:


this is only true for python, since js can't return just an int, right?

We should be clear about this

jakerachleff · 2024-11-22T18:41:45Z

docs/evaluation/how_to_guides/evaluation/metric_type.mdx

+For an evaluator result to be logged as a numerical metric, it must returned as:
+
+- an `int`, `float`, or `bool`
+- a dict of the form `{"key": "metric_name", "score": int | float | bool}`


nit: a feedback[link to reference] dict of the form

docs/evaluation/how_to_guides/evaluation/run_evals_api_only.mdx

jakerachleff · 2024-11-22T18:43:17Z

docs/evaluation/how_to_guides/evaluation/summary.mdx

+  typescript,
+} from "@site/src/components/InstructionsWithCode";
+
+# How to run an aggregate evaluation


Any way to make it clear to users who want to calculate things like f1 score that they should go here from our indexes? I feel like no one ever finds this page

wip: eval how to revamp

f7f83a9

vercel bot deployed to Preview November 13, 2024 01:50 View deployment

agola11 reviewed Nov 13, 2024

View reviewed changes

agola11 approved these changes Nov 13, 2024

View reviewed changes

wip

3ae30e2

baskaryan commented Nov 14, 2024

View reviewed changes

docs/evaluation/how_to_guides/evaluation/evaluate_llm_application.mdx

Copy link

Contributor Author

baskaryan Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agola11 can review this one

vercel bot had a problem deploying to Preview November 14, 2024 03:05 Failure

wip

b7d4341

vercel bot had a problem deploying to Preview November 14, 2024 17:17 Failure

baskaryan added 2 commits November 14, 2024 13:46

wip

e4efc7d

fix

e258077

vercel bot deployed to Preview November 14, 2024 21:49 View deployment

wip

26a8101

vercel bot had a problem deploying to Preview November 15, 2024 00:24 Failure

wip

ef5b3ae

vercel bot had a problem deploying to Preview November 15, 2024 16:21 Failure

wip

ebbd945

vercel bot had a problem deploying to Preview November 15, 2024 17:06 Failure

links

028de02

vercel bot deployed to Preview November 15, 2024 17:09 View deployment

intro

7d021f6

vercel bot deployed to Preview November 15, 2024 17:41 View deployment

wip

7520c48

vercel bot deployed to Preview November 15, 2024 18:17 View deployment

wip

0eebf98

vercel bot deployed to Preview November 15, 2024 23:24 View deployment

wip

25c88a0

vercel bot deployed to Preview November 16, 2024 02:39 View deployment

wip

ac3e7f3

langgraph

7f78762

vercel bot had a problem deploying to Preview November 22, 2024 17:24 Failure

wip

ad4c30d

vercel bot had a problem deploying to Preview November 22, 2024 18:20 Failure

jakerachleff reviewed Nov 22, 2024

View reviewed changes

fix

db7c388

vercel bot deployed to Preview November 22, 2024 18:51 View deployment

more

a080e25

vercel bot deployed to Preview November 22, 2024 23:20 View deployment

cr

8ebd71e

baskaryan marked this pull request as ready for review November 23, 2024 00:20

vercel bot deployed to Preview November 23, 2024 00:22 View deployment

fmt

6c23581

vercel bot had a problem deploying to Preview November 23, 2024 02:19 Failure

fix

0524905

vercel bot had a problem deploying to Preview November 23, 2024 09:38 Failure

try redirect

4e2da39

vercel bot had a problem deploying to Preview November 23, 2024 18:31 Failure

fix

f6507d0

vercel bot deployed to Preview November 23, 2024 18:55 View deployment

fix

d8b3fc4

vercel bot deployed to Preview November 23, 2024 18:59 View deployment

fix

a6df8ea

vercel bot had a problem deploying to Preview November 23, 2024 19:07 Failure

nit

ce3a19d

vercel bot had a problem deploying to Preview November 23, 2024 19:09 Failure

fix

1a194f0

vercel bot deployed to Preview November 23, 2024 19:11 View deployment

baskaryan merged commit 1a148ed into main Nov 23, 2024
5 of 6 checks passed

baskaryan deleted the bagatur/eval_how_to_revamp branch November 23, 2024 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: eval how to revamp #525

wip: eval how to revamp #525

baskaryan commented Nov 13, 2024

vercel bot commented Nov 13, 2024 •

edited

Loading

agola11 left a comment

agola11 Nov 13, 2024

baskaryan Nov 13, 2024

agola11 Nov 13, 2024

baskaryan Nov 14, 2024

jakerachleff Nov 22, 2024

jakerachleff Nov 22, 2024

jakerachleff Nov 22, 2024

jakerachleff Nov 22, 2024

jakerachleff Nov 22, 2024

jakerachleff Nov 22, 2024

jakerachleff Nov 22, 2024


		# How to define an LLM-as-a-judge evaluator

		:::info Key concepts


		LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](../../how_to_guides/evaluation/custom_evaluator).

		For an evaluator result to be logged as a numerical metric, it must returned as:

wip: eval how to revamp #525

wip: eval how to revamp #525

Conversation

baskaryan commented Nov 13, 2024

vercel bot commented Nov 13, 2024 • edited Loading

agola11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Nov 13, 2024 •

edited

Loading