langchain-ai · baskaryan · Nov 26, 2024 · Nov 25, 2024 · Nov 25, 2024 · Nov 25, 2024
diff --git a/...nistration/how_to_guides/organization_management/manage_organization_by_api.mdx b/...nistration/how_to_guides/organization_management/manage_organization_by_api.mdx
@@ -144,7 +144,7 @@ If the header is not present, operations will default to the workspace the API k
 ## Security Settings
 
 :::note
-"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/datasets/share_dataset.mdx).
+"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/share_dataset.mdx).
 :::
 
 - <RegionalUrl

diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx
@@ -66,7 +66,7 @@ When setting up your evaluation, you may want to partition your dataset into dif
 To learn more about creating dataset splits in LangSmith:
 
 - See our video on [`dataset splits`](https://youtu.be/FQMn_FQV-fI?feature=shared) in the LangSmith Evaluation series.
-- See our documentation [here](https://docs.smith.langchain.com/how_to_guides/datasets/manage_datasets_in_application#create-and-manage-dataset-splits).
+- See our documentation [here](./how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits).
 
 :::
 
@@ -105,7 +105,7 @@ Heuristic evaluators are hard-coded functions that perform computations to deter
 For some tasks, like code generation, custom heuristic evaluation (e.g., import and code execution-evaluation) are often extremely useful and superior to other evaluations (e.g., LLM-as-judge, discussed below).
 
 - Watch the [`Custom evaluator` video in our LangSmith Evaluation series](https://www.youtube.com/watch?v=w31v_kFvcNw) for a comprehensive overview.
-- Read our [documentation](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_on_intermediate_steps#3-define-your-custom-evaluators) on custom evaluators.
+- Read our [documentation](./how_to_guides/custom_evaluator) on custom evaluators.
 - See our [blog](https://blog.langchain.dev/code-execution-with-langgraph/) using custom evaluators for code generation.
 
 :::
@@ -124,7 +124,7 @@ With LLM-as-judge evaluators, it is important to carefully review the resulting
 
 :::tip
 
-See documentation on our workflow to audit and manually correct evaluator scores [here](https://docs.smith.langchain.com/how_to_guides/evaluation/audit_evaluator_scores).
+See documentation on our workflow to audit and manually correct evaluator scores [here](./how_to_guides/audit_evaluator_scores).
 
 :::
 
@@ -225,7 +225,7 @@ LangSmith evaluations are kicked off using a single function, `evaluate`, which
 
 :::tip
 
-See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application).
+See documentation on using `evaluate` [here](./how_to_guides/evaluate_llm_application).
 
 :::
 
@@ -236,7 +236,7 @@ One of the most common questions when evaluating AI applications is: how can I b
 :::tip
 
 - See the [video on `Repetitions` in our LangSmith Evaluation series](https://youtu.be/Pvz24JdzzF8)
-- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/repetition)
+- See our documentation on [`Repetitions`](./how_to_guides/repetition)
 
 :::
 
@@ -281,7 +281,7 @@ However, there are several downsides to this type of evaluation. First, it usual
 
 :::tip
 
-See our tutorial on [evaluating agent response](https://docs.smith.langchain.com/tutorials/Developers/agents#response-evaluation).
+See our tutorial on [evaluating agent response](./tutorials/agents).
 
 :::
 
@@ -299,7 +299,7 @@ There are several benefits to this type of evaluation. It allows you to evaluate
 
 :::tip
 
-See our tutorial on [evaluating a single step of an agent](https://docs.smith.langchain.com/tutorials/Developers/agents#single-step-evaluation).
+See our tutorial on [evaluating a single step of an agent](./tutorials/agents#single-step-evaluation).
 
 :::
 
@@ -319,7 +319,7 @@ However, none of these approaches evaluate the input to the tools; they only foc
 
 :::tip
 
-See our tutorial on [evaluating agent trajectory](https://docs.smith.langchain.com/tutorials/Developers/agents#trajectory).
+See our tutorial on [evaluating agent trajectory](./tutorials/agents#trajectory).
 
 :::
 
@@ -434,7 +434,7 @@ Classification / Tagging applies a label to a given input (e.g., for toxicity de
 
 A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
 
-If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
+If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
 
 `Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
 

diff --git a/...human_feedback/annotate_traces_inline.mdx → .../how_to_guides/annotate_traces_inline.mdx b/...human_feedback/annotate_traces_inline.mdx → .../how_to_guides/annotate_traces_inline.mdx
diff --git a/...ides/human_feedback/annotation_queues.mdx → ...ation/how_to_guides/annotation_queues.mdx b/...ides/human_feedback/annotation_queues.mdx → ...ation/how_to_guides/annotation_queues.mdx
diff --git a/...uation/how_to_guides/evaluation/async.mdx → docs/evaluation/how_to_guides/async.mdx b/...uation/how_to_guides/evaluation/async.mdx → docs/evaluation/how_to_guides/async.mdx
@@ -4,19 +4,19 @@ import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";
 
 :::info Key concepts
 
-[Evaluations](../../concepts#applying-evaluations) | [Evaluators](../../concepts#evaluators) | [Datasets](../../concepts#datasets) | [Experiments](../../concepts#experiments)
+[Evaluations](../concepts#applying-evaluations) | [Evaluators](../concepts#evaluators) | [Datasets](../concepts#datasets) | [Experiments](../concepts#experiments)
 
 :::
 
 We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
 which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
-You can learn more about how to use the `evaluate()` function [here](../../how_to_guides/evaluation/evaluate_llm_application).
+You can learn more about how to use the `evaluate()` function [here](./evaluate_llm_application).
 
 :::info Python only
 
 This guide is only relevant when using the Python SDK.
 In JS/TS the `evaluate()` function is already async.
-You can see how to use it [here](../../how_to_guides/evaluation/evaluate_llm_application).
+You can see how to use it [here](./evaluate_llm_application).
 
 :::
 
@@ -76,5 +76,5 @@ list 5 concrete questions that should be investigated to determine if the idea i
 
 ## Related
 
-- [Run an evaluation (synchronously)](../../how_to_guides/evaluation/evaluate_llm_application)
-- [Handle model rate limits](../../how_to_guides/evaluation/rate_limiting)
+- [Run an evaluation (synchronously)](./evaluate_llm_application)
+- [Handle model rate limits](./rate_limiting)
diff --git a/...s/human_feedback/attach_user_feedback.mdx → ...on/how_to_guides/attach_user_feedback.mdx b/...s/human_feedback/attach_user_feedback.mdx → ...on/how_to_guides/attach_user_feedback.mdx
diff --git a/...des/evaluation/audit_evaluator_scores.mdx → .../how_to_guides/audit_evaluator_scores.mdx b/...des/evaluation/audit_evaluator_scores.mdx → .../how_to_guides/audit_evaluator_scores.mdx
@@ -18,13 +18,13 @@ In the comparison view, you may click on any feedback tag to bring up the feedba
 If you would like, you may also attach an explanation to your correction. This is useful if you are using a [few-shot evaluator](./create_few_shot_evaluators) and will be automatically inserted into your few-shot examples
 in place of the `few_shot_explanation` prompt variable.
 
-![Audit Evaluator Comparison View](../evaluation/static/corrections_comparison_view.png)
+![Audit Evaluator Comparison View](./static/corrections_comparison_view.png)
 
 ## In the runs table
 
 In the runs table, find the "Feedback" column and click on the feedback tag to bring up the feedback details. Again, click the "edit" icon on the right to bring up the corrections view.
 
-![Audit Evaluator Runs Table](../evaluation/static/corrections_runs_table.png)
+![Audit Evaluator Runs Table](./static/corrections_runs_table.png)
 
 ## In the SDK
 

diff --git a/.../evaluation/bind_evaluator_to_dataset.mdx → ...w_to_guides/bind_evaluator_to_dataset.mdx b/.../evaluation/bind_evaluator_to_dataset.mdx → ...w_to_guides/bind_evaluator_to_dataset.mdx
@@ -23,7 +23,7 @@ The next steps vary based on the evaluator type.
 1. **Select the LLM as judge type evaluator**
 2. **Give your evaluator a name** and **set an inline prompt or load a prompt from the prompt hub** that will be used to evaluate the results of the runs in the experiment.
 
-![Add evaluator name and prompt](../evaluation/static/create_evaluator.png)
+![Add evaluator name and prompt](./static/create_evaluator.png)
 
 Importantly, evaluator prompts can only contain the following input variables:
 
@@ -42,11 +42,11 @@ LangSmith currently doesn't support setting up evaluators in the application tha
 
 You can specify the scoring criteria in the "schema" field. In this example, we are asking the LLM to grade on "correctness" of the output with respect to the reference, with a boolean output of 0 or 1. The name of the field in the schema will be interpreted as the feedback key and the type will be the type of the score.
 
-![Evaluator prompt](../evaluation/static/evaluator_prompt.png)
+![Evaluator prompt](./static/evaluator_prompt.png)
 
 3. **Save the evaluator** and navigate back to the dataset details page. Each **subsequent** experiment run from the dataset will now be evaluated by the evaluator you configured. Note that in the below image, each run in the experiment has a "correctness" score.
 
-![Playground evaluator results](../evaluation/static/playground_evaluator_results.png)
+![Playground evaluator results](./static/playground_evaluator_results.png)
 
 ## Custom code evaluators
 
@@ -70,7 +70,7 @@ You can specify the scoring criteria in the "schema" field. In this example, we
 
 In the UI, you will see a panel that lets you write your code inline, with some starter code:
 
-![](../evaluation/static/code-autoeval-popup.png)
+![](./static/code-autoeval-popup.png)
 
 Custom Code evaluators take in two arguments:
 
@@ -127,8 +127,8 @@ To visualize the feedback left on new experiments, try running a new experiment
 On the dataset, if you now click to the `experiments` tab -> `+ Experiment` -> `Run in Playground`, you can see the results in action.
 Your runs in your experiments will be automatically marked with the key specified in your code sample above (here, `formatted`):
 
-![](../evaluation/static/show-feedback-from-autoeval-code.png)
+![](./static/show-feedback-from-autoeval-code.png)
 
 And if you navigate back to your dataset, you'll see summary stats for said experiment in the `experiments` tab:
 
-![](../evaluation/static/experiments-tab-code-results.png)
+![](./static/experiments-tab-code-results.png)
diff --git a/...evaluation/compare_experiment_results.mdx → ..._to_guides/compare_experiment_results.mdx b/...evaluation/compare_experiment_results.mdx → ..._to_guides/compare_experiment_results.mdx
@@ -8,65 +8,65 @@ Oftentimes, when you are iterating on your LLM application (such as changing the
 
 LangSmith supports a powerful comparison view that lets you hone in on key differences, regressions, and improvements between different experiments.
 
-![](../evaluation/static/regression_test.gif)
+![](./static/regression_test.gif)
 
 ## Open the comparison view
 
 To open the comparison view, select two or more experiments from the "Experiments" tab from a given dataset page. Then, click on the "Compare" button at the bottom of the page.
 
-![](../evaluation/static/open_comparison_view.png)
+![](./static/open_comparison_view.png)
 
 ## Toggle different views
 
 You can toggle between different views by clicking on the "Display" dropdown at the top right of the page. You can toggle different views to be displayed.
 
 Toggling Full Text will show the full text of the input, output and reference output for each run. If the reference output is too long to display in the table, you can click on expand to view the full content.
 
-![](../evaluation/static/toggle_views.png)
+![](./static/toggle_views.png)
 
 ## View regressions and improvements
 
 In the LangSmith comparison view, runs that _regressed_ on your specified feedback key against your baseline experiment will be highlighted in red, while runs that _improved_
 will be highlighted in green. At the top of each column, you can see how many runs in that experiment did better and how many did worse than your baseline experiment.
 
-![Regressions](../evaluation/static/regression_view.png)
+![Regressions](./static/regression_view.png)
 
 ## Filter on regressions or improvements
 
 Click on the regressions or improvements buttons on the top of each column to filter to the runs that regressed or improved in that specific experiment.
 
-![Regressions Filter](../evaluation/static/filter_to_regressions.png)
+![Regressions Filter](./static/filter_to_regressions.png)
 
 ## Update baseline experiment
 
 In order to track regressions, you need a baseline experiment against which to compare. This will be automatically assigned as the first experiment in your comparison, but you can
 change it from the dropdown at the top of the page.
 
-![Baseline](../evaluation/static/select_baseline.png)
+![Baseline](./static/select_baseline.png)
 
 ## Select feedback key
 
 You will also want to select the feedback key (evaluation metric) on which you would like focus on. This can be selected via another dropdown at the top. Again, one will be assigned by
 default, but you can adjust as needed.
 
-![Feedback](../evaluation/static/select_feedback.png)
+![Feedback](./static/select_feedback.png)
 
 ## Open a trace
 
 If tracing is enabled for the evaluation run, you can click on the trace icon in the hover state of any experiment cell to open the trace view for that run. This will open up a trace in the side panel.
 
-![](../evaluation/static/open_trace_comparison.png)
+![](./static/open_trace_comparison.png)
 
 ## Expand detailed view
 
 From any cell, you can click on the expand icon in the hover state to open up a detailed view of all experiment results on that particular example input, along with feedback keys and scores.
 
-![](../evaluation/static/expanded_view.png)
+![](./static/expanded_view.png)
 
 ## Update display settings
 
 You can adjust the display settings for comparison view by clicking on "Display" in the top right corner.
 
 Here, you'll be able to toggle feedback, metrics, summary charts, and expand full text.
 
-![](../evaluation/static/update_display.png)
+![](./static/update_display.png)
diff --git a/...evaluation/create_few_shot_evaluators.mdx → ..._to_guides/create_few_shot_evaluators.mdx b/...evaluation/create_few_shot_evaluators.mdx → ..._to_guides/create_few_shot_evaluators.mdx
@@ -34,7 +34,7 @@ as your output key. For example, if your main prompt has variables `question` an
 You may also specify the number of few-shot examples to use. The default is 5. If your examples will tend to be very long, you may want to set this number lower to save tokens - whereas if your examples tend
 to be short, you can set a higher number in order to give your evaluator more examples to learn from. If you have more examples in your dataset than this number, we will randomly choose them for you.
 
-![Use corrections as few-shot examples](../evaluation/static/use_corrections_as_few_shot.png)
+![Use corrections as few-shot examples](./static/use_corrections_as_few_shot.png)
 
 Note that few-shot examples are not currently supported in evaluators that use Hub prompts.
 
@@ -51,20 +51,20 @@ begin seeing examples populated inside your corrections dataset. As you make cor
 The inputs to the few-shot examples will be the relevant fields from the inputs, outputs, and reference (if this an offline evaluator) of your chain/dataset.
 The outputs will be the corrected evaluator score and the explanations that you created when you left the corrections. Feel free to edit these to your liking. Here is an example of a few-shot example in a corrections dataset:
 
-![Few-shot example](../evaluation/static/few_shot_example.png)
+![Few-shot example](./static/few_shot_example.png)
 
 Note that the corrections may take a minute or two to be populated into your few-shot dataset. Once they are there, future runs of your evaluator will include them in the prompt!
 
 ## View your corrections dataset
 
 In order to view your corrections dataset, go to your rule and click "Edit Rule" (or "Edit Evaluator" from a dataset):
 
-![Edit Evaluator](../evaluation/static/edit_evaluator.png)
+![Edit Evaluator](./static/edit_evaluator.png)
 
 If this is an online evaluator (in a tracing project), you will need to click to edit your prompt:
 
-![Edit Prompt](../evaluation/static/click_to_edit_prompt.png)
+![Edit Prompt](./static/click_to_edit_prompt.png)
 
 From this screen, you will see a button that says "View few-shot dataset". Clicking this will bring you to your dataset of corrections, where you can view and update your few-shot examples:
 
-![View few-shot dataset](../evaluation/static/view_few_shot_ds.png)
+![View few-shot dataset](./static/view_few_shot_ds.png)
diff --git a/...to_guides/evaluation/custom_evaluator.mdx → ...uation/how_to_guides/custom_evaluator.mdx b/...to_guides/evaluation/custom_evaluator.mdx → ...uation/how_to_guides/custom_evaluator.mdx
@@ -8,7 +8,7 @@ import {
 
 :::info Key concepts
 
-- [Evaluators](../../concepts#evaluators)
+- [Evaluators](../concepts#evaluators)
 
 :::
 
@@ -138,5 +138,5 @@ answer is logically valid and consistent with question and the answer."""
 
 ## Related
 
-- [Evaluate aggregate experiment results](../../how_to_guides/evaluation/summary): Define summary evaluators, which compute metrics for an entire experiment.
-- [Run an evaluation comparing two experiments](../../how_to_guides/evaluation/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
+- [Evaluate aggregate experiment results](./summary): Define summary evaluators, which compute metrics for an entire experiment.
+- [Run an evaluation comparing two experiments](./evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
diff --git a/...w_to_guides/evaluation/dataset_subset.mdx → ...aluation/how_to_guides/dataset_subset.mdx b/...w_to_guides/evaluation/dataset_subset.mdx → ...aluation/how_to_guides/dataset_subset.mdx
@@ -10,8 +10,8 @@ import {
 
 Before diving into this content, it might be helpful to read:
 
-- [guide on fetching examples](../datasets/manage_datasets_programmatically#fetch-examples).
-- [guide on creating/managing dataset splits](../datasets/manage_datasets_in_application#create-and-manage-dataset-splits)
+- [guide on fetching examples](./manage_datasets_programmatically#fetch-examples).
+- [guide on creating/managing dataset splits](./manage_datasets_in_application#create-and-manage-dataset-splits)
 
 :::
 
@@ -49,7 +49,7 @@ One common workflow is to fetch examples that have a certain metadata key-value
   ]}
 />
 
-For more advanced filtering capabilities see this [how-to guide](../datasets/manage_datasets_programmatically#list-examples-by-structured-filter).
+For more advanced filtering capabilities see this [how-to guide](./manage_datasets_programmatically#list-examples-by-structured-filter).
 
 ## Evaluate on a dataset split
 
@@ -85,4 +85,4 @@ You can use the `list_examples` / `listExamples` method to evaluate on one or mu
 
 ## Related
 
-- More on [how to filter datasets](../datasets/manage_datasets_programmatically#list-examples-by-structured-filter)
+- More on [how to filter datasets](./manage_datasets_programmatically#list-examples-by-structured-filter)
diff --git a/..._to_guides/evaluation/dataset_version.mdx → ...luation/how_to_guides/dataset_version.mdx b/..._to_guides/evaluation/dataset_version.mdx → ...luation/how_to_guides/dataset_version.mdx
@@ -8,8 +8,8 @@ import {
 
 :::tip Recommended reading
 
-Before diving into this content, it might be helpful to read the [guide on versioning datasets](../datasets/version_datasets).
-Additionally, it might be helpful to read the [guide on fetching examples](../datasets/manage_datasets_programmatically#fetch-examples).
+Before diving into this content, it might be helpful to read the [guide on versioning datasets](./version_datasets).
+Additionally, it might be helpful to read the [guide on fetching examples](./manage_datasets_programmatically#fetch-examples).
 
 :::
 

diff --git a/docs/evaluation/how_to_guides/datasets/_category_.json b/docs/evaluation/how_to_guides/datasets/_category_.json
diff --git a/...aluation/evaluate_existing_experiment.mdx → ...o_guides/evaluate_existing_experiment.mdx b/...aluation/evaluate_existing_experiment.mdx → ...o_guides/evaluate_existing_experiment.mdx