diff --git a/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx b/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx
index 385408f8..558f52f2 100644
--- a/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx
+++ b/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx
@@ -144,7 +144,7 @@ If the header is not present, operations will default to the workspace the API k
 ## Security Settings
 
 :::note
-"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/datasets/share_dataset.mdx).
+"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/share_dataset.mdx).
 :::
 
 - <RegionalUrl
diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx
index d068081c..fb4056ea 100644
--- a/docs/evaluation/concepts/index.mdx
+++ b/docs/evaluation/concepts/index.mdx
@@ -66,7 +66,7 @@ When setting up your evaluation, you may want to partition your dataset into dif
 To learn more about creating dataset splits in LangSmith:
 
 - See our video on [`dataset splits`](https://youtu.be/FQMn_FQV-fI?feature=shared) in the LangSmith Evaluation series.
-- See our documentation [here](https://docs.smith.langchain.com/how_to_guides/datasets/manage_datasets_in_application#create-and-manage-dataset-splits).
+- See our documentation [here](../how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits).
 
 :::
 
@@ -105,7 +105,7 @@ Heuristic evaluators are hard-coded functions that perform computations to deter
 For some tasks, like code generation, custom heuristic evaluation (e.g., import and code execution-evaluation) are often extremely useful and superior to other evaluations (e.g., LLM-as-judge, discussed below).
 
 - Watch the [`Custom evaluator` video in our LangSmith Evaluation series](https://www.youtube.com/watch?v=w31v_kFvcNw) for a comprehensive overview.
-- Read our [documentation](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_on_intermediate_steps#3-define-your-custom-evaluators) on custom evaluators.
+- Read our [documentation](../how_to_guides/custom_evaluator) on custom evaluators.
 - See our [blog](https://blog.langchain.dev/code-execution-with-langgraph/) using custom evaluators for code generation.
 
 :::
@@ -124,7 +124,7 @@ With LLM-as-judge evaluators, it is important to carefully review the resulting
 
 :::tip
 
-See documentation on our workflow to audit and manually correct evaluator scores [here](https://docs.smith.langchain.com/how_to_guides/evaluation/audit_evaluator_scores).
+See documentation on our workflow to audit and manually correct evaluator scores [here](../how_to_guides/audit_evaluator_scores).
 
 :::
 
@@ -225,7 +225,7 @@ LangSmith evaluations are kicked off using a single function, `evaluate`, which
 
 :::tip
 
-See documentation on using `evaluate` [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application).
+See documentation on using `evaluate` [here](../how_to_guides/evaluate_llm_application).
 
 :::
 
@@ -236,7 +236,7 @@ One of the most common questions when evaluating AI applications is: how can I b
 :::tip
 
 - See the [video on `Repetitions` in our LangSmith Evaluation series](https://youtu.be/Pvz24JdzzF8)
-- See our documentation on [`Repetitions`](https://docs.smith.langchain.com/how_to_guides/evaluation/repetition)
+- See our documentation on [`Repetitions`](../how_to_guides/repetition)
 
 :::
 
@@ -281,7 +281,7 @@ However, there are several downsides to this type of evaluation. First, it usual
 
 :::tip
 
-See our tutorial on [evaluating agent response](https://docs.smith.langchain.com/tutorials/Developers/agents#response-evaluation).
+See our tutorial on [evaluating agent response](../tutorials/agents).
 
 :::
 
@@ -299,7 +299,7 @@ There are several benefits to this type of evaluation. It allows you to evaluate
 
 :::tip
 
-See our tutorial on [evaluating a single step of an agent](https://docs.smith.langchain.com/tutorials/Developers/agents#single-step-evaluation).
+See our tutorial on [evaluating a single step of an agent](../tutorials/agents#single-step-evaluation).
 
 :::
 
@@ -319,7 +319,7 @@ However, none of these approaches evaluate the input to the tools; they only foc
 
 :::tip
 
-See our tutorial on [evaluating agent trajectory](https://docs.smith.langchain.com/tutorials/Developers/agents#trajectory).
+See our tutorial on [evaluating agent trajectory](../tutorials/agents#trajectory).
 
 :::
 
@@ -434,7 +434,7 @@ Classification / Tagging applies a label to a given input (e.g., for toxicity de
 
 A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
 
-If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
+If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](../how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
 
 `Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
 
diff --git a/docs/evaluation/how_to_guides/async.mdx b/docs/evaluation/how_to_guides/async.mdx
index dfd7fc2e..37c8cff6 100644
--- a/docs/evaluation/how_to_guides/async.mdx
+++ b/docs/evaluation/how_to_guides/async.mdx
@@ -10,13 +10,13 @@ import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";
 
 We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
 which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
-You can learn more about how to use the `evaluate()` function [here](../../how_to_guides/evaluation/evaluate_llm_application).
+You can learn more about how to use the `evaluate()` function [here](../../how_to_guides/evaluate_llm_application).
 
 :::info Python only
 
 This guide is only relevant when using the Python SDK.
 In JS/TS the `evaluate()` function is already async.
-You can see how to use it [here](../../how_to_guides/evaluation/evaluate_llm_application).
+You can see how to use it [here](../../how_to_guides/evaluate_llm_application).
 
 :::
 
@@ -76,5 +76,5 @@ list 5 concrete questions that should be investigated to determine if the idea i
 
 ## Related
 
-- [Run an evaluation (synchronously)](../../how_to_guides/evaluation/evaluate_llm_application)
-- [Handle model rate limits](../../how_to_guides/evaluation/rate_limiting)
+- [Run an evaluation (synchronously)](../../how_to_guides/evaluate_llm_application)
+- [Handle model rate limits](../../how_to_guides/rate_limiting)
diff --git a/docs/evaluation/how_to_guides/audit_evaluator_scores.mdx b/docs/evaluation/how_to_guides/audit_evaluator_scores.mdx
index 0aaf9268..2fd6ca32 100644
--- a/docs/evaluation/how_to_guides/audit_evaluator_scores.mdx
+++ b/docs/evaluation/how_to_guides/audit_evaluator_scores.mdx
@@ -18,13 +18,13 @@ In the comparison view, you may click on any feedback tag to bring up the feedba
 If you would like, you may also attach an explanation to your correction. This is useful if you are using a [few-shot evaluator](./create_few_shot_evaluators) and will be automatically inserted into your few-shot examples
 in place of the `few_shot_explanation` prompt variable.
 
-![Audit Evaluator Comparison View](../evaluation/static/corrections_comparison_view.png)
+![Audit Evaluator Comparison View](./static/corrections_comparison_view.png)
 
 ## In the runs table
 
 In the runs table, find the "Feedback" column and click on the feedback tag to bring up the feedback details. Again, click the "edit" icon on the right to bring up the corrections view.
 
-![Audit Evaluator Runs Table](../evaluation/static/corrections_runs_table.png)
+![Audit Evaluator Runs Table](./static/corrections_runs_table.png)
 
 ## In the SDK
 
diff --git a/docs/evaluation/how_to_guides/bind_evaluator_to_dataset.mdx b/docs/evaluation/how_to_guides/bind_evaluator_to_dataset.mdx
index 88236b54..d289071c 100644
--- a/docs/evaluation/how_to_guides/bind_evaluator_to_dataset.mdx
+++ b/docs/evaluation/how_to_guides/bind_evaluator_to_dataset.mdx
@@ -23,7 +23,7 @@ The next steps vary based on the evaluator type.
 1. **Select the LLM as judge type evaluator**
 2. **Give your evaluator a name** and **set an inline prompt or load a prompt from the prompt hub** that will be used to evaluate the results of the runs in the experiment.
 
-![Add evaluator name and prompt](../evaluation/static/create_evaluator.png)
+![Add evaluator name and prompt](./static/create_evaluator.png)
 
 Importantly, evaluator prompts can only contain the following input variables:
 
@@ -42,11 +42,11 @@ LangSmith currently doesn't support setting up evaluators in the application tha
 
 You can specify the scoring criteria in the "schema" field. In this example, we are asking the LLM to grade on "correctness" of the output with respect to the reference, with a boolean output of 0 or 1. The name of the field in the schema will be interpreted as the feedback key and the type will be the type of the score.
 
-![Evaluator prompt](../evaluation/static/evaluator_prompt.png)
+![Evaluator prompt](./static/evaluator_prompt.png)
 
 3. **Save the evaluator** and navigate back to the dataset details page. Each **subsequent** experiment run from the dataset will now be evaluated by the evaluator you configured. Note that in the below image, each run in the experiment has a "correctness" score.
 
-![Playground evaluator results](../evaluation/static/playground_evaluator_results.png)
+![Playground evaluator results](./static/playground_evaluator_results.png)
 
 ## Custom code evaluators
 
@@ -70,7 +70,7 @@ You can specify the scoring criteria in the "schema" field. In this example, we
 
 In the UI, you will see a panel that lets you write your code inline, with some starter code:
 
-![](../evaluation/static/code-autoeval-popup.png)
+![](./static/code-autoeval-popup.png)
 
 Custom Code evaluators take in two arguments:
 
@@ -127,8 +127,8 @@ To visualize the feedback left on new experiments, try running a new experiment
 On the dataset, if you now click to the `experiments` tab -> `+ Experiment` -> `Run in Playground`, you can see the results in action.
 Your runs in your experiments will be automatically marked with the key specified in your code sample above (here, `formatted`):
 
-![](../evaluation/static/show-feedback-from-autoeval-code.png)
+![](./static/show-feedback-from-autoeval-code.png)
 
 And if you navigate back to your dataset, you'll see summary stats for said experiment in the `experiments` tab:
 
-![](../evaluation/static/experiments-tab-code-results.png)
+![](./static/experiments-tab-code-results.png)
diff --git a/docs/evaluation/how_to_guides/compare_experiment_results.mdx b/docs/evaluation/how_to_guides/compare_experiment_results.mdx
index 9875f4db..2dc96583 100644
--- a/docs/evaluation/how_to_guides/compare_experiment_results.mdx
+++ b/docs/evaluation/how_to_guides/compare_experiment_results.mdx
@@ -8,52 +8,52 @@ Oftentimes, when you are iterating on your LLM application (such as changing the
 
 LangSmith supports a powerful comparison view that lets you hone in on key differences, regressions, and improvements between different experiments.
 
-![](../evaluation/static/regression_test.gif)
+![](./static/regression_test.gif)
 
 ## Open the comparison view
 
 To open the comparison view, select two or more experiments from the "Experiments" tab from a given dataset page. Then, click on the "Compare" button at the bottom of the page.
 
-![](../evaluation/static/open_comparison_view.png)
+![](./static/open_comparison_view.png)
 
 ## View regressions and improvements
 
 In the LangSmith comparison view, runs that _regressed_ on your specified feedback key against your baseline experiment will be highlighted in red, while runs that _improved_
 will be highlighted in green. At the top of each column, you can see how many runs in that experiment did better and how many did worse than your baseline experiment.
 
-![Regressions](../evaluation/static/regression_view.png)
+![Regressions](./static/regression_view.png)
 
 ## Filter on regressions or improvements
 
 Click on the regressions or improvements buttons on the top of each column to filter to the runs that regressed or improved in that specific experiment.
 
-![Regressions Filter](../evaluation/static/filter_to_regressions.png)
+![Regressions Filter](./static/filter_to_regressions.png)
 
 ## Update baseline experiment
 
 In order to track regressions, you need a baseline experiment against which to compare. This will be automatically assigned as the first experiment in your comparison, but you can
 change it from the dropdown at the top of the page.
 
-![Baseline](../evaluation/static/select_baseline.png)
+![Baseline](./static/select_baseline.png)
 
 ## Select feedback key
 
 You will also want to select the feedback key (evaluation metric) on which you would like focus on. This can be selected via another dropdown at the top. Again, one will be assigned by
 default, but you can adjust as needed.
 
-![Feedback](../evaluation/static/select_feedback.png)
+![Feedback](./static/select_feedback.png)
 
 ## Open a trace
 
 If tracing is enabled for the evaluation run, you can click on the trace icon in the hover state of any experiment cell to open the trace view for that run. This will open up a trace in the side panel.
 
-![](../evaluation/static/open_trace_comparison.png)
+![](./static/open_trace_comparison.png)
 
 ## Expand detailed view
 
 From any cell, you can click on the expand icon in the hover state to open up a detailed view of all experiment results on that particular example input, along with feedback keys and scores.
 
-![](../evaluation/static/expanded_view.png)
+![](./static/expanded_view.png)
 
 ## Update display settings
 
@@ -61,4 +61,4 @@ You can adjust the display settings for comparison view by clicking on "Display"
 
 Here, you'll be able to toggle feedback, metrics, summary charts, and expand full text.
 
-![](../evaluation/static/update_display.png)
+![](./static/update_display.png)
diff --git a/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx b/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx
index e50b3965..4bf8f696 100644
--- a/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx
+++ b/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx
@@ -34,7 +34,7 @@ as your output key. For example, if your main prompt has variables `question` an
 You may also specify the number of few-shot examples to use. The default is 5. If your examples will tend to be very long, you may want to set this number lower to save tokens - whereas if your examples tend
 to be short, you can set a higher number in order to give your evaluator more examples to learn from. If you have more examples in your dataset than this number, we will randomly choose them for you.
 
-![Use corrections as few-shot examples](../evaluation/static/use_corrections_as_few_shot.png)
+![Use corrections as few-shot examples](./static/use_corrections_as_few_shot.png)
 
 Note that few-shot examples are not currently supported in evaluators that use Hub prompts.
 
@@ -51,7 +51,7 @@ begin seeing examples populated inside your corrections dataset. As you make cor
 The inputs to the few-shot examples will be the relevant fields from the inputs, outputs, and reference (if this an offline evaluator) of your chain/dataset.
 The outputs will be the corrected evaluator score and the explanations that you created when you left the corrections. Feel free to edit these to your liking. Here is an example of a few-shot example in a corrections dataset:
 
-![Few-shot example](../evaluation/static/few_shot_example.png)
+![Few-shot example](./static/few_shot_example.png)
 
 Note that the corrections may take a minute or two to be populated into your few-shot dataset. Once they are there, future runs of your evaluator will include them in the prompt!
 
@@ -59,12 +59,12 @@ Note that the corrections may take a minute or two to be populated into your few
 
 In order to view your corrections dataset, go to your rule and click "Edit Rule" (or "Edit Evaluator" from a dataset):
 
-![Edit Evaluator](../evaluation/static/edit_evaluator.png)
+![Edit Evaluator](./static/edit_evaluator.png)
 
 If this is an online evaluator (in a tracing project), you will need to click to edit your prompt:
 
-![Edit Prompt](../evaluation/static/click_to_edit_prompt.png)
+![Edit Prompt](./static/click_to_edit_prompt.png)
 
 From this screen, you will see a button that says "View few-shot dataset". Clicking this will bring you to your dataset of corrections, where you can view and update your few-shot examples:
 
-![View few-shot dataset](../evaluation/static/view_few_shot_ds.png)
+![View few-shot dataset](./static/view_few_shot_ds.png)
diff --git a/docs/evaluation/how_to_guides/custom_evaluator.mdx b/docs/evaluation/how_to_guides/custom_evaluator.mdx
index bce7b66d..9086b696 100644
--- a/docs/evaluation/how_to_guides/custom_evaluator.mdx
+++ b/docs/evaluation/how_to_guides/custom_evaluator.mdx
@@ -138,5 +138,5 @@ answer is logically valid and consistent with question and the answer."""
 
 ## Related
 
-- [Evaluate aggregate experiment results](../../how_to_guides/evaluation/summary): Define summary evaluators, which compute metrics for an entire experiment.
-- [Run an evaluation comparing two experiments](../../how_to_guides/evaluation/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
+- [Evaluate aggregate experiment results](../../how_to_guides/summary): Define summary evaluators, which compute metrics for an entire experiment.
+- [Run an evaluation comparing two experiments](../../how_to_guides/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.
diff --git a/docs/evaluation/how_to_guides/evaluate_llm_application.mdx b/docs/evaluation/how_to_guides/evaluate_llm_application.mdx
index fdefed61..a6b886c6 100644
--- a/docs/evaluation/how_to_guides/evaluate_llm_application.mdx
+++ b/docs/evaluation/how_to_guides/evaluate_llm_application.mdx
@@ -22,7 +22,7 @@ In this guide we'll go over how to evaluate an application using the [evaluate()
 
 For larger evaluation jobs in Python we recommend using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html), the asynchronous version of `evaluate()`.
 It is still worthwhile to read this guide first, as the two have nearly identical interfaces,
-and then read the how-to guide on [running an evaluation asynchronously](../../how_to_guides/evaluation/async).
+and then read the how-to guide on [running an evaluation asynchronously](../../how_to_guides/async).
 
 :::
 
@@ -223,7 +223,7 @@ Evaluation scores are stored against each actual output as feedback.
 
 _If you've annotated your code for tracing, you can open the trace of each row in a side panel view._
 
-![](../evaluation/static/view_experiment.gif)
+![](./static/view_experiment.gif)
 
 ## Reference code
 
@@ -364,6 +364,6 @@ _If you've annotated your code for tracing, you can open the trace of each row i
 
 ## Related
 
-- [Run an evaluation asynchronously](../../how_to_guides/evaluation/async)
-- [Run an evaluation via the REST API](../../how_to_guides/evaluation/run_evals_api_only)
-- [Run an evaluation from the prompt playground](../../how_to_guides/evaluation/run_evaluation_from_prompt_playground)
+- [Run an evaluation asynchronously](../../how_to_guides/async)
+- [Run an evaluation via the REST API](../../how_to_guides/run_evals_api_only)
+- [Run an evaluation from the prompt playground](../../how_to_guides/run_evaluation_from_prompt_playground)
diff --git a/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx b/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx
index 39e1041a..4685864e 100644
--- a/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx
+++ b/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx
@@ -167,7 +167,7 @@ def rag_pipeline(question):
 />
 
 This pipeline will produce a trace that looks something like:
-![](../evaluation/static/evaluation_intermediate_trace.png)
+![](./static/evaluation_intermediate_trace.png)
 
 ## 2. Create a dataset and examples to evaluate the pipeline
 
@@ -387,7 +387,7 @@ Finally, we'll run `evaluate` with the custom evaluators defined above.
 />
 
 The experiment will contain the results of the evaluation, including the scores and comments from the evaluators:
-![](../evaluation/static/evaluation_intermediate_experiment.png)
+![](./static/evaluation_intermediate_experiment.png)
 
 ## Related
 
diff --git a/docs/evaluation/how_to_guides/evaluate_pairwise.mdx b/docs/evaluation/how_to_guides/evaluate_pairwise.mdx
index d68b48b7..55c2857f 100644
--- a/docs/evaluation/how_to_guides/evaluate_pairwise.mdx
+++ b/docs/evaluation/how_to_guides/evaluate_pairwise.mdx
@@ -22,7 +22,7 @@ This allows you to score the outputs from multiple experiments against each othe
 Think [LMSYS Chatbot Arena](https://chat.lmsys.org/) - this is the same concept!
 To do this, use the [evaluate_comparative](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate_comparative.html) / `evaluateComparative` function with two existing experiments.
 
-If you haven't already created experiments to compare, check out our [quick start](https://docs.smith.langchain.com/#5-run-your-first-evaluation) or oue [how-to guide](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application) to get started with evaluations.
+If you haven't already created experiments to compare, check out our [quick start](https://docs.smith.langchain.com/#5-run-your-first-evaluation) or oue [how-to guide](https://docs.smith.langchain.com/how_to_guides/evaluate_llm_application) to get started with evaluations.
 
 ## `evaluate_comparative` args
 
@@ -240,12 +240,12 @@ In the Python example below, we are pulling [this structured prompt](https://smi
 
 Navigate to the "Pairwise Experiments" tab from the dataset page:
 
-![Pairwise Experiments Tab](../evaluation/static/pairwise_from_dataset.png)
+![Pairwise Experiments Tab](./static/pairwise_from_dataset.png)
 
 Click on a pairwise experiment that you would like to inspect, and you will be brought to the Comparison View:
 
-![Pairwise Comparison View](../evaluation/static/pairwise_comparison_view.png)
+![Pairwise Comparison View](./static/pairwise_comparison_view.png)
 
 You may filter to runs where the first experiment was better or vice versa by clicking the thumbs up/thumbs down buttons in the table header:
 
-![Pairwise Filtering](../evaluation/static/filter_pairwise.png)
+![Pairwise Filtering](./static/filter_pairwise.png)
diff --git a/docs/evaluation/how_to_guides/filter_experiments_ui.mdx b/docs/evaluation/how_to_guides/filter_experiments_ui.mdx
index 6f32dfc2..eff983ef 100644
--- a/docs/evaluation/how_to_guides/filter_experiments_ui.mdx
+++ b/docs/evaluation/how_to_guides/filter_experiments_ui.mdx
@@ -74,20 +74,20 @@ and a known ID of the prompt:
 
 In the UI, we see all experiments that have been run by default.
 
-![](../evaluation/static/filter-all-experiments.png)
+![](./static/filter-all-experiments.png)
 
 If we, say, have a preference for openai models, we can easily filter down and see scores within just openai
 models first:
 
-![](../evaluation/static/filter-openai.png)
+![](./static/filter-openai.png)
 
 We can stack filters, allowing us to filter out low scores on correctness to make sure we only compare
 relevant experiments:
 
-![](../evaluation/static/filter-feedback.png)
+![](./static/filter-feedback.png)
 
 Finally, we can clear and reset filters. For example, if we see there is clear there's a winner with the
 `singleminded` prompt, we can change filtering settings to see if any other model providers' models work
 as well with it:
 
-![](../evaluation/static/filter-singleminded.png)
+![](./static/filter-singleminded.png)
diff --git a/docs/evaluation/how_to_guides/langchain_runnable.mdx b/docs/evaluation/how_to_guides/langchain_runnable.mdx
index 3993abfa..806a3e9f 100644
--- a/docs/evaluation/how_to_guides/langchain_runnable.mdx
+++ b/docs/evaluation/how_to_guides/langchain_runnable.mdx
@@ -132,7 +132,7 @@ To evaluate our chain we can pass it directly to the `evaluate()` / `aevaluate()
 
 The runnable is traced appropriately for each output.
 
-![](../evaluation/static/runnable_eval.png)
+![](./static/runnable_eval.png)
 
 ## Related
 
diff --git a/docs/evaluation/how_to_guides/llm_as_judge.mdx b/docs/evaluation/how_to_guides/llm_as_judge.mdx
index c8a0b8f7..b098bf18 100644
--- a/docs/evaluation/how_to_guides/llm_as_judge.mdx
+++ b/docs/evaluation/how_to_guides/llm_as_judge.mdx
@@ -72,8 +72,8 @@ for the answer is logically valid and consistent with question and the answer.\\
 ]}
 />
 
-See [here](../../how_to_guides/evaluation/custom_evaluator) for more on how to write a custom evaluator.
+See [here](../../how_to_guides/custom_evaluator) for more on how to write a custom evaluator.
 
 ## Prebuilt evaluator via `langchain`
 
-See [here](../../how_to_guides/evaluation/use_langchain_off_the_shelf_evaluators) for how to use prebuilt evaluators from `langchain`.
+See [here](../../how_to_guides/use_langchain_off_the_shelf_evaluators) for how to use prebuilt evaluators from `langchain`.
diff --git a/docs/evaluation/how_to_guides/metric_type.mdx b/docs/evaluation/how_to_guides/metric_type.mdx
index a3aa401a..a59e4355 100644
--- a/docs/evaluation/how_to_guides/metric_type.mdx
+++ b/docs/evaluation/how_to_guides/metric_type.mdx
@@ -6,7 +6,7 @@ import {
 
 # How to return categorical vs numerical metrics
 
-LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](../../how_to_guides/evaluation/custom_evaluator).
+LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](../../how_to_guides/custom_evaluator).
 
 For an evaluator result to be logged as a numerical metric, it must returned as:
 
@@ -68,4 +68,4 @@ Here are some examples:
 
 ## Related
 
-- [Return multiple metrics in one evaluator](../../how_to_guides/evaluation/multiple_scores)
+- [Return multiple metrics in one evaluator](../../how_to_guides/multiple_scores)
diff --git a/docs/evaluation/how_to_guides/multiple_scores.mdx b/docs/evaluation/how_to_guides/multiple_scores.mdx
index 2a433002..a69989a6 100644
--- a/docs/evaluation/how_to_guides/multiple_scores.mdx
+++ b/docs/evaluation/how_to_guides/multiple_scores.mdx
@@ -6,7 +6,7 @@ import {
 
 # How to return multiple scores in one evaluator
 
-Sometimes it is useful for a [custom evaluator function](../../how_to_guides/evaluation/custom_evaluator) or [summary evaluator function](../../how_to_guides/evaluation/summary) to return multiple metrics.
+Sometimes it is useful for a [custom evaluator function](../../how_to_guides/custom_evaluator) or [summary evaluator function](../../how_to_guides/summary) to return multiple metrics.
 For example, if you have multiple metrics being generated by an LLM judge, you can save time and money by making a single LLM call that generates multiple metrics instead of making multiple LLM calls.
 
 To return multiple scores using the Python SDK, simply return a list of dictionaries/objects of the following form:
@@ -71,8 +71,8 @@ Example:
 
 Rows from the resulting experiment will display each of the scores.
 
-![](../evaluation/static/multiple_scores.png)
+![](./static/multiple_scores.png)
 
 ## Related
 
-- [Return categorical vs numerical metrics](../../how_to_guides/evaluation/metric_type)
+- [Return categorical vs numerical metrics](../../how_to_guides/metric_type)
diff --git a/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx b/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx
index b2dee48b..726b2935 100644
--- a/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx
+++ b/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx
@@ -12,12 +12,12 @@ This allows you to test your prompt / model configuration over a series of input
 
 1. **Navigate to the prompt playground** by clicking on "Prompts" in the sidebar, then selecting a prompt from the list of available prompts or creating a new one.
 2. **Select the "Switch to dataset" button** to switch to the dataset you want to use for the experiment. Please note that the dataset keys of the dataset inputs must match the input variables of the prompt. In the below sections, note that the selected dataset has inputs with keys "text", which correctly match the input variable of the prompt. Also note that there is a max capacity of 15 inputs for the prompt playground.
-   ![Switch to dataset](../evaluation/static/switch_to_dataset.png)
+   ![Switch to dataset](./static/switch_to_dataset.png)
 3. **Click on the "Start" button** or CMD+Enter to start the experiment. This will run the prompt over all the examples in the dataset and create an entry for the experiment in the dataset details page. Note that you need to commit the prompt to the prompt hub before you can start the experiment to ensure it can be referenced in the experiment. The result for each input will be streamed and displayed inline for each input in the dataset.
-   ![Input variables](../evaluation/static/input_variables_playground.png)
+   ![Input variables](./static/input_variables_playground.png)
 4. **View the results** by clicking on the "View Experiment" button at the bottom of the page. This will take you to the experiment details page where you can see the results of the experiment.
 5. **Navigate back to the commit page** by clicking on the "View Commit" button. This will take you back to the prompt page where you can make changes to the prompt and run more experiments. The "View Commit" button is available to all experiments that were run from the prompt playground. The experiment is prefixed with the prompt repository name, a unique identifier, and the date and time the experiment was run.
-   ![Playground experiment results](../evaluation/static/playground_experiment_results.png)
+   ![Playground experiment results](./static/playground_experiment_results.png)
 
 ## Add evaluation scores to the experiment
 
diff --git a/docs/evaluation/how_to_guides/summary.mdx b/docs/evaluation/how_to_guides/summary.mdx
index 97fd68bf..761043eb 100644
--- a/docs/evaluation/how_to_guides/summary.mdx
+++ b/docs/evaluation/how_to_guides/summary.mdx
@@ -73,4 +73,4 @@ You can then pass this evaluator to the `evaluate` method as follows:
 
 In the LangSmith UI, you'll the summary evaluator's score displayed with the corresponding key.
 
-![](../evaluation/static/summary_eval.png)
+![](./static/summary_eval.png)
diff --git a/docs/evaluation/how_to_guides/unit_testing.mdx b/docs/evaluation/how_to_guides/unit_testing.mdx
index b43eab1b..a6ce4b06 100644
--- a/docs/evaluation/how_to_guides/unit_testing.mdx
+++ b/docs/evaluation/how_to_guides/unit_testing.mdx
@@ -57,7 +57,7 @@ Each time you run this test suite, LangSmith collects the pass/fail rate and oth
 
 The test suite syncs to a corresponding dataset named after your package or github repository.
 
-![Test Example](../evaluation/static/unit-test-suite.png)
+![Test Example](./static/unit-test-suite.png)
 
 ## Going further
 
diff --git a/docs/evaluation/how_to_guides/upload_existing_experiments.mdx b/docs/evaluation/how_to_guides/upload_existing_experiments.mdx
index c9c8551d..caa2901a 100644
--- a/docs/evaluation/how_to_guides/upload_existing_experiments.mdx
+++ b/docs/evaluation/how_to_guides/upload_existing_experiments.mdx
@@ -260,12 +260,12 @@ information in the request body).
 ## View the experiment in the UI
 
 Now, login to the UI and click on your newly-created dataset! You should see a single experiment:
-![Uploaded experiments table](../evaluation/static/uploaded_dataset.png)
+![Uploaded experiments table](./static/uploaded_dataset.png)
 
 Your examples will have been uploaded:
-![Uploaded examples](../evaluation/static/uploaded_dataset_examples.png)
+![Uploaded examples](./static/uploaded_dataset_examples.png)
 
 Clicking on your experiment will bring you to the comparison view:
-![Uploaded experiment comparison view](../evaluation/static/uploaded_experiment.png)
+![Uploaded experiment comparison view](./static/uploaded_experiment.png)
 
 As you upload more experiments to your dataset, you will be able to compare the results and easily identify regressions in the comparison view.
diff --git a/docs/evaluation/index.mdx b/docs/evaluation/index.mdx
index a88c782f..f8d368a3 100644
--- a/docs/evaluation/index.mdx
+++ b/docs/evaluation/index.mdx
@@ -116,7 +116,7 @@ groupId="client-language"
 
 Click the link printed out by your evaluation run to access the LangSmith Experiments UI, and explore the results of your evaluation.
 
-![](./how_to_guides/evaluation/static/view_experiment.gif)
+![](./how_to_guides/static/view_experiment.gif)
 
 ## Next steps
 
diff --git a/docs/evaluation/tutorials/agents.mdx b/docs/evaluation/tutorials/agents.mdx
index 9efd0f73..83ae815f 100644
--- a/docs/evaluation/tutorials/agents.mdx
+++ b/docs/evaluation/tutorials/agents.mdx
@@ -460,7 +460,7 @@ See the full overview of single step evaluation in our [conceptual guide](https:
 
 :::
 
-We can check a specific tool call using [a custom evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator):
+We can check a specific tool call using [a custom evaluator](https://docs.smith.langchain.com/how_to_guides/custom_evaluator):
 
 - Here, we just invoke the assistant, `assistant_runnable`, with a prompt and check if the resulting tool call is as expected.
 - Here, we are using a specialized agent where the tools are hard-coded (rather than passed with the dataset input).
@@ -507,7 +507,7 @@ experiment_results = evaluate(
 
 ### Trajectory
 
-We can check a trajectory of tool calls using [custom evaluators](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator):
+We can check a trajectory of tool calls using [custom evaluators](https://docs.smith.langchain.com/how_to_guides/custom_evaluator):
 
 - Here, we just invoke the agent, `graph.invoke`, with a prompt.
 - Here, we are using a specialized agent where the tools are hard-coded (rather than passed with the dataset input).
diff --git a/docs/evaluation/tutorials/rag.mdx b/docs/evaluation/tutorials/rag.mdx
index 3ff6eddf..39b52d71 100644
--- a/docs/evaluation/tutorials/rag.mdx
+++ b/docs/evaluation/tutorials/rag.mdx
@@ -406,7 +406,7 @@ However, we will show that this is not required.
 
 We can isolate them as intermediate chain steps.
 
-See detail on isolating intermediate chain steps [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_on_intermediate_steps).
+See detail on isolating intermediate chain steps [here](https://docs.smith.langchain.com/how_to_guides/evaluate_on_intermediate_steps).
 
 Here is the a video from our LangSmith evaluation series for reference:
 
diff --git a/docs/evaluation/tutorials/swe-benchmark.mdx b/docs/evaluation/tutorials/swe-benchmark.mdx
index aa7ee4b0..c1f9b00b 100644
--- a/docs/evaluation/tutorials/swe-benchmark.mdx
+++ b/docs/evaluation/tutorials/swe-benchmark.mdx
@@ -72,7 +72,7 @@ dataset = client.upload_csv(
 
 ### Create dataset split for quicker testing
 
-Since running the SWE-bench evaluator takes a long time when run on all examples, you can create a "test" split for quickly testing the evaluator and your code. Read [this guide](../../evaluation/how_to_guides/datasets/manage_datasets_in_application#create-and-manage-dataset-splits) to learn more about managing dataset splits, or watch this short video that shows how to do it (to get to the starting page of the video, just click on your dataset created above and go to the `Examples` tab):
+Since running the SWE-bench evaluator takes a long time when run on all examples, you can create a "test" split for quickly testing the evaluator and your code. Read [this guide](../../evaluation/how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits) to learn more about managing dataset splits, or watch this short video that shows how to do it (to get to the starting page of the video, just click on your dataset created above and go to the `Examples` tab):
 
 import creating_split from "./static/creating_split.mp4";
 
diff --git a/docs/observability/concepts/index.mdx b/docs/observability/concepts/index.mdx
index b4acdff7..f007e0fc 100644
--- a/docs/observability/concepts/index.mdx
+++ b/docs/observability/concepts/index.mdx
@@ -50,9 +50,9 @@ Feedback can currently be continuous or discrete (categorical), and you can reus
 
 Collecting feedback on runs can be done in a number of ways:
 
-1. [Sent up along with a trace](/evaluation/how_to_guides/human_feedback/attach_user_feedback) from the LLM application
-2. Generated by a user in the app [inline](/evaluation/how_to_guides/human_feedback/annotate_traces_inline) or in an [annotation queue](../evaluation/how_to_guides/human_feedback/annotation_queues)
-3. Generated by an automatic evaluator during [offline evaluation](/evaluation/how_to_guides/evaluation/evaluate_llm_application)
+1. [Sent up along with a trace](/evaluation/how_to_guides/attach_user_feedback) from the LLM application
+2. Generated by a user in the app [inline](/evaluation/how_to_guides/annotate_traces_inline) or in an [annotation queue](../evaluation/how_to_guides/annotation_queues)
+3. Generated by an automatic evaluator during [offline evaluation](/evaluation/how_to_guides/evaluate_llm_application)
 4. Generated by an [online evaluator](./how_to_guides/monitoring/online_evaluations)
 
 To learn more about how feedback is stored in the application, see [this reference guide](../reference/data_formats/feedback_data_format).
diff --git a/docs/observability/how_to_guides/monitoring/rules.mdx b/docs/observability/how_to_guides/monitoring/rules.mdx
index bedaa787..898cdb37 100644
--- a/docs/observability/how_to_guides/monitoring/rules.mdx
+++ b/docs/observability/how_to_guides/monitoring/rules.mdx
@@ -31,7 +31,7 @@ _Alternatively_, you can access rules in settings by navigating to <RegionalUrl
 There are currently two types of rules you can create: **Project Rule** and **Dataset Rule**.
 
 - **Project Rule**: This rule will apply to traces in the specified project. Actions allowed are adding to a dataset, adding to an annotation queue, running online evaluation, and triggering a webhook.
-- **Dataset Rule**: This rule will apply to traces that are part of an experiment in the specified dataset. Actions allowed are only running an evaluator on the experiment results. To see this in action, you can follow [this guide](../../../evaluation/how_to_guides/evaluation/run_evaluation_from_prompt_playground).
+- **Dataset Rule**: This rule will apply to traces that are part of an experiment in the specified dataset. Actions allowed are only running an evaluator on the experiment results. To see this in action, you can follow [this guide](../../../evaluation/how_to_guides/run_evaluation_from_prompt_playground).
 
 :::
 
diff --git a/docs/prompt_engineering/how_to_guides/index.md b/docs/prompt_engineering/how_to_guides/index.md
index 251c568b..2b6361da 100644
--- a/docs/prompt_engineering/how_to_guides/index.md
+++ b/docs/prompt_engineering/how_to_guides/index.md
@@ -24,4 +24,4 @@ Quickly iterate on prompts and models in the LangSmith Playground.
 
 Use LangSmith datasets to serve few shot examples to your application
 
-- [Index a dataset for few shot example selection](../../evaluation/how_to_guides/datasets/index_datasets_for_dynamic_few_shot_example_selection)
+- [Index a dataset for few shot example selection](../../evaluation/how_to_guides/index_datasets_for_dynamic_few_shot_example_selection)
diff --git a/docs/reference/data_formats/example_data_format.mdx b/docs/reference/data_formats/example_data_format.mdx
index ef7ac1ef..639de5ff 100644
--- a/docs/reference/data_formats/example_data_format.mdx
+++ b/docs/reference/data_formats/example_data_format.mdx
@@ -25,4 +25,4 @@ LangSmith stores examples in datasets as follows:
 | **source_run_id** | UUID     | If this example was created from a LangSmith [`Run`](./run_data_format), the ID of said run |
 | **metadata**      | object   | A map of additional, user or SDK defined information that can be stored on an example.      |
 
-To learn more about how examples are used in evaluation, read our how-to guide on [evaluating LLM applications](/evaluation/how_to_guides/evaluation/evaluate_llm_application).
+To learn more about how examples are used in evaluation, read our how-to guide on [evaluating LLM applications](/evaluation/how_to_guides/evaluate_llm_application).
diff --git a/docs/reference/data_formats/feedback_data_format.mdx b/docs/reference/data_formats/feedback_data_format.mdx
index 7c1d9c96..480e25c6 100644
--- a/docs/reference/data_formats/feedback_data_format.mdx
+++ b/docs/reference/data_formats/feedback_data_format.mdx
@@ -14,9 +14,9 @@ Before diving into this content, it might be helpful to read the following:
 **Feedback** is LangSmith's way of storing the criteria and scores from evaluation on a particular trace or intermediate run (span).
 Feedback can be produced from a variety of ways, such as:
 
-1. [Sent up along with a trace](/evaluation/how_to_guides/human_feedback/attach_user_feedback) from the LLM application
-2. Generated by a user in the app [inline](/evaluation/how_to_guides/human_feedback/annotate_traces_inline) or in an [annotation queue](../../evaluation/how_to_guides/human_feedback/annotation_queues)
-3. Generated by an automatic evaluator during [offline evaluation](/evaluation/how_to_guides/evaluation/evaluate_llm_application)
+1. [Sent up along with a trace](/evaluation/how_to_guides/attach_user_feedback) from the LLM application
+2. Generated by a user in the app [inline](/evaluation/how_to_guides/annotate_traces_inline) or in an [annotation queue](../../evaluation/how_to_guides/annotation_queues)
+3. Generated by an automatic evaluator during [offline evaluation](/evaluation/how_to_guides/evaluate_llm_application)
 4. Generated by an [online evaluator](/observability/how_to_guides/monitoring/online_evaluations)
 
 Feedback is stored in a simple format with the following fields:
diff --git a/docs/reference/evaluation/dataset_transformations.mdx b/docs/reference/evaluation/dataset_transformations.mdx
index 69f5fd2d..b7945084 100644
--- a/docs/reference/evaluation/dataset_transformations.mdx
+++ b/docs/reference/evaluation/dataset_transformations.mdx
@@ -46,7 +46,7 @@ your schema directly and manually add the relevant transformations.
 When adding a run from a tracing project or annotation queue to a dataset, if it has the LLM run type, we will apply
 the Chat Model schema by default.
 
-For enablement on new datasets, see our [dataset management how-to guide](/evaluation/how_to_guides/datasets/manage_datasets_in_application).
+For enablement on new datasets, see our [dataset management how-to guide](/evaluation/how_to_guides/manage_datasets_in_application).
 
 ### Specs
 
diff --git a/docs/reference/sdk_reference/langchain_evaluators.mdx b/docs/reference/sdk_reference/langchain_evaluators.mdx
index 94182d1b..71f9f79c 100644
--- a/docs/reference/sdk_reference/langchain_evaluators.mdx
+++ b/docs/reference/sdk_reference/langchain_evaluators.mdx
@@ -1,7 +1,7 @@
 # LangChain off-the-shelf evaluators
 
 LangChain's evaluation module provides evaluators you can use as-is for common evaluation scenarios.
-To learn how to use these evaluators, please refer to the [following guide](../../../evaluation/how_to_guides/evaluation/use_langchain_off_the_shelf_evaluators).
+To learn how to use these evaluators, please refer to the [following guide](../../../evaluation/how_to_guides/use_langchain_off_the_shelf_evaluators).
 
 :::note