diff --git a/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx b/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx index 385408f8..558f52f2 100644 --- a/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx +++ b/docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx @@ -144,7 +144,7 @@ If the header is not present, operations will default to the workspace the API k ## Security Settings :::note -"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/datasets/share_dataset.mdx). +"Shared resources" in this context refer to [public prompts](../../../prompt_engineering/how_to_guides/prompts/create_a_prompt#save-your-prompt), [shared runs](../../../observability/how_to_guides/tracing/share_trace), and [shared datasets](../../../evaluation/how_to_guides/share_dataset.mdx). ::: - `+ Experiment` -> `Run in Playground`, you can see the results in action. Your runs in your experiments will be automatically marked with the key specified in your code sample above (here, `formatted`): -![](../evaluation/static/show-feedback-from-autoeval-code.png) +![](./static/show-feedback-from-autoeval-code.png) And if you navigate back to your dataset, you'll see summary stats for said experiment in the `experiments` tab: -![](../evaluation/static/experiments-tab-code-results.png) +![](./static/experiments-tab-code-results.png) diff --git a/docs/evaluation/how_to_guides/compare_experiment_results.mdx b/docs/evaluation/how_to_guides/compare_experiment_results.mdx index 9875f4db..2dc96583 100644 --- a/docs/evaluation/how_to_guides/compare_experiment_results.mdx +++ b/docs/evaluation/how_to_guides/compare_experiment_results.mdx @@ -8,52 +8,52 @@ Oftentimes, when you are iterating on your LLM application (such as changing the LangSmith supports a powerful comparison view that lets you hone in on key differences, regressions, and improvements between different experiments. -![](../evaluation/static/regression_test.gif) +![](./static/regression_test.gif) ## Open the comparison view To open the comparison view, select two or more experiments from the "Experiments" tab from a given dataset page. Then, click on the "Compare" button at the bottom of the page. -![](../evaluation/static/open_comparison_view.png) +![](./static/open_comparison_view.png) ## View regressions and improvements In the LangSmith comparison view, runs that _regressed_ on your specified feedback key against your baseline experiment will be highlighted in red, while runs that _improved_ will be highlighted in green. At the top of each column, you can see how many runs in that experiment did better and how many did worse than your baseline experiment. -![Regressions](../evaluation/static/regression_view.png) +![Regressions](./static/regression_view.png) ## Filter on regressions or improvements Click on the regressions or improvements buttons on the top of each column to filter to the runs that regressed or improved in that specific experiment. -![Regressions Filter](../evaluation/static/filter_to_regressions.png) +![Regressions Filter](./static/filter_to_regressions.png) ## Update baseline experiment In order to track regressions, you need a baseline experiment against which to compare. This will be automatically assigned as the first experiment in your comparison, but you can change it from the dropdown at the top of the page. -![Baseline](../evaluation/static/select_baseline.png) +![Baseline](./static/select_baseline.png) ## Select feedback key You will also want to select the feedback key (evaluation metric) on which you would like focus on. This can be selected via another dropdown at the top. Again, one will be assigned by default, but you can adjust as needed. -![Feedback](../evaluation/static/select_feedback.png) +![Feedback](./static/select_feedback.png) ## Open a trace If tracing is enabled for the evaluation run, you can click on the trace icon in the hover state of any experiment cell to open the trace view for that run. This will open up a trace in the side panel. -![](../evaluation/static/open_trace_comparison.png) +![](./static/open_trace_comparison.png) ## Expand detailed view From any cell, you can click on the expand icon in the hover state to open up a detailed view of all experiment results on that particular example input, along with feedback keys and scores. -![](../evaluation/static/expanded_view.png) +![](./static/expanded_view.png) ## Update display settings @@ -61,4 +61,4 @@ You can adjust the display settings for comparison view by clicking on "Display" Here, you'll be able to toggle feedback, metrics, summary charts, and expand full text. -![](../evaluation/static/update_display.png) +![](./static/update_display.png) diff --git a/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx b/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx index e50b3965..4bf8f696 100644 --- a/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx +++ b/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx @@ -34,7 +34,7 @@ as your output key. For example, if your main prompt has variables `question` an You may also specify the number of few-shot examples to use. The default is 5. If your examples will tend to be very long, you may want to set this number lower to save tokens - whereas if your examples tend to be short, you can set a higher number in order to give your evaluator more examples to learn from. If you have more examples in your dataset than this number, we will randomly choose them for you. -![Use corrections as few-shot examples](../evaluation/static/use_corrections_as_few_shot.png) +![Use corrections as few-shot examples](./static/use_corrections_as_few_shot.png) Note that few-shot examples are not currently supported in evaluators that use Hub prompts. @@ -51,7 +51,7 @@ begin seeing examples populated inside your corrections dataset. As you make cor The inputs to the few-shot examples will be the relevant fields from the inputs, outputs, and reference (if this an offline evaluator) of your chain/dataset. The outputs will be the corrected evaluator score and the explanations that you created when you left the corrections. Feel free to edit these to your liking. Here is an example of a few-shot example in a corrections dataset: -![Few-shot example](../evaluation/static/few_shot_example.png) +![Few-shot example](./static/few_shot_example.png) Note that the corrections may take a minute or two to be populated into your few-shot dataset. Once they are there, future runs of your evaluator will include them in the prompt! @@ -59,12 +59,12 @@ Note that the corrections may take a minute or two to be populated into your few In order to view your corrections dataset, go to your rule and click "Edit Rule" (or "Edit Evaluator" from a dataset): -![Edit Evaluator](../evaluation/static/edit_evaluator.png) +![Edit Evaluator](./static/edit_evaluator.png) If this is an online evaluator (in a tracing project), you will need to click to edit your prompt: -![Edit Prompt](../evaluation/static/click_to_edit_prompt.png) +![Edit Prompt](./static/click_to_edit_prompt.png) From this screen, you will see a button that says "View few-shot dataset". Clicking this will bring you to your dataset of corrections, where you can view and update your few-shot examples: -![View few-shot dataset](../evaluation/static/view_few_shot_ds.png) +![View few-shot dataset](./static/view_few_shot_ds.png) diff --git a/docs/evaluation/how_to_guides/custom_evaluator.mdx b/docs/evaluation/how_to_guides/custom_evaluator.mdx index bce7b66d..9086b696 100644 --- a/docs/evaluation/how_to_guides/custom_evaluator.mdx +++ b/docs/evaluation/how_to_guides/custom_evaluator.mdx @@ -138,5 +138,5 @@ answer is logically valid and consistent with question and the answer.""" ## Related -- [Evaluate aggregate experiment results](../../how_to_guides/evaluation/summary): Define summary evaluators, which compute metrics for an entire experiment. -- [Run an evaluation comparing two experiments](../../how_to_guides/evaluation/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other. +- [Evaluate aggregate experiment results](../../how_to_guides/summary): Define summary evaluators, which compute metrics for an entire experiment. +- [Run an evaluation comparing two experiments](../../how_to_guides/evaluate_pairwise): Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other. diff --git a/docs/evaluation/how_to_guides/evaluate_llm_application.mdx b/docs/evaluation/how_to_guides/evaluate_llm_application.mdx index fdefed61..a6b886c6 100644 --- a/docs/evaluation/how_to_guides/evaluate_llm_application.mdx +++ b/docs/evaluation/how_to_guides/evaluate_llm_application.mdx @@ -22,7 +22,7 @@ In this guide we'll go over how to evaluate an application using the [evaluate() For larger evaluation jobs in Python we recommend using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html), the asynchronous version of `evaluate()`. It is still worthwhile to read this guide first, as the two have nearly identical interfaces, -and then read the how-to guide on [running an evaluation asynchronously](../../how_to_guides/evaluation/async). +and then read the how-to guide on [running an evaluation asynchronously](../../how_to_guides/async). ::: @@ -223,7 +223,7 @@ Evaluation scores are stored against each actual output as feedback. _If you've annotated your code for tracing, you can open the trace of each row in a side panel view._ -![](../evaluation/static/view_experiment.gif) +![](./static/view_experiment.gif) ## Reference code @@ -364,6 +364,6 @@ _If you've annotated your code for tracing, you can open the trace of each row i ## Related -- [Run an evaluation asynchronously](../../how_to_guides/evaluation/async) -- [Run an evaluation via the REST API](../../how_to_guides/evaluation/run_evals_api_only) -- [Run an evaluation from the prompt playground](../../how_to_guides/evaluation/run_evaluation_from_prompt_playground) +- [Run an evaluation asynchronously](../../how_to_guides/async) +- [Run an evaluation via the REST API](../../how_to_guides/run_evals_api_only) +- [Run an evaluation from the prompt playground](../../how_to_guides/run_evaluation_from_prompt_playground) diff --git a/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx b/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx index 39e1041a..4685864e 100644 --- a/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx +++ b/docs/evaluation/how_to_guides/evaluate_on_intermediate_steps.mdx @@ -167,7 +167,7 @@ def rag_pipeline(question): /> This pipeline will produce a trace that looks something like: -![](../evaluation/static/evaluation_intermediate_trace.png) +![](./static/evaluation_intermediate_trace.png) ## 2. Create a dataset and examples to evaluate the pipeline @@ -387,7 +387,7 @@ Finally, we'll run `evaluate` with the custom evaluators defined above. /> The experiment will contain the results of the evaluation, including the scores and comments from the evaluators: -![](../evaluation/static/evaluation_intermediate_experiment.png) +![](./static/evaluation_intermediate_experiment.png) ## Related diff --git a/docs/evaluation/how_to_guides/evaluate_pairwise.mdx b/docs/evaluation/how_to_guides/evaluate_pairwise.mdx index d68b48b7..55c2857f 100644 --- a/docs/evaluation/how_to_guides/evaluate_pairwise.mdx +++ b/docs/evaluation/how_to_guides/evaluate_pairwise.mdx @@ -22,7 +22,7 @@ This allows you to score the outputs from multiple experiments against each othe Think [LMSYS Chatbot Arena](https://chat.lmsys.org/) - this is the same concept! To do this, use the [evaluate_comparative](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate_comparative.html) / `evaluateComparative` function with two existing experiments. -If you haven't already created experiments to compare, check out our [quick start](https://docs.smith.langchain.com/#5-run-your-first-evaluation) or oue [how-to guide](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application) to get started with evaluations. +If you haven't already created experiments to compare, check out our [quick start](https://docs.smith.langchain.com/#5-run-your-first-evaluation) or oue [how-to guide](https://docs.smith.langchain.com/how_to_guides/evaluate_llm_application) to get started with evaluations. ## `evaluate_comparative` args @@ -240,12 +240,12 @@ In the Python example below, we are pulling [this structured prompt](https://smi Navigate to the "Pairwise Experiments" tab from the dataset page: -![Pairwise Experiments Tab](../evaluation/static/pairwise_from_dataset.png) +![Pairwise Experiments Tab](./static/pairwise_from_dataset.png) Click on a pairwise experiment that you would like to inspect, and you will be brought to the Comparison View: -![Pairwise Comparison View](../evaluation/static/pairwise_comparison_view.png) +![Pairwise Comparison View](./static/pairwise_comparison_view.png) You may filter to runs where the first experiment was better or vice versa by clicking the thumbs up/thumbs down buttons in the table header: -![Pairwise Filtering](../evaluation/static/filter_pairwise.png) +![Pairwise Filtering](./static/filter_pairwise.png) diff --git a/docs/evaluation/how_to_guides/filter_experiments_ui.mdx b/docs/evaluation/how_to_guides/filter_experiments_ui.mdx index 6f32dfc2..eff983ef 100644 --- a/docs/evaluation/how_to_guides/filter_experiments_ui.mdx +++ b/docs/evaluation/how_to_guides/filter_experiments_ui.mdx @@ -74,20 +74,20 @@ and a known ID of the prompt: In the UI, we see all experiments that have been run by default. -![](../evaluation/static/filter-all-experiments.png) +![](./static/filter-all-experiments.png) If we, say, have a preference for openai models, we can easily filter down and see scores within just openai models first: -![](../evaluation/static/filter-openai.png) +![](./static/filter-openai.png) We can stack filters, allowing us to filter out low scores on correctness to make sure we only compare relevant experiments: -![](../evaluation/static/filter-feedback.png) +![](./static/filter-feedback.png) Finally, we can clear and reset filters. For example, if we see there is clear there's a winner with the `singleminded` prompt, we can change filtering settings to see if any other model providers' models work as well with it: -![](../evaluation/static/filter-singleminded.png) +![](./static/filter-singleminded.png) diff --git a/docs/evaluation/how_to_guides/langchain_runnable.mdx b/docs/evaluation/how_to_guides/langchain_runnable.mdx index 3993abfa..806a3e9f 100644 --- a/docs/evaluation/how_to_guides/langchain_runnable.mdx +++ b/docs/evaluation/how_to_guides/langchain_runnable.mdx @@ -132,7 +132,7 @@ To evaluate our chain we can pass it directly to the `evaluate()` / `aevaluate() The runnable is traced appropriately for each output. -![](../evaluation/static/runnable_eval.png) +![](./static/runnable_eval.png) ## Related diff --git a/docs/evaluation/how_to_guides/llm_as_judge.mdx b/docs/evaluation/how_to_guides/llm_as_judge.mdx index c8a0b8f7..b098bf18 100644 --- a/docs/evaluation/how_to_guides/llm_as_judge.mdx +++ b/docs/evaluation/how_to_guides/llm_as_judge.mdx @@ -72,8 +72,8 @@ for the answer is logically valid and consistent with question and the answer.\\ ]} /> -See [here](../../how_to_guides/evaluation/custom_evaluator) for more on how to write a custom evaluator. +See [here](../../how_to_guides/custom_evaluator) for more on how to write a custom evaluator. ## Prebuilt evaluator via `langchain` -See [here](../../how_to_guides/evaluation/use_langchain_off_the_shelf_evaluators) for how to use prebuilt evaluators from `langchain`. +See [here](../../how_to_guides/use_langchain_off_the_shelf_evaluators) for how to use prebuilt evaluators from `langchain`. diff --git a/docs/evaluation/how_to_guides/metric_type.mdx b/docs/evaluation/how_to_guides/metric_type.mdx index a3aa401a..a59e4355 100644 --- a/docs/evaluation/how_to_guides/metric_type.mdx +++ b/docs/evaluation/how_to_guides/metric_type.mdx @@ -6,7 +6,7 @@ import { # How to return categorical vs numerical metrics -LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](../../how_to_guides/evaluation/custom_evaluator). +LangSmith supports both categorical and numerical metrics, and you can return either when writing a [custom evaluator](../../how_to_guides/custom_evaluator). For an evaluator result to be logged as a numerical metric, it must returned as: @@ -68,4 +68,4 @@ Here are some examples: ## Related -- [Return multiple metrics in one evaluator](../../how_to_guides/evaluation/multiple_scores) +- [Return multiple metrics in one evaluator](../../how_to_guides/multiple_scores) diff --git a/docs/evaluation/how_to_guides/multiple_scores.mdx b/docs/evaluation/how_to_guides/multiple_scores.mdx index 2a433002..a69989a6 100644 --- a/docs/evaluation/how_to_guides/multiple_scores.mdx +++ b/docs/evaluation/how_to_guides/multiple_scores.mdx @@ -6,7 +6,7 @@ import { # How to return multiple scores in one evaluator -Sometimes it is useful for a [custom evaluator function](../../how_to_guides/evaluation/custom_evaluator) or [summary evaluator function](../../how_to_guides/evaluation/summary) to return multiple metrics. +Sometimes it is useful for a [custom evaluator function](../../how_to_guides/custom_evaluator) or [summary evaluator function](../../how_to_guides/summary) to return multiple metrics. For example, if you have multiple metrics being generated by an LLM judge, you can save time and money by making a single LLM call that generates multiple metrics instead of making multiple LLM calls. To return multiple scores using the Python SDK, simply return a list of dictionaries/objects of the following form: @@ -71,8 +71,8 @@ Example: Rows from the resulting experiment will display each of the scores. -![](../evaluation/static/multiple_scores.png) +![](./static/multiple_scores.png) ## Related -- [Return categorical vs numerical metrics](../../how_to_guides/evaluation/metric_type) +- [Return categorical vs numerical metrics](../../how_to_guides/metric_type) diff --git a/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx b/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx index b2dee48b..726b2935 100644 --- a/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx +++ b/docs/evaluation/how_to_guides/run_evaluation_from_prompt_playground.mdx @@ -12,12 +12,12 @@ This allows you to test your prompt / model configuration over a series of input 1. **Navigate to the prompt playground** by clicking on "Prompts" in the sidebar, then selecting a prompt from the list of available prompts or creating a new one. 2. **Select the "Switch to dataset" button** to switch to the dataset you want to use for the experiment. Please note that the dataset keys of the dataset inputs must match the input variables of the prompt. In the below sections, note that the selected dataset has inputs with keys "text", which correctly match the input variable of the prompt. Also note that there is a max capacity of 15 inputs for the prompt playground. - ![Switch to dataset](../evaluation/static/switch_to_dataset.png) + ![Switch to dataset](./static/switch_to_dataset.png) 3. **Click on the "Start" button** or CMD+Enter to start the experiment. This will run the prompt over all the examples in the dataset and create an entry for the experiment in the dataset details page. Note that you need to commit the prompt to the prompt hub before you can start the experiment to ensure it can be referenced in the experiment. The result for each input will be streamed and displayed inline for each input in the dataset. - ![Input variables](../evaluation/static/input_variables_playground.png) + ![Input variables](./static/input_variables_playground.png) 4. **View the results** by clicking on the "View Experiment" button at the bottom of the page. This will take you to the experiment details page where you can see the results of the experiment. 5. **Navigate back to the commit page** by clicking on the "View Commit" button. This will take you back to the prompt page where you can make changes to the prompt and run more experiments. The "View Commit" button is available to all experiments that were run from the prompt playground. The experiment is prefixed with the prompt repository name, a unique identifier, and the date and time the experiment was run. - ![Playground experiment results](../evaluation/static/playground_experiment_results.png) + ![Playground experiment results](./static/playground_experiment_results.png) ## Add evaluation scores to the experiment diff --git a/docs/evaluation/how_to_guides/summary.mdx b/docs/evaluation/how_to_guides/summary.mdx index 97fd68bf..761043eb 100644 --- a/docs/evaluation/how_to_guides/summary.mdx +++ b/docs/evaluation/how_to_guides/summary.mdx @@ -73,4 +73,4 @@ You can then pass this evaluator to the `evaluate` method as follows: In the LangSmith UI, you'll the summary evaluator's score displayed with the corresponding key. -![](../evaluation/static/summary_eval.png) +![](./static/summary_eval.png) diff --git a/docs/evaluation/how_to_guides/unit_testing.mdx b/docs/evaluation/how_to_guides/unit_testing.mdx index b43eab1b..a6ce4b06 100644 --- a/docs/evaluation/how_to_guides/unit_testing.mdx +++ b/docs/evaluation/how_to_guides/unit_testing.mdx @@ -57,7 +57,7 @@ Each time you run this test suite, LangSmith collects the pass/fail rate and oth The test suite syncs to a corresponding dataset named after your package or github repository. -![Test Example](../evaluation/static/unit-test-suite.png) +![Test Example](./static/unit-test-suite.png) ## Going further diff --git a/docs/evaluation/how_to_guides/upload_existing_experiments.mdx b/docs/evaluation/how_to_guides/upload_existing_experiments.mdx index c9c8551d..caa2901a 100644 --- a/docs/evaluation/how_to_guides/upload_existing_experiments.mdx +++ b/docs/evaluation/how_to_guides/upload_existing_experiments.mdx @@ -260,12 +260,12 @@ information in the request body). ## View the experiment in the UI Now, login to the UI and click on your newly-created dataset! You should see a single experiment: -![Uploaded experiments table](../evaluation/static/uploaded_dataset.png) +![Uploaded experiments table](./static/uploaded_dataset.png) Your examples will have been uploaded: -![Uploaded examples](../evaluation/static/uploaded_dataset_examples.png) +![Uploaded examples](./static/uploaded_dataset_examples.png) Clicking on your experiment will bring you to the comparison view: -![Uploaded experiment comparison view](../evaluation/static/uploaded_experiment.png) +![Uploaded experiment comparison view](./static/uploaded_experiment.png) As you upload more experiments to your dataset, you will be able to compare the results and easily identify regressions in the comparison view. diff --git a/docs/evaluation/index.mdx b/docs/evaluation/index.mdx index a88c782f..f8d368a3 100644 --- a/docs/evaluation/index.mdx +++ b/docs/evaluation/index.mdx @@ -116,7 +116,7 @@ groupId="client-language" Click the link printed out by your evaluation run to access the LangSmith Experiments UI, and explore the results of your evaluation. -![](./how_to_guides/evaluation/static/view_experiment.gif) +![](./how_to_guides/static/view_experiment.gif) ## Next steps diff --git a/docs/evaluation/tutorials/agents.mdx b/docs/evaluation/tutorials/agents.mdx index 9efd0f73..83ae815f 100644 --- a/docs/evaluation/tutorials/agents.mdx +++ b/docs/evaluation/tutorials/agents.mdx @@ -460,7 +460,7 @@ See the full overview of single step evaluation in our [conceptual guide](https: ::: -We can check a specific tool call using [a custom evaluator](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator): +We can check a specific tool call using [a custom evaluator](https://docs.smith.langchain.com/how_to_guides/custom_evaluator): - Here, we just invoke the assistant, `assistant_runnable`, with a prompt and check if the resulting tool call is as expected. - Here, we are using a specialized agent where the tools are hard-coded (rather than passed with the dataset input). @@ -507,7 +507,7 @@ experiment_results = evaluate( ### Trajectory -We can check a trajectory of tool calls using [custom evaluators](https://docs.smith.langchain.com/how_to_guides/evaluation/custom_evaluator): +We can check a trajectory of tool calls using [custom evaluators](https://docs.smith.langchain.com/how_to_guides/custom_evaluator): - Here, we just invoke the agent, `graph.invoke`, with a prompt. - Here, we are using a specialized agent where the tools are hard-coded (rather than passed with the dataset input). diff --git a/docs/evaluation/tutorials/rag.mdx b/docs/evaluation/tutorials/rag.mdx index 3ff6eddf..39b52d71 100644 --- a/docs/evaluation/tutorials/rag.mdx +++ b/docs/evaluation/tutorials/rag.mdx @@ -406,7 +406,7 @@ However, we will show that this is not required. We can isolate them as intermediate chain steps. -See detail on isolating intermediate chain steps [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_on_intermediate_steps). +See detail on isolating intermediate chain steps [here](https://docs.smith.langchain.com/how_to_guides/evaluate_on_intermediate_steps). Here is the a video from our LangSmith evaluation series for reference: diff --git a/docs/evaluation/tutorials/swe-benchmark.mdx b/docs/evaluation/tutorials/swe-benchmark.mdx index aa7ee4b0..c1f9b00b 100644 --- a/docs/evaluation/tutorials/swe-benchmark.mdx +++ b/docs/evaluation/tutorials/swe-benchmark.mdx @@ -72,7 +72,7 @@ dataset = client.upload_csv( ### Create dataset split for quicker testing -Since running the SWE-bench evaluator takes a long time when run on all examples, you can create a "test" split for quickly testing the evaluator and your code. Read [this guide](../../evaluation/how_to_guides/datasets/manage_datasets_in_application#create-and-manage-dataset-splits) to learn more about managing dataset splits, or watch this short video that shows how to do it (to get to the starting page of the video, just click on your dataset created above and go to the `Examples` tab): +Since running the SWE-bench evaluator takes a long time when run on all examples, you can create a "test" split for quickly testing the evaluator and your code. Read [this guide](../../evaluation/how_to_guides/manage_datasets_in_application#create-and-manage-dataset-splits) to learn more about managing dataset splits, or watch this short video that shows how to do it (to get to the starting page of the video, just click on your dataset created above and go to the `Examples` tab): import creating_split from "./static/creating_split.mp4"; diff --git a/docs/observability/concepts/index.mdx b/docs/observability/concepts/index.mdx index b4acdff7..f007e0fc 100644 --- a/docs/observability/concepts/index.mdx +++ b/docs/observability/concepts/index.mdx @@ -50,9 +50,9 @@ Feedback can currently be continuous or discrete (categorical), and you can reus Collecting feedback on runs can be done in a number of ways: -1. [Sent up along with a trace](/evaluation/how_to_guides/human_feedback/attach_user_feedback) from the LLM application -2. Generated by a user in the app [inline](/evaluation/how_to_guides/human_feedback/annotate_traces_inline) or in an [annotation queue](../evaluation/how_to_guides/human_feedback/annotation_queues) -3. Generated by an automatic evaluator during [offline evaluation](/evaluation/how_to_guides/evaluation/evaluate_llm_application) +1. [Sent up along with a trace](/evaluation/how_to_guides/attach_user_feedback) from the LLM application +2. Generated by a user in the app [inline](/evaluation/how_to_guides/annotate_traces_inline) or in an [annotation queue](../evaluation/how_to_guides/annotation_queues) +3. Generated by an automatic evaluator during [offline evaluation](/evaluation/how_to_guides/evaluate_llm_application) 4. Generated by an [online evaluator](./how_to_guides/monitoring/online_evaluations) To learn more about how feedback is stored in the application, see [this reference guide](../reference/data_formats/feedback_data_format). diff --git a/docs/observability/how_to_guides/monitoring/rules.mdx b/docs/observability/how_to_guides/monitoring/rules.mdx index bedaa787..898cdb37 100644 --- a/docs/observability/how_to_guides/monitoring/rules.mdx +++ b/docs/observability/how_to_guides/monitoring/rules.mdx @@ -31,7 +31,7 @@ _Alternatively_, you can access rules in settings by navigating to