docs: updated the formula for some metrics (#1834)

explodinggradients · Jan 11, 2025 · f265cf5 · f265cf5
1 parent 91393e6
commit f265cf5
Show file tree

Hide file tree

Showing 13 changed files with 168 additions and 56 deletions.
diff --git a/docs/concepts/metrics/available_metrics/agents.md b/docs/concepts/metrics/available_metrics/agents.md
@@ -52,17 +52,24 @@ AIMessage(content="I found a great recipe for chocolate cake! Would you like the
 
 
 sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
-scorer = TopicAdherenceScore(mode="precision")
-scorer.llm = openai_model
+scorer = TopicAdherenceScore(llm = evaluator_llm, mode="precision")
 await scorer.multi_turn_ascore(sample)
 ```
+Output
+```
+0.6666666666444444
+```
 
 
 To change the mode to recall, set the `mode` parameter to `recall`.
 
 ```python
-scorer = TopicAdherenceScore(mode="recall")
+scorer = TopicAdherenceScore(llm = evaluator_llm, mode="recall")
 ```  
+Output
+```
+0.99999999995
+```
 
 
 
@@ -96,10 +103,13 @@ sample = MultiTurnSample(
     ]
 )
 
-scorer = ToolCallAccuracy()
-scorer.llm = your_llm
+scorer = ToolCallAccuracy(llm = evaluator_llm)
 await scorer.multi_turn_ascore(sample)
 ```
+Output
+```
+1.0
+```
 
 The tool call sequence specified in `reference_tool_calls` is used as the ideal outcome. If the tool calls made by the AI does not match the order or sequence of the `reference_tool_calls`, the metric will return a score of 0. This helps to ensure that the AI is able to identify and call the required tools in the correct order to complete a given task.
 
@@ -109,7 +119,7 @@ By default the tool names and arguments are compared using exact string matching
 from ragas.metrics._string import NonLLMStringSimilarity
 from ragas.metrics._tool_call_accuracy import ToolCallAccuracy
 
-metric = ToolCallAccuracy()
+metric = ToolCallAccuracy(llm = evaluator_llm)
 metric.arg_comparison_metric = NonLLMStringSimilarity()
 ```
 
@@ -146,10 +156,13 @@ sample = MultiTurnSample(user_input=[
 ],
     reference="Table booked at one of the chinese restaurants at 8 pm")
 
-scorer = AgentGoalAccuracyWithReference()
-scorer.llm = your_llm
+scorer = AgentGoalAccuracyWithReference(llm = evaluator_llm)
 await scorer.multi_turn_ascore(sample)
 
+```
+Output
+```
+1.0
 ```
 
 ### Without reference
@@ -181,7 +194,11 @@ sample = MultiTurnSample(user_input=[
     HumanMessage(content="thanks"),
 ])
 
-scorer = AgentGoalAccuracyWithoutReference()
-await metric.multi_turn_ascore(sample)
+scorer = AgentGoalAccuracyWithoutReference(llm = evaluator_llm)
+await scorer.multi_turn_ascore(sample)
 
 ```
+Output
+```
+1.0
+```
diff --git a/docs/concepts/metrics/available_metrics/answer_relevance.md b/docs/concepts/metrics/available_metrics/answer_relevance.md
@@ -1,27 +1,29 @@
 ## Response Relevancy
 
-`ResponseRelevancy` metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the `user_input`, the `retrived_contexts` and the `response`. 
+The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.  
 
-The Answer Relevancy is defined as the mean cosine similarity of the original `user_input` to a number of artificial questions, which where generated (reverse engineered) based on the `response`: 
+This metric is calculated using the `user_input` and the `response` as follows:  
 
-$$
-\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)
-$$
+1. Generate a set of artificial questions (default is 3) based on the response. These questions are designed to reflect the content of the response.  
+2. Compute the cosine similarity between the embedding of the user input ($E_o$) and the embedding of each generated question ($E_{g_i}$).  
+3. Take the average of these cosine similarity scores to get the **Answer Relevancy**:  
 
 $$
-\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\|\|E_o\|}
-$$
-
-Where: 
+\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \text{cosine similarity}(E_{g_i}, E_o)
+$$  
 
-* $E_{g_i}$ is the embedding of the generated question $i$.
-* $E_o$ is the embedding of the original question.
-* $N$ is the number of generated questions, which is 3 default.
+$$
+\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}
+$$  
 
-Please note, that even though in practice the score will range between 0 and 1 most of the time, this is not mathematically guaranteed, due to the nature of the cosine similarity ranging from -1 to 1.
+Where:  
+- $E_{g_i}$: Embedding of the $i^{th}$ generated question.  
+- $E_o$: Embedding of the user input.  
+- $N$: Number of generated questions (default is 3).  
 
-An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.
+**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.
 
+An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
 
 ### Example
 
@@ -37,9 +39,13 @@ sample = SingleTurnSample(
         ]
     )
 
-scorer = ResponseRelevancy()
+scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embedding)
 await scorer.single_turn_ascore(sample)
 ```
+Output
+```
+0.9165088378587264
+```
 
 ### How It’s Calculated
 

diff --git a/docs/concepts/metrics/available_metrics/aspect_critic.md b/docs/concepts/metrics/available_metrics/aspect_critic.md
@@ -38,6 +38,7 @@ scorer.llm = openai_model
 await scorer.single_turn_ascore(sample)
 ```
 
+
 ## Calculation
 
 Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:

diff --git a/docs/concepts/metrics/available_metrics/context_entities_recall.md b/docs/concepts/metrics/available_metrics/context_entities_recall.md
@@ -2,10 +2,15 @@
 
 `ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
 
-To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in `reference` and set of entities present in `retrieved_contexts` respectively. We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:
+To compute this metric, we use two sets:  
+
+- **$RE$**: The set of entities in the reference.  
+- **$RCE$**: The set of entities in the retrieved contexts.  
+
+We calculate the number of entities common to both sets ($RCE \cap RE$) and divide it by the total number of entities in the reference ($RE$). The formula is:  
 
 $$
-\text{context entity recall} = \frac{| CE \cap GE |}{| GE |}
+\text{Context Entity Recall} = \frac{\text{Number of common entities between $RCE$ and $RE$}}{\text{Total number of entities in $RE$}}
 $$
 
 
@@ -20,10 +25,14 @@ sample = SingleTurnSample(
     retrieved_contexts=["The Eiffel Tower is located in Paris."], 
 )
 
-scorer = ContextEntityRecall()
+scorer = ContextEntityRecall(llm=evaluator_llm)
 
 await scorer.single_turn_ascore(sample)
 ```
+Output
+```
+0.999999995
+```
 
 ### How It’s Calculated
 
@@ -34,25 +43,25 @@ await scorer.single_turn_ascore(sample)
     **High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
     **Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.
 
-Let us consider the ground truth and the contexts given above.
+Let us consider the refrence and the retrieved contexts given above.
 
-- **Step-1**: Find entities present in the ground truths.
-    - Entities in ground truth (GE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
-- **Step-2**: Find entities present in the context.
-    - Entities in context (CE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
-    - Entities in context (CE2) - ['Taj Mahal', 'UNESCO', 'India']
+- **Step-1**: Find entities present in the refrence.
+    - Entities in ground truth (RE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
+- **Step-2**: Find entities present in the retrieved contexts.
+    - Entities in context (RCE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
+    - Entities in context (RCE2) - ['Taj Mahal', 'UNESCO', 'India']
 - **Step-3**: Use the formula given above to calculate entity-recall
 
     $$
-    \text{context entity recall 1} = \frac{| CE1 \cap GE |}{| GE |}
+    \text{context entity recall 1} = \frac{| RCE1 \cap RE |}{| RE |}
                                  = 4/6
                                  = 0.666
     $$
 
     $$
-    \text{context entity recall 2} = \frac{| CE2 \cap GE |}{| GE |}
+    \text{context entity recall 2} = \frac{| RCE2 \cap RE |}{| RE |}
                                  = 1/6
     $$
 
-    We can see that the first context had a high entity recall, because it has a better entity coverage given the ground truth. If these two contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
+    We can see that the first context had a high entity recall, because it has a better entity coverage given the refrence. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
 
diff --git a/docs/concepts/metrics/available_metrics/context_precision.md b/docs/concepts/metrics/available_metrics/context_precision.md
@@ -25,7 +25,7 @@ The following metrics uses LLM to identify if a retrieved context is relevant or
 from ragas import SingleTurnSample
 from ragas.metrics import LLMContextPrecisionWithoutReference
 
-context_precision = LLMContextPrecisionWithoutReference()
+context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)
 
 sample = SingleTurnSample(
     user_input="Where is the Eiffel Tower located?",
@@ -36,6 +36,10 @@ sample = SingleTurnSample(
 
 await context_precision.single_turn_ascore(sample)
 ```
+Output
+```
+0.9999999999
+```
 
 ### Context Precision with reference
 
@@ -47,7 +51,7 @@ await context_precision.single_turn_ascore(sample)
 from ragas import SingleTurnSample
 from ragas.metrics import LLMContextPrecisionWithReference
 
-context_precision = LLMContextPrecisionWithReference()
+context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
 
 sample = SingleTurnSample(
     user_input="Where is the Eiffel Tower located?",
@@ -57,6 +61,10 @@ sample = SingleTurnSample(
 
 await context_precision.single_turn_ascore(sample)
 ```
+Output
+```
+0.9999999999
+```
 
 ## Non LLM Based Context Precision
 
@@ -80,4 +88,8 @@ sample = SingleTurnSample(
 )
 
 await context_precision.single_turn_ascore(sample)
+```
+Output
+```
+0.9999999999
 ```
diff --git a/docs/concepts/metrics/available_metrics/context_recall.md b/docs/concepts/metrics/available_metrics/context_recall.md
@@ -13,7 +13,7 @@ In short, recall is about not missing anything important. Since it is about not
 The formula for calculating context recall is as follows:
 
 $$
-\text{context recall} = {|\text{GT claims that can be attributed to context}| \over |\text{Number of claims in GT}|}
+\text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}
 $$
 
 ### Example
@@ -29,9 +29,13 @@ sample = SingleTurnSample(
     retrieved_contexts=["Paris is the capital of France."], 
 )
 
-context_recall = LLMContextRecall()
+context_recall = LLMContextRecall(llm=evaluator_llm)
 await context_recall.single_turn_ascore(sample)
 
+```
+Output
+```
+1.0
 ```
 
 ## Non LLM Based Context Recall
@@ -61,4 +65,8 @@ context_recall = NonLLMContextRecall()
 await context_recall.single_turn_ascore(sample)
 
 
+```
+Output
+```
+0.5
 ```
diff --git a/docs/concepts/metrics/available_metrics/factual_correctness.md b/docs/concepts/metrics/available_metrics/factual_correctness.md
@@ -42,15 +42,22 @@ sample = SingleTurnSample(
     reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
 )
 
-scorer = FactualCorrectness()
-scorer.llm = openai_model
+scorer = FactualCorrectness(llm = evaluator_llm)
 await scorer.single_turn_ascore(sample)
 ```
+Output
+```
+0.67
+```
 
 By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.
 
 ```python
-scorer = FactualCorrectness(mode="precision")
+scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
+```
+Output
+```
+1.0
 ```
 
 ### Controlling the Number of Claims
@@ -63,6 +70,10 @@ Each sentence in the response and reference can be broken down into one or more
 ```python
 scorer = FactualCorrectness(mode="precision",atomicity="low")
 ```
+Output
+```
+1.0
+```
 
 
 #### Understanding Atomicity and Coverage

diff --git a/docs/concepts/metrics/available_metrics/faithfulness.md b/docs/concepts/metrics/available_metrics/faithfulness.md
@@ -1,11 +1,16 @@
 ## Faithfulness
 
-`Faithfulness` metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
+The **Faithfulness** metric measures how factually consistent a `response` is with the `retrieved context`. It ranges from 0 to 1, with higher scores indicating better consistency.  
 
-The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. To calculate this, a set of claims from the generated answer is first identified. Then each of these claims is cross-checked with the given context to determine if it can be inferred from the context. The faithfulness score is given by:
+A response is considered **faithful** if all its claims can be supported by the retrieved context.  
+
+To calculate this:  
+1. Identify all the claims in the response.  
+2. Check each claim to see if it can be inferred from the retrieved context.  
+3. Compute the faithfulness score using the formula:  
 
 $$
-\text{Faithfulness score} = {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|}
+\text{Faithfulness Score} = \frac{\text{Number of claims in the response supported by the retrieved context}}{\text{Total number of claims in the response}}
 $$
 
 
@@ -22,9 +27,13 @@ sample = SingleTurnSample(
             "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
         ]
     )
-scorer = Faithfulness()
+scorer = Faithfulness(llm=evaluator_llm)
 await scorer.single_turn_ascore(sample)
 ```
+Output
+```
+1.0
+```
 
 
 ## Faithfullness with HHEM-2.1-Open
@@ -43,7 +52,7 @@ sample = SingleTurnSample(
             "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
         ]
     )
-scorer = FaithfulnesswithHHEM()
+scorer = FaithfulnesswithHHEM(llm=evaluator_llm)
 await scorer.single_turn_ascore(sample)
 
 ```
-Original file line number
+Diff line change
@@ Expand Up / @@ -38,6 +38,7 @@ scorer.llm = openai_model @@
     await scorer.single_turn_ascore(sample)
     ```
     ## Calculation
     Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:
@@ Expand Down @@