Skip to content

Commit

Permalink
docs: updated the formula for some metrics (#1834)
Browse files Browse the repository at this point in the history
  • Loading branch information
sahusiddharth authored Jan 11, 2025
1 parent 91393e6 commit f265cf5
Show file tree
Hide file tree
Showing 13 changed files with 168 additions and 56 deletions.
37 changes: 27 additions & 10 deletions docs/concepts/metrics/available_metrics/agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,17 +52,24 @@ AIMessage(content="I found a great recipe for chocolate cake! Would you like the


sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
scorer = TopicAdherenceScore(mode="precision")
scorer.llm = openai_model
scorer = TopicAdherenceScore(llm = evaluator_llm, mode="precision")
await scorer.multi_turn_ascore(sample)
```
Output
```
0.6666666666444444
```


To change the mode to recall, set the `mode` parameter to `recall`.

```python
scorer = TopicAdherenceScore(mode="recall")
scorer = TopicAdherenceScore(llm = evaluator_llm, mode="recall")
```
Output
```
0.99999999995
```



Expand Down Expand Up @@ -96,10 +103,13 @@ sample = MultiTurnSample(
]
)

scorer = ToolCallAccuracy()
scorer.llm = your_llm
scorer = ToolCallAccuracy(llm = evaluator_llm)
await scorer.multi_turn_ascore(sample)
```
Output
```
1.0
```

The tool call sequence specified in `reference_tool_calls` is used as the ideal outcome. If the tool calls made by the AI does not match the order or sequence of the `reference_tool_calls`, the metric will return a score of 0. This helps to ensure that the AI is able to identify and call the required tools in the correct order to complete a given task.

Expand All @@ -109,7 +119,7 @@ By default the tool names and arguments are compared using exact string matching
from ragas.metrics._string import NonLLMStringSimilarity
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy

metric = ToolCallAccuracy()
metric = ToolCallAccuracy(llm = evaluator_llm)
metric.arg_comparison_metric = NonLLMStringSimilarity()
```

Expand Down Expand Up @@ -146,10 +156,13 @@ sample = MultiTurnSample(user_input=[
],
reference="Table booked at one of the chinese restaurants at 8 pm")

scorer = AgentGoalAccuracyWithReference()
scorer.llm = your_llm
scorer = AgentGoalAccuracyWithReference(llm = evaluator_llm)
await scorer.multi_turn_ascore(sample)

```
Output
```
1.0
```

### Without reference
Expand Down Expand Up @@ -181,7 +194,11 @@ sample = MultiTurnSample(user_input=[
HumanMessage(content="thanks"),
])

scorer = AgentGoalAccuracyWithoutReference()
await metric.multi_turn_ascore(sample)
scorer = AgentGoalAccuracyWithoutReference(llm = evaluator_llm)
await scorer.multi_turn_ascore(sample)

```
Output
```
1.0
```
36 changes: 21 additions & 15 deletions docs/concepts/metrics/available_metrics/answer_relevance.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,29 @@
## Response Relevancy

`ResponseRelevancy` metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the `user_input`, the `retrived_contexts` and the `response`.
The `ResponseRelevancy` metric measures how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.

The Answer Relevancy is defined as the mean cosine similarity of the original `user_input` to a number of artificial questions, which where generated (reverse engineered) based on the `response`:
This metric is calculated using the `user_input` and the `response` as follows:

$$
\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)
$$
1. Generate a set of artificial questions (default is 3) based on the response. These questions are designed to reflect the content of the response.
2. Compute the cosine similarity between the embedding of the user input ($E_o$) and the embedding of each generated question ($E_{g_i}$).
3. Take the average of these cosine similarity scores to get the **Answer Relevancy**:

$$
\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\|\|E_o\|}
$$

Where:
\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \text{cosine similarity}(E_{g_i}, E_o)
$$

* $E_{g_i}$ is the embedding of the generated question $i$.
* $E_o$ is the embedding of the original question.
* $N$ is the number of generated questions, which is 3 default.
$$
\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}
$$

Please note, that even though in practice the score will range between 0 and 1 most of the time, this is not mathematically guaranteed, due to the nature of the cosine similarity ranging from -1 to 1.
Where:
- $E_{g_i}$: Embedding of the $i^{th}$ generated question.
- $E_o$: Embedding of the user input.
- $N$: Number of generated questions (default is 3).

An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.
**Note**: While the score usually falls between 0 and 1, it is not guaranteed due to cosine similarity's mathematical range of -1 to 1.

An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.

### Example

Expand All @@ -37,9 +39,13 @@ sample = SingleTurnSample(
]
)

scorer = ResponseRelevancy()
scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embedding)
await scorer.single_turn_ascore(sample)
```
Output
```
0.9165088378587264
```

### How It’s Calculated

Expand Down
1 change: 1 addition & 0 deletions docs/concepts/metrics/available_metrics/aspect_critic.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ scorer.llm = openai_model
await scorer.single_turn_ascore(sample)
```


## Calculation

Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:
Expand Down
33 changes: 21 additions & 12 deletions docs/concepts/metrics/available_metrics/context_entities_recall.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,15 @@

`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.

To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in `reference` and set of entities present in `retrieved_contexts` respectively. We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:
To compute this metric, we use two sets:

- **$RE$**: The set of entities in the reference.
- **$RCE$**: The set of entities in the retrieved contexts.

We calculate the number of entities common to both sets ($RCE \cap RE$) and divide it by the total number of entities in the reference ($RE$). The formula is:

$$
\text{context entity recall} = \frac{| CE \cap GE |}{| GE |}
\text{Context Entity Recall} = \frac{\text{Number of common entities between $RCE$ and $RE$}}{\text{Total number of entities in $RE$}}
$$


Expand All @@ -20,10 +25,14 @@ sample = SingleTurnSample(
retrieved_contexts=["The Eiffel Tower is located in Paris."],
)

scorer = ContextEntityRecall()
scorer = ContextEntityRecall(llm=evaluator_llm)

await scorer.single_turn_ascore(sample)
```
Output
```
0.999999995
```

### How It’s Calculated

Expand All @@ -34,25 +43,25 @@ await scorer.single_turn_ascore(sample)
**High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
**Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.

Let us consider the ground truth and the contexts given above.
Let us consider the refrence and the retrieved contexts given above.

- **Step-1**: Find entities present in the ground truths.
- Entities in ground truth (GE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
- **Step-2**: Find entities present in the context.
- Entities in context (CE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
- Entities in context (CE2) - ['Taj Mahal', 'UNESCO', 'India']
- **Step-1**: Find entities present in the refrence.
- Entities in ground truth (RE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
- **Step-2**: Find entities present in the retrieved contexts.
- Entities in context (RCE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
- Entities in context (RCE2) - ['Taj Mahal', 'UNESCO', 'India']
- **Step-3**: Use the formula given above to calculate entity-recall

$$
\text{context entity recall 1} = \frac{| CE1 \cap GE |}{| GE |}
\text{context entity recall 1} = \frac{| RCE1 \cap RE |}{| RE |}
= 4/6
= 0.666
$$

$$
\text{context entity recall 2} = \frac{| CE2 \cap GE |}{| GE |}
\text{context entity recall 2} = \frac{| RCE2 \cap RE |}{| RE |}
= 1/6
$$

We can see that the first context had a high entity recall, because it has a better entity coverage given the ground truth. If these two contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
We can see that the first context had a high entity recall, because it has a better entity coverage given the refrence. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.

16 changes: 14 additions & 2 deletions docs/concepts/metrics/available_metrics/context_precision.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The following metrics uses LLM to identify if a retrieved context is relevant or
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision = LLMContextPrecisionWithoutReference()
context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
Expand All @@ -36,6 +36,10 @@ sample = SingleTurnSample(

await context_precision.single_turn_ascore(sample)
```
Output
```
0.9999999999
```

### Context Precision with reference

Expand All @@ -47,7 +51,7 @@ await context_precision.single_turn_ascore(sample)
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithReference

context_precision = LLMContextPrecisionWithReference()
context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)

sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
Expand All @@ -57,6 +61,10 @@ sample = SingleTurnSample(

await context_precision.single_turn_ascore(sample)
```
Output
```
0.9999999999
```

## Non LLM Based Context Precision

Expand All @@ -80,4 +88,8 @@ sample = SingleTurnSample(
)

await context_precision.single_turn_ascore(sample)
```
Output
```
0.9999999999
```
12 changes: 10 additions & 2 deletions docs/concepts/metrics/available_metrics/context_recall.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ In short, recall is about not missing anything important. Since it is about not
The formula for calculating context recall is as follows:

$$
\text{context recall} = {|\text{GT claims that can be attributed to context}| \over |\text{Number of claims in GT}|}
\text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}
$$

### Example
Expand All @@ -29,9 +29,13 @@ sample = SingleTurnSample(
retrieved_contexts=["Paris is the capital of France."],
)

context_recall = LLMContextRecall()
context_recall = LLMContextRecall(llm=evaluator_llm)
await context_recall.single_turn_ascore(sample)

```
Output
```
1.0
```

## Non LLM Based Context Recall
Expand Down Expand Up @@ -61,4 +65,8 @@ context_recall = NonLLMContextRecall()
await context_recall.single_turn_ascore(sample)


```
Output
```
0.5
```
17 changes: 14 additions & 3 deletions docs/concepts/metrics/available_metrics/factual_correctness.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,22 @@ sample = SingleTurnSample(
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
)

scorer = FactualCorrectness()
scorer.llm = openai_model
scorer = FactualCorrectness(llm = evaluator_llm)
await scorer.single_turn_ascore(sample)
```
Output
```
0.67
```

By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.

```python
scorer = FactualCorrectness(mode="precision")
scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
```
Output
```
1.0
```

### Controlling the Number of Claims
Expand All @@ -63,6 +70,10 @@ Each sentence in the response and reference can be broken down into one or more
```python
scorer = FactualCorrectness(mode="precision",atomicity="low")
```
Output
```
1.0
```


#### Understanding Atomicity and Coverage
Expand Down
19 changes: 14 additions & 5 deletions docs/concepts/metrics/available_metrics/faithfulness.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
## Faithfulness

`Faithfulness` metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
The **Faithfulness** metric measures how factually consistent a `response` is with the `retrieved context`. It ranges from 0 to 1, with higher scores indicating better consistency.

The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context. To calculate this, a set of claims from the generated answer is first identified. Then each of these claims is cross-checked with the given context to determine if it can be inferred from the context. The faithfulness score is given by:
A response is considered **faithful** if all its claims can be supported by the retrieved context.

To calculate this:
1. Identify all the claims in the response.
2. Check each claim to see if it can be inferred from the retrieved context.
3. Compute the faithfulness score using the formula:

$$
\text{Faithfulness score} = {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|}
\text{Faithfulness Score} = \frac{\text{Number of claims in the response supported by the retrieved context}}{\text{Total number of claims in the response}}
$$


Expand All @@ -22,9 +27,13 @@ sample = SingleTurnSample(
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
]
)
scorer = Faithfulness()
scorer = Faithfulness(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
```
Output
```
1.0
```


## Faithfullness with HHEM-2.1-Open
Expand All @@ -43,7 +52,7 @@ sample = SingleTurnSample(
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
]
)
scorer = FaithfulnesswithHHEM()
scorer = FaithfulnesswithHHEM(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

```
Expand Down
Loading

0 comments on commit f265cf5

Please sign in to comment.