feat: replaced NLTK's implementation of BLEU with sacrebleu's impleme…

…ntation (#1744) NLTK's implementation of BLEU is limited. In particular, only 20% of my attempts to compute BLEU return a score because the number of candidate and reference sentences are not the same, a requirement of NLTK's implementation. The implementation of BLEU by [sacrebleu](https://github.com/mjpost/sacrebleu) is recommended because it is more robust. In my PR, I have modified _bleu_score.py to use sacrebleu's implementation.
explodinggradients · Dec 12, 2024 · 27c8277 · 27c8277
1 parent 45c9bcb
commit 27c8277
Show file tree

Hide file tree

Showing 3 changed files with 9 additions and 17 deletions.
diff --git a/docs/concepts/metrics/available_metrics/traditional.md b/docs/concepts/metrics/available_metrics/traditional.md
@@ -29,7 +29,7 @@ scorer = NonLLMStringSimilarity(distance_measure=DistanceMeasure.HAMMING)
 
 ## BLEU Score
 
-The `BleuScore` score is a metric used to evaluate the quality of `response` by comparing it with `reference`. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. Since it was designed to evaluate machine translation systems, it expects the response and reference to contain same number of sentences. The comparison is done at sentence level. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
+The `BleuScore` score is a metric used to evaluate the quality of `response` by comparing it with `reference`. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.
 
 ### Example
 ```python
@@ -44,12 +44,6 @@ sample = SingleTurnSample(
 scorer = BleuScore()
 await scorer.single_turn_ascore(sample)
 ```
-Custom weights may be supplied to fine-tune the BLEU score further. A tuple of float weights for unigrams, bigrams, trigrams and so on can be given by
-
-```python
-scorer = BleuScore(weights=(0.25, 0.25, 0.25, 0.25))
-```
-
 
 
 ## ROUGE Score
@@ -110,4 +104,4 @@ sample = SingleTurnSample(
 )
 scorer = StringPresence()
 await scorer.single_turn_ascore(sample)
-```
+```
diff --git a/requirements/dev.txt b/requirements/dev.txt
@@ -11,7 +11,8 @@ transformers
 fastembed
 graphene
 rouge_score
+sacrebleu
 nltk
 rapidfuzz
 pandas
-datacompy
+datacompy
diff --git a/src/ragas/metrics/_bleu_score.py b/src/ragas/metrics/_bleu_score.py
@@ -15,21 +15,18 @@ class BleuScore(SingleTurnMetric):
     _required_columns: t.Dict[MetricType, t.Set[str]] = field(
         default_factory=lambda: {MetricType.SINGLE_TURN: {"reference", "response"}}
     )
-    weights: t.Tuple[float, ...] = (0.25, 0.25, 0.25, 0.25)
     sentence_segmenter: t.Optional[HasSegmentMethod] = None
     language: str = "english"
 
     def __post_init__(self):
         try:
-            from nltk.tokenize import word_tokenize
-            from nltk.translate.bleu_score import corpus_bleu
+            from sacrebleu import corpus_bleu
         except ImportError:
             raise ImportError(
-                "nltk is required for bleu score. Please install it using `pip install nltk`"
+                "sacrebleu is required for bleu score. Please install it using `pip install sacrebleu`"
             )
         if not self.sentence_segmenter:
             self.sentence_segmenter = get_segmenter(language=self.language, clean=False)
-        self.word_tokenizer = word_tokenize
         self.corpus_bleu = corpus_bleu
 
     def init(self, run_config: RunConfig):
@@ -46,10 +43,10 @@ async def _single_turn_ascore(
         response_sentences = self.sentence_segmenter.segment(sample.response)
 
         reference = [
-            [self.word_tokenizer(reference)] for reference in reference_sentences
+            [reference] for reference in reference_sentences
         ]
-        response = [self.word_tokenizer(response) for response in response_sentences]
-        score = self.corpus_bleu(reference, response, weights=self.weights)
+        response = response_sentences
+        score = self.corpus_bleu(response, reference).score / 100
         assert isinstance(score, float), "Expecting a float"
         return score