How is accuracy on OPUS-100 computed? #117

bminixhofer · 2022-08-27T11:14:53Z

Hi! Thanks for this library.

Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]

    correct = 0
    total = 0

    segmenter = pysbd.Segmenter(language="de")

    for sent1, sent2 in zip(sentences, sentences[1:]):
        out = tuple(
            s.strip() for s in segmenter.segment(sent1 + " " + sent2)
        )

        total += 1

        if out == (sent1, sent2):
            correct += 1

    print(f"{correct}/{total} = {correct / total}")

But I get 1011/1999 = 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.

Thanks for any help!

The text was updated successfully, but these errors were encountered:

nipunsadvilkar · 2022-11-17T21:45:35Z

Hey @bminixhofer,

I don't remember distinctively but it was something like this:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]
    text = " ".join(sentences)
    total = len(sentences)

    segmenter = pysbd.Segmenter(language="de")
    segments = segmenter.segment(text)
    correct = len(set(sentences).intersection(set(sentences)))
    print(f"{correct}/{total} = {correct / total}")

Also, note that I didn't use datasets but OPUS dataset in raw format by downloading it from official source - https://opus.nlpl.eu/opus-100.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is accuracy on OPUS-100 computed? #117

How is accuracy on OPUS-100 computed? #117

bminixhofer commented Aug 27, 2022 •

edited

Loading

nipunsadvilkar commented Nov 17, 2022 •

edited

Loading

How is accuracy on OPUS-100 computed? #117

How is accuracy on OPUS-100 computed? #117

Comments

bminixhofer commented Aug 27, 2022 • edited Loading

nipunsadvilkar commented Nov 17, 2022 • edited Loading

bminixhofer commented Aug 27, 2022 •

edited

Loading

nipunsadvilkar commented Nov 17, 2022 •

edited

Loading