Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is accuracy on OPUS-100 computed? #117

Open
bminixhofer opened this issue Aug 27, 2022 · 1 comment
Open

How is accuracy on OPUS-100 computed? #117

bminixhofer opened this issue Aug 27, 2022 · 1 comment

Comments

@bminixhofer
Copy link

bminixhofer commented Aug 27, 2022

Hi! Thanks for this library.

Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]

    correct = 0
    total = 0

    segmenter = pysbd.Segmenter(language="de")

    for sent1, sent2 in zip(sentences, sentences[1:]):
        out = tuple(
            s.strip() for s in segmenter.segment(sent1 + " " + sent2)
        )

        total += 1

        if out == (sent1, sent2):
            correct += 1

    print(f"{correct}/{total} = {correct / total}")

But I get 1011/1999 = 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.

Thanks for any help!

@nipunsadvilkar
Copy link
Owner

nipunsadvilkar commented Nov 17, 2022

Hey @bminixhofer,

I don't remember distinctively but it was something like this:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]
    text = " ".join(sentences)
    total = len(sentences)

    segmenter = pysbd.Segmenter(language="de")
    segments = segmenter.segment(text)
    correct = len(set(sentences).intersection(set(sentences)))
    print(f"{correct}/{total} = {correct / total}")

Also, note that I didn't use datasets but OPUS dataset in raw format by downloading it from official source - https://opus.nlpl.eu/opus-100.php

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants