Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluations of NLLB-1.3B Distilled on FLoRes200 are incorrect, and may be duplicates of FLoRes101. #1

Open
shauncassini opened this issue Nov 2, 2023 · 0 comments

Comments

@shauncassini
Copy link

shauncassini commented Nov 2, 2023

Hello. First off, thanks for conducting such extensive evaluations on all of these models. I am finding it very useful for checking my own results. However, when looking into your evaluation files, I've noticed the following:

nllb-200-distilled-1.3B/flores101-devtest.eng-deu.eval:

chrF2++|nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.3.1 = 0.59321
BLEU|nrefs:1|case:mixed|eff:no|tok:flores200|smooth:exp|version:2.3.1 = 41.4 68.6/51.0/40.2/32.2 (BP = 0.897 ratio = 0.902 hyp_len = 35747 ref_len = 39633)
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1 = 35.2 67.6/43.8/31.1/22.7 (BP = 0.926 ratio = 0.929 hyp_len = 23307 ref_len = 25094)
COMET+default = 0.5955
chrF2|nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.3.1 = 0.61757

is exactly the same as

nllb-200-distilled-1.3B/flores200-devtest.eng-deu.eval:

chrF2++|nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.3.1 = 0.59321
BLEU|nrefs:1|case:mixed|eff:no|tok:flores200|smooth:exp|version:2.3.1 = 41.4 68.6/51.0/40.2/32.2 (BP = 0.897 ratio = 0.902 hyp_len = 35747 ref_len = 39633)
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1 = 35.2 67.6/43.8/31.1/22.7 (BP = 0.926 ratio = 0.929 hyp_len = 23307 ref_len = 25094)
COMET+default = 0.5955
chrF2|nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.3.1 = 0.61757

furthermore, when running sacrebleu on the model output files for FLoRes200, I get different results. It is likely that these are duplicates. Maybe the flores101.output was evaluated twice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant