Low inter-annotator agreement? #2

kocmitom · 2021-10-21T21:02:25Z

Hello,

I have been analyzing your results, maybe I missed something important, but when you take into account only sentences that do NOT change [1], you get the following graph:

In other words, not changing anything helps HT to score better. It can be also visualized in the following way. If you take only scores for sentences that didn't change and compare how the ranking changes between BTS and ATS, you get this distribution:

This shows that the ranking of MT vs HT changes for almost half of the sentences (only 389 sentences for MT_Y and 440 sentences for MT_Z stay consistent) in one or the other direction no matter that the sentence didn't change. This illustrates a low inter-annotator agreement, therefore the claims in the paper are not possible to conclude. Or what am I missing something?

[1] changing two lines at https://github.com/ahrii-kim/suboptimal_test_set/blob/master/evaluation/score.py#L117 to "F"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low inter-annotator agreement? #2

Low inter-annotator agreement? #2

kocmitom commented Oct 21, 2021

Low inter-annotator agreement? #2

Low inter-annotator agreement? #2

Comments

kocmitom commented Oct 21, 2021