You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been analyzing your results, maybe I missed something important, but when you take into account only sentences that do NOT change [1], you get the following graph:
In other words, not changing anything helps HT to score better. It can be also visualized in the following way. If you take only scores for sentences that didn't change and compare how the ranking changes between BTS and ATS, you get this distribution:
This shows that the ranking of MT vs HT changes for almost half of the sentences (only 389 sentences for MT_Y and 440 sentences for MT_Z stay consistent) in one or the other direction no matter that the sentence didn't change. This illustrates a low inter-annotator agreement, therefore the claims in the paper are not possible to conclude. Or what am I missing something?
Hello,
I have been analyzing your results, maybe I missed something important, but when you take into account only sentences that do NOT change [1], you get the following graph:
In other words, not changing anything helps HT to score better. It can be also visualized in the following way. If you take only scores for sentences that didn't change and compare how the ranking changes between BTS and ATS, you get this distribution:
This shows that the ranking of MT vs HT changes for almost half of the sentences (only 389 sentences for MT_Y and 440 sentences for MT_Z stay consistent) in one or the other direction no matter that the sentence didn't change. This illustrates a low inter-annotator agreement, therefore the claims in the paper are not possible to conclude. Or what am I missing something?
[1] changing two lines at https://github.com/ahrii-kim/suboptimal_test_set/blob/master/evaluation/score.py#L117 to "F"
The text was updated successfully, but these errors were encountered: