Vacillating Human Correlation of SacreBLEU in Unprotected Languages

This repository provides datasets and codes for MT evaluation employed in the given publication, written by Ahrii Kim (김아리) and Jinhyeon Kim (김진현) and submitted at Preprints.org (1st ver.) and HumEval 2022) (final ver.).

Abstract

SacreBLEU, by incorporating a text normalizing step in the pipeline, has become a rising automatic evaluation metric in recent MT studies. With agglutinative languages such as Korean, however, the lexical-level metric cannot provide a conceivable result without a customized pre-tokenization. This paper endeavors to examine the influence of diversified tokenization schemes –word, morpheme, subword, character, and consonants & vowels (CV)– on the metric after its protective layer is peeled off.

By performing meta-evaluation with manually-constructed into-Korean resources, our empirical study demonstrates that the human correlation of the surface-based metric and other homogeneous ones (as an extension) vacillates greatly by the token type. Moreover, the human correlation of the metric often deteriorates due to some tokenization, with CV one of its culprits. Guiding through the proper usage of tokenizers for the given metric, we discover i) the feasibility of the character tokens and ii) the deficit of CV in the Korean MT evaluation.

Dataset

Base
- Source Text: English from WMT 20 English III-type (2,048 sentences / 61 documents)
- Reference Text: Korean* (manually created)
- System Translation: 4 online APIs
Judgment
- Human: Direct Assessment (DA) of adequacy & fluency
- Automatic:
  - BLEU, TER, and ChrF from SacreBLEU (Post, 2018)
  - NLTK_BLEU (Papineni et al., 2002)
  - GLEU (Wu et al., 2016)
  - RIBES (Isozaki et al., 2010)
  - NIST
  - EED (Wang et al., 2016)
  - CharacTER (Stanchev et al., 2019)
    *For legal issue, a sample of the reference set is publicly available.

Tokenization

Word Level
Morpheme Level
- KoNLPy (Park and Cho, 2014)
- Korean Intelligent Word Identifier (Kiwi)
- Kakao Hangul Analyzer III (Khaiii)
Subword Level
- SentencePiece (Kudo and Richardson, 2018)
Character Level
CV Level
- Jamo

Examplary Tokens

Type	Tokens
Sentence	"모델 레옹 데임은 아직 그 누구도 시도한 적 없는 방식으로 캣워크를 활보했다"
Word	`['모델', '레옹', '데임은', '아직', '그', '누구도', '시도한', '적', '없는', '방식으로', '캣워크를', '활보했다']`
MeCab-ko	`['모델', '레옹', '데임', '은', '아직', '그', '누구', '도', '시도', '한', '적', '없', '는', '방식', '으로', '캣', '워크', '를', '활보', '했', '다']`
Kiwi	`['모델', '레옹', '데이', 'ᆷ', '은', '아직', '그', '누구', '도', '시도', '하', 'ᆫ', '적', '없', '는', '방식', '으로', '캣워크', '를', '활보', '하', '었', '다']`
Khaiii	`['모델', '레옹', '데임', '은', '아직', '그', '누구', '도', '시도', '하', 'ㄴ', '적', '없', '는', '방식', '으로', '캣워크', '를', '활보', '하', '였', '다']`
SPM	`['모델', '레', '옹', '데', '임', '은', '아직', '그', '누구', '도', '시도', '한', '적', '없', '는', '방식', '으로', '', '캣', '워크', '를', '활', '보', '했', '다']`
Character	`['모', '델', '레', '옹', '데', '임', '은', '아', '직', '그', '누', '구', '도', '시', '도', '한', '적', '없', '는', '방', '식', '으', '로', '캣', '워', '크', '를', '활', '보', '했', '다']`
CV	['ㅁ', 'ㅗ', 'ㄷ', 'ㅔ', 'ㄹ', ' ', 'ㄹ', 'ㅔ', 'ㅇ', 'ㅗ', 'ㅇ', ' ', 'ㄷ', 'ㅔ', 'ㅇ', 'ㅣ', 'ㅁ', 'ㅇ', 'ㅡ', 'ㄴ', ' ', 'ㅇ', 'ㅏ', 'ㅈ', 'ㅣ', 'ㄱ', ' ', 'ㄱ', 'ㅡ', ' ', 'ㄴ', 'ㅜ', 'ㄱ', 'ㅜ', 'ㄷ', 'ㅗ', ' ', 'ㅅ', 'ㅣ', 'ㄷ', 'ㅗ', 'ㅎ', 'ㅏ', 'ㄴ', ' ', 'ㅈ', 'ㅓ', 'ㄱ', ' ', 'ㅇ', 'ㅓ', 'ㅄ', 'ㄴ', 'ㅡ', 'ㄴ', ' ', 'ㅂ', 'ㅏ', 'ㅇ', 'ㅅ', 'ㅣ', 'ㄱ', 'ㅇ', 'ㅡ', 'ㄹ', 'ㅗ', ' ', 'ㅋ', 'ㅐ', 'ㅅ', 'ㅇ', 'ㅝ', 'ㅋ', 'ㅡ', 'ㄹ', 'ㅡ', 'ㄹ', ' ', 'ㅎ', 'ㅘ', 'ㄹ', 'ㅂ', 'ㅗ', 'ㅎ', 'ㅐ', 'ㅆ', 'ㄷ', 'ㅏ']

Tools & Results

Using the tools

Before implementation, we remind you to have all tokenizers installed. To test all the tools applied for our experiment (rank clustering, bootstrap resampling, tokenized samples, and automatic metrics scores), use the following code:

# for instance, metric = gleu
$ python3 script.py gleu

The given two metrics are not included in the script above, as they aer copied from the original libraries. You can test them seperately with our example with the following code:

[EED]
$ python3 ./tool/metric/ExtendedEditDistance/EED.py \
                -ref data/ref_example.txt \
                -hyp data/hyp_example.txt

[CharacTER]
$ python3 ./tool/metric/CharacTER/CharacTER.py \
                -r data/ref_example.txt \
                -o data/hyp_example.txt

Reproducing the experiment results

The regeneration of the figures of our paper is with the code given below. The options are to choose either the segment or corpus level and to save the images using --save.

$ python3 tool/draw_graph.py 'corpus'

Computing time

You can check the time to tokenize our sample text with the following code.

$ bash tool/tokenizer.sh data/sample tsv "Ref Hyp" "Kkma Hannanum Okt Komoran Mecab Khaiii Kiwi Spm Syllable CV"

Citation

@inproceedings{kim-kim-2022-vacillating,
    title = "Vacillating Human Correlation of {S}acre{BLEU} in Unprotected Languages",
    author = "Kim, Ahrii  and Kim, Jinhyeon",
    booktitle = "Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.humeval-1.1",
    pages = "1--15",
    abstract = "SacreBLEU, by incorporating a text normalizing step in the pipeline, has become a rising                automatic evaluation metric in recent MT studies. With agglutinative languages such as Korean, however, the                 lexical-level metric cannot provide a conceivable result without a customized pre-tokenization. This paper endeavors to ex- amine the influence of diversified tokenization schemes {--}word, morpheme, subword, character, and consonants {\&} vowels (CV){--} on the metric after its protective layer is peeled off.By performing meta-evaluation with manually- constructed into-Korean resources, our empirical study demonstrates that the human correlation of the surface-based metric and other homogeneous ones (as an extension) vacillates greatly by the token type. Moreover, the human correlation of the metric often deteriorates due to some tokenization, with CV one of its culprits. Guiding through the proper usage of tokenizers for the given metric,      we discover i) the feasibility of the character tokens and ii) the deficit of CV in the Korean MT evaluation.",
    }

License

Apache License Version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
data		data
img		img
tool		tool
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vacillating Human Correlation of SacreBLEU in Unprotected Languages

Abstract

Dataset

Tokenization

Examplary Tokens

Tools & Results

Using the tools

Reproducing the experiment results

Computing time

Citation

License

About

Releases

Packages

Languages

License

trotacodigos/Korean_SacreBLEU

Folders and files

Latest commit

History

Repository files navigation

Vacillating Human Correlation of SacreBLEU in Unprotected Languages

Abstract

Dataset

Tokenization

Examplary Tokens

Tools & Results

Using the tools

Reproducing the experiment results

Computing time

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages