We extend the original ACL 2020 work (Morpheus) by Tan et al., to allow for multiple source languages.
First, download the package and install the requirements:
git clone https://github.com/murali1996/morpheus_multilingual
cd morpheus_multilingual
pip install -e .
The required packages are listed in requirements.txt and can also be installed
using pip install -r requirements.txt
Then, install fairseq package:
pip install fairseq fastBPE
If you face any issues w/ fastBPE installation, see here or here to resolve issues for fastBPE installation and complete installing fairseq.
Second, prepare the language-specific tools,
mkdir -p resources
# setup unimorph_inflect
git clone https://github.com/antonisa/unimorph_inflect.git ./resources/unimorph_inflect
cd resources/unimorph_inflect
python3 setup.py install
cd ../../
# download ud-compatibility for format conversion from stanza to unimorph
git clone https://github.com/unimorph/ud-compatibility ./resources/ud-compatibility
sed -i 's/from\ collections\.abc\ import\ Set/from\ typing\ import\ Set/g' resources/ud-compatibility/UD_UM/utils.py
# sed -i '' 's/from\ collections\.abc\ import\ Set/from\ typing\ import\ Set/g' resources/ud-compatibility/UD_UM/utils.py # for macOS
# download language specific resources
python run_setup.py deu
# can pass any number of languages
# python run_setup.py deu rus ukr
Irrespective of number of data files you would like to create adversaries for, you only need to
run run_setup.py
only once per language.
To obtain adversarial evaluation sets on the provided sample MT data, follow the steps below.
# Use pretrained NMT models from fairseq to evaluate their morphological robustness on a
# sample of newstest2018 evaluation suite from WMT
python run_create_adversaries.py -l deu -f "./sample_mt_data/deu-eng/newstest2018-dev-sample.orig.deu-eng" -m "" --use_pretrained_fairseq_model -b 4
To observe stanza outputs and obtain the tagged Morpho-Syntactic Descriptions (MSDs) for your MT file's source text, follow the steps below. Run the command as-is to obtain results on the provided sample MT data.
Given source language text file, we first collect its vocabulary and then generate all possible
morphological forms. We use Stanza
to generate the morpho-syntactic description (MSD) for each
token in the source text. Then, we use ud-comptability
toolkit to map the UD-style MSD to UniMorph
format. Finally, we make use of UniMorph dictionaries and unimorph_inflect
toolkit to generate all
possible morphological forms.
# generate seed set for adversarial replacement
python -W ignore morpheus_multilingual/utils/create_dictionary.py -l deu -f "./sample_mt_data/deu-eng/newstest2018-dev-sample.orig.deu-eng"
- This might take anywhere between few seconds and few minutes to run,
- And creates 4 files in the output folder (defaults to folder of the mt_file provided above)
with the following extensions:
.stanza
(stanza processed file),.married
(stanza processed file with MSDs converted to UniMorph format),.dict
(all the unique combinations of (lemma, set(MSD)) with set(MSD) containing one of ["V", "N", "ADJ"]),.dict_frequency
(the observed frequency of the unique combinations).
- If your source side texts are already tokenized, you can run the above command with
flag
--pretokenized
to disable any tokenization by Stanza - To use a GPU while running Stanza, use the flag
--use_gpu
Once all the MSDs from the MT data file are obtain, the next step is to observe what can be the potential replacement candidates (aka. reinflections) given the lemma and their MSD. These reinflections are used in finding adversarial examples.
# generate candidate set for adversarial replacement
python -W ignore morpheus_multilingual/utils/create_reinflections.py -l deu -f "./sample_mt_data/deu-eng/candidates/newstest2018-dev-sample.orig.deu-eng.stanza.married.dict"
- This run might take a bit longer than the one to observe stanza outputs
- And creates one file consisting of reinflections
python run_download_ted_data.py
If you use this tool in your research, consider adding the following citation,
@inproceedings{jayanthi-pratapa-2021-study,
title = "A Study of Morphological Robustness of Neural Machine Translation",
author = "Jayanthi, Sai Muralidhar and
Pratapa, Adithya",
booktitle = "Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.sigmorphon-1.6",
doi = "10.18653/v1/2021.sigmorphon-1.6",
pages = "49--59"
}
We also recommend citing the original MORPHEUS work from ACL 2020 that formed the basis of this tool,
@inproceedings{tan-etal-2020-morphin,
title = "It{'}s Morphin{'} Time! {C}ombating Linguistic Discrimination with Inflectional Perturbations",
author = "Tan, Samson and
Joty, Shafiq and
Kan, Min-Yen and
Socher, Richard",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.263",
pages = "2920--2935",
}