This repository contains the code for the experiment to translate the English SQuAD2.0 to Finnish (or other languages).
Download the original SQuAD2.0
train and
dev files
and put them into the squad2-en/
folder. Then run the script to convert them from
.json
to .docx
(pip install -r requirements.txt
first if needed):
python3 squad2doc.py squad2-en/dev-v2.0.json squad2-en/train-v2.0.json
This will create a bunch of .docx
files that respect the size limit of the
DeepL translation service and a meta.jsonl
file that contains the information
to map the answers to the questions after translating the .docx
files.
Feed the .docx
files to DeepL and save the
translated files into the squad2-fi-raw/
folder.
Html files are easier to parse with Python so it makes sense to convert the
.docx
files to html.
Easy way to do this is with something like LibreOffice:
for FILE in squad2-fi-raw/*.docx ; do libreoffice --convert-to html --outdir squad2-fi-raw/html "$FILE" ; done
The last step is to parse the html files to create the final Finnish JSON files: (this will take a while)
python3 html2squad.py
The final dataset is then created into the squad2_fi/
folder.