Explanations and visualisations
- Crash Course Linguistics #16
- Sebastian Ruder's blog: The State of Multilingual AI
- Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
- The State and Fate of Linguistic Diversity and Inclusion in the NLP World
- No Language Left Behind: Scaling Human-Centered Machine Translation
- Connecting Language Technologies with Rich, Diverse Data Sources Covering Thousands of Languages
- In principle, we can pre-train a model on one language and use it to process texts in another language
- Most often, we pre-train models on multiple languages and then use this multilingual model to process texts in any given single language
- Target language is the language in which we want to perform a task (process texts)
- Transfer language is the language in which we have a lot of labelled data, we fine-tune a pre-trained multilingual model on the transfer language and then apply it (often as zero-shot) to a different target language.
- Double transfer: when we transfer a model across languages and fine-tune it for a given task, there are two transfer steps, one across languages and one across tasks
- (Continued) training: if we have some unlabelled data in the target language, we can continue training the model with the pre-training objective before fine-tuning it for a given task
- Zero-shot: we attempt to perform a task without fine-tuning and continued training
- Pre-trained model can be a bare LM or trained/fine-tuned for a specific task
- An interesting example is the Helsinki team solution to the AmericasNLP task of translating from Spanish into low-resource languages: for each pair Spanish - TARGET, train a on Spanish - English for 90% of the time, then continue on Spanish - TARGET for 10% of the time, best results for all TARGET languages
Only text
- Bible 100, 103 languages, 30 families, Majority non-Indo-European
- mBERT, 97 languages, 15 families, Top 100 size of Wikipedia plus Thai and Mongolian
Parallel data (for machine translation)
- OPUS, 744 languages
- FLORES, 200 languages
Annotated for text parsing
- Universal Dependencies (UD) 150 languages
Annotated for semantic NLP tasks (sentiment, similarity, inference, question-answering, ...)
- XTREME, 40 languages (used for training XML-R)
- XGLUE, 19 languages
- XNLI, 15 languages
- XCOPA, 11 languages
- TyDiQA, 11 languages
- XQuAD, 12 languages
Many multilingual data sets are created from a selection of data taken from Common Crawl.
BERT-type
- mBERT was the first, trained on top 100 Wikipedia languages, plus a few arbitrary ones
- XML-R, a BERT-based model, currently most popular as a starting point for multilingual experiments
GPT-type
- BLOOM
- Falcon
- Phi
Full Transformers
- mT5
Multiple encoder-decoder (not transformers)
- NLLB
Other pre-trained models are typically trained for a single language or a group of languages (e.g. Indic BERT, AraBERT, BERTić)
- If the target language is seen in pre-training, the performance will be better
- There is a trade-off between the size of the training data and the closeness to the target language
- It is not easy to predict which will be good transfer-target pairs
- Often BERT base (only English) works best even if the target language is very distant
- To measure distances between languages, we represent each language as a vector
- One-hot encodings are used as language IDs
- Typological feature values can be considered vectors, but actual features in the existing databases need to be processed, the two most common methods are conversion into binary values and interpolation (filling missing values).
- Vectors learned from text samples are called language model (LM) vectors: typically a special token appended to each sentence (same token for all sentences of a single language), this token is expected to contain a representation of a language
- Typological data bases: WALS, Glottolog, URIEL (derived from WALS, Glottolog and some other sources), Grambank
- Linguistic and machine learning: bigger challenges lead to better approaches, e.g. subword tokenisation
- Cultural and normative: better representation of the real world knowledge
- Cognitive: learn interlingual abstractions