Skip to content

Latest commit

 

History

History
130 lines (79 loc) · 5.31 KB

11.md

File metadata and controls

130 lines (79 loc) · 5.31 KB

11. Cross-lingual transfer and multilingual NLP

Explanations and visualisations

 

finetune

continuepretrain

test

 

Cross-lingual transfer

  • In principle, we can pre-train a model on one language and use it to process texts in another language
  • Most often, we pre-train models on multiple languages and then use this multilingual model to process texts in any given single language
  • Target language is the language in which we want to perform a task (process texts)
  • Transfer language is the language in which we have a lot of labelled data, we fine-tune a pre-trained multilingual model on the transfer language and then apply it (often as zero-shot) to a different target language.
  • Double transfer: when we transfer a model across languages and fine-tune it for a given task, there are two transfer steps, one across languages and one across tasks
  • (Continued) training: if we have some unlabelled data in the target language, we can continue training the model with the pre-training objective before fine-tuning it for a given task
  • Zero-shot: we attempt to perform a task without fine-tuning and continued training
  • Pre-trained model can be a bare LM or trained/fine-tuned for a specific task
  • An interesting example is the Helsinki team solution to the AmericasNLP task of translating from Spanish into low-resource languages: for each pair Spanish - TARGET, train a on Spanish - English for 90% of the time, then continue on Spanish - TARGET for 10% of the time, best results for all TARGET languages

 

Multilingual data sets

Only text

  • Bible 100, 103 languages, 30 families, Majority non-Indo-European
  • mBERT, 97 languages, 15 families, Top 100 size of Wikipedia plus Thai and Mongolian

 

Parallel data (for machine translation)

  • OPUS, 744 languages
  • FLORES, 200 languages

 

Annotated for text parsing

  • Universal Dependencies (UD) 150 languages

 

Annotated for semantic NLP tasks (sentiment, similarity, inference, question-answering, ...)

  • XTREME, 40 languages (used for training XML-R)
  • XGLUE, 19 languages
  • XNLI, 15 languages
  • XCOPA, 11 languages
  • TyDiQA, 11 languages
  • XQuAD, 12 languages

Many multilingual data sets are created from a selection of data taken from Common Crawl.

 

Multilingual pre-trained models

BERT-type

  • mBERT was the first, trained on top 100 Wikipedia languages, plus a few arbitrary ones
  • XML-R, a BERT-based model, currently most popular as a starting point for multilingual experiments

GPT-type

  • BLOOM
  • Falcon
  • Phi

Full Transformers

  • mT5

Multiple encoder-decoder (not transformers)

  • NLLB

Other pre-trained models are typically trained for a single language or a group of languages (e.g. Indic BERT, AraBERT, BERTić)

 

Language similarity and transfer

 

mentions

  • If the target language is seen in pre-training, the performance will be better
  • There is a trade-off between the size of the training data and the closeness to the target language
  • It is not easy to predict which will be good transfer-target pairs
  • Often BERT base (only English) works best even if the target language is very distant

 

Language vectors

  • To measure distances between languages, we represent each language as a vector
  • One-hot encodings are used as language IDs
  • Typological feature values can be considered vectors, but actual features in the existing databases need to be processed, the two most common methods are conversion into binary values and interpolation (filling missing values).
  • Vectors learned from text samples are called language model (LM) vectors: typically a special token appended to each sentence (same token for all sentences of a single language), this token is expected to contain a representation of a language
  • Typological data bases: WALS, Glottolog, URIEL (derived from WALS, Glottolog and some other sources), Grambank

 

Benefits of multilingual NLP

  • Linguistic and machine learning: bigger challenges lead to better approaches, e.g. subword tokenisation
  • Cultural and normative: better representation of the real world knowledge
  • Cognitive: learn interlingual abstractions