umLabeller is an inspection tool for characterizing the semantic compositionality of subword tokenization, based on the morphological information retrieved from UniMorph. Given a word w and its subword tokenization, s = (s1, ..., sn) | ∀i si ∈ V, umLabeller assigns one of four categories: vocab, alien, morph, or n/a:
- vocabulary subword: the given word w is a subword in the vocabulary as w ∈ V;
- alien composition: the given subword sequence s is an alien subword composition if we find at least two subwords si and sj in s that are not meaningful with respect to the meaning of w;
- morphological composition: the subword sequence s is morphological if it is neither a vocabulary nor an alien subword composition;
- n/a: UniMorph has no information on the word.
umLabeller can characterize over half a million English words and is compatible with most modern tokenizers.
input word | subword tokenization | output label |
---|---|---|
jogging | _j ogging | alien |
neutralised | _neutral ised | morph |
stepstones | _steps tones | alien |
swappiness | _sw appiness | alien |
swappiness | _swap pi ness | morph |
jogging | _jogging | vocab |
To install from the source, please use the following commands:
!git clone https://github.com/unimorph/umLabeller.git
cd umLabeller
!pip install .
Note: The instructions above have been tested on Google Colab.
from umLabeller.umLabeller import UniMorphLabeller
uml = UniMorphLabeller()
print(uml.auto_classify('stepstones',['Ġsteps','tones']))
Output:
alien
https://creativecommons.org/licenses/by-sa/3.0/
More details can be read in the following article:
Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella – Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge. https://arxiv.org/abs/2404.13292