Skip to content

Latest commit

 

History

History
65 lines (45 loc) · 2.31 KB

README.md

File metadata and controls

65 lines (45 loc) · 2.31 KB

umLabeller

umLabeller is an inspection tool for characterizing the semantic compositionality of subword tokenization, based on the morphological information retrieved from UniMorph. Given a word w and its subword tokenization, s = (s1, ..., sn) | ∀i si ∈ V, umLabeller assigns one of four categories: vocab, alien, morph, or n/a:

  • vocabulary subword: the given word w is a subword in the vocabulary as wV;
  • alien composition: the given subword sequence s is an alien subword composition if we find at least two subwords si and sj in s that are not meaningful with respect to the meaning of w;
  • morphological composition: the subword sequence s is morphological if it is neither a vocabulary nor an alien subword composition;
  • n/a: UniMorph has no information on the word.

umLabeller can characterize over half a million English words and is compatible with most modern tokenizers.

Examples

input word subword tokenization output label
jogging _j ogging alien
neutralised _neutral ised morph
stepstones _steps tones alien
swappiness _sw appiness alien
swappiness _swap pi ness morph
jogging _jogging vocab

Installation

To install from the source, please use the following commands:

!git clone https://github.com/unimorph/umLabeller.git
cd umLabeller
!pip install .

Note: The instructions above have been tested on Google Colab.

Usage

from umLabeller.umLabeller import UniMorphLabeller

uml = UniMorphLabeller()
print(uml.auto_classify('stepstones',['Ġsteps','tones']))

Output:

alien

License:

https://creativecommons.org/licenses/by-sa/3.0/

References

More details can be read in the following article:

Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella – Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge. https://arxiv.org/abs/2404.13292