Serbian Language Tools is a PHP library for dealing with text written in Serbian language. It features:
- Tokenizer
- Diacritic restoration tool
- Transliterator between Serbian Cyrillic and Latin alphabets
- Alphabet detection
This library requires PHP 7.4 or greater with sqlite3, intl and mbstring extensions.
You can install the package via composer:
composer require turanjanin/serbian-language-tools
In order to use the library, you need to tokenize the string. Tokenization is a process of splitting the string into a series of related characters. This library can recognize the following tokens: Word, Whitespace, URI (which includes URLs, hashtags and at-mentions), Interpunction, HTML and Emoticon.
Tokenizing can be achieved by creating a new instance of Text
class using the named constructor:
use Turanjanin\SerbianLanguageTools\Text;
$text = Text::fromString('Zdravo svete, ovo je primer teksta!');
Text object will now contain an array of various tokens that can be processed. You can use this object as any other PHP array since it implements ArrayAccess
interface.
echo count($text) . "\n"; // 13
echo get_class($text[1]). "\n"; // Turanjanin\SerbianLanguageTools\Tokens\Whitespace
echo $text[9] . "\n"; // primer
Serbian Latin alphabet includes a couple of specific characters that are not found in ASCII encoding table. These characters feature diacritics - č, ć, š, ž, dž, đ - which are often omitted in everyday communication (social media, emails and SMS), mainly due to the widespread usage of English keyboard layouts.
This degraded Latin alphabet can be easily understood by human readers but it poses significant challenge for search engines and natural language processing. Therefore, this library features an algorithm that allows automated restoration of ASCII text by using a dictionary of Serbian words and phrases for context disambiguation.
The algorithm inspects all Word
tokens and looks for restoration candidates - the words with s, c, z or dj characters. After that, the following two steps are applied:
-
The most common phrases are searched for inside the text and, if found, words are replaced with their diacritical equivalents. This step takes word context into consideration which allows us to give advantage to some less used variations. For example,
sto hiljada
won't be replaced withšto hiljada
, even though the formšto
(why) has much greater frequency compared to wordsto
(hundred). -
Every restoration candidate is looked up in the dictionary and, if there are known variations, token is replaced with
RestoredWord
(if there is only one possible variation) orMultipleRestoredWord
(if there are more possible variations). In case of more than one variation, the one with the highest frequency will be marked as preferred.
Diacritic restoration can be performed by calling the invokable class:
use Turanjanin\SerbianLanguageTools\Text;
use Turanjanin\SerbianLanguageTools\Transformers\DiacriticRestorer;
$text = Text::fromString('Cetiri cavke cuceci dzangrizavo cijucu u zeleznickoj skoli.');
echo (new DiacriticRestorer)($text); // Četiri čavke čučeći džangrizavo cijuču u železničkoj školi.
Dictionary needed for this algorithm is stored in custom-made SQLite database that is included with this library. You can extend this database or use different storage solution by providing custom implementation of Turanjanin\SerbianLanguageTools\Dictionary\Dictionary
interface.
Library supports transliteration of text between Cyrillic, Latin and ASCII alphabets. Transliteration can be performed by calling appropriate invokable class:
use Turanjanin\SerbianLanguageTools\Text;
use Turanjanin\SerbianLanguageTools\Transformers\ToAsciiLatin;
use Turanjanin\SerbianLanguageTools\Transformers\ToCyrillic;
use Turanjanin\SerbianLanguageTools\Transformers\ToLatin;
$cyrillic = Text::fromString('Ово је ћирилични текст');
$latin = Text::fromString('Primer latiničnog teksta');
echo (new ToLatin)($cyrillic); // Ovo je ćirilični tekst
echo (new ToCyrillic)($latin); // Пример латиничног текста
echo (new ToAsciiLatin)($cyrillic); // Ovo je cirilicni tekst
If you need only transliteration between Latin and Cyrillic alphabets, take a look at the simpler library - turanjanin/serbian-transliterator.
Library can be used to detect if text is written in Serbian Cyrillic or Latin alphabet:
use Turanjanin\SerbianLanguageTools\Text;
Text::fromString('Ovo je latinica')->isLatin(); // true
Text::fromString('Ovo je latinica')->isCyrillic(); // false
The MIT License (MIT). Please see License File for more information.