- Arabic is one of the most spoken languages around the globe. Although the use of Arabic increased on the Internet, the Arabic NLP community is lagging compared to other languages. One of the aspects that differentiate Arabic is diacritics. Diacritics are short vowels with a constant length that are spoken but usually omitted from Arabic text as Arabic speakers usually can infer it easily.
- The same word in the Arabic language can have different meanings and different pronunciations based on how it is diacritized. Getting back these diacritics in the text is very useful in many NLP systems like Text To Speech (TTS) systems and machine translation as diacritics removes ambiguity in both pronunciation and meaning. Here is an example of Arabic text diacritization:
- Built using Python.
- You can view Data Set which was used to train the model
- Project Report
Real input | Golden Output |
---|---|
ذهب علي إلى الشاطئ | ذَهَبَ عَلِي إِلَى اَلشَّاطِئِ |
- First install the needed packages.
pip install -r requirements.txt
- Folder Structure
├───dataset
├───src
│ ├──utils
│ ├──constants.py
│ ├──evaluation.py
│ ├──featureExtraction.py
│ ├──models.py
│ ├──preprocessing.py
│ └──train.py
├───trained_models
├───requirements.txt
...
- Navigate to the src directory
cd src
- Run the
train.py
file to train model
python train.py
- Run the
evaluation.py
file to test model
python evaluation.py
- Features are saved in
../trained_models
- Data Cleaning: First step is always to clean the sentences we read from the corpus by defining the basic Arabic letters with all different formats of them they encountered 36 unique letter, the main diacritics we have in the language and they encountered 15 unique diacritic, all punctuations and white spaces. Anything other than the mentioned characters gets filtered out.
- Tokenization: The way we found yielding the best result is to divide the corpus into sentences of fixed size (A window we set with length of 1000) which means that if a sentence exceeds the window size we will go backward until the first space we will face then this will be the cutting edge of the first sentence, and the splitted word will be the first one in the next sentence and keep going like this. If the sentence length is less than the window size then we pad the rest of the empty size to ensure they’re all almost equal.
- Encoding: The last step is to encode each character and diacritic to a specific index which is defined in our character to index and diacritic to index dictionaries. Basically transforming letters and diacritics into a numerical form to be input to our model.
- Failed Trails: We Tried not to give it a full sentence but a small sliding window and this sliding window is flexible in size as we can determine the size of previous words we want to get and the size of the next words.
- Trainable Embeddings (Used): Here we use the Embedding layer provided by torch.nn. which gives us trainable embeddings on the character level. This layer in a neural network is responsible for transforming discrete input elements, in our case character indices, into continuous vector representations where each unique input element is associated with a learnable vector and these vectors capture semantic relationships between the elements.
- AraVec CBOW Word Embeddings: AraVec is an open-source project that offers pre-trained distributed word representations, specifically designed for the Arabic natural language processing (NLP) research community.
- AraBERT: Pre-trained Transformers for Arabic Language Understanding and Generation (Arabic BERT, Arabic GPT2, Arabic ELECTRA)
- Bag Of Words: by using the CountVectorizer method which is trained on the whole corpus. The vectorizer gets the feature names after fitting the data and then we save them to a csv file which represents the bag of words model. We can index any word and get its corresponding vector which describes its count in the corpus and no info about the position of this word.
- TF-IDF: The TfIdfVectorizer initializes our model and we choose to turn off lowercasing the words. After transforming and fitting the model on the input data, we extract the feature names out and this will be out words set that we’ll place them in column headings of each column in the output csv file.
Fitting training data and labels into a 5-Layer Bidirectional LSTM which gave us 97% accuracy.
Abdelrahman Hamdy |
Beshoy Morad |
Abdelrahman Noaman |
Eslam Ashraf |
Note: This software is licensed under MIT License, See License for more information ©AbdelrahmanHamdyy.