'+ title + '
' + summary +'
diff --git a/site/about/contributing/index.html b/site/about/contributing/index.html deleted file mode 100644 index c0228e5..0000000 --- a/site/about/contributing/index.html +++ /dev/null @@ -1,317 +0,0 @@ - - - -
- - - - - - - - - -One of our main objectives in this project is to promote collaborative projects with open-source outcomes. If you are generous and passionate to volunteer and help the Kurdish language, there are three ways you can do so:
-If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must. Please get in touch by joining the KurdishNLP community on Gitter. Our collaborations oftentimes lead to a scientific paper depending on the task. Please check the following repositories to find out about some of our previous projects:
- -If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you are invited to create content online. You can start creating a blog or tweet in Kurdish. After all, every single person is a contributor as well.
-In any case, please follow this project and introduce it to your friends. Test the tool and raise your issues so that we can fix them.
This project is created by Sina Ahmadi and is publicly available under a Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/.
Please note that KLPT is under development and some of the functionalities will appear in the future versions. You can find out regarding the progress of each task at the Projects section. In the current version, the following tasks are included:
-Modules |
- Tasks | -Sorani (ckb) | -Kurmanji (kmr) | -
---|---|---|---|
preprocess |
- normalization | -✓ (v0.1.0) | -✓ (v0.1.0) | -
standardization | -✓ (v0.1.0) | -✓ (v0.1.0) | -|
unification of numerals | -✓ (v0.1.0) | -✓ (v0.1.0) | -|
tokenize |
- word tokenization |
- ✓ (v0.1.0) | -✓ (v0.1.0) | -
MWE tokenization |
- ✓ (v0.1.0) | -✓ (v0.1.0) | -|
sentence tokenization | -✓ (v0.1.0) | -✓ (v0.1.0) | -|
transliterate |
- Arabic to Latin | -✓ (v0.1.0) | -✓ (v0.1.0) | -
Latin to Arabic | -✓ (v0.1.0) | -✓ (v0.1.0) | -|
Detection of u/w and î/y | -✓ (v0.1.0) | -✓ (v0.1.0) | -|
Detection of Bizroke ( i ) | -✗ | -✗ | -|
stem |
- morphological analysis | -✓ (v0.1.0) | -✗ | -
morphological generation | -✓ (v0.1.0) | -✗ | -|
stemming | -✗ | -✗ | -|
lemmatization | -✗ | -✗ | -|
spell error detection and correction | -✓ (v0.1.0) | -✗ | -
Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support,
-Be the first one! 🙂
-Name/company | -donation ($) | -URL | -
---|---|---|
- | - | - |
- | - | - |
- -
- -Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as less-resourced languages.
-Despite a plethora of performant tools and specific frameworks for natural language processing (NLP), such as NLTK, Stanza and spaCy, the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish.
-Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages):
-kmr
ckb
sdh
lki
Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script.
-KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the Kurdish language. The current version (0.1) comes with four core modules, namely preprocess
, stem
, transliterate
and tokenize
, and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the Sorani and Kurmanji dialects of Kurdish. More importantly, it is an open-source project!
To find out more about how to use the tool, please check the "User Guide" section of this website.
-Please consider citing this paper, if you use any part of the data or the tool (bib
file):
@inproceedings{ahmadi2020klpt,
- title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
- author = "Ahmadi, Sina",
- booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
- month = nov,
- year = "2020",
- address = "Online",
- publisher = "Association for Computational Linguistics",
- url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
- pages = "72--84"
-}
-
-You can also watch the presentation of this paper at https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit.
-Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means:
-' + summary +'
No results found
"); - } -} - -function doSearch () { - var query = document.getElementById('mkdocs-search-query').value; - if (query.length > min_search_length) { - if (!window.Worker) { - displayResults(search(query)); - } else { - searchWorker.postMessage({query: query}); - } - } else { - // Clear results for short queries - displayResults([]); - } -} - -function initSearch () { - var search_input = document.getElementById('mkdocs-search-query'); - if (search_input) { - search_input.addEventListener("keyup", doSearch); - } - var term = getSearchTermFromLocation(); - if (term) { - search_input.value = term; - doSearch(); - } -} - -function onWorkerMessage (e) { - if (e.data.allowSearch) { - initSearch(); - } else if (e.data.results) { - var results = e.data.results; - displayResults(results); - } else if (e.data.config) { - min_search_length = e.data.config.min_search_length-1; - } -} - -if (!window.Worker) { - console.log('Web Worker API not supported'); - // load index in main thread - $.getScript(joinUrl(base_url, "search/worker.js")).done(function () { - console.log('Loaded worker'); - init(); - window.postMessage = function (msg) { - onWorkerMessage({data: msg}); - }; - }).fail(function (jqxhr, settings, exception) { - console.error('Could not load worker.js'); - }); -} else { - // Wrap search in a web worker - var searchWorker = new Worker(joinUrl(base_url, "search/worker.js")); - searchWorker.postMessage({init: true}); - searchWorker.onmessage = onWorkerMessage; -} diff --git a/site/search/search_index.json b/site/search/search_index.json deleted file mode 100644 index 086f62f..0000000 --- a/site/search/search_index.json +++ /dev/null @@ -1 +0,0 @@ -{"config":{"lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Welcome / H\u00fbn bi x\u00ear hatin / \u0628\u06d5 \u062e\u06ce\u0631 \u0628\u06ce\u0646! \ud83d\ude42 Introduction Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as less-resourced languages . Despite a plethora of performant tools and specific frameworks for natural language processing (NLP), such as NLTK , Stanza and spaCy , the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish. Kurdish Language Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages): Northern Kurdish (or Kurmanji) kmr Central Kurdish (or Sorani) ckb Southern Kurdish sdh Laki lki Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script. KLPT - The Kurdish Language Processing Toolkit KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the Kurdish language . The current version (0.1) comes with four core modules, namely preprocess , stem , transliterate and tokenize , and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the Sorani and Kurmanji dialects of Kurdish. More importantly, it is an open-source project ! To find out more about how to use the tool, please check the \"User Guide\" section of this website. Cite this project Please consider citing this paper , if you use any part of the data or the tool ( bib file ): @inproceedings{ahmadi2020klpt, title = \"{KLPT} {--} {K}urdish Language Processing Toolkit\", author = \"Ahmadi, Sina\", booktitle = \"Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)\", month = nov, year = \"2020\", address = \"Online\", publisher = \"Association for Computational Linguistics\", url = \"https://www.aclweb.org/anthology/2020.nlposs-1.11\", pages = \"72--84\" } You can also watch the presentation of this paper at https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit . License Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means: You are free to share , copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material for any purpose, even commercially . You must give appropriate credit , provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original .","title":"Home"},{"location":"#welcome-hun-bi-xer-hatin","text":"","title":"Welcome / H\u00fbn bi x\u00ear hatin / \u0628\u06d5 \u062e\u06ce\u0631 \u0628\u06ce\u0646! \ud83d\ude42"},{"location":"#introduction","text":"Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as less-resourced languages . Despite a plethora of performant tools and specific frameworks for natural language processing (NLP), such as NLTK , Stanza and spaCy , the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish.","title":"Introduction"},{"location":"#kurdish-language","text":"Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages): Northern Kurdish (or Kurmanji) kmr Central Kurdish (or Sorani) ckb Southern Kurdish sdh Laki lki Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script.","title":"Kurdish Language"},{"location":"#klpt-the-kurdish-language-processing-toolkit","text":"KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the Kurdish language . The current version (0.1) comes with four core modules, namely preprocess , stem , transliterate and tokenize , and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the Sorani and Kurmanji dialects of Kurdish. More importantly, it is an open-source project ! To find out more about how to use the tool, please check the \"User Guide\" section of this website.","title":"KLPT - The Kurdish Language Processing Toolkit"},{"location":"#cite-this-project","text":"Please consider citing this paper , if you use any part of the data or the tool ( bib file ): @inproceedings{ahmadi2020klpt, title = \"{KLPT} {--} {K}urdish Language Processing Toolkit\", author = \"Ahmadi, Sina\", booktitle = \"Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)\", month = nov, year = \"2020\", address = \"Online\", publisher = \"Association for Computational Linguistics\", url = \"https://www.aclweb.org/anthology/2020.nlposs-1.11\", pages = \"72--84\" } You can also watch the presentation of this paper at https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit .","title":"Cite this project"},{"location":"#license","text":"Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means: You are free to share , copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material for any purpose, even commercially . You must give appropriate credit , provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original .","title":"License"},{"location":"about/contributing/","text":"How to help Kurdish language processing? One of our main objectives in this project is to promote collaborative projects with open-source outcomes. If you are generous and passionate to volunteer and help the Kurdish language, there are three ways you can do so: If you are a native Kurdish speaker with general knowledge about Kurdish and are comfortable working with computer, contributing to collaboratively-curated resources is the best starting point, particularly to: W\u00eek\u00eeferheng - the Kurdish Wiktionary Wikipedia in Sorani and in Kurmanji If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must. Please get in touch by joining the KurdishNLP community on Gitter . Our collaborations oftentimes lead to a scientific paper depending on the task. Please check the following repositories to find out about some of our previous projects: Kurdish tokenization Kurdish Hunspell Kurdish transliteration If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you are invited to create content online. You can start creating a blog or tweet in Kurdish. After all, every single person is a contributor as well . In any case, please follow this project and introduce it to your friends. Test the tool and raise your issues so that we can fix them.","title":"Contributing"},{"location":"about/contributing/#how-to-help-kurdish-language-processing","text":"One of our main objectives in this project is to promote collaborative projects with open-source outcomes. If you are generous and passionate to volunteer and help the Kurdish language, there are three ways you can do so: If you are a native Kurdish speaker with general knowledge about Kurdish and are comfortable working with computer, contributing to collaboratively-curated resources is the best starting point, particularly to: W\u00eek\u00eeferheng - the Kurdish Wiktionary Wikipedia in Sorani and in Kurmanji If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must. Please get in touch by joining the KurdishNLP community on Gitter . Our collaborations oftentimes lead to a scientific paper depending on the task. Please check the following repositories to find out about some of our previous projects: Kurdish tokenization Kurdish Hunspell Kurdish transliteration If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you are invited to create content online. You can start creating a blog or tweet in Kurdish. After all, every single person is a contributor as well . In any case, please follow this project and introduce it to your friends. Test the tool and raise your issues so that we can fix them.","title":"How to help Kurdish language processing?"},{"location":"about/license/","text":"License This project is created by Sina Ahmadi and is publicly available under a Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/ .","title":"License"},{"location":"about/license/#license","text":"This project is created by Sina Ahmadi and is publicly available under a Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/ .","title":"License"},{"location":"about/release-notes/","text":"About the current version Please note that KLPT is under development and some of the functionalities will appear in the future versions. You can find out regarding the progress of each task at the Projects section. In the current version, the following tasks are included: Modules Tasks Sorani (ckb) Kurmanji (kmr) preprocess normalization \u2713 (v0.1.0) \u2713 (v0.1.0) standardization \u2713 (v0.1.0) \u2713 (v0.1.0) unification of numerals \u2713 (v0.1.0) \u2713 (v0.1.0) tokenize word tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) MWE tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) sentence tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) transliterate Arabic to Latin \u2713 (v0.1.0) \u2713 (v0.1.0) Latin to Arabic \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of u/w and \u00ee/y \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of Bizroke ( i ) \u2717 \u2717 stem morphological analysis \u2713 (v0.1.0) \u2717 morphological generation \u2713 (v0.1.0) \u2717 stemming \u2717 \u2717 lemmatization \u2717 \u2717 spell error detection and correction \u2713 (v0.1.0) \u2717","title":"Release Notes"},{"location":"about/release-notes/#about-the-current-version","text":"Please note that KLPT is under development and some of the functionalities will appear in the future versions. You can find out regarding the progress of each task at the Projects section. In the current version, the following tasks are included: Modules Tasks Sorani (ckb) Kurmanji (kmr) preprocess normalization \u2713 (v0.1.0) \u2713 (v0.1.0) standardization \u2713 (v0.1.0) \u2713 (v0.1.0) unification of numerals \u2713 (v0.1.0) \u2713 (v0.1.0) tokenize word tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) MWE tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) sentence tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) transliterate Arabic to Latin \u2713 (v0.1.0) \u2713 (v0.1.0) Latin to Arabic \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of u/w and \u00ee/y \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of Bizroke ( i ) \u2717 \u2717 stem morphological analysis \u2713 (v0.1.0) \u2717 morphological generation \u2713 (v0.1.0) \u2717 stemming \u2717 \u2717 lemmatization \u2717 \u2717 spell error detection and correction \u2713 (v0.1.0) \u2717","title":"About the current version"},{"location":"about/sponsors/","text":"Become a sponsor Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support, You can be an official sponsor You will get a GitHub sponsor badge on your profile If you have any questions, I will focus on it If you want, I will add your name or company logo on the front page of your preferred project Your contribution will be acknowledged in one of my future papers in a field of your choice Our sponsors: Be the first one! \ud83d\ude42 Name/company donation ($) URL","title":"Sponsors"},{"location":"about/sponsors/#become-a-sponsor","text":"Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support, You can be an official sponsor You will get a GitHub sponsor badge on your profile If you have any questions, I will focus on it If you want, I will add your name or company logo on the front page of your preferred project Your contribution will be acknowledged in one of my future papers in a field of your choice","title":"Become a sponsor"},{"location":"about/sponsors/#our-sponsors","text":"Be the first one! \ud83d\ude42 Name/company donation ($) URL","title":"Our sponsors:"},{"location":"user-guide/getting-started/","text":"Install KLPT KLPT is implemented in Python and requires basic knowledge on programming and particularly the Python language. Find out more about Python at https://www.python.org/ . Requirements Operating system : macOS / OS X \u00b7 Linux \u00b7 Windows (Cygwin, MinGW, Visual Studio) Python version : Python 3.5+ Package managers : pip cyhunspell >= 2.0.1 pip Using pip, KLPT releases are available as source packages and binary wheels. Please make sure that a compatible Python version is installed: pip install klpt All the data files including lexicons and morphological rules are also installed with the package. Although KLPT is not dependent on any NLP toolkit, there is one important requirement, particularly for the stem module. That is cyhunspell which should be installed with a version >= 2.0.1. Import klpt Once the package is installed, you can import the toolkit as follows: import klpt As a principle, the following parameters are widely used in the toolkit: dialect : the name of the dialect as Sorani or Kurmanji (ISO 639-3 code will be also added) script : the script of your input text as \"Arabic\" or \"Latin\" numeral : the type of the numerals as Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] Latin [1234567890]","title":"Getting started"},{"location":"user-guide/getting-started/#install-klpt","text":"KLPT is implemented in Python and requires basic knowledge on programming and particularly the Python language. Find out more about Python at https://www.python.org/ .","title":"Install KLPT"},{"location":"user-guide/getting-started/#requirements","text":"Operating system : macOS / OS X \u00b7 Linux \u00b7 Windows (Cygwin, MinGW, Visual Studio) Python version : Python 3.5+ Package managers : pip cyhunspell >= 2.0.1","title":"Requirements"},{"location":"user-guide/getting-started/#pip","text":"Using pip, KLPT releases are available as source packages and binary wheels. Please make sure that a compatible Python version is installed: pip install klpt All the data files including lexicons and morphological rules are also installed with the package. Although KLPT is not dependent on any NLP toolkit, there is one important requirement, particularly for the stem module. That is cyhunspell which should be installed with a version >= 2.0.1.","title":"pip"},{"location":"user-guide/getting-started/#import-klpt","text":"Once the package is installed, you can import the toolkit as follows: import klpt As a principle, the following parameters are widely used in the toolkit: dialect : the name of the dialect as Sorani or Kurmanji (ISO 639-3 code will be also added) script : the script of your input text as \"Arabic\" or \"Latin\" numeral : the type of the numerals as Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] Latin [1234567890]","title":"Import klpt"},{"location":"user-guide/preprocess/","text":"preprocess package This module deals with normalizing scripts and orthographies by using writing conventions based on dialects and scripts. The goal is not to correct the orthography but to normalize the text in terms of the encoding and common writing rules. The input encoding should be in UTF-8 only. To this end, three functions are provided as follows: normalize : deals with different encodings and unifies characters based on dialects and scripts standardize : given a normalized text, it returns standardized text based on the Kurdish orthographies following recommendations for Kurmanji and Sorani unify_numerals : conversion of the various types of numerals used in Kurdish texts It is recommended that the output of this module be used as the input of subsequent tasks in an NLP pipeline. Examples: >>> from klpt.preprocess import Preprocess >>> preprocessor_ckb = Preprocess(\"Sorani\", \"Arabic\", numeral=\"Latin\") >>> preprocessor_ckb.normalize(\"\u0644\u06d5 \u0633\u0640\u0640\u0640\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc \u0661\u0669\u0665\u0660\u062f\u0627\") '\u0644\u06d5 \u0633\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc 1950\u062f\u0627' >>> preprocessor_ckb.standardize(\"\u0631\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u0648\u06b5\u0627\u062a\u06d5\u062f\u0627\") '\u0695\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u06b5\u0627\u062a\u06d5\u062f\u0627' >>> preprocessor_ckb.unify_numerals(\"\u0662\u0660\u0662\u0660\") '2020' >>> preprocessor_kmr = Preprocess(\"Kurmanji\", \"Latin\") >>> preprocessor_kmr.standardize(\"di sala 2018-an\") 'di sala 2018an' >>> preprocessor_kmr.standardize(\"h\u00eaviya\") 'h\u00eav\u00eeya' The preprocessing rules are provided at data/preprocess_map.json . __init__ ( self , dialect , script , numeral = 'Latin' ) special Initialization of the Preprocess class Parameters: Name Type Description Default dialect str the name of the dialect or its ISO 639-3 code required script str the name of the script required numeral str the type of the numeral 'Latin' Source code in klpt/preprocess.py def __init__ ( self , dialect , script , numeral = \"Latin\" ): \"\"\" Initialization of the Preprocess class Arguments: dialect (str): the name of the dialect or its ISO 639-3 code script (str): the name of the script numeral (str): the type of the numeral \"\"\" with open ( klpt . get_data ( \"data/preprocess_map.json\" )) as preprocess_file : self . preprocess_map = json . load ( preprocess_file ) configuration = Configuration ({ \"dialect\" : dialect , \"script\" : script , \"numeral\" : numeral }) self . dialect = configuration . dialect self . script = configuration . script self . numeral = configuration . numeral normalize ( self , text ) Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: Sorani-Arabic: replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) remove Kashida \"\u0640\" \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Parameters: Name Type Description Default text str a string required Returns: Type Description str normalized text Source code in klpt/preprocess.py def normalize ( self , text ): \"\"\" Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: - Sorani-Arabic: - replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" - replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. - replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) - remove Kashida \"\u0640\" - \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) - replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Arguments: text (str): a string Returns: str: normalized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for normalization_type in [ \"universal\" , self . dialect ]: for rep in self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip () preprocess ( self , text ) One single function for normalization, standardization and unification of numerals Parameters: Name Type Description Default text str a string required Returns: Type Description str preprocessed text Source code in klpt/preprocess.py def preprocess ( self , text ): \"\"\" One single function for normalization, standardization and unification of numerals Arguments: text (str): a string Returns: str: preprocessed text \"\"\" return self . unify_numerals ( self . standardize ( self . normalize ( text ))) standardize ( self , text ) Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. Sorani-Arabic: replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) replace double rr and ll with \u0159 and \u0142 respectively Kurmanji-Latin: replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should \u0131 (LATIN SMALL LETTER DOTLESS I be replaced by i? Parameters: Name Type Description Default text str a string required Returns: Type Description str standardized text Source code in klpt/preprocess.py def standardize ( self , text ): \"\"\" Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. - Sorani-Arabic: - replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) - replace double rr and ll with \u0159 and \u0142 respectively - Kurmanji-Latin: - replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should [\u0131 (LATIN SMALL LETTER DOTLESS I](https://www.compart.com/en/unicode/U+0131) be replaced by i? Arguments: text (str): a string Returns: str: standardized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for standardization_type in [ self . dialect ]: for rep in self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip () unify_numerals ( self , text ) Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Parameters: Name Type Description Default text str a string required Returns: Type Description str text with unified numerals Source code in klpt/preprocess.py def unify_numerals ( self , text ): \"\"\" Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Arguments: text (str): a string Returns: str: text with unified numerals \"\"\" for i , j in self . preprocess_map [ \"normalizer\" ][ \"universal\" ][ \"numerals\" ][ self . numeral ] . items (): text = text . replace ( i , j ) return text","title":"Preprocess"},{"location":"user-guide/preprocess/#preprocess-package","text":"","title":"preprocess package"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess","text":"This module deals with normalizing scripts and orthographies by using writing conventions based on dialects and scripts. The goal is not to correct the orthography but to normalize the text in terms of the encoding and common writing rules. The input encoding should be in UTF-8 only. To this end, three functions are provided as follows: normalize : deals with different encodings and unifies characters based on dialects and scripts standardize : given a normalized text, it returns standardized text based on the Kurdish orthographies following recommendations for Kurmanji and Sorani unify_numerals : conversion of the various types of numerals used in Kurdish texts It is recommended that the output of this module be used as the input of subsequent tasks in an NLP pipeline. Examples: >>> from klpt.preprocess import Preprocess >>> preprocessor_ckb = Preprocess(\"Sorani\", \"Arabic\", numeral=\"Latin\") >>> preprocessor_ckb.normalize(\"\u0644\u06d5 \u0633\u0640\u0640\u0640\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc \u0661\u0669\u0665\u0660\u062f\u0627\") '\u0644\u06d5 \u0633\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc 1950\u062f\u0627' >>> preprocessor_ckb.standardize(\"\u0631\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u0648\u06b5\u0627\u062a\u06d5\u062f\u0627\") '\u0695\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u06b5\u0627\u062a\u06d5\u062f\u0627' >>> preprocessor_ckb.unify_numerals(\"\u0662\u0660\u0662\u0660\") '2020' >>> preprocessor_kmr = Preprocess(\"Kurmanji\", \"Latin\") >>> preprocessor_kmr.standardize(\"di sala 2018-an\") 'di sala 2018an' >>> preprocessor_kmr.standardize(\"h\u00eaviya\") 'h\u00eav\u00eeya' The preprocessing rules are provided at data/preprocess_map.json .","title":"klpt.preprocess.Preprocess"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.__init__","text":"Initialization of the Preprocess class Parameters: Name Type Description Default dialect str the name of the dialect or its ISO 639-3 code required script str the name of the script required numeral str the type of the numeral 'Latin' Source code in klpt/preprocess.py def __init__ ( self , dialect , script , numeral = \"Latin\" ): \"\"\" Initialization of the Preprocess class Arguments: dialect (str): the name of the dialect or its ISO 639-3 code script (str): the name of the script numeral (str): the type of the numeral \"\"\" with open ( klpt . get_data ( \"data/preprocess_map.json\" )) as preprocess_file : self . preprocess_map = json . load ( preprocess_file ) configuration = Configuration ({ \"dialect\" : dialect , \"script\" : script , \"numeral\" : numeral }) self . dialect = configuration . dialect self . script = configuration . script self . numeral = configuration . numeral","title":"__init__()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.normalize","text":"Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: Sorani-Arabic: replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) remove Kashida \"\u0640\" \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Parameters: Name Type Description Default text str a string required Returns: Type Description str normalized text Source code in klpt/preprocess.py def normalize ( self , text ): \"\"\" Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: - Sorani-Arabic: - replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" - replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. - replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) - remove Kashida \"\u0640\" - \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) - replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Arguments: text (str): a string Returns: str: normalized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for normalization_type in [ \"universal\" , self . dialect ]: for rep in self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip ()","title":"normalize()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.preprocess","text":"One single function for normalization, standardization and unification of numerals Parameters: Name Type Description Default text str a string required Returns: Type Description str preprocessed text Source code in klpt/preprocess.py def preprocess ( self , text ): \"\"\" One single function for normalization, standardization and unification of numerals Arguments: text (str): a string Returns: str: preprocessed text \"\"\" return self . unify_numerals ( self . standardize ( self . normalize ( text )))","title":"preprocess()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.standardize","text":"Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. Sorani-Arabic: replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) replace double rr and ll with \u0159 and \u0142 respectively Kurmanji-Latin: replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should \u0131 (LATIN SMALL LETTER DOTLESS I be replaced by i? Parameters: Name Type Description Default text str a string required Returns: Type Description str standardized text Source code in klpt/preprocess.py def standardize ( self , text ): \"\"\" Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. - Sorani-Arabic: - replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) - replace double rr and ll with \u0159 and \u0142 respectively - Kurmanji-Latin: - replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should [\u0131 (LATIN SMALL LETTER DOTLESS I](https://www.compart.com/en/unicode/U+0131) be replaced by i? Arguments: text (str): a string Returns: str: standardized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for standardization_type in [ self . dialect ]: for rep in self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip ()","title":"standardize()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.unify_numerals","text":"Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Parameters: Name Type Description Default text str a string required Returns: Type Description str text with unified numerals Source code in klpt/preprocess.py def unify_numerals ( self , text ): \"\"\" Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Arguments: text (str): a string Returns: str: text with unified numerals \"\"\" for i , j in self . preprocess_map [ \"normalizer\" ][ \"universal\" ][ \"numerals\" ][ self . numeral ] . items (): text = text . replace ( i , j ) return text","title":"unify_numerals()"},{"location":"user-guide/stem/","text":"stem package The Stem module deals with various tasks, mainly through the following functions: - check_spelling : spell error detection - correct_spelling : spell error correction - analyze : morphological analysis Please note that only Sorani is supported in this version in this module. The module is based on the Kurdish Hunspell project . Examples: >>> from klpt.stem import Stem >>> stemmer = Stem(\"Sorani\", \"Arabic\") >>> stemmer.check_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") False >>> stemmer.correct_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") (False, ['\u0633\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u0695\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0695\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0641\u06d5\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0628\u0648\u0648\u0698\u0627\u0646\u062f\u0628\u0648\u0648\u062a']) >>> stemmer.analyze(\"\u062f\u06cc\u062a\u0628\u0627\u0645\u0646\") [{'pos': 'verb', 'description': 'past_stem_transitive_active', 'base': '\u062f\u06cc\u062a', 'terminal_suffix': '\u0628\u0627\u0645\u0646'}] analyze ( self , word_form ) Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at https://github.com/sinaahmadi/KurdishHunspell . It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: \"pos\": the part-of-speech of the word-form according to the Universal Dependency tag set . \"description\": is flag \"terminal_suffix\": anything except ts flag \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. \"base\": ts flag. The definition of terminal suffix is a bit tricky in Hunspell. According to the Hunspell documentation , \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Parameters: Name Type Description Default word_form str a single word-form required Exceptions: Type Description TypeError only string as input Returns: Type Description (list(dict)) a list of all possible morphological analyses according to the defined morphological rules Source code in klpt/stem.py def analyze ( self , word_form ): \"\"\" Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell). It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: - \"pos\": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html). - \"description\": is flag - \"terminal_suffix\": anything except ts flag - \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. - \"base\": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Args: word_form (str): a single word-form Raises: TypeError: only string as input Returns: (list(dict)): a list of all possible morphological analyses according to the defined morphological rules \"\"\" if not isinstance ( word_form , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : # Given the morphological analysis of a word-form with Hunspell flags, extract relevant information and return a dictionary word_analysis = list () for analysis in list ( self . huns . analyze ( word_form )): analysis_dict = dict () for item in analysis . split (): if \":\" not in item : continue if item . split ( \":\" )[ 1 ] == \"ts\" : # ts flag exceptionally appears after the value as value:key in the Hunspell output analysis_dict [ \"base\" ] = item . split ( \":\" )[ 0 ] # anything except the terminal_suffix is considered to be the base analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 1 ]]] = word_form . replace ( item . split ( \":\" )[ 0 ], \"\" ) elif item . split ( \":\" )[ 0 ] in self . hunspell_flags . keys (): # assign the key:value pairs from the Hunspell string output to the dictionary output of the current function # for ds flag, add derivation as the formation type, otherwise inflection if item . split ( \":\" )[ 0 ] == \"ds\" : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = \"derivational\" analysis_dict [ self . hunspell_flags [ \"is\" ]] = item . split ( \":\" )[ 1 ] else : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = item . split ( \":\" )[ 1 ] # if there is no value assigned to the ts flag, the terminal suffix is a zero-morpheme 0 if self . hunspell_flags [ \"ts\" ] not in analysis_dict or analysis_dict [ self . hunspell_flags [ \"ts\" ]] == \"\" : analysis_dict [ self . hunspell_flags [ \"ts\" ]] = \"0\" word_analysis . append ( analysis_dict ) return word_analysis check_spelling ( self , word ) Check spelling of a word Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description bool True if the spelling is correct, False if the spelling is incorrect Source code in klpt/stem.py def check_spelling ( self , word ): \"\"\"Check spelling of a word Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: bool: True if the spelling is correct, False if the spelling is incorrect \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : return self . huns . spell ( word ) correct_spelling ( self , word ) Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description tuple (boolean, list) Source code in klpt/stem.py def correct_spelling ( self , word ): \"\"\" Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: tuple (boolean, list) \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : if self . check_spelling ( word ): return ( True , []) return ( False , list ( self . huns . suggest ( word )))","title":"Stem"},{"location":"user-guide/stem/#stem-package","text":"","title":"stem package"},{"location":"user-guide/stem/#klpt.stem.Stem","text":"The Stem module deals with various tasks, mainly through the following functions: - check_spelling : spell error detection - correct_spelling : spell error correction - analyze : morphological analysis Please note that only Sorani is supported in this version in this module. The module is based on the Kurdish Hunspell project . Examples: >>> from klpt.stem import Stem >>> stemmer = Stem(\"Sorani\", \"Arabic\") >>> stemmer.check_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") False >>> stemmer.correct_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") (False, ['\u0633\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u0695\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0695\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0641\u06d5\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0628\u0648\u0648\u0698\u0627\u0646\u062f\u0628\u0648\u0648\u062a']) >>> stemmer.analyze(\"\u062f\u06cc\u062a\u0628\u0627\u0645\u0646\") [{'pos': 'verb', 'description': 'past_stem_transitive_active', 'base': '\u062f\u06cc\u062a', 'terminal_suffix': '\u0628\u0627\u0645\u0646'}]","title":"klpt.stem.Stem"},{"location":"user-guide/stem/#klpt.stem.Stem.analyze","text":"Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at https://github.com/sinaahmadi/KurdishHunspell . It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: \"pos\": the part-of-speech of the word-form according to the Universal Dependency tag set . \"description\": is flag \"terminal_suffix\": anything except ts flag \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. \"base\": ts flag. The definition of terminal suffix is a bit tricky in Hunspell. According to the Hunspell documentation , \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Parameters: Name Type Description Default word_form str a single word-form required Exceptions: Type Description TypeError only string as input Returns: Type Description (list(dict)) a list of all possible morphological analyses according to the defined morphological rules Source code in klpt/stem.py def analyze ( self , word_form ): \"\"\" Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell). It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: - \"pos\": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html). - \"description\": is flag - \"terminal_suffix\": anything except ts flag - \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. - \"base\": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Args: word_form (str): a single word-form Raises: TypeError: only string as input Returns: (list(dict)): a list of all possible morphological analyses according to the defined morphological rules \"\"\" if not isinstance ( word_form , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : # Given the morphological analysis of a word-form with Hunspell flags, extract relevant information and return a dictionary word_analysis = list () for analysis in list ( self . huns . analyze ( word_form )): analysis_dict = dict () for item in analysis . split (): if \":\" not in item : continue if item . split ( \":\" )[ 1 ] == \"ts\" : # ts flag exceptionally appears after the value as value:key in the Hunspell output analysis_dict [ \"base\" ] = item . split ( \":\" )[ 0 ] # anything except the terminal_suffix is considered to be the base analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 1 ]]] = word_form . replace ( item . split ( \":\" )[ 0 ], \"\" ) elif item . split ( \":\" )[ 0 ] in self . hunspell_flags . keys (): # assign the key:value pairs from the Hunspell string output to the dictionary output of the current function # for ds flag, add derivation as the formation type, otherwise inflection if item . split ( \":\" )[ 0 ] == \"ds\" : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = \"derivational\" analysis_dict [ self . hunspell_flags [ \"is\" ]] = item . split ( \":\" )[ 1 ] else : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = item . split ( \":\" )[ 1 ] # if there is no value assigned to the ts flag, the terminal suffix is a zero-morpheme 0 if self . hunspell_flags [ \"ts\" ] not in analysis_dict or analysis_dict [ self . hunspell_flags [ \"ts\" ]] == \"\" : analysis_dict [ self . hunspell_flags [ \"ts\" ]] = \"0\" word_analysis . append ( analysis_dict ) return word_analysis","title":"analyze()"},{"location":"user-guide/stem/#klpt.stem.Stem.check_spelling","text":"Check spelling of a word Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description bool True if the spelling is correct, False if the spelling is incorrect Source code in klpt/stem.py def check_spelling ( self , word ): \"\"\"Check spelling of a word Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: bool: True if the spelling is correct, False if the spelling is incorrect \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : return self . huns . spell ( word )","title":"check_spelling()"},{"location":"user-guide/stem/#klpt.stem.Stem.correct_spelling","text":"Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description tuple (boolean, list) Source code in klpt/stem.py def correct_spelling ( self , word ): \"\"\" Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: tuple (boolean, list) \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : if self . check_spelling ( word ): return ( True , []) return ( False , list ( self . huns . suggest ( word )))","title":"correct_spelling()"},{"location":"user-guide/tokenize/","text":"tokenize package This module focuses on the tokenization of both Kurmanji and Sorani dialects of Kurdish with the following functions: word_tokenize : tokenization of texts into tokens (both multi-word expressions and single-word tokens). mwe_tokenize : tokenization of texts by only taking compound forms into account sent_tokenize : tokenization of texts into sentences The module is based on the Kurdish tokenization project . Examples: >>> from klpt.tokenize import Tokenize >>> tokenizer = Tokenize(\"Kurmanji\", \"Latin\") >>> tokenizer.word_tokenize(\"ji bo fort\u00ea xwe av\u00eatin\") ['\u2581ji\u2581', 'bo', '\u2581\u2581fort\u00ea\u2012xwe\u2012av\u00eatin\u2581\u2581'] >>> tokenizer.mwe_tokenize(\"bi serok\u00ea huk\u00fbmeta her\u00eama Kurdistan\u00ea Prof. Salih re saz kir.\") 'bi serok\u00ea huk\u00fbmeta her\u00eama Kurdistan\u00ea Prof . Salih re saz kir .' >>> tokenizer_ckb = Tokenize(\"Sorani\", \"Arabic\") >>> tokenizer_ckb.word(\"\u0628\u06d5 \u0647\u06d5\u0645\u0648\u0648 \u0647\u06d5\u0645\u0648\u0648\u0627\u0646\u06d5\u0648\u06d5 \u0695\u06ce\u06a9 \u06a9\u06d5\u0648\u062a\u0646\") ['\u2581\u0628\u06d5\u2581', '\u2581\u0647\u06d5\u0645\u0648\u0648\u2581', '\u0647\u06d5\u0645\u0648\u0648\u0627\u0646\u06d5\u0648\u06d5', '\u2581\u2581\u0695\u06ce\u06a9\u2012\u06a9\u06d5\u0648\u062a\u0646\u2581\u2581'] mwe_tokenize ( self , sentence , separator = '\u2581\u2581' , in_separator = '\u2012' , punct_marked = False , keep_form = False ) Multi-word expression tokenization Parameters: Name Type Description Default sentence str sentence to be split by multi-word expressions required separator str a specific token to specify a multi-word expression. By default two \u2581 (\u2581\u2581) are used for this purpose. '\u2581\u2581' in_separator str a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose. '\u2012' keep_form boolean if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash \u2012, as in \"dab\u2012\u00fb\u2012ner\u00eet\" False Returns: Type Description str sentence containing d multi-word expressions using the separator Source code in klpt/tokenize.py def mwe_tokenize ( self , sentence , separator = \"\u2581\u2581\" , in_separator = \"\u2012\" , punct_marked = False , keep_form = False ): \"\"\" Multi-word expression tokenization Args: sentence (str): sentence to be split by multi-word expressions separator (str): a specific token to specify a multi-word expression. By default two \u2581 (\u2581\u2581) are used for this purpose. in_separator (str): a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose. keep_form (boolean): if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash \u2012, as in \"dab\u2012\u00fb\u2012ner\u00eet\" Returns: str: sentence containing d multi-word expressions using the separator \"\"\" sentence = \" \" + sentence + \" \" if not punct_marked : # find punctuation marks and add a space around for punct in self . tokenize_map [ \"word_tokenize\" ][ self . dialect ][ self . script ][ \"punctuation\" ]: if punct in sentence : sentence = sentence . replace ( punct , \" \" + punct + \" \" ) # look for compound words and delimit them by double the separator for compound_lemma in self . mwe_lexicon : compound_lemma_context = \" \" + compound_lemma + \" \" if compound_lemma_context in sentence : if keep_form : sentence = sentence . replace ( compound_lemma_context , \" \u2581\u2581\" + compound_lemma + \"\u2581\u2581 \" ) else : sentence = sentence . replace ( compound_lemma_context , \" \u2581\u2581\" + compound_lemma . replace ( \"-\" , in_separator ) + \"\u2581\u2581 \" ) # check the possible word forms available for each compound lemma in the lex files, too # Note: compound forms don't have any hyphen or separator in the lex files for compound_form in self . mwe_lexicon [ compound_lemma ][ \"token_forms\" ]: compound_form_context = \" \" + compound_form + \" \" if compound_form_context in sentence : if keep_form : sentence = sentence . replace ( compound_form_context , \" \u2581\u2581\" + compound_form + \"\u2581\u2581 \" ) else : sentence = sentence . replace ( compound_form_context , \" \u2581\u2581\" + compound_lemma . replace ( \"-\" , in_separator ) + \"\u2581\u2581 \" ) # print(sentence) return sentence . replace ( \" \" , \" \" ) . replace ( \"\u2581\u2581\" , separator ) . strip () sent_tokenize ( self , text ) Sentence tokenizer Parameters: Name Type Description Default text [str] [input text to be tokenized by sentences] required Returns: Type Description [list] [a list of sentences] Source code in klpt/tokenize.py def sent_tokenize ( self , text ): \"\"\"Sentence tokenizer Args: text ([str]): [input text to be tokenized by sentences] Returns: [list]: [a list of sentences] \"\"\" text = \" \" + text + \" \" text = text . replace ( \" \\n \" , \" \" ) text = re . sub ( self . prefixes , \" \\\\ 1KLPT is implemented in Python and requires basic knowledge on programming and particularly the Python language. Find out more about Python at https://www.python.org/.
-cyhunspell
>= 2.0.1Using pip, KLPT releases are available as source packages and binary wheels. Please make sure that a compatible Python version is installed:
-pip install klpt
-
-
-All the data files including lexicons and morphological rules are also installed with the package.
-Although KLPT is not dependent on any NLP toolkit, there is one important requirement, particularly for the stem
module. That is cyhunspell
which should be installed with a version >= 2.0.1.
klpt
Once the package is installed, you can import the toolkit as follows:
-import klpt
-
-
-As a principle, the following parameters are widely used in the toolkit:
-dialect
: the name of the dialect as Sorani
or Kurmanji
(ISO 639-3 code will be also added)script
: the script of your input text as "Arabic" or "Latin"numeral
: the type of the numerals aspreprocess
packageThis module deals with normalizing scripts and orthographies by using writing conventions based on dialects and scripts. The goal is not to correct the orthography but to normalize the text in terms of the encoding and common writing rules. The input encoding should be in UTF-8 only. To this end, three functions are provided as follows:
-normalize
: deals with different encodings and unifies characters based on dialects and scriptsstandardize
: given a normalized text, it returns standardized text based on the Kurdish orthographies following recommendations for Kurmanji and Soraniunify_numerals
: conversion of the various types of numerals used in Kurdish textsIt is recommended that the output of this module be used as the input of subsequent tasks in an NLP pipeline.
-Examples:
->>> from klpt.preprocess import Preprocess
-
->>> preprocessor_ckb = Preprocess("Sorani", "Arabic", numeral="Latin")
->>> preprocessor_ckb.normalize("لە ســـاڵەکانی ١٩٥٠دا")
-'لە ساڵەکانی 1950دا'
->>> preprocessor_ckb.standardize("راستە لەو ووڵاتەدا")
-'ڕاستە لەو وڵاتەدا'
->>> preprocessor_ckb.unify_numerals("٢٠٢٠")
-'2020'
-
->>> preprocessor_kmr = Preprocess("Kurmanji", "Latin")
->>> preprocessor_kmr.standardize("di sala 2018-an")
-'di sala 2018an'
->>> preprocessor_kmr.standardize("hêviya")
-'hêvîya'
-
-The preprocessing rules are provided at data/preprocess_map.json
.
-__init__(self, dialect, script, numeral='Latin')
-
-special
-
-Initialization of the Preprocess class
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
dialect |
-str |
-
- the name of the dialect or its ISO 639-3 code - |
-required | -
script |
-str |
-
- the name of the script - |
-required | -
numeral |
-str |
-
- the type of the numeral - |
-'Latin' |
-
klpt/preprocess.py
def __init__(self, dialect, script, numeral="Latin"):
- """
- Initialization of the Preprocess class
-
- Arguments:
- dialect (str): the name of the dialect or its ISO 639-3 code
- script (str): the name of the script
- numeral (str): the type of the numeral
-
- """
- with open(klpt.get_data("data/preprocess_map.json")) as preprocess_file:
- self.preprocess_map = json.load(preprocess_file)
-
- configuration = Configuration({"dialect": dialect, "script": script, "numeral": numeral})
- self.dialect = configuration.dialect
- self.script = configuration.script
- self.numeral = configuration.numeral
-
-
-normalize(self, text)
-Text normalization
-This function deals with different encodings and unifies characters based on dialects and scripts as follows:
-Sorani-Arabic:
-It should be noted that the order of the replacements is important. Check out provided files for further details and test cases.
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
text |
-str |
-
- a string - |
-required | -
Returns:
-Type | -Description | -
---|---|
str |
-
- normalized text - |
-
klpt/preprocess.py
def normalize(self, text):
- """
- Text normalization
-
- This function deals with different encodings and unifies characters based on dialects and scripts as follows:
-
- - Sorani-Arabic:
-
- - replace frequent Arabic characters with their equivalent Kurdish ones, e.g. "ي" by "ی" and "ك" by "ک"
- - replace "ه" followed by zero-width non-joiner (ZWNJ, U+200C) with "ە" where ZWNJ is removed ("رهزبهر" is converted to "رەزبەر"). ZWNJ in HTML is also taken into account.
- - replace "هـ" with "ھ" (U+06BE, ARABIC LETTER HEH DOACHASHMEE)
- - remove Kashida "ـ"
- - "ھ" in the middle of a word is replaced by ه (U+0647)
- - replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649)
-
- It should be noted that the order of the replacements is important. Check out provided files for further details and test cases.
-
- Arguments:
- text (str): a string
-
- Returns:
- str: normalized text
-
- """
- temp_text = " " + self.unify_numerals(text) + " "
-
- for normalization_type in ["universal", self.dialect]:
- for rep in self.preprocess_map["normalizer"][normalization_type][self.script]:
- rep_tar = self.preprocess_map["normalizer"][normalization_type][self.script][rep]
- temp_text = re.sub(rf"{rep}", rf"{rep_tar}", temp_text, flags=re.I)
-
- return temp_text.strip()
-
-
-preprocess(self, text)
-One single function for normalization, standardization and unification of numerals
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
text |
-str |
-
- a string - |
-required | -
Returns:
-Type | -Description | -
---|---|
str |
-
- preprocessed text - |
-
klpt/preprocess.py
def preprocess(self, text):
- """
- One single function for normalization, standardization and unification of numerals
-
- Arguments:
- text (str): a string
-
- Returns:
- str: preprocessed text
- """
- return self.unify_numerals(self.standardize(self.normalize(text)))
-
-
-standardize(self, text)
-Method of standardization of Kurdish orthographies
-Given a normalized text, it returns standardized text based on the Kurdish orthographies.
-Sorani-Arabic:
-Kurmanji-Latin:
-Open issues: - - replace " وە " by " و "? But this is not always possible, "min bo we" (ریزگـرتنا من بو وە نە ئە وە ئــە ز) - - "pirtükê": "pirtûkê"? - - Should ı (LATIN SMALL LETTER DOTLESS I be replaced by i?
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
text |
-str |
-
- a string - |
-required | -
Returns:
-Type | -Description | -
---|---|
str |
-
- standardized text - |
-
klpt/preprocess.py
def standardize(self, text):
- """
- Method of standardization of Kurdish orthographies
-
- Given a normalized text, it returns standardized text based on the Kurdish orthographies.
-
- - Sorani-Arabic:
- - replace alveolar flap ر (/ɾ/) at the begging of the word by the alveolar trill ڕ (/r/)
- - replace double rr and ll with ř and ł respectively
-
- - Kurmanji-Latin:
- - replace "-an" or "'an" in dates and numerals ("di sala 2018'an" and "di sala 2018-an" -> "di sala 2018an")
-
- Open issues:
- - replace " وە " by " و "? But this is not always possible, "min bo we" (ریزگـرتنا من بو وە نە ئە وە ئــە ز)
- - "pirtükê": "pirtûkê"?
- - Should [ı (LATIN SMALL LETTER DOTLESS I](https://www.compart.com/en/unicode/U+0131) be replaced by i?
-
- Arguments:
- text (str): a string
-
- Returns:
- str: standardized text
-
- """
- temp_text = " " + self.unify_numerals(text) + " "
-
- for standardization_type in [self.dialect]:
- for rep in self.preprocess_map["standardizer"][standardization_type][self.script]:
- rep_tar = self.preprocess_map["standardizer"][standardization_type][self.script][rep]
- temp_text = re.sub(rf"{rep}", rf"{rep_tar}", temp_text, flags=re.I)
-
- return temp_text.strip()
-
-
-unify_numerals(self, text)
-Convert numerals to the desired one
-There are three types of numerals: -- Arabic [١٢٣٤٥٦٧٨٩٠] -- Farsi [۱۲۳۴۵۶۷۸۹۰] -- Latin [1234567890]
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
text |
-str |
-
- a string - |
-required | -
Returns:
-Type | -Description | -
---|---|
str |
-
- text with unified numerals - |
-
klpt/preprocess.py
def unify_numerals(self, text):
- """
- Convert numerals to the desired one
-
- There are three types of numerals:
- - Arabic [١٢٣٤٥٦٧٨٩٠]
- - Farsi [۱۲۳۴۵۶۷۸۹۰]
- - Latin [1234567890]
-
- Arguments:
- text (str): a string
-
- Returns:
- str: text with unified numerals
-
- """
- for i, j in self.preprocess_map["normalizer"]["universal"]["numerals"][self.numeral].items():
- text = text.replace(i, j)
- return text
-
-stem
packageThe Stem module deals with various tasks, mainly through the following functions:
- - check_spelling
: spell error detection
- - correct_spelling
: spell error correction
- - analyze
: morphological analysis
Please note that only Sorani is supported in this version in this module. The module is based on the Kurdish Hunspell project.
-Examples:
->>> from klpt.stem import Stem
->>> stemmer = Stem("Sorani", "Arabic")
->>> stemmer.check_spelling("سوتاندبووت")
-False
->>> stemmer.correct_spelling("سوتاندبووت")
-(False, ['ستاندبووت', 'سووتاندبووت', 'سووڕاندبووت', 'ڕووتاندبووت', 'فەوتاندبووت', 'بووژاندبووت'])
->>> stemmer.analyze("دیتبامن")
-[{'pos': 'verb', 'description': 'past_stem_transitive_active', 'base': 'دیت', 'terminal_suffix': 'بامن'}]
-
-
-analyze(self, word_form)
-
-
- Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at https://github.com/sinaahmadi/KurdishHunspell.
-It returns morphological analyses. The morphological analysis is returned as a dictionary as follows:
-ts
flag. The definition of terminal suffix is a bit tricky in Hunspell. According to the Hunspell documentation, "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.If the input cannot be analyzed morphologically, an empty list is returned.
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
word_form |
-str |
-
- a single word-form - |
-required | -
Exceptions:
-Type | -Description | -
---|---|
TypeError |
-
- only string as input - |
-
Returns:
-Type | -Description | -
---|---|
(list(dict)) |
-
- a list of all possible morphological analyses according to the defined morphological rules - |
-
klpt/stem.py
def analyze(self, word_form):
- """
- Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell).
-
- It returns morphological analyses. The morphological analysis is returned as a dictionary as follows:
-
- - "pos": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html).
- - "description": is flag
- - "terminal_suffix": anything except ts flag
- - "formation": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure.
- - "base": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.
-
- If the input cannot be analyzed morphologically, an empty list is returned.
-
- Args:
- word_form (str): a single word-form
-
- Raises:
- TypeError: only string as input
-
- Returns:
- (list(dict)): a list of all possible morphological analyses according to the defined morphological rules
-
- """
- if not isinstance(word_form, str):
- raise TypeError("Only a word (str) is allowed.")
- else:
- # Given the morphological analysis of a word-form with Hunspell flags, extract relevant information and return a dictionary
- word_analysis = list()
- for analysis in list(self.huns.analyze(word_form)):
- analysis_dict = dict()
- for item in analysis.split():
- if ":" not in item:
- continue
- if item.split(":")[1] == "ts":
- # ts flag exceptionally appears after the value as value:key in the Hunspell output
- analysis_dict["base"] = item.split(":")[0]
- # anything except the terminal_suffix is considered to be the base
- analysis_dict[self.hunspell_flags[item.split(":")[1]]] = word_form.replace(item.split(":")[0], "")
- elif item.split(":")[0] in self.hunspell_flags.keys():
- # assign the key:value pairs from the Hunspell string output to the dictionary output of the current function
- # for ds flag, add derivation as the formation type, otherwise inflection
- if item.split(":")[0] == "ds":
- analysis_dict[self.hunspell_flags[item.split(":")[0]]] = "derivational"
- analysis_dict[self.hunspell_flags["is"]] = item.split(":")[1]
- else:
- analysis_dict[self.hunspell_flags[item.split(":")[0]]] = item.split(":")[1]
-
- # if there is no value assigned to the ts flag, the terminal suffix is a zero-morpheme 0
- if self.hunspell_flags["ts"] not in analysis_dict or analysis_dict[self.hunspell_flags["ts"]] == "":
- analysis_dict[self.hunspell_flags["ts"]] = "0"
-
- word_analysis.append(analysis_dict)
-
- return word_analysis
-
-
-check_spelling(self, word)
-
-
- Check spelling of a word
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
word |
-str |
-
- input word to be spell-checked - |
-required | -
Exceptions:
-Type | -Description | -
---|---|
TypeError |
-
- only string as input - |
-
Returns:
-Type | -Description | -
---|---|
bool |
-
- True if the spelling is correct, False if the spelling is incorrect - |
-
klpt/stem.py
def check_spelling(self, word):
- """Check spelling of a word
-
- Args:
- word (str): input word to be spell-checked
-
- Raises:
- TypeError: only string as input
-
- Returns:
- bool: True if the spelling is correct, False if the spelling is incorrect
- """
- if not isinstance(word, str):
- raise TypeError("Only a word (str) is allowed.")
- else:
- return self.huns.spell(word)
-
-
-correct_spelling(self, word)
-
-
- Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). - If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). - If no suggestion is available, the list is returned empty as (True, []).
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
word |
-str |
-
- input word to be spell-checked - |
-required | -
Exceptions:
-Type | -Description | -
---|---|
TypeError |
-
- only string as input - |
-
Returns:
-Type | -Description | -
---|---|
|
-
- tuple (boolean, list) - |
-
klpt/stem.py
def correct_spelling(self, word):
- """
- Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect).
- If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []).
- If no suggestion is available, the list is returned empty as (True, []).
-
- Args:
- word (str): input word to be spell-checked
-
- Raises:
- TypeError: only string as input
-
- Returns:
- tuple (boolean, list)
-
- """
- if not isinstance(word, str):
- raise TypeError("Only a word (str) is allowed.")
- else:
- if self.check_spelling(word):
- return (True, [])
- return (False, list(self.huns.suggest(word)))
-
-tokenize
packageThis module focuses on the tokenization of both Kurmanji and Sorani dialects of Kurdish with the following functions:
-word_tokenize
: tokenization of texts into tokens (both multi-word expressions and single-word tokens).mwe_tokenize
: tokenization of texts by only taking compound forms into accountsent_tokenize
: tokenization of texts into sentencesThe module is based on the Kurdish tokenization project.
-Examples:
->>> from klpt.tokenize import Tokenize
-
->>> tokenizer = Tokenize("Kurmanji", "Latin")
->>> tokenizer.word_tokenize("ji bo fortê xwe avêtin")
-['▁ji▁', 'bo', '▁▁fortê‒xwe‒avêtin▁▁']
->>> tokenizer.mwe_tokenize("bi serokê hukûmeta herêma Kurdistanê Prof. Salih re saz kir.")
-'bi serokê hukûmeta herêma Kurdistanê Prof . Salih re saz kir .'
-
->>> tokenizer_ckb = Tokenize("Sorani", "Arabic")
->>> tokenizer_ckb.word("بە هەموو هەمووانەوە ڕێک کەوتن")
-['▁بە▁', '▁هەموو▁', 'هەمووانەوە', '▁▁ڕێک‒کەوتن▁▁']
-
-
-mwe_tokenize(self, sentence, separator='▁▁', in_separator='‒', punct_marked=False, keep_form=False)
-Multi-word expression tokenization
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
sentence |
-str |
-
- sentence to be split by multi-word expressions - |
-required | -
separator |
-str |
-
- a specific token to specify a multi-word expression. By default two ▁ (▁▁) are used for this purpose. - |
-'▁▁' |
-
in_separator |
-str |
-
- a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose. - |
-'‒' |
-
keep_form |
-boolean |
-
- if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash ‒, as in "dab‒û‒nerît" - |
-False |
-
Returns:
-Type | -Description | -
---|---|
str |
-
- sentence containing d multi-word expressions using the separator - |
-
klpt/tokenize.py
def mwe_tokenize(self, sentence, separator="▁▁", in_separator="‒", punct_marked=False, keep_form=False):
- """
- Multi-word expression tokenization
-
- Args:
- sentence (str): sentence to be split by multi-word expressions
- separator (str): a specific token to specify a multi-word expression. By default two ▁ (▁▁) are used for this purpose.
- in_separator (str): a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose.
- keep_form (boolean): if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash ‒, as in "dab‒û‒nerît"
-
- Returns:
- str: sentence containing d multi-word expressions using the separator
-
- """
- sentence = " " + sentence + " "
-
- if not punct_marked:
- # find punctuation marks and add a space around
- for punct in self.tokenize_map["word_tokenize"][self.dialect][self.script]["punctuation"]:
- if punct in sentence:
- sentence = sentence.replace(punct, " " + punct + " ")
-
- # look for compound words and delimit them by double the separator
- for compound_lemma in self.mwe_lexicon:
- compound_lemma_context = " " + compound_lemma + " "
- if compound_lemma_context in sentence:
- if keep_form:
- sentence = sentence.replace(compound_lemma_context, " ▁▁" + compound_lemma + "▁▁ ")
- else:
- sentence = sentence.replace(compound_lemma_context, " ▁▁" + compound_lemma.replace("-", in_separator) + "▁▁ ")
-# check the possible word forms available for each compound lemma in the lex files, too
- # Note: compound forms don't have any hyphen or separator in the lex files
- for compound_form in self.mwe_lexicon[compound_lemma]["token_forms"]:
- compound_form_context = " " + compound_form + " "
- if compound_form_context in sentence:
- if keep_form:
- sentence = sentence.replace(compound_form_context, " ▁▁" + compound_form + "▁▁ ")
- else:
- sentence = sentence.replace(compound_form_context, " ▁▁" + compound_lemma.replace("-", in_separator) + "▁▁ ")
-
-
- # print(sentence)
- return sentence.replace(" ", " ").replace("▁▁", separator).strip()
-
-
-sent_tokenize(self, text)
-Sentence tokenizer
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
text |
-[str] |
-
- [input text to be tokenized by sentences] - |
-required | -
Returns:
-Type | -Description | -
---|---|
[list] |
-
- [a list of sentences] - |
-
klpt/tokenize.py
def sent_tokenize(self, text):
- """Sentence tokenizer
-
- Args:
- text ([str]): [input text to be tokenized by sentences]
-
- Returns:
- [list]: [a list of sentences]
-
- """
- text = " " + text + " "
- text = text.replace("\n", " ")
- text = re.sub(self.prefixes, "\\1<prd>", text)
- text = re.sub(self.websites, "<prd>\\1", text)
- text = re.sub("\s" + self.alphabets + "[.] ", " \\1<prd> ", text)
- text = re.sub(self.acronyms + " " + self.starters, "\\1<stop> \\2", text)
- text = re.sub(self.alphabets + "[.]" + self.alphabets + "[.]" + self.alphabets + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
- text = re.sub(self.alphabets + "[.]" + self.alphabets + "[.]", "\\1<prd>\\2<prd>", text)
- text = re.sub(" " + self.suffixes + "[.] " + self.starters, " \\1<stop> \\2", text)
- text = re.sub(" " + self.suffixes + "[.]", " \\1<prd>", text)
- text = re.sub(self.digits + "[.]" + self.digits, "\\1<prd>\\2", text)
-
- # for punct in self.tokenize_map[self.dialect][self.script]["compound_puncts"]:
- # if punct in text:
- # text = text.replace("." + punct, punct + ".")
-
- for punct in self.tokenize_map["sent_tokenize"][self.dialect][self.script]["punct_boundary"]:
- text = text.replace(punct, punct + "<stop>")
-
- text = text.replace("<prd>", ".")
- sentences = text.split("<stop>")
- sentences = [s.strip() for s in sentences if len(s.strip())]
-
- return sentences
-
-
-word_tokenize(self, sentence, separator='▁', mwe_separator='▁▁', keep_form=False)
-Word tokenizer
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
sentence |
-str |
-
- sentence or text to be tokenized - |
-required | -
Returns:
-Type | -Description | -
---|---|
[list] |
-
- [a list of words] - |
-
klpt/tokenize.py
def word_tokenize(self, sentence, separator="▁", mwe_separator="▁▁", keep_form=False):
- """Word tokenizer
-
- Args:
- sentence (str): sentence or text to be tokenized
-
- Returns:
- [list]: [a list of words]
-
- """
- # find multi-word expressions in the sentence
- sentence = self.mwe_tokenize(sentence, keep_form=keep_form)
-
- # find punctuation marks and add a space around
- for punct in self.tokenize_map["word_tokenize"][self.dialect][self.script]["punctuation"]:
- if punct in sentence:
- sentence = sentence.replace(punct, " " + punct + " ")
-
- # print(sentence)
- tokens = list()
- # split the sentence by space and look for identifiable tokens
- for word in sentence.strip().split():
- if "▁▁" in word:
- # the word is previously detected as a compound word
- tokens.append(word)
- else:
- if word in self.lexicon:
- # check if the word exists in the lexicon
- tokens.append("▁" + word + "▁")
- else:
- # the word is neither a lemma nor a compound
- # morphological analysis by identifying affixes and clitics
- token_identified = False
-
- for preposition in self.morphemes["prefixes"]:
- if word.startswith(preposition) and len(word.split(preposition, 1)) > 1:
- if word.split(preposition, 1)[1] in self.lexicon:
- word = "▁".join(["", self.morphemes["prefixes"][preposition], word.split(preposition, 1)[1], ""])
- token_identified = True
- break
- elif self.mwe_tokenize(word.split(preposition, 1)[1], keep_form=keep_form) != word.split(preposition, 1)[1]:
- word = "▁" + self.morphemes["prefixes"][preposition] + self.mwe_tokenize(word.split(preposition, 1)[1], keep_form=keep_form)
- token_identified = True
- break
-
- if not token_identified:
- for postposition in self.morphemes["suffixes"]:
- if word.endswith(postposition) and len(word.rpartition(postposition)[0]):
- if word.rpartition(postposition)[0] in self.lexicon:
- word = "▁" + word.rpartition(postposition)[0] + "▁" + self.morphemes["suffixes"][postposition]
- break
- elif self.mwe_tokenize(word.rpartition(postposition)[0], keep_form=keep_form) != word.rpartition(postposition)[0]:
- word = ("▁" + self.mwe_tokenize(word.rpartition(postposition)[0], keep_form=keep_form) + "▁" + self.morphemes["suffixes"][postposition] + "▁").replace("▁▁▁", "▁▁")
- break
-
- tokens.append(word)
- # print(tokens)
- return " ".join(tokens).replace("▁▁", mwe_separator).replace("▁", separator).split()
-
-transliterate
packageThis module aims at transliterating one script of Kurdish into another one. Currently, only the Latin-based and the Arabic-based scripts of Sorani and Kurmanji are supported. The main function in this module is transliterate()
which also takes care of detecting the correct form of double-usage graphemes, namely و ↔ w/u and ی ↔ î/y. In some specific occasions, it can also predict the placement of the missing i (also known as Bizroke/بزرۆکە).
The module is based on the Kurdish transliteration project.
-Examples:
->>> from klpt.transliterate import Transliterate
->>> transliterate = Transliterate("Kurmanji", "Latin", target_script="Arabic")
->>> transliterate.transliterate("rojhilata navîn")
-'رۆژهلاتا ناڤین'
-
->>> transliterate_ckb = Transliterate("Sorani", "Arabic", target_script="Latin")
->>> transliterate_ckb.transliterate("لە وڵاتەکانی دیکەدا")
-'le wiłatekanî dîkeda'
-
-
-__init__(self, dialect, script, target_script, unknown='�', numeral='Latin')
-
-special
-
-Initializing using a Configuration object
-To do: - - "لە ئیسپانیا ژنان لە دژی ‘patriarkavirus’ ڕێپێوانیان کرد": "le îspanya jinan le dijî ‘patriarkavirus’ řêpêwanyan kird" - - "egerçî damezrandnî rêkxrawe kurdîyekan her rêpênedraw mabûnewe Inzîbat.": "ئەگەرچی دامەزراندنی ڕێکخراوە کوردییەکان هەر رێپێنەدراو مابوونەوە ئنزیبات.",
-Parameters:
-Name | -Type | -Description | -Default | -
---|---|---|---|
mode |
-[type] |
-
- [description] - |
-required | -
unknown |
-str |
-
- [description]. Defaults to "�". - |
-'�' |
-
numeral |
-str |
-
- [description]. Defaults to "Latin". Modifiable only if the source script is in Arabic. Otherwise, the Default value will be Latin. - |
-'Latin' |
-
Exceptions:
-Type | -Description | -
---|---|
ValueError |
-
- [description] - |
-
ValueError |
-
- [description] - |
-
klpt/transliterate.py
def __init__(self, dialect, script, target_script, unknown="�", numeral="Latin"):
- """Initializing using a Configuration object
-
- To do:
- - "لە ئیسپانیا ژنان لە دژی ‘patriarkavirus’ ڕێپێوانیان کرد": "le îspanya jinan le dijî ‘patriarkavirus’ řêpêwanyan kird"
- - "egerçî damezrandnî rêkxrawe kurdîyekan her rêpênedraw mabûnewe Inzîbat.": "ئەگەرچی دامەزراندنی ڕێکخراوە کوردییەکان هەر رێپێنەدراو مابوونەوە ئنزیبات.",
-
- Args:
- mode ([type]): [description]
- unknown (str, optional): [description]. Defaults to "�".
- numeral (str, optional): [description]. Defaults to "Latin". Modifiable only if the source script is in Arabic. Otherwise, the Default value will be Latin.
-
- Raises:
- ValueError: [description]
- ValueError: [description]
-
- """
- # with open("data/default-options.json") as f:
- # options = json.load(f)
-
- self.UNKNOWN = "�"
- with open(klpt.get_data("data/wergor.json")) as f:
- self.wergor_configurations = json.load(f)
-
- with open(klpt.get_data("data/preprocess_map.json")) as f:
- self.preprocess_map = json.load(f)["normalizer"]
-
- configuration = Configuration({"dialect": dialect, "script": script, "numeral": numeral, "target_script": target_script, "unknown": unknown})
- # self.preprocess_map = object.preprocess_map["normalizer"]
- self.dialect = configuration.dialect
- self.script = configuration.script
- self.numeral = configuration.numeral
- self.mode = configuration.mode
- self.target_script = configuration.target_script
- self.user_UNKNOWN = configuration.user_UNKNOWN
-
- # self.mode = mode
- # if mode=="arabic_to_latin":
- # target_script = "Latin"
- # elif mode=="latin_to_arabic":
- # target_script = "Arabic"
- # else:
- # raise ValueError(f'Unknown transliteration option. Available options: {options["transliterator"]}')
-
- # if len(unknown):
- # self.user_UNKNOWN = unknown
- # else:
- # raise ValueError(f'Unknown unknown tag. Select a non-empty token (e.g. <UNK>.')
-
- self.characters_mapping = self.wergor_configurations["characters_mapping"]
- self.digits_mapping = self.preprocess_map["universal"]["numerals"][self.target_script]
- self.digits_mapping_all = list(set(list(self.preprocess_map["universal"]["numerals"][self.target_script].keys()) + list(self.preprocess_map["universal"]["numerals"][self.target_script].values())))
- self.punctuation_mapping = self.wergor_configurations["punctuation"][self.target_script]
- self.punctuation_mapping_all = list(set(list(self.wergor_configurations["punctuation"][self.target_script].keys()) +
- list(self.wergor_configurations["punctuation"][self.target_script].values())))
- # self.tricky_characters = self.wergor_configurations["characters_mapping"]
- self.wy_mappings = self.wergor_configurations["wy_mappings"]
-
- self.hemze = self.wergor_configurations["hemze"]
- self.bizroke = self.wergor_configurations["bizroke"]
- self.uw_iy_forms = self.wergor_configurations["uw_iy_forms"]
- self.target_char = self.wergor_configurations["target_char"]
- self.arabic_vowels = self.wergor_configurations["arabic_vowels"]
- self.arabic_cons = self.wergor_configurations["arabic_cons"]
- self.latin_vowels = self.wergor_configurations["latin_vowels"]
- self.latin_cons = self.wergor_configurations["latin_cons"]
-
- self.characters_pack = {"arabic_to_latin": self.characters_mapping.values(), "latin_to_arabic": self.characters_mapping.keys()}
- if self.target_script == "Arabic":
- self.prep = Preprocess("Sorani", "Latin", numeral=self.numeral)
- else:
- self.prep = Preprocess("Sorani", "Latin", numeral="Latin")
-
-
-arabic_to_latin(self, char)
-Mapping Arabic-based characters to the Latin-based equivalents
-klpt/transliterate.py
def arabic_to_latin(self, char):
- """Mapping Arabic-based characters to the Latin-based equivalents"""
- if char != "":
- if char in list(self.characters_mapping.values()):
- return list(self.characters_mapping.keys())[list(self.characters_mapping.values()).index(char)]
- elif char in self.punctuation_mapping:
- return self.punctuation_mapping[char]
- elif char in self.punctuation_mapping:
- return self.punctuation_mapping[char]
- return char
-
-
-bizroke_finder(self, word)
-Detection of the "i" character in the Arabic-based script. Incomplete version.
-klpt/transliterate.py
def bizroke_finder(self, word):
- """Detection of the "i" character in the Arabic-based script. Incomplete version."""
- word = list(word)
- if len(word) > 2 and word[0] in self.latin_cons and word[1] in self.latin_cons and word[1] != "w" and word[1] != "y":
- word.insert(1, "i")
- return "".join(word)
-
-
-latin_to_arabic(self, char)
-Mapping Latin-based characters to the Arabic-based equivalents
-klpt/transliterate.py
def latin_to_arabic(self, char):
- """Mapping Latin-based characters to the Arabic-based equivalents"""
- # check if the character is in upper case
- mapped_char = ""
-
- if char.lower() != "":
- if char.lower() in self.wy_mappings.keys():
- mapped_char = self.wy_mappings[char.lower()]
- elif char.lower() in self.characters_mapping.keys():
- mapped_char = self.characters_mapping[char.lower()]
- elif char.lower() in self.punctuation_mapping:
- mapped_char = self.punctuation_mapping[char.lower()]
- # elif char.lower() in self.digits_mapping.values():
- # mapped_char = self.digits_mapping.keys()[self.digits_mapping.values().index(char.lower())]
-
- if len(mapped_char):
- if char.isupper():
- return mapped_char.upper()
- return mapped_char
- else:
- return char
-
-
-preprocessor(self, word)
-Preprocessing by normalizing text encoding and removing embedding characters
-klpt/transliterate.py
def preprocessor(self, word):
- """Preprocessing by normalizing text encoding and removing embedding characters"""
- # replace this by the normalization part
- word = list(word.replace('\u202b', "").replace('\u202c', "").replace('\u202a', "").replace(u"وو", "û").replace("\u200c", "").replace("ـ", ""))
- # for char_index in range(len(word)):
- # if(word[char_index] in self.tricky_characters.keys()):
- # word[char_index] = self.tricky_characters[word[char_index]]
- return "".join(word)
-
-
-syllable_detector(self, word)
-Detection of the syllable based on the given pattern. May be used for transcription applications.
-klpt/transliterate.py
def syllable_detector(self, word):
- """Detection of the syllable based on the given pattern. May be used for transcription applications."""
- syllable_templates = ["V", "VC", "VCC", "CV", "CVC", "CVCCC"]
- CV_converted_list = ""
-for char in word:
- if char in self.latin_vowels:
- CV_converted_list += "V"
- else:
- CV_converted_list += "C"
-
- syllables = list()
- for i in range(1, len(CV_converted_list)):
- syllable_templates_permutated = [p for p in itertools.product(syllable_templates, repeat=i)]
- for syl in syllable_templates_permutated:
- if len("".join(syl)) == len(CV_converted_list):
- if CV_converted_list == "".join(syl) and "VV" not in "".join(syl):
- syllables.append(syl)
- return syllables
-
-
-to_pieces(self, token)
-Given a token, find other segments composed of numbers and punctuation marks not seperated by space ▁
-klpt/transliterate.py
def to_pieces(self, token):
- """Given a token, find other segments composed of numbers and punctuation marks not seperated by space ▁"""
- tokens_dict = dict()
- flag = False # True if a token is a \w
- i = 0
-
- for char_index in range(len(token)):
- if token[char_index] in self.digits_mapping_all or token[char_index] in self.punctuation_mapping_all:
- tokens_dict[char_index] = token[char_index]
- flag = False
- i = 0
- elif token[char_index] in self.characters_pack[self.mode] or \
- token[char_index] in self.target_char or \
- token[char_index] == self.hemze or token[char_index].lower() == self.bizroke:
- if flag:
- tokens_dict[char_index-i] = tokens_dict[char_index-i] + token[char_index]
- else:
- tokens_dict[char_index] = token[char_index]
- flag = True
- i += 1
- else:
- tokens_dict[char_index] = self.UNKNOWN
-
- return tokens_dict
-
-
-transliterate(self, text)
-The main method of the class:
- - find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space)
- - map characters
- - detect double-usage characters w/u and y/î
- - find possible position of Bizroke (to be completed - 2017)
-
- Notice: text format should not be changed at all (no lower case, no style replacement ,
-
-etc.). - If the source and the target scripts are identical, the input text should be returned without any further processing.
-klpt/transliterate.py
def transliterate(self, text):
- """The main method of the class:
-
- - find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space)
- - map characters
- - detect double-usage characters w/u and y/î
- - find possible position of Bizroke (to be completed - 2017)
-
- Notice: text format should not be changed at all (no lower case, no style replacement \t, \n etc.).
- If the source and the target scripts are identical, the input text should be returned without any further processing.
-
- """
- text = self.prep.unify_numerals(text).split("\n")
- transliterated_text = list()
-
- for line in text:
- transliterated_line = list()
- for token in line.split():
- trans_token = ""
- # try:
- token = self.preprocessor(token) # This is not correct as the capital letter should be kept the way it is given.
- tokens_dict = self.to_pieces(token)
- # Transliterate words
- for token_key in tokens_dict:
- if len(tokens_dict[token_key]):
- word = tokens_dict[token_key]
- if self.mode == "arabic_to_latin":
- # w/y detection based on the priority in "word"
- for char in word:
- if char in self.target_char:
- word = self.uw_iy_Detector(word, char)
-if word[0] == self.hemze and word[1] in self.arabic_vowels:
- word = word[1:]
- word = list(word)
- for char_index in range(len(word)):
- word[char_index] = self.arabic_to_latin(word[char_index])
- word = "".join(word)
- word = self.bizroke_finder(word)
- elif self.mode == "latin_to_arabic":
- if len(word):
- word = list(word)
- for char_index in range(len(word)):
- word[char_index] = self.latin_to_arabic(word[char_index])
- if word[0] in self.arabic_vowels or word[0].lower() == self.bizroke:
- word.insert(0, self.hemze)
- word = "".join(word).replace("û", "وو").replace(self.bizroke.lower(), "").replace(self.bizroke.upper(), "")
- # else:
- # return self.UNKNOWN
-
- trans_token = trans_token + word
-
- transliterated_line.append(trans_token)
- transliterated_text.append(" ".join(transliterated_line).replace(u" w ", u" û "))
-
- # standardize the output
- # replace UNKOWN by the user's choice
- if self.user_UNKNOWN != self.UNKNOWN:
- return "\n".join(transliterated_text).replace(self.UNKNOWN, self.user_UNKNOWN)
- else:
- return "\n".join(transliterated_text)
-
-
-uw_iy_Detector(self, word, target_char)
-Detection of "و" and "ی" in the Arabic-based script
-klpt/transliterate.py
def uw_iy_Detector(self, word, target_char):
- """Detection of "و" and "ی" in the Arabic-based script"""
- word = list(word)
- if target_char == "و":
- dic_index = 1
- else:
- dic_index = 0
-
- if word == target_char:
- word = self.uw_iy_forms["target_char_cons"][dic_index]
- else:
- for index in range(len(word)):
- if word[index] == self.hemze and word[index+1] == target_char:
- word[index+1] = self.uw_iy_forms["target_char_vowel"][dic_index]
- index += 1
- else:
- if word[index] == target_char:
- if index == 0:
- word[index] = self.uw_iy_forms["target_char_cons"][dic_index]
- else:
- if word[index-1] in self.arabic_vowels:
- word[index] = self.uw_iy_forms["target_char_cons"][dic_index]
- else:
- if index+1 < len(word):
- if word[index+1] in self.arabic_vowels:
- word[index] = self.uw_iy_forms["target_char_cons"][dic_index]
- else:
- word[index] = self.uw_iy_forms["target_char_vowel"][dic_index]
- else:
- word[index] = self.uw_iy_forms["target_char_vowel"][dic_index]
-
- word = "".join(word).replace(self.hemze+self.uw_iy_forms["target_char_vowel"][dic_index], self.uw_iy_forms["target_char_vowel"][dic_index])
- return word
-
-