@@ -23,7 +26,7 @@
### Welcome / *Hûn bi xêr hatin* / بە خێر بێن! 🙂
-Kurdish Language Processing Toolkit--KLPT is a [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) toolkit in Python for the [Kurdish language](https://en.wikipedia.org/wiki/Kurdish_languages). The current version comes with four core modules, namely `preprocess`, `stem`, `transliterate` and `tokenize` and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell-checking and morphological analysis for the [Sorani](https://en.wikipedia.org/wiki/Sorani) and the [Kurmanji](https://en.wikipedia.org/wiki/Kurmanji) dialects of Kurdish.
+Kurdish Language Processing Toolkit--KLPT is a [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) toolkit in Python for the [Kurdish language](https://en.wikipedia.org/wiki/Kurdish_languages). The current version comes with four core modules, namely `preprocess`, `stem`, `transliterate` and `tokenize` and addresses basic language processing tasks such as text preprocessing, stemming, tokenization, spell-checking and morphological analysis for the [Sorani](https://en.wikipedia.org/wiki/Sorani) and the [Kurmanji](https://en.wikipedia.org/wiki/Kurmanji) dialects of Kurdish.
## Install KLPT
diff --git a/cinder/base.html b/cinder/base.html
index f818eac..ef8fa64 100644
--- a/cinder/base.html
+++ b/cinder/base.html
@@ -88,7 +88,7 @@
diff --git a/docs/about/contributing.md b/docs/about/contributing.md
index 9456f26..9ba7113 100644
--- a/docs/about/contributing.md
+++ b/docs/about/contributing.md
@@ -1,18 +1,19 @@
-## How to help
+# How to help Kurdish language processing?
-One of our main objectives in this project is to promote collaborative projects with **open-source** outcomes. If you are generous enough to volunteer, like us, and help the Kurdish language, there are three ways you can do to:
+One of our main objectives in this project is to promote collaborative projects with **open-source** outcomes. If you are generous and passionate to volunteer and help the Kurdish language, there are three ways you can do so:
-1- If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must.
-2- If you are iffy about your knowledge in Kurdish but have expertise in computer programming, you can also contribute to this project.
-3- If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you can contribute to lexicon development. The current lexicons include less than 20,000 headwords which should be further extended.
+1. If you are a native Kurdish speaker with general knowledge about Kurdish and are comfortable working with computer, [contributing to collaboratively-curated resources](https://en.wikipedia.org/wiki/Wikipedia:Contributing_to_Wikipedia) is the best starting point, particularly to:
+ - [Wîkîferheng - the Kurdish Wiktionary](https://ku.wiktionary.org/wiki/Destp%C3%AAk)
+ - Wikipedia in [Sorani](https://ckb.wikipedia.org/wiki/%D8%AF%DB%95%D8%B3%D8%AA%D9%BE%DB%8E%DA%A9) and in [Kurmanji](https://ku.wikipedia.org/wiki/Destp%C3%AAk)
+2. If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must. Please get in touch by joining the [KurdishNLP community on Gitter](https://gitter.im/KurdishNLP). **Our collaborations oftentimes lead to a scientific paper** depending on the task. Please check the following repositories to find out about some of our previous projects:
+ - [Kurdish tokenization](https://github.com/sinaahmadi/KurdishTokenization)
+ - [Kurdish Hunspell](https://github.com/sinaahmadi/KurdishHunspell)
+ - [Kurdish transliteration](https://github.com/sinaahmadi/wergor)
-In any case, please follow this project and introduce it to your surrounding. Test the tool and raise your issues so that we can fix them.
+3. If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you are invited to create content online. You can start creating a blog or tweet in Kurdish. After all, **every single person is a contributor as well**.
+In any case, please follow this project and introduce it to your friends. Test the tool and raise your issues so that we can fix them.
-## What is next?
-
-
-I am aware that many Kurds are interested in their language and many times, they invest their passion into literature. Well, that's amazing, but we are living in the IT era. We do need poets and novelists, but programmer and NLP engineers too. Therefore, I am planning to initiate an NLP course in Kurdish in the coming months.
\ No newline at end of file
diff --git a/docs/about/license.md b/docs/about/license.md
index 93bbf1f..64b31ed 100644
--- a/docs/about/license.md
+++ b/docs/about/license.md
@@ -1,108 +1,5 @@
-Attribution-ShareAlike 4.0 International Public License (https://creativecommons.org/licenses/by-sa/4.0/)
-Copyright (c) 2020
+# License
-Sina Ahmadi (ahmadi.sina@outlook.com)
-=======================================================================
-Creative Commons Attribution-ShareAlike 4.0 International Public License
+This project is created by [Sina Ahmadi](https://sinaahmadi.github.io/) and is publicly available under a Creative Commons Attribution-ShareAlike 4.0 International Public License [https://creativecommons.org/licenses/by-sa/4.0/](https://creativecommons.org/licenses/by-sa/4.0/).
-By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
-Section 1 – Definitions.
-
- Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
- Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
- BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License.
- Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
- Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
- Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
- License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.
- Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
- Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
- Licensor means the individual(s) or entity(ies) granting rights under this Public License.
- Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
- Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
- You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
-
-Section 2 – Scope.
-
- License grant.
- Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
- reproduce and Share the Licensed Material, in whole or in part; and
- produce, reproduce, and Share Adapted Material.
- Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
- Term. The term of this Public License is specified in Section 6(a).
- Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
- Downstream recipients.
- Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
- Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.
- No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
- No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
-
- Other rights.
- Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
- Patent and trademark rights are not licensed under this Public License.
- To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties.
-
-Section 3 – License Conditions.
-
-Your exercise of the Licensed Rights is expressly made subject to the following conditions.
-
- Attribution.
-
- If You Share the Licensed Material (including in modified form), You must:
- retain the following if it is supplied by the Licensor with the Licensed Material:
- identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
- a copyright notice;
- a notice that refers to this Public License;
- a notice that refers to the disclaimer of warranties;
- a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
- indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
- indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
- You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
- If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
- ShareAlike.
-
- In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply.
- The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.
- You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material.
- You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.
-
-Section 4 – Sui Generis Database Rights.
-
-Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
-
- for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;
- if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and
- You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
-
-For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
-
-Section 5 – Disclaimer of Warranties and Limitation of Liability.
-
- Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.
- To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.
-
- The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
-
-Section 6 – Term and Termination.
-
- This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
-
- Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
- automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
- upon express reinstatement by the Licensor.
- For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
- For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
- Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
-
-Section 7 – Other Terms and Conditions.
-
- The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
- Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
-
-Section 8 – Interpretation.
-
- For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
- To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
- No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
- Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
diff --git a/docs/about/release-notes.md b/docs/about/release-notes.md
index b41eb25..cd68bdd 100644
--- a/docs/about/release-notes.md
+++ b/docs/about/release-notes.md
@@ -1 +1,95 @@
-release-notes.md
\ No newline at end of file
+### About the current version
+
+Please note that KLPT is under development and some of the functionalities will appear in the future versions. You can find out regarding the progress of each task at the [Projects](https://github.com/sinaahmadi/KLPT/projects) section. In the current version, the following tasks are included:
+
+
+
+
+
Modules
+
Tasks
+
Sorani (ckb)
+
Kurmanji (kmr)
+
+
+
+
+
preprocess
+
normalization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
standardization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
unification of numerals
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
tokenize
+
word tokenization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
MWE tokenization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
sentence tokenization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
transliterate
+
Arabic to Latin
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
Latin to Arabic
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
Detection of u/w and î/y
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
Detection of Bizroke ( i )
+
✗
+
✗
+
+
+
stem
+
morphological analysis
+
✓ (v0.1.0)
+
✗
+
+
+
morphological generation
+
✓ (v0.1.0)
+
✗
+
+
+
stemming
+
✗
+
✗
+
+
+
lemmatization
+
✗
+
✗
+
+
+
spell error detection and correction
+
✓ (v0.1.0)
+
✗
+
+
+
\ No newline at end of file
diff --git a/docs/about/sponsors.md b/docs/about/sponsors.md
new file mode 100644
index 0000000..d4dd651
--- /dev/null
+++ b/docs/about/sponsors.md
@@ -0,0 +1,18 @@
+## Become a sponsor
+
+Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by [becoming a sponsor](https://github.com/sponsors/sinaahmadi) to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support,
+
+- You can be an official sponsor
+- You will get a GitHub sponsor badge on your profile
+- If you have any questions, I will focus on it
+- If you want, I will add your name or company logo on the front page of your preferred project
+- Your contribution will be acknowledged in one of my future papers in a field of your choice
+
+### Our sponsors:
+
+**Be the first one!** 🙂
+
+| Name/company | donation ($) | URL |
+|-------------- |-------------- |----- |
+| | | |
+| | | |
\ No newline at end of file
diff --git a/docs/img/favicon.ico b/docs/img/favicon.ico
new file mode 100644
index 0000000..7b6643c
Binary files /dev/null and b/docs/img/favicon.ico differ
diff --git a/docs/index.md b/docs/index.md
index 3f22ca0..4d73d6f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,11 +1,59 @@
+
+
+
-# Kurdish Language Processing Toolkit
+### **Welcome / *Hûn bi xêr hatin* / بە خێر بێن!** 🙂
-
API documentation
+### Introduction
+[Language technology](https://en.wikipedia.org/wiki/Language_technology) is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as *less-resourced languages*.
+
+Despite a plethora of performant tools and specific frameworks for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP), such as [NLTK](https://www.nltk.org/), [Stanza](https://stanfordnlp.github.io/stanza/) and [spaCy](https://github.com/explosion/spaCy), the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish.
+
+### Kurdish Language
+
+Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages):
+
+- [Northern Kurdish](https://en.wikipedia.org/wiki/Kurmanji) (or Kurmanji) `kmr`
+- [Central Kurdish](https://en.wikipedia.org/wiki/Sorani) (or Sorani) `ckb`
+- [Southern Kurdish](https://en.wikipedia.org/wiki/Southern_Kurdish) `sdh`
+- [Laki](https://en.wikipedia.org/wiki/Laki_language) `lki`
+
+Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script.
+
+### KLPT - The Kurdish Language Processing Toolkit
+
+KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the [Kurdish language](https://en.wikipedia.org/wiki/Kurdish_languages). The current version (0.1) comes with four core modules, namely `preprocess`, `stem`, `transliterate` and `tokenize`, and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the [Sorani](https://en.wikipedia.org/wiki/Sorani) and [Kurmanji](https://en.wikipedia.org/wiki/Kurmanji) dialects of Kurdish. More importantly, **it is an open-source project**!
+
+To find out more about how to use the tool, please check the "User Guide" section of this website.
+
+### Cite this project
+
+Please consider citing [this paper](https://sinaahmadi.github.io/docs/articles/ahmadi2020klpt.pdf), if you use any part of the data or the tool ([`bib` file](https://sinaahmadi.github.io/bibliography/ahmadi2020klpt.txt)):
+
+ @inproceedings{ahmadi2020klpt,
+ title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
+ author = "Ahmadi, Sina",
+ booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
+ month = nov,
+ year = "2020",
+ address = "Online",
+ publisher = "Association for Computational Linguistics",
+ url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
+ pages = "72--84"
+ }
+
+You can also watch the presentation of this paper at [https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit](https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit).
+
+### License
+
+Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means:
+
+- **You are free to share**, copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material
+for any purpose, **even commercially**.
+- **You must give appropriate credit**, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
+- If you remix, transform, or build upon the material, **you must distribute your contributions under the same license as the original**.
-KLPT - the Kurdish Language Processing Toolkit is a [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) toolkit for the [Kurdish language](https://en.wikipedia.org/wiki/Kurdish_languages). The current version (0.1) comes with four core modules, namely `preprocess`, `stem`, `transliterate` and `tokenize`, and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the [Sorani](https://en.wikipedia.org/wiki/Sorani) and [Kurmanji](https://en.wikipedia.org/wiki/Kurmanji) dialects of Kurdish.
-# Reference
\ No newline at end of file
diff --git a/docs/user-guide/configuration.md b/docs/user-guide/configuration.md
deleted file mode 100644
index 0443f59..0000000
--- a/docs/user-guide/configuration.md
+++ /dev/null
@@ -1 +0,0 @@
-::: klpt.configuration.Configuration
\ No newline at end of file
diff --git a/docs/user-guide/getting-started.md b/docs/user-guide/getting-started.md
index e69de29..80f1795 100644
--- a/docs/user-guide/getting-started.md
+++ b/docs/user-guide/getting-started.md
@@ -0,0 +1,42 @@
+
+## Install KLPT
+
+KLPT is implemented in Python and requires basic knowledge on programming and particularly the Python language. Find out more about Python at [https://www.python.org/](https://www.python.org/).
+
+### Requirements
+
+- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
+ Studio)
+- **Python version**: Python 3.5+
+- **Package managers**: [pip](https://pypi.org/project/klpt/)
+- [`cyhunspell`](https://pypi.org/project/cyhunspell/) >= 2.0.1
+
+
+### pip
+
+Using pip, KLPT releases are available as source packages and binary wheels. Please make sure that a compatible Python version is installed:
+
+```bash
+pip install klpt
+```
+
+All the data files including lexicons and morphological rules are also installed with the package.
+
+Although KLPT is not dependent on any NLP toolkit, there is one important requirement, particularly for the `stem` module. That is [`cyhunspell`](https://pypi.org/project/cyhunspell/) which should be installed with a version >= 2.0.1.
+
+
+### Import `klpt`
+Once the package is installed, you can import the toolkit as follows:
+
+```python
+import klpt
+```
+
+As a principle, the following parameters are widely used in the toolkit:
+
+- `dialect`: the name of the dialect as `Sorani` or `Kurmanji` (ISO 639-3 code will be also added)
+- `script`: the script of your input text as "Arabic" or "Latin"
+- `numeral`: the type of the numerals as
+ - Arabic [١٢٣٤٥٦٧٨٩٠]
+ - Farsi [۱۲۳۴۵۶۷۸۹۰]
+ - Latin [1234567890]
\ No newline at end of file
diff --git a/docs/user-guide/preprocess.md b/docs/user-guide/preprocess.md
index 5f324d9..3085b5a 100644
--- a/docs/user-guide/preprocess.md
+++ b/docs/user-guide/preprocess.md
@@ -1,4 +1,3 @@
-
-## Preprocess package
+## `preprocess` package
::: klpt.preprocess.Preprocess
\ No newline at end of file
diff --git a/docs/user-guide/stem.md b/docs/user-guide/stem.md
index e69de29..01df96e 100644
--- a/docs/user-guide/stem.md
+++ b/docs/user-guide/stem.md
@@ -0,0 +1,3 @@
+## `stem` package
+
+::: klpt.stem.Stem
\ No newline at end of file
diff --git a/docs/user-guide/tokenize.md b/docs/user-guide/tokenize.md
index e69de29..186dabd 100644
--- a/docs/user-guide/tokenize.md
+++ b/docs/user-guide/tokenize.md
@@ -0,0 +1,3 @@
+## `tokenize` package
+
+::: klpt.tokenize.Tokenize
\ No newline at end of file
diff --git a/docs/user-guide/transliterate.md b/docs/user-guide/transliterate.md
index e69de29..3eea0eb 100644
--- a/docs/user-guide/transliterate.md
+++ b/docs/user-guide/transliterate.md
@@ -0,0 +1,3 @@
+## `transliterate` package
+
+::: klpt.transliterate.Transliterate
\ No newline at end of file
diff --git a/klpt/preprocess.py b/klpt/preprocess.py
index 67eb946..dd270f0 100644
--- a/klpt/preprocess.py
+++ b/klpt/preprocess.py
@@ -8,7 +8,7 @@
* created: 2020/05/10 02:10:54
* author: Sina Ahmadi
-
+
"""
import json
@@ -20,10 +20,35 @@
class Preprocess:
"""
- Text preprocessing by normalizing encodings and standardizing orthographies
-
- A class to deal with various text preprocessing tasks, particularly encoding normalization and orthographic standardization. Input encoding only in UTF-8.
-
+ This module deals with normalizing scripts and orthographies by using writing conventions based on dialects and scripts. The goal is not to correct the orthography but to normalize the text in terms of the encoding and common writing rules. The input encoding should be in UTF-8 only. To this end, three functions are provided as follows:
+
+ - `normalize`: deals with different encodings and unifies characters based on dialects and scripts
+ - `standardize`: given a normalized text, it returns standardized text based on the Kurdish orthographies following recommendations for [Kurmanji](https://books.google.ie/books?id=Z7lDnwEACAAJ) and [Sorani](http://yageyziman.com/Renusi_Kurdi.htm)
+ - `unify_numerals`: conversion of the various types of numerals used in Kurdish texts
+
+ It is recommended that the output of this module be used as the input of subsequent tasks in an NLP pipeline.
+
+ Example:
+
+ ```python
+ >>> from klpt.preprocess import Preprocess
+
+ >>> preprocessor_ckb = Preprocess("Sorani", "Arabic", numeral="Latin")
+ >>> preprocessor_ckb.normalize("لە ســـاڵەکانی ١٩٥٠دا")
+ 'لە ساڵەکانی 1950دا'
+ >>> preprocessor_ckb.standardize("راستە لەو ووڵاتەدا")
+ 'ڕاستە لەو وڵاتەدا'
+ >>> preprocessor_ckb.unify_numerals("٢٠٢٠")
+ '2020'
+
+ >>> preprocessor_kmr = Preprocess("Kurmanji", "Latin")
+ >>> preprocessor_kmr.standardize("di sala 2018-an")
+ 'di sala 2018an'
+ >>> preprocessor_kmr.standardize("hêviya")
+ 'hêvîya'
+ ```
+
+ The preprocessing rules are provided at [`data/preprocess_map.json`](https://github.com/sinaahmadi/klpt/blob/master/klpt/data/preprocess_map.json).
"""
def __init__(self, dialect, script, numeral="Latin"):
@@ -34,9 +59,7 @@ def __init__(self, dialect, script, numeral="Latin"):
dialect (str): the name of the dialect or its ISO 639-3 code
script (str): the name of the script
numeral (str): the type of the numeral
-
- preprocess_map (dict): a dictionary exported from a JSON file containing the mapping rules
-
+
"""
with open(klpt.get_data("data/preprocess_map.json")) as preprocess_file:
self.preprocess_map = json.load(preprocess_file)
@@ -141,5 +164,8 @@ def preprocess(self, text):
Arguments:
text (str): a string
+
+ Returns:
+ str: preprocessed text
"""
return self.unify_numerals(self.standardize(self.normalize(text)))
diff --git a/klpt/stem.py b/klpt/stem.py
index 83158d2..3d55c3c 100644
--- a/klpt/stem.py
+++ b/klpt/stem.py
@@ -14,13 +14,27 @@
sys.path.append('../klpt')
import klpt
-class Stem():
- """The Stem class deals with various tasks as follows:
- - spell error detection and correction
- - morphological analysis
- - stemming
+class Stem:
+ """
- These tasks are carried out in the `Kurdish Hunspell project `_.
+ The Stem module deals with various tasks, mainly through the following functions:
+ - `check_spelling`: spell error detection
+ - `correct_spelling`: spell error correction
+ - `analyze`: morphological analysis
+
+ Please note that only Sorani is supported in this version in this module. The module is based on the [Kurdish Hunspell project](https://github.com/sinaahmadi/KurdishHunspell).
+
+ Example:
+ ```python
+ >>> from klpt.stem import Stem
+ >>> stemmer = Stem("Sorani", "Arabic")
+ >>> stemmer.check_spelling("سوتاندبووت")
+ False
+ >>> stemmer.correct_spelling("سوتاندبووت")
+ (False, ['ستاندبووت', 'سووتاندبووت', 'سووڕاندبووت', 'ڕووتاندبووت', 'فەوتاندبووت', 'بووژاندبووت'])
+ >>> stemmer.analyze("دیتبامن")
+ [{'pos': 'verb', 'description': 'past_stem_transitive_active', 'base': 'دیت', 'terminal_suffix': 'بامن'}]
+ ```
"""
@@ -57,7 +71,10 @@ def check_spelling(self, word):
return self.huns.spell(word)
def correct_spelling(self, word):
- """Correct spelling errors if the input word is incorrect
+ """
+ Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect).
+ If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []).
+ If no suggestion is available, the list is returned empty as (True, []).
Args:
word (str): input word to be spell-checked
@@ -66,9 +83,8 @@ def correct_spelling(self, word):
TypeError: only string as input
Returns:
- tuple (boolean, list): a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect).
- If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []).
- If no suggestion is available, the list is returned empty as (True, []).
+ tuple (boolean, list)
+
"""
if not isinstance(word, str):
raise TypeError("Only a word (str) is allowed.")
@@ -78,8 +94,18 @@ def correct_spelling(self, word):
return (False, list(self.huns.suggest(word)))
def analyze(self, word_form):
- """Morphological analysis of a given word
- More details regarding Kurdish morphological analysis can be found at https://github.com/sinaahmadi/KurdishHunspell
+ """
+ Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell).
+
+ It returns morphological analyses. The morphological analysis is returned as a dictionary as follows:
+
+ - "pos": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html).
+ - "description": is flag
+ - "terminal_suffix": anything except ts flag
+ - "formation": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure.
+ - "base": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.
+
+ If the input cannot be analyzed morphologically, an empty list is returned.
Args:
word_form (str): a single word-form
@@ -90,14 +116,6 @@ def analyze(self, word_form):
Returns:
(list(dict)): a list of all possible morphological analyses according to the defined morphological rules
- The morphological analysis is returned as a dictionary as follows:
- - "pos": the part-of-speech of the word-form according to `the Universal Dependency tag set `_
- - "description": is flag
- - "terminal_suffix": anything except ts flag
- - "formation": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure.
- - "base": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to `the Hunspell documentation `_, "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.
-
- If the input cannot be analyzed morphologically, an empty list is returned.
"""
if not isinstance(word_form, str):
raise TypeError("Only a word (str) is allowed.")
diff --git a/klpt/tokenize.py b/klpt/tokenize.py
index a3fc98e..26dc960 100644
--- a/klpt/tokenize.py
+++ b/klpt/tokenize.py
@@ -19,7 +19,32 @@
import klpt
class Tokenize:
- """A class for tokenizing text in Sorani and Kurmanji Kurdish
+ """
+
+ This module focuses on the tokenization of both Kurmanji and Sorani dialects of Kurdish with the following functions:
+
+ - `word_tokenize`: tokenization of texts into tokens (both [multi-word expressions](https://aclweb.org/aclwiki/Multiword_Expressions) and single-word tokens).
+ - `mwe_tokenize`: tokenization of texts by only taking compound forms into account
+ - `sent_tokenize`: tokenization of texts into sentences
+
+ The module is based on the [Kurdish tokenization project](https://github.com/sinaahmadi/KurdishTokenization).
+
+ Example:
+
+ ```python
+ >>> from klpt.tokenize import Tokenize
+
+ >>> tokenizer = Tokenize("Kurmanji", "Latin")
+ >>> tokenizer.word_tokenize("ji bo fortê xwe avêtin")
+ ['▁ji▁', 'bo', '▁▁fortê‒xwe‒avêtin▁▁']
+ >>> tokenizer.mwe_tokenize("bi serokê hukûmeta herêma Kurdistanê Prof. Salih re saz kir.")
+ 'bi serokê hukûmeta herêma Kurdistanê Prof . Salih re saz kir .'
+
+ >>> tokenizer_ckb = Tokenize("Sorani", "Arabic")
+ >>> tokenizer_ckb.word("بە هەموو هەمووانەوە ڕێک کەوتن")
+ ['▁بە▁', '▁هەموو▁', 'هەمووانەوە', '▁▁ڕێک‒کەوتن▁▁']
+ ```
+
"""
def __init__(self, dialect, script, numeral="Latin", separator='▁'):
@@ -51,7 +76,8 @@ def __init__(self, dialect, script, numeral="Latin", separator='▁'):
def mwe_tokenize(self, sentence, separator="▁▁", in_separator="‒", punct_marked=False, keep_form=False):
- """Multi-word expression tokenization
+ """
+ Multi-word expression tokenization
Args:
sentence (str): sentence to be split by multi-word expressions
diff --git a/klpt/transliterate.py b/klpt/transliterate.py
index f7fc0b5..a926bce 100644
--- a/klpt/transliterate.py
+++ b/klpt/transliterate.py
@@ -21,7 +21,22 @@
class Transliterate:
"""
- A class for transliterating various Kurdish scripts.
+ This module aims at transliterating one script of Kurdish into another one. Currently, only the Latin-based and the Arabic-based scripts of Sorani and Kurmanji are supported. The main function in this module is `transliterate()` which also takes care of detecting the correct form of double-usage graphemes, namely و ↔ w/u and ی ↔ î/y. In some specific occasions, it can also predict the placement of the missing *i* (also known as *Bizroke/بزرۆکە*).
+
+ The module is based on the [Kurdish transliteration project](https://github.com/sinaahmadi/wergor).
+
+ Example:
+ ```python
+ >>> from klpt.transliterate import Transliterate
+ >>> transliterate = Transliterate("Kurmanji", "Latin", target_script="Arabic")
+ >>> transliterate.transliterate("rojhilata navîn")
+ 'رۆژهلاتا ناڤین'
+
+ >>> transliterate_ckb = Transliterate("Sorani", "Arabic", target_script="Latin")
+ >>> transliterate_ckb.transliterate("لە وڵاتەکانی دیکەدا")
+ 'le wiłatekanî dîkeda'
+ ```
+
"""
def __init__(self, dialect, script, target_script, unknown="�", numeral="Latin"):
diff --git a/mkdocs.yml b/mkdocs.yml
index e2c1e6a..5a4ffc2 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,35 +1,37 @@
use_directory_urls: true
site_name: KLPT
-site_url: https://sinaahmadi.github.io/KLPT/
+site_url: https://sinaahmadi.github.io/klpt/
site_description: "Kurdish Language Processing Project"
site_author: Sina Ahmadi
copyright: Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
-repo_url: "https://github.com/sinaahmadi/KLPT"
+repo_url: "https://github.com/sinaahmadi/klpt"
theme:
name: null
custom_dir: cinder
- colorscheme: github
+ colorscheme: dracula
highlightjs: true
hljs_languages:
- python
+ - json
nav:
- Home: index.md
- User Guide:
- Getting started: user-guide/getting-started.md
- Preprocess: user-guide/preprocess.md
- - Stem: user-guide/stem.md
- Tokenize: user-guide/tokenize.md
- Transliterate: user-guide/transliterate.md
- - Configuration: user-guide/configuration.md
+ - Stem: user-guide/stem.md
- About:
- Release Notes: about/release-notes.md
- Contributing: about/contributing.md
+ - Sponsors: about/sponsors.md
- License: about/license.md
plugins:
- search
- mkdocstrings
+ - git-revision-date
diff --git a/site/about/contributing/index.html b/site/about/contributing/index.html
new file mode 100644
index 0000000..c0228e5
--- /dev/null
+++ b/site/about/contributing/index.html
@@ -0,0 +1,317 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Contributing - KLPT
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
One of our main objectives in this project is to promote collaborative projects with open-source outcomes. If you are generous and passionate to volunteer and help the Kurdish language, there are three ways you can do so:
+
+
If you are a native Kurdish speaker with general knowledge about Kurdish and are comfortable working with computer, contributing to collaboratively-curated resources is the best starting point, particularly to:
If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must. Please get in touch by joining the KurdishNLP community on Gitter. Our collaborations oftentimes lead to a scientific paper depending on the task. Please check the following repositories to find out about some of our previous projects:
If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you are invited to create content online. You can start creating a blog or tweet in Kurdish. After all, every single person is a contributor as well.
+
+
+
In any case, please follow this project and introduce it to your friends. Test the tool and raise your issues so that we can fix them.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
Please note that KLPT is under development and some of the functionalities will appear in the future versions. You can find out regarding the progress of each task at the Projects section. In the current version, the following tasks are included:
+
+
+
+
Modules
+
Tasks
+
Sorani (ckb)
+
Kurmanji (kmr)
+
+
+
+
+
preprocess
+
normalization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
standardization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
unification of numerals
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
tokenize
+
word tokenization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
MWE tokenization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
sentence tokenization
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
transliterate
+
Arabic to Latin
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
Latin to Arabic
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
Detection of u/w and î/y
+
✓ (v0.1.0)
+
✓ (v0.1.0)
+
+
+
Detection of Bizroke ( i )
+
✗
+
✗
+
+
+
stem
+
morphological analysis
+
✓ (v0.1.0)
+
✗
+
+
+
morphological generation
+
✓ (v0.1.0)
+
✗
+
+
+
stemming
+
✗
+
✗
+
+
+
lemmatization
+
✗
+
✗
+
+
+
spell error detection and correction
+
✓ (v0.1.0)
+
✗
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support,
+
+
You can be an official sponsor
+
You will get a GitHub sponsor badge on your profile
+
If you have any questions, I will focus on it
+
If you want, I will add your name or company logo on the front page of your preferred project
+
Your contribution will be acknowledged in one of my future papers in a field of your choice
+
+
Our sponsors:
+
Be the first one! 🙂
+
+
+
+
Name/company
+
donation ($)
+
URL
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as less-resourced languages.
+
Despite a plethora of performant tools and specific frameworks for natural language processing (NLP), such as NLTK, Stanza and spaCy, the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish.
+
Kurdish Language
+
Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages):
Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script.
+
KLPT - The Kurdish Language Processing Toolkit
+
KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the Kurdish language. The current version (0.1) comes with four core modules, namely preprocess, stem, transliterate and tokenize, and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the Sorani and Kurmanji dialects of Kurdish. More importantly, it is an open-source project!
+
To find out more about how to use the tool, please check the "User Guide" section of this website.
+
Cite this project
+
Please consider citing this paper, if you use any part of the data or the tool (bib file):
+
@inproceedings{ahmadi2020klpt,
+ title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
+ author = "Ahmadi, Sina",
+ booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
+ month = nov,
+ year = "2020",
+ address = "Online",
+ publisher = "Association for Computational Linguistics",
+ url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
+ pages = "72--84"
+}
+
You are free to share, copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material
+for any purpose, even commercially.
+
You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
+
If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
'}),b.prototype=a.extend({},a.fn.tooltip.Constructor.prototype),b.prototype.constructor=b,b.prototype.getDefaults=function(){return b.DEFAULTS},b.prototype.setContent=function(){var a=this.tip(),b=this.getTitle(),c=this.getContent();a.find(".popover-title")[this.options.html?"html":"text"](b),a.find(".popover-content")[this.options.html?"html":"text"](c),a.removeClass("fade top bottom left right in"),a.find(".popover-title").html()||a.find(".popover-title").hide()},b.prototype.hasContent=function(){return this.getTitle()||this.getContent()},b.prototype.getContent=function(){var a=this.$element,b=this.options;return a.attr("data-content")||("function"==typeof b.content?b.content.call(a[0]):b.content)},b.prototype.arrow=function(){return this.$arrow=this.$arrow||this.tip().find(".arrow")},b.prototype.tip=function(){return this.$tip||(this.$tip=a(this.options.template)),this.$tip};var c=a.fn.popover;a.fn.popover=function(c){return this.each(function(){var d=a(this),e=d.data("bs.popover"),f="object"==typeof c&&c;e||d.data("bs.popover",e=new b(this,f)),"string"==typeof c&&e[c]()})},a.fn.popover.Constructor=b,a.fn.popover.noConflict=function(){return a.fn.popover=c,this}}(jQuery),+function(a){"use strict";function b(c,d){var e,f=a.proxy(this.process,this);this.$element=a(c).is("body")?a(window):a(c),this.$body=a("body"),this.$scrollElement=this.$element.on("scroll.bs.scroll-spy.data-api",f),this.options=a.extend({},b.DEFAULTS,d),this.selector=(this.options.target||(e=a(c).attr("href"))&&e.replace(/.*(?=#[^\s]+$)/,"")||"")+" .nav li > a",this.offsets=a([]),this.targets=a([]),this.activeTarget=null,this.refresh(),this.process()}b.DEFAULTS={offset:10},b.prototype.refresh=function(){var b=this.$element[0]==window?"offset":"position";this.offsets=a([]),this.targets=a([]);var c=this;this.$body.find(this.selector).map(function(){var d=a(this),e=d.data("target")||d.attr("href"),f=/^#\w/.test(e)&&a(e);return f&&f.length&&[[f[b]().top+(!a.isWindow(c.$scrollElement.get(0))&&c.$scrollElement.scrollTop()),e]]||null}).sort(function(a,b){return a[0]-b[0]}).each(function(){c.offsets.push(this[0]),c.targets.push(this[1])})},b.prototype.process=function(){var a,b=this.$scrollElement.scrollTop()+this.options.offset,c=this.$scrollElement[0].scrollHeight||this.$body[0].scrollHeight,d=c-this.$scrollElement.height(),e=this.offsets,f=this.targets,g=this.activeTarget;if(b>=d)return g!=(a=f.last()[0])&&this.activate(a);for(a=e.length;a--;)g!=f[a]&&b>=e[a]&&(!e[a+1]||b<=e[a+1])&&this.activate(f[a])},b.prototype.activate=function(b){this.activeTarget=b,a(this.selector).parents(".active").removeClass("active");var c=this.selector+'[data-target="'+b+'"],'+this.selector+'[href="'+b+'"]',d=a(c).parents("li").addClass("active");d.parent(".dropdown-menu").length&&(d=d.closest("li.dropdown").addClass("active")),d.trigger("activate.bs.scrollspy")};var c=a.fn.scrollspy;a.fn.scrollspy=function(c){return this.each(function(){var d=a(this),e=d.data("bs.scrollspy"),f="object"==typeof c&&c;e||d.data("bs.scrollspy",e=new b(this,f)),"string"==typeof c&&e[c]()})},a.fn.scrollspy.Constructor=b,a.fn.scrollspy.noConflict=function(){return a.fn.scrollspy=c,this},a(window).on("load",function(){a('[data-spy="scroll"]').each(function(){var b=a(this);b.scrollspy(b.data())})})}(jQuery),+function(a){"use strict";var b=function(b){this.element=a(b)};b.prototype.show=function(){var b=this.element,c=b.closest("ul:not(.dropdown-menu)"),d=b.data("target");if(d||(d=b.attr("href"),d=d&&d.replace(/.*(?=#[^\s]*$)/,"")),!b.parent("li").hasClass("active")){var e=c.find(".active:last a")[0],f=a.Event("show.bs.tab",{relatedTarget:e});if(b.trigger(f),!f.isDefaultPrevented()){var g=a(d);this.activate(b.parent("li"),c),this.activate(g,g.parent(),function(){b.trigger({type:"shown.bs.tab",relatedTarget:e})})}}},b.prototype.activate=function(b,c,d){function e(){f.removeClass("active").find("> .dropdown-menu > .active").removeClass("active"),b.addClass("active"),g?(b[0].offsetWidth,b.addClass("in")):b.removeClass("fade"),b.parent(".dropdown-menu")&&b.closest("li.dropdown").addClass("active"),d&&d()}var f=c.find("> .active"),g=d&&a.support.transition&&f.hasClass("fade");g?f.one(a.support.transition.end,e).emulateTransitionEnd(150):e(),f.removeClass("in")};var c=a.fn.tab;a.fn.tab=function(c){return this.each(function(){var d=a(this),e=d.data("bs.tab");e||d.data("bs.tab",e=new b(this)),"string"==typeof c&&e[c]()})},a.fn.tab.Constructor=b,a.fn.tab.noConflict=function(){return a.fn.tab=c,this},a(document).on("click.bs.tab.data-api",'[data-toggle="tab"], [data-toggle="pill"]',function(b){b.preventDefault(),a(this).tab("show")})}(jQuery),+function(a){"use strict";var b=function(c,d){this.options=a.extend({},b.DEFAULTS,d),this.$window=a(window).on("scroll.bs.affix.data-api",a.proxy(this.checkPosition,this)).on("click.bs.affix.data-api",a.proxy(this.checkPositionWithEventLoop,this)),this.$element=a(c),this.affixed=this.unpin=null,this.checkPosition()};b.RESET="affix affix-top affix-bottom",b.DEFAULTS={offset:0},b.prototype.checkPositionWithEventLoop=function(){setTimeout(a.proxy(this.checkPosition,this),1)},b.prototype.checkPosition=function(){if(this.$element.is(":visible")){var c=a(document).height(),d=this.$window.scrollTop(),e=this.$element.offset(),f=this.options.offset,g=f.top,h=f.bottom;"object"!=typeof f&&(h=g=f),"function"==typeof g&&(g=f.top()),"function"==typeof h&&(h=f.bottom());var i=null!=this.unpin&&d+this.unpin<=e.top?!1:null!=h&&e.top+this.$element.height()>=c-h?"bottom":null!=g&&g>=d?"top":!1;this.affixed!==i&&(this.unpin&&this.$element.css("top",""),this.affixed=i,this.unpin="bottom"==i?e.top-d:null,this.$element.removeClass(b.RESET).addClass("affix"+(i?"-"+i:"")),"bottom"==i&&this.$element.offset({top:document.body.offsetHeight-h-this.$element.height()}))}};var c=a.fn.affix;a.fn.affix=function(c){return this.each(function(){var d=a(this),e=d.data("bs.affix"),f="object"==typeof c&&c;e||d.data("bs.affix",e=new b(this,f)),"string"==typeof c&&e[c]()})},a.fn.affix.Constructor=b,a.fn.affix.noConflict=function(){return a.fn.affix=c,this},a(window).on("load",function(){a('[data-spy="affix"]').each(function(){var b=a(this),c=b.data();c.offset=c.offset||{},c.offsetBottom&&(c.offset.bottom=c.offsetBottom),c.offsetTop&&(c.offset.top=c.offsetTop),b.affix(c)})})}(jQuery);
\ No newline at end of file
diff --git a/site/search/lunr.js b/site/search/lunr.js
new file mode 100644
index 0000000..c353765
--- /dev/null
+++ b/site/search/lunr.js
@@ -0,0 +1,3475 @@
+/**
+ * lunr - http://lunrjs.com - A bit like Solr, but much smaller and not as bright - 2.3.8
+ * Copyright (C) 2019 Oliver Nightingale
+ * @license MIT
+ */
+
+;(function(){
+
+/**
+ * A convenience function for configuring and constructing
+ * a new lunr Index.
+ *
+ * A lunr.Builder instance is created and the pipeline setup
+ * with a trimmer, stop word filter and stemmer.
+ *
+ * This builder object is yielded to the configuration function
+ * that is passed as a parameter, allowing the list of fields
+ * and other builder parameters to be customised.
+ *
+ * All documents _must_ be added within the passed config function.
+ *
+ * @example
+ * var idx = lunr(function () {
+ * this.field('title')
+ * this.field('body')
+ * this.ref('id')
+ *
+ * documents.forEach(function (doc) {
+ * this.add(doc)
+ * }, this)
+ * })
+ *
+ * @see {@link lunr.Builder}
+ * @see {@link lunr.Pipeline}
+ * @see {@link lunr.trimmer}
+ * @see {@link lunr.stopWordFilter}
+ * @see {@link lunr.stemmer}
+ * @namespace {function} lunr
+ */
+var lunr = function (config) {
+ var builder = new lunr.Builder
+
+ builder.pipeline.add(
+ lunr.trimmer,
+ lunr.stopWordFilter,
+ lunr.stemmer
+ )
+
+ builder.searchPipeline.add(
+ lunr.stemmer
+ )
+
+ config.call(builder, builder)
+ return builder.build()
+}
+
+lunr.version = "2.3.8"
+/*!
+ * lunr.utils
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * A namespace containing utils for the rest of the lunr library
+ * @namespace lunr.utils
+ */
+lunr.utils = {}
+
+/**
+ * Print a warning message to the console.
+ *
+ * @param {String} message The message to be printed.
+ * @memberOf lunr.utils
+ * @function
+ */
+lunr.utils.warn = (function (global) {
+ /* eslint-disable no-console */
+ return function (message) {
+ if (global.console && console.warn) {
+ console.warn(message)
+ }
+ }
+ /* eslint-enable no-console */
+})(this)
+
+/**
+ * Convert an object to a string.
+ *
+ * In the case of `null` and `undefined` the function returns
+ * the empty string, in all other cases the result of calling
+ * `toString` on the passed object is returned.
+ *
+ * @param {Any} obj The object to convert to a string.
+ * @return {String} string representation of the passed object.
+ * @memberOf lunr.utils
+ */
+lunr.utils.asString = function (obj) {
+ if (obj === void 0 || obj === null) {
+ return ""
+ } else {
+ return obj.toString()
+ }
+}
+
+/**
+ * Clones an object.
+ *
+ * Will create a copy of an existing object such that any mutations
+ * on the copy cannot affect the original.
+ *
+ * Only shallow objects are supported, passing a nested object to this
+ * function will cause a TypeError.
+ *
+ * Objects with primitives, and arrays of primitives are supported.
+ *
+ * @param {Object} obj The object to clone.
+ * @return {Object} a clone of the passed object.
+ * @throws {TypeError} when a nested object is passed.
+ * @memberOf Utils
+ */
+lunr.utils.clone = function (obj) {
+ if (obj === null || obj === undefined) {
+ return obj
+ }
+
+ var clone = Object.create(null),
+ keys = Object.keys(obj)
+
+ for (var i = 0; i < keys.length; i++) {
+ var key = keys[i],
+ val = obj[key]
+
+ if (Array.isArray(val)) {
+ clone[key] = val.slice()
+ continue
+ }
+
+ if (typeof val === 'string' ||
+ typeof val === 'number' ||
+ typeof val === 'boolean') {
+ clone[key] = val
+ continue
+ }
+
+ throw new TypeError("clone is not deep and does not support nested objects")
+ }
+
+ return clone
+}
+lunr.FieldRef = function (docRef, fieldName, stringValue) {
+ this.docRef = docRef
+ this.fieldName = fieldName
+ this._stringValue = stringValue
+}
+
+lunr.FieldRef.joiner = "/"
+
+lunr.FieldRef.fromString = function (s) {
+ var n = s.indexOf(lunr.FieldRef.joiner)
+
+ if (n === -1) {
+ throw "malformed field ref string"
+ }
+
+ var fieldRef = s.slice(0, n),
+ docRef = s.slice(n + 1)
+
+ return new lunr.FieldRef (docRef, fieldRef, s)
+}
+
+lunr.FieldRef.prototype.toString = function () {
+ if (this._stringValue == undefined) {
+ this._stringValue = this.fieldName + lunr.FieldRef.joiner + this.docRef
+ }
+
+ return this._stringValue
+}
+/*!
+ * lunr.Set
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * A lunr set.
+ *
+ * @constructor
+ */
+lunr.Set = function (elements) {
+ this.elements = Object.create(null)
+
+ if (elements) {
+ this.length = elements.length
+
+ for (var i = 0; i < this.length; i++) {
+ this.elements[elements[i]] = true
+ }
+ } else {
+ this.length = 0
+ }
+}
+
+/**
+ * A complete set that contains all elements.
+ *
+ * @static
+ * @readonly
+ * @type {lunr.Set}
+ */
+lunr.Set.complete = {
+ intersect: function (other) {
+ return other
+ },
+
+ union: function (other) {
+ return other
+ },
+
+ contains: function () {
+ return true
+ }
+}
+
+/**
+ * An empty set that contains no elements.
+ *
+ * @static
+ * @readonly
+ * @type {lunr.Set}
+ */
+lunr.Set.empty = {
+ intersect: function () {
+ return this
+ },
+
+ union: function (other) {
+ return other
+ },
+
+ contains: function () {
+ return false
+ }
+}
+
+/**
+ * Returns true if this set contains the specified object.
+ *
+ * @param {object} object - Object whose presence in this set is to be tested.
+ * @returns {boolean} - True if this set contains the specified object.
+ */
+lunr.Set.prototype.contains = function (object) {
+ return !!this.elements[object]
+}
+
+/**
+ * Returns a new set containing only the elements that are present in both
+ * this set and the specified set.
+ *
+ * @param {lunr.Set} other - set to intersect with this set.
+ * @returns {lunr.Set} a new set that is the intersection of this and the specified set.
+ */
+
+lunr.Set.prototype.intersect = function (other) {
+ var a, b, elements, intersection = []
+
+ if (other === lunr.Set.complete) {
+ return this
+ }
+
+ if (other === lunr.Set.empty) {
+ return other
+ }
+
+ if (this.length < other.length) {
+ a = this
+ b = other
+ } else {
+ a = other
+ b = this
+ }
+
+ elements = Object.keys(a.elements)
+
+ for (var i = 0; i < elements.length; i++) {
+ var element = elements[i]
+ if (element in b.elements) {
+ intersection.push(element)
+ }
+ }
+
+ return new lunr.Set (intersection)
+}
+
+/**
+ * Returns a new set combining the elements of this and the specified set.
+ *
+ * @param {lunr.Set} other - set to union with this set.
+ * @return {lunr.Set} a new set that is the union of this and the specified set.
+ */
+
+lunr.Set.prototype.union = function (other) {
+ if (other === lunr.Set.complete) {
+ return lunr.Set.complete
+ }
+
+ if (other === lunr.Set.empty) {
+ return this
+ }
+
+ return new lunr.Set(Object.keys(this.elements).concat(Object.keys(other.elements)))
+}
+/**
+ * A function to calculate the inverse document frequency for
+ * a posting. This is shared between the builder and the index
+ *
+ * @private
+ * @param {object} posting - The posting for a given term
+ * @param {number} documentCount - The total number of documents.
+ */
+lunr.idf = function (posting, documentCount) {
+ var documentsWithTerm = 0
+
+ for (var fieldName in posting) {
+ if (fieldName == '_index') continue // Ignore the term index, its not a field
+ documentsWithTerm += Object.keys(posting[fieldName]).length
+ }
+
+ var x = (documentCount - documentsWithTerm + 0.5) / (documentsWithTerm + 0.5)
+
+ return Math.log(1 + Math.abs(x))
+}
+
+/**
+ * A token wraps a string representation of a token
+ * as it is passed through the text processing pipeline.
+ *
+ * @constructor
+ * @param {string} [str=''] - The string token being wrapped.
+ * @param {object} [metadata={}] - Metadata associated with this token.
+ */
+lunr.Token = function (str, metadata) {
+ this.str = str || ""
+ this.metadata = metadata || {}
+}
+
+/**
+ * Returns the token string that is being wrapped by this object.
+ *
+ * @returns {string}
+ */
+lunr.Token.prototype.toString = function () {
+ return this.str
+}
+
+/**
+ * A token update function is used when updating or optionally
+ * when cloning a token.
+ *
+ * @callback lunr.Token~updateFunction
+ * @param {string} str - The string representation of the token.
+ * @param {Object} metadata - All metadata associated with this token.
+ */
+
+/**
+ * Applies the given function to the wrapped string token.
+ *
+ * @example
+ * token.update(function (str, metadata) {
+ * return str.toUpperCase()
+ * })
+ *
+ * @param {lunr.Token~updateFunction} fn - A function to apply to the token string.
+ * @returns {lunr.Token}
+ */
+lunr.Token.prototype.update = function (fn) {
+ this.str = fn(this.str, this.metadata)
+ return this
+}
+
+/**
+ * Creates a clone of this token. Optionally a function can be
+ * applied to the cloned token.
+ *
+ * @param {lunr.Token~updateFunction} [fn] - An optional function to apply to the cloned token.
+ * @returns {lunr.Token}
+ */
+lunr.Token.prototype.clone = function (fn) {
+ fn = fn || function (s) { return s }
+ return new lunr.Token (fn(this.str, this.metadata), this.metadata)
+}
+/*!
+ * lunr.tokenizer
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * A function for splitting a string into tokens ready to be inserted into
+ * the search index. Uses `lunr.tokenizer.separator` to split strings, change
+ * the value of this property to change how strings are split into tokens.
+ *
+ * This tokenizer will convert its parameter to a string by calling `toString` and
+ * then will split this string on the character in `lunr.tokenizer.separator`.
+ * Arrays will have their elements converted to strings and wrapped in a lunr.Token.
+ *
+ * Optional metadata can be passed to the tokenizer, this metadata will be cloned and
+ * added as metadata to every token that is created from the object to be tokenized.
+ *
+ * @static
+ * @param {?(string|object|object[])} obj - The object to convert into tokens
+ * @param {?object} metadata - Optional metadata to associate with every token
+ * @returns {lunr.Token[]}
+ * @see {@link lunr.Pipeline}
+ */
+lunr.tokenizer = function (obj, metadata) {
+ if (obj == null || obj == undefined) {
+ return []
+ }
+
+ if (Array.isArray(obj)) {
+ return obj.map(function (t) {
+ return new lunr.Token(
+ lunr.utils.asString(t).toLowerCase(),
+ lunr.utils.clone(metadata)
+ )
+ })
+ }
+
+ var str = obj.toString().toLowerCase(),
+ len = str.length,
+ tokens = []
+
+ for (var sliceEnd = 0, sliceStart = 0; sliceEnd <= len; sliceEnd++) {
+ var char = str.charAt(sliceEnd),
+ sliceLength = sliceEnd - sliceStart
+
+ if ((char.match(lunr.tokenizer.separator) || sliceEnd == len)) {
+
+ if (sliceLength > 0) {
+ var tokenMetadata = lunr.utils.clone(metadata) || {}
+ tokenMetadata["position"] = [sliceStart, sliceLength]
+ tokenMetadata["index"] = tokens.length
+
+ tokens.push(
+ new lunr.Token (
+ str.slice(sliceStart, sliceEnd),
+ tokenMetadata
+ )
+ )
+ }
+
+ sliceStart = sliceEnd + 1
+ }
+
+ }
+
+ return tokens
+}
+
+/**
+ * The separator used to split a string into tokens. Override this property to change the behaviour of
+ * `lunr.tokenizer` behaviour when tokenizing strings. By default this splits on whitespace and hyphens.
+ *
+ * @static
+ * @see lunr.tokenizer
+ */
+lunr.tokenizer.separator = /[\s\-]+/
+/*!
+ * lunr.Pipeline
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * lunr.Pipelines maintain an ordered list of functions to be applied to all
+ * tokens in documents entering the search index and queries being ran against
+ * the index.
+ *
+ * An instance of lunr.Index created with the lunr shortcut will contain a
+ * pipeline with a stop word filter and an English language stemmer. Extra
+ * functions can be added before or after either of these functions or these
+ * default functions can be removed.
+ *
+ * When run the pipeline will call each function in turn, passing a token, the
+ * index of that token in the original list of all tokens and finally a list of
+ * all the original tokens.
+ *
+ * The output of functions in the pipeline will be passed to the next function
+ * in the pipeline. To exclude a token from entering the index the function
+ * should return undefined, the rest of the pipeline will not be called with
+ * this token.
+ *
+ * For serialisation of pipelines to work, all functions used in an instance of
+ * a pipeline should be registered with lunr.Pipeline. Registered functions can
+ * then be loaded. If trying to load a serialised pipeline that uses functions
+ * that are not registered an error will be thrown.
+ *
+ * If not planning on serialising the pipeline then registering pipeline functions
+ * is not necessary.
+ *
+ * @constructor
+ */
+lunr.Pipeline = function () {
+ this._stack = []
+}
+
+lunr.Pipeline.registeredFunctions = Object.create(null)
+
+/**
+ * A pipeline function maps lunr.Token to lunr.Token. A lunr.Token contains the token
+ * string as well as all known metadata. A pipeline function can mutate the token string
+ * or mutate (or add) metadata for a given token.
+ *
+ * A pipeline function can indicate that the passed token should be discarded by returning
+ * null, undefined or an empty string. This token will not be passed to any downstream pipeline
+ * functions and will not be added to the index.
+ *
+ * Multiple tokens can be returned by returning an array of tokens. Each token will be passed
+ * to any downstream pipeline functions and all will returned tokens will be added to the index.
+ *
+ * Any number of pipeline functions may be chained together using a lunr.Pipeline.
+ *
+ * @interface lunr.PipelineFunction
+ * @param {lunr.Token} token - A token from the document being processed.
+ * @param {number} i - The index of this token in the complete list of tokens for this document/field.
+ * @param {lunr.Token[]} tokens - All tokens for this document/field.
+ * @returns {(?lunr.Token|lunr.Token[])}
+ */
+
+/**
+ * Register a function with the pipeline.
+ *
+ * Functions that are used in the pipeline should be registered if the pipeline
+ * needs to be serialised, or a serialised pipeline needs to be loaded.
+ *
+ * Registering a function does not add it to a pipeline, functions must still be
+ * added to instances of the pipeline for them to be used when running a pipeline.
+ *
+ * @param {lunr.PipelineFunction} fn - The function to check for.
+ * @param {String} label - The label to register this function with
+ */
+lunr.Pipeline.registerFunction = function (fn, label) {
+ if (label in this.registeredFunctions) {
+ lunr.utils.warn('Overwriting existing registered function: ' + label)
+ }
+
+ fn.label = label
+ lunr.Pipeline.registeredFunctions[fn.label] = fn
+}
+
+/**
+ * Warns if the function is not registered as a Pipeline function.
+ *
+ * @param {lunr.PipelineFunction} fn - The function to check for.
+ * @private
+ */
+lunr.Pipeline.warnIfFunctionNotRegistered = function (fn) {
+ var isRegistered = fn.label && (fn.label in this.registeredFunctions)
+
+ if (!isRegistered) {
+ lunr.utils.warn('Function is not registered with pipeline. This may cause problems when serialising the index.\n', fn)
+ }
+}
+
+/**
+ * Loads a previously serialised pipeline.
+ *
+ * All functions to be loaded must already be registered with lunr.Pipeline.
+ * If any function from the serialised data has not been registered then an
+ * error will be thrown.
+ *
+ * @param {Object} serialised - The serialised pipeline to load.
+ * @returns {lunr.Pipeline}
+ */
+lunr.Pipeline.load = function (serialised) {
+ var pipeline = new lunr.Pipeline
+
+ serialised.forEach(function (fnName) {
+ var fn = lunr.Pipeline.registeredFunctions[fnName]
+
+ if (fn) {
+ pipeline.add(fn)
+ } else {
+ throw new Error('Cannot load unregistered function: ' + fnName)
+ }
+ })
+
+ return pipeline
+}
+
+/**
+ * Adds new functions to the end of the pipeline.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @param {lunr.PipelineFunction[]} functions - Any number of functions to add to the pipeline.
+ */
+lunr.Pipeline.prototype.add = function () {
+ var fns = Array.prototype.slice.call(arguments)
+
+ fns.forEach(function (fn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(fn)
+ this._stack.push(fn)
+ }, this)
+}
+
+/**
+ * Adds a single function after a function that already exists in the
+ * pipeline.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @param {lunr.PipelineFunction} existingFn - A function that already exists in the pipeline.
+ * @param {lunr.PipelineFunction} newFn - The new function to add to the pipeline.
+ */
+lunr.Pipeline.prototype.after = function (existingFn, newFn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(newFn)
+
+ var pos = this._stack.indexOf(existingFn)
+ if (pos == -1) {
+ throw new Error('Cannot find existingFn')
+ }
+
+ pos = pos + 1
+ this._stack.splice(pos, 0, newFn)
+}
+
+/**
+ * Adds a single function before a function that already exists in the
+ * pipeline.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @param {lunr.PipelineFunction} existingFn - A function that already exists in the pipeline.
+ * @param {lunr.PipelineFunction} newFn - The new function to add to the pipeline.
+ */
+lunr.Pipeline.prototype.before = function (existingFn, newFn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(newFn)
+
+ var pos = this._stack.indexOf(existingFn)
+ if (pos == -1) {
+ throw new Error('Cannot find existingFn')
+ }
+
+ this._stack.splice(pos, 0, newFn)
+}
+
+/**
+ * Removes a function from the pipeline.
+ *
+ * @param {lunr.PipelineFunction} fn The function to remove from the pipeline.
+ */
+lunr.Pipeline.prototype.remove = function (fn) {
+ var pos = this._stack.indexOf(fn)
+ if (pos == -1) {
+ return
+ }
+
+ this._stack.splice(pos, 1)
+}
+
+/**
+ * Runs the current list of functions that make up the pipeline against the
+ * passed tokens.
+ *
+ * @param {Array} tokens The tokens to run through the pipeline.
+ * @returns {Array}
+ */
+lunr.Pipeline.prototype.run = function (tokens) {
+ var stackLength = this._stack.length
+
+ for (var i = 0; i < stackLength; i++) {
+ var fn = this._stack[i]
+ var memo = []
+
+ for (var j = 0; j < tokens.length; j++) {
+ var result = fn(tokens[j], j, tokens)
+
+ if (result === null || result === void 0 || result === '') continue
+
+ if (Array.isArray(result)) {
+ for (var k = 0; k < result.length; k++) {
+ memo.push(result[k])
+ }
+ } else {
+ memo.push(result)
+ }
+ }
+
+ tokens = memo
+ }
+
+ return tokens
+}
+
+/**
+ * Convenience method for passing a string through a pipeline and getting
+ * strings out. This method takes care of wrapping the passed string in a
+ * token and mapping the resulting tokens back to strings.
+ *
+ * @param {string} str - The string to pass through the pipeline.
+ * @param {?object} metadata - Optional metadata to associate with the token
+ * passed to the pipeline.
+ * @returns {string[]}
+ */
+lunr.Pipeline.prototype.runString = function (str, metadata) {
+ var token = new lunr.Token (str, metadata)
+
+ return this.run([token]).map(function (t) {
+ return t.toString()
+ })
+}
+
+/**
+ * Resets the pipeline by removing any existing processors.
+ *
+ */
+lunr.Pipeline.prototype.reset = function () {
+ this._stack = []
+}
+
+/**
+ * Returns a representation of the pipeline ready for serialisation.
+ *
+ * Logs a warning if the function has not been registered.
+ *
+ * @returns {Array}
+ */
+lunr.Pipeline.prototype.toJSON = function () {
+ return this._stack.map(function (fn) {
+ lunr.Pipeline.warnIfFunctionNotRegistered(fn)
+
+ return fn.label
+ })
+}
+/*!
+ * lunr.Vector
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * A vector is used to construct the vector space of documents and queries. These
+ * vectors support operations to determine the similarity between two documents or
+ * a document and a query.
+ *
+ * Normally no parameters are required for initializing a vector, but in the case of
+ * loading a previously dumped vector the raw elements can be provided to the constructor.
+ *
+ * For performance reasons vectors are implemented with a flat array, where an elements
+ * index is immediately followed by its value. E.g. [index, value, index, value]. This
+ * allows the underlying array to be as sparse as possible and still offer decent
+ * performance when being used for vector calculations.
+ *
+ * @constructor
+ * @param {Number[]} [elements] - The flat list of element index and element value pairs.
+ */
+lunr.Vector = function (elements) {
+ this._magnitude = 0
+ this.elements = elements || []
+}
+
+
+/**
+ * Calculates the position within the vector to insert a given index.
+ *
+ * This is used internally by insert and upsert. If there are duplicate indexes then
+ * the position is returned as if the value for that index were to be updated, but it
+ * is the callers responsibility to check whether there is a duplicate at that index
+ *
+ * @param {Number} insertIdx - The index at which the element should be inserted.
+ * @returns {Number}
+ */
+lunr.Vector.prototype.positionForIndex = function (index) {
+ // For an empty vector the tuple can be inserted at the beginning
+ if (this.elements.length == 0) {
+ return 0
+ }
+
+ var start = 0,
+ end = this.elements.length / 2,
+ sliceLength = end - start,
+ pivotPoint = Math.floor(sliceLength / 2),
+ pivotIndex = this.elements[pivotPoint * 2]
+
+ while (sliceLength > 1) {
+ if (pivotIndex < index) {
+ start = pivotPoint
+ }
+
+ if (pivotIndex > index) {
+ end = pivotPoint
+ }
+
+ if (pivotIndex == index) {
+ break
+ }
+
+ sliceLength = end - start
+ pivotPoint = start + Math.floor(sliceLength / 2)
+ pivotIndex = this.elements[pivotPoint * 2]
+ }
+
+ if (pivotIndex == index) {
+ return pivotPoint * 2
+ }
+
+ if (pivotIndex > index) {
+ return pivotPoint * 2
+ }
+
+ if (pivotIndex < index) {
+ return (pivotPoint + 1) * 2
+ }
+}
+
+/**
+ * Inserts an element at an index within the vector.
+ *
+ * Does not allow duplicates, will throw an error if there is already an entry
+ * for this index.
+ *
+ * @param {Number} insertIdx - The index at which the element should be inserted.
+ * @param {Number} val - The value to be inserted into the vector.
+ */
+lunr.Vector.prototype.insert = function (insertIdx, val) {
+ this.upsert(insertIdx, val, function () {
+ throw "duplicate index"
+ })
+}
+
+/**
+ * Inserts or updates an existing index within the vector.
+ *
+ * @param {Number} insertIdx - The index at which the element should be inserted.
+ * @param {Number} val - The value to be inserted into the vector.
+ * @param {function} fn - A function that is called for updates, the existing value and the
+ * requested value are passed as arguments
+ */
+lunr.Vector.prototype.upsert = function (insertIdx, val, fn) {
+ this._magnitude = 0
+ var position = this.positionForIndex(insertIdx)
+
+ if (this.elements[position] == insertIdx) {
+ this.elements[position + 1] = fn(this.elements[position + 1], val)
+ } else {
+ this.elements.splice(position, 0, insertIdx, val)
+ }
+}
+
+/**
+ * Calculates the magnitude of this vector.
+ *
+ * @returns {Number}
+ */
+lunr.Vector.prototype.magnitude = function () {
+ if (this._magnitude) return this._magnitude
+
+ var sumOfSquares = 0,
+ elementsLength = this.elements.length
+
+ for (var i = 1; i < elementsLength; i += 2) {
+ var val = this.elements[i]
+ sumOfSquares += val * val
+ }
+
+ return this._magnitude = Math.sqrt(sumOfSquares)
+}
+
+/**
+ * Calculates the dot product of this vector and another vector.
+ *
+ * @param {lunr.Vector} otherVector - The vector to compute the dot product with.
+ * @returns {Number}
+ */
+lunr.Vector.prototype.dot = function (otherVector) {
+ var dotProduct = 0,
+ a = this.elements, b = otherVector.elements,
+ aLen = a.length, bLen = b.length,
+ aVal = 0, bVal = 0,
+ i = 0, j = 0
+
+ while (i < aLen && j < bLen) {
+ aVal = a[i], bVal = b[j]
+ if (aVal < bVal) {
+ i += 2
+ } else if (aVal > bVal) {
+ j += 2
+ } else if (aVal == bVal) {
+ dotProduct += a[i + 1] * b[j + 1]
+ i += 2
+ j += 2
+ }
+ }
+
+ return dotProduct
+}
+
+/**
+ * Calculates the similarity between this vector and another vector.
+ *
+ * @param {lunr.Vector} otherVector - The other vector to calculate the
+ * similarity with.
+ * @returns {Number}
+ */
+lunr.Vector.prototype.similarity = function (otherVector) {
+ return this.dot(otherVector) / this.magnitude() || 0
+}
+
+/**
+ * Converts the vector to an array of the elements within the vector.
+ *
+ * @returns {Number[]}
+ */
+lunr.Vector.prototype.toArray = function () {
+ var output = new Array (this.elements.length / 2)
+
+ for (var i = 1, j = 0; i < this.elements.length; i += 2, j++) {
+ output[j] = this.elements[i]
+ }
+
+ return output
+}
+
+/**
+ * A JSON serializable representation of the vector.
+ *
+ * @returns {Number[]}
+ */
+lunr.Vector.prototype.toJSON = function () {
+ return this.elements
+}
+/* eslint-disable */
+/*!
+ * lunr.stemmer
+ * Copyright (C) 2019 Oliver Nightingale
+ * Includes code from - http://tartarus.org/~martin/PorterStemmer/js.txt
+ */
+
+/**
+ * lunr.stemmer is an english language stemmer, this is a JavaScript
+ * implementation of the PorterStemmer taken from http://tartarus.org/~martin
+ *
+ * @static
+ * @implements {lunr.PipelineFunction}
+ * @param {lunr.Token} token - The string to stem
+ * @returns {lunr.Token}
+ * @see {@link lunr.Pipeline}
+ * @function
+ */
+lunr.stemmer = (function(){
+ var step2list = {
+ "ational" : "ate",
+ "tional" : "tion",
+ "enci" : "ence",
+ "anci" : "ance",
+ "izer" : "ize",
+ "bli" : "ble",
+ "alli" : "al",
+ "entli" : "ent",
+ "eli" : "e",
+ "ousli" : "ous",
+ "ization" : "ize",
+ "ation" : "ate",
+ "ator" : "ate",
+ "alism" : "al",
+ "iveness" : "ive",
+ "fulness" : "ful",
+ "ousness" : "ous",
+ "aliti" : "al",
+ "iviti" : "ive",
+ "biliti" : "ble",
+ "logi" : "log"
+ },
+
+ step3list = {
+ "icate" : "ic",
+ "ative" : "",
+ "alize" : "al",
+ "iciti" : "ic",
+ "ical" : "ic",
+ "ful" : "",
+ "ness" : ""
+ },
+
+ c = "[^aeiou]", // consonant
+ v = "[aeiouy]", // vowel
+ C = c + "[^aeiouy]*", // consonant sequence
+ V = v + "[aeiou]*", // vowel sequence
+
+ mgr0 = "^(" + C + ")?" + V + C, // [C]VC... is m>0
+ meq1 = "^(" + C + ")?" + V + C + "(" + V + ")?$", // [C]VC[V] is m=1
+ mgr1 = "^(" + C + ")?" + V + C + V + C, // [C]VCVC... is m>1
+ s_v = "^(" + C + ")?" + v; // vowel in stem
+
+ var re_mgr0 = new RegExp(mgr0);
+ var re_mgr1 = new RegExp(mgr1);
+ var re_meq1 = new RegExp(meq1);
+ var re_s_v = new RegExp(s_v);
+
+ var re_1a = /^(.+?)(ss|i)es$/;
+ var re2_1a = /^(.+?)([^s])s$/;
+ var re_1b = /^(.+?)eed$/;
+ var re2_1b = /^(.+?)(ed|ing)$/;
+ var re_1b_2 = /.$/;
+ var re2_1b_2 = /(at|bl|iz)$/;
+ var re3_1b_2 = new RegExp("([^aeiouylsz])\\1$");
+ var re4_1b_2 = new RegExp("^" + C + v + "[^aeiouwxy]$");
+
+ var re_1c = /^(.+?[^aeiou])y$/;
+ var re_2 = /^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/;
+
+ var re_3 = /^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/;
+
+ var re_4 = /^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/;
+ var re2_4 = /^(.+?)(s|t)(ion)$/;
+
+ var re_5 = /^(.+?)e$/;
+ var re_5_1 = /ll$/;
+ var re3_5 = new RegExp("^" + C + v + "[^aeiouwxy]$");
+
+ var porterStemmer = function porterStemmer(w) {
+ var stem,
+ suffix,
+ firstch,
+ re,
+ re2,
+ re3,
+ re4;
+
+ if (w.length < 3) { return w; }
+
+ firstch = w.substr(0,1);
+ if (firstch == "y") {
+ w = firstch.toUpperCase() + w.substr(1);
+ }
+
+ // Step 1a
+ re = re_1a
+ re2 = re2_1a;
+
+ if (re.test(w)) { w = w.replace(re,"$1$2"); }
+ else if (re2.test(w)) { w = w.replace(re2,"$1$2"); }
+
+ // Step 1b
+ re = re_1b;
+ re2 = re2_1b;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ re = re_mgr0;
+ if (re.test(fp[1])) {
+ re = re_1b_2;
+ w = w.replace(re,"");
+ }
+ } else if (re2.test(w)) {
+ var fp = re2.exec(w);
+ stem = fp[1];
+ re2 = re_s_v;
+ if (re2.test(stem)) {
+ w = stem;
+ re2 = re2_1b_2;
+ re3 = re3_1b_2;
+ re4 = re4_1b_2;
+ if (re2.test(w)) { w = w + "e"; }
+ else if (re3.test(w)) { re = re_1b_2; w = w.replace(re,""); }
+ else if (re4.test(w)) { w = w + "e"; }
+ }
+ }
+
+ // Step 1c - replace suffix y or Y by i if preceded by a non-vowel which is not the first letter of the word (so cry -> cri, by -> by, say -> say)
+ re = re_1c;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ w = stem + "i";
+ }
+
+ // Step 2
+ re = re_2;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ suffix = fp[2];
+ re = re_mgr0;
+ if (re.test(stem)) {
+ w = stem + step2list[suffix];
+ }
+ }
+
+ // Step 3
+ re = re_3;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ suffix = fp[2];
+ re = re_mgr0;
+ if (re.test(stem)) {
+ w = stem + step3list[suffix];
+ }
+ }
+
+ // Step 4
+ re = re_4;
+ re2 = re2_4;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ re = re_mgr1;
+ if (re.test(stem)) {
+ w = stem;
+ }
+ } else if (re2.test(w)) {
+ var fp = re2.exec(w);
+ stem = fp[1] + fp[2];
+ re2 = re_mgr1;
+ if (re2.test(stem)) {
+ w = stem;
+ }
+ }
+
+ // Step 5
+ re = re_5;
+ if (re.test(w)) {
+ var fp = re.exec(w);
+ stem = fp[1];
+ re = re_mgr1;
+ re2 = re_meq1;
+ re3 = re3_5;
+ if (re.test(stem) || (re2.test(stem) && !(re3.test(stem)))) {
+ w = stem;
+ }
+ }
+
+ re = re_5_1;
+ re2 = re_mgr1;
+ if (re.test(w) && re2.test(w)) {
+ re = re_1b_2;
+ w = w.replace(re,"");
+ }
+
+ // and turn initial Y back to y
+
+ if (firstch == "y") {
+ w = firstch.toLowerCase() + w.substr(1);
+ }
+
+ return w;
+ };
+
+ return function (token) {
+ return token.update(porterStemmer);
+ }
+})();
+
+lunr.Pipeline.registerFunction(lunr.stemmer, 'stemmer')
+/*!
+ * lunr.stopWordFilter
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * lunr.generateStopWordFilter builds a stopWordFilter function from the provided
+ * list of stop words.
+ *
+ * The built in lunr.stopWordFilter is built using this generator and can be used
+ * to generate custom stopWordFilters for applications or non English languages.
+ *
+ * @function
+ * @param {Array} token The token to pass through the filter
+ * @returns {lunr.PipelineFunction}
+ * @see lunr.Pipeline
+ * @see lunr.stopWordFilter
+ */
+lunr.generateStopWordFilter = function (stopWords) {
+ var words = stopWords.reduce(function (memo, stopWord) {
+ memo[stopWord] = stopWord
+ return memo
+ }, {})
+
+ return function (token) {
+ if (token && words[token.toString()] !== token.toString()) return token
+ }
+}
+
+/**
+ * lunr.stopWordFilter is an English language stop word list filter, any words
+ * contained in the list will not be passed through the filter.
+ *
+ * This is intended to be used in the Pipeline. If the token does not pass the
+ * filter then undefined will be returned.
+ *
+ * @function
+ * @implements {lunr.PipelineFunction}
+ * @params {lunr.Token} token - A token to check for being a stop word.
+ * @returns {lunr.Token}
+ * @see {@link lunr.Pipeline}
+ */
+lunr.stopWordFilter = lunr.generateStopWordFilter([
+ 'a',
+ 'able',
+ 'about',
+ 'across',
+ 'after',
+ 'all',
+ 'almost',
+ 'also',
+ 'am',
+ 'among',
+ 'an',
+ 'and',
+ 'any',
+ 'are',
+ 'as',
+ 'at',
+ 'be',
+ 'because',
+ 'been',
+ 'but',
+ 'by',
+ 'can',
+ 'cannot',
+ 'could',
+ 'dear',
+ 'did',
+ 'do',
+ 'does',
+ 'either',
+ 'else',
+ 'ever',
+ 'every',
+ 'for',
+ 'from',
+ 'get',
+ 'got',
+ 'had',
+ 'has',
+ 'have',
+ 'he',
+ 'her',
+ 'hers',
+ 'him',
+ 'his',
+ 'how',
+ 'however',
+ 'i',
+ 'if',
+ 'in',
+ 'into',
+ 'is',
+ 'it',
+ 'its',
+ 'just',
+ 'least',
+ 'let',
+ 'like',
+ 'likely',
+ 'may',
+ 'me',
+ 'might',
+ 'most',
+ 'must',
+ 'my',
+ 'neither',
+ 'no',
+ 'nor',
+ 'not',
+ 'of',
+ 'off',
+ 'often',
+ 'on',
+ 'only',
+ 'or',
+ 'other',
+ 'our',
+ 'own',
+ 'rather',
+ 'said',
+ 'say',
+ 'says',
+ 'she',
+ 'should',
+ 'since',
+ 'so',
+ 'some',
+ 'than',
+ 'that',
+ 'the',
+ 'their',
+ 'them',
+ 'then',
+ 'there',
+ 'these',
+ 'they',
+ 'this',
+ 'tis',
+ 'to',
+ 'too',
+ 'twas',
+ 'us',
+ 'wants',
+ 'was',
+ 'we',
+ 'were',
+ 'what',
+ 'when',
+ 'where',
+ 'which',
+ 'while',
+ 'who',
+ 'whom',
+ 'why',
+ 'will',
+ 'with',
+ 'would',
+ 'yet',
+ 'you',
+ 'your'
+])
+
+lunr.Pipeline.registerFunction(lunr.stopWordFilter, 'stopWordFilter')
+/*!
+ * lunr.trimmer
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * lunr.trimmer is a pipeline function for trimming non word
+ * characters from the beginning and end of tokens before they
+ * enter the index.
+ *
+ * This implementation may not work correctly for non latin
+ * characters and should either be removed or adapted for use
+ * with languages with non-latin characters.
+ *
+ * @static
+ * @implements {lunr.PipelineFunction}
+ * @param {lunr.Token} token The token to pass through the filter
+ * @returns {lunr.Token}
+ * @see lunr.Pipeline
+ */
+lunr.trimmer = function (token) {
+ return token.update(function (s) {
+ return s.replace(/^\W+/, '').replace(/\W+$/, '')
+ })
+}
+
+lunr.Pipeline.registerFunction(lunr.trimmer, 'trimmer')
+/*!
+ * lunr.TokenSet
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * A token set is used to store the unique list of all tokens
+ * within an index. Token sets are also used to represent an
+ * incoming query to the index, this query token set and index
+ * token set are then intersected to find which tokens to look
+ * up in the inverted index.
+ *
+ * A token set can hold multiple tokens, as in the case of the
+ * index token set, or it can hold a single token as in the
+ * case of a simple query token set.
+ *
+ * Additionally token sets are used to perform wildcard matching.
+ * Leading, contained and trailing wildcards are supported, and
+ * from this edit distance matching can also be provided.
+ *
+ * Token sets are implemented as a minimal finite state automata,
+ * where both common prefixes and suffixes are shared between tokens.
+ * This helps to reduce the space used for storing the token set.
+ *
+ * @constructor
+ */
+lunr.TokenSet = function () {
+ this.final = false
+ this.edges = {}
+ this.id = lunr.TokenSet._nextId
+ lunr.TokenSet._nextId += 1
+}
+
+/**
+ * Keeps track of the next, auto increment, identifier to assign
+ * to a new tokenSet.
+ *
+ * TokenSets require a unique identifier to be correctly minimised.
+ *
+ * @private
+ */
+lunr.TokenSet._nextId = 1
+
+/**
+ * Creates a TokenSet instance from the given sorted array of words.
+ *
+ * @param {String[]} arr - A sorted array of strings to create the set from.
+ * @returns {lunr.TokenSet}
+ * @throws Will throw an error if the input array is not sorted.
+ */
+lunr.TokenSet.fromArray = function (arr) {
+ var builder = new lunr.TokenSet.Builder
+
+ for (var i = 0, len = arr.length; i < len; i++) {
+ builder.insert(arr[i])
+ }
+
+ builder.finish()
+ return builder.root
+}
+
+/**
+ * Creates a token set from a query clause.
+ *
+ * @private
+ * @param {Object} clause - A single clause from lunr.Query.
+ * @param {string} clause.term - The query clause term.
+ * @param {number} [clause.editDistance] - The optional edit distance for the term.
+ * @returns {lunr.TokenSet}
+ */
+lunr.TokenSet.fromClause = function (clause) {
+ if ('editDistance' in clause) {
+ return lunr.TokenSet.fromFuzzyString(clause.term, clause.editDistance)
+ } else {
+ return lunr.TokenSet.fromString(clause.term)
+ }
+}
+
+/**
+ * Creates a token set representing a single string with a specified
+ * edit distance.
+ *
+ * Insertions, deletions, substitutions and transpositions are each
+ * treated as an edit distance of 1.
+ *
+ * Increasing the allowed edit distance will have a dramatic impact
+ * on the performance of both creating and intersecting these TokenSets.
+ * It is advised to keep the edit distance less than 3.
+ *
+ * @param {string} str - The string to create the token set from.
+ * @param {number} editDistance - The allowed edit distance to match.
+ * @returns {lunr.Vector}
+ */
+lunr.TokenSet.fromFuzzyString = function (str, editDistance) {
+ var root = new lunr.TokenSet
+
+ var stack = [{
+ node: root,
+ editsRemaining: editDistance,
+ str: str
+ }]
+
+ while (stack.length) {
+ var frame = stack.pop()
+
+ // no edit
+ if (frame.str.length > 0) {
+ var char = frame.str.charAt(0),
+ noEditNode
+
+ if (char in frame.node.edges) {
+ noEditNode = frame.node.edges[char]
+ } else {
+ noEditNode = new lunr.TokenSet
+ frame.node.edges[char] = noEditNode
+ }
+
+ if (frame.str.length == 1) {
+ noEditNode.final = true
+ }
+
+ stack.push({
+ node: noEditNode,
+ editsRemaining: frame.editsRemaining,
+ str: frame.str.slice(1)
+ })
+ }
+
+ if (frame.editsRemaining == 0) {
+ continue
+ }
+
+ // insertion
+ if ("*" in frame.node.edges) {
+ var insertionNode = frame.node.edges["*"]
+ } else {
+ var insertionNode = new lunr.TokenSet
+ frame.node.edges["*"] = insertionNode
+ }
+
+ if (frame.str.length == 0) {
+ insertionNode.final = true
+ }
+
+ stack.push({
+ node: insertionNode,
+ editsRemaining: frame.editsRemaining - 1,
+ str: frame.str
+ })
+
+ // deletion
+ // can only do a deletion if we have enough edits remaining
+ // and if there are characters left to delete in the string
+ if (frame.str.length > 1) {
+ stack.push({
+ node: frame.node,
+ editsRemaining: frame.editsRemaining - 1,
+ str: frame.str.slice(1)
+ })
+ }
+
+ // deletion
+ // just removing the last character from the str
+ if (frame.str.length == 1) {
+ frame.node.final = true
+ }
+
+ // substitution
+ // can only do a substitution if we have enough edits remaining
+ // and if there are characters left to substitute
+ if (frame.str.length >= 1) {
+ if ("*" in frame.node.edges) {
+ var substitutionNode = frame.node.edges["*"]
+ } else {
+ var substitutionNode = new lunr.TokenSet
+ frame.node.edges["*"] = substitutionNode
+ }
+
+ if (frame.str.length == 1) {
+ substitutionNode.final = true
+ }
+
+ stack.push({
+ node: substitutionNode,
+ editsRemaining: frame.editsRemaining - 1,
+ str: frame.str.slice(1)
+ })
+ }
+
+ // transposition
+ // can only do a transposition if there are edits remaining
+ // and there are enough characters to transpose
+ if (frame.str.length > 1) {
+ var charA = frame.str.charAt(0),
+ charB = frame.str.charAt(1),
+ transposeNode
+
+ if (charB in frame.node.edges) {
+ transposeNode = frame.node.edges[charB]
+ } else {
+ transposeNode = new lunr.TokenSet
+ frame.node.edges[charB] = transposeNode
+ }
+
+ if (frame.str.length == 1) {
+ transposeNode.final = true
+ }
+
+ stack.push({
+ node: transposeNode,
+ editsRemaining: frame.editsRemaining - 1,
+ str: charA + frame.str.slice(2)
+ })
+ }
+ }
+
+ return root
+}
+
+/**
+ * Creates a TokenSet from a string.
+ *
+ * The string may contain one or more wildcard characters (*)
+ * that will allow wildcard matching when intersecting with
+ * another TokenSet.
+ *
+ * @param {string} str - The string to create a TokenSet from.
+ * @returns {lunr.TokenSet}
+ */
+lunr.TokenSet.fromString = function (str) {
+ var node = new lunr.TokenSet,
+ root = node
+
+ /*
+ * Iterates through all characters within the passed string
+ * appending a node for each character.
+ *
+ * When a wildcard character is found then a self
+ * referencing edge is introduced to continually match
+ * any number of any characters.
+ */
+ for (var i = 0, len = str.length; i < len; i++) {
+ var char = str[i],
+ final = (i == len - 1)
+
+ if (char == "*") {
+ node.edges[char] = node
+ node.final = final
+
+ } else {
+ var next = new lunr.TokenSet
+ next.final = final
+
+ node.edges[char] = next
+ node = next
+ }
+ }
+
+ return root
+}
+
+/**
+ * Converts this TokenSet into an array of strings
+ * contained within the TokenSet.
+ *
+ * This is not intended to be used on a TokenSet that
+ * contains wildcards, in these cases the results are
+ * undefined and are likely to cause an infinite loop.
+ *
+ * @returns {string[]}
+ */
+lunr.TokenSet.prototype.toArray = function () {
+ var words = []
+
+ var stack = [{
+ prefix: "",
+ node: this
+ }]
+
+ while (stack.length) {
+ var frame = stack.pop(),
+ edges = Object.keys(frame.node.edges),
+ len = edges.length
+
+ if (frame.node.final) {
+ /* In Safari, at this point the prefix is sometimes corrupted, see:
+ * https://github.com/olivernn/lunr.js/issues/279 Calling any
+ * String.prototype method forces Safari to "cast" this string to what
+ * it's supposed to be, fixing the bug. */
+ frame.prefix.charAt(0)
+ words.push(frame.prefix)
+ }
+
+ for (var i = 0; i < len; i++) {
+ var edge = edges[i]
+
+ stack.push({
+ prefix: frame.prefix.concat(edge),
+ node: frame.node.edges[edge]
+ })
+ }
+ }
+
+ return words
+}
+
+/**
+ * Generates a string representation of a TokenSet.
+ *
+ * This is intended to allow TokenSets to be used as keys
+ * in objects, largely to aid the construction and minimisation
+ * of a TokenSet. As such it is not designed to be a human
+ * friendly representation of the TokenSet.
+ *
+ * @returns {string}
+ */
+lunr.TokenSet.prototype.toString = function () {
+ // NOTE: Using Object.keys here as this.edges is very likely
+ // to enter 'hash-mode' with many keys being added
+ //
+ // avoiding a for-in loop here as it leads to the function
+ // being de-optimised (at least in V8). From some simple
+ // benchmarks the performance is comparable, but allowing
+ // V8 to optimize may mean easy performance wins in the future.
+
+ if (this._str) {
+ return this._str
+ }
+
+ var str = this.final ? '1' : '0',
+ labels = Object.keys(this.edges).sort(),
+ len = labels.length
+
+ for (var i = 0; i < len; i++) {
+ var label = labels[i],
+ node = this.edges[label]
+
+ str = str + label + node.id
+ }
+
+ return str
+}
+
+/**
+ * Returns a new TokenSet that is the intersection of
+ * this TokenSet and the passed TokenSet.
+ *
+ * This intersection will take into account any wildcards
+ * contained within the TokenSet.
+ *
+ * @param {lunr.TokenSet} b - An other TokenSet to intersect with.
+ * @returns {lunr.TokenSet}
+ */
+lunr.TokenSet.prototype.intersect = function (b) {
+ var output = new lunr.TokenSet,
+ frame = undefined
+
+ var stack = [{
+ qNode: b,
+ output: output,
+ node: this
+ }]
+
+ while (stack.length) {
+ frame = stack.pop()
+
+ // NOTE: As with the #toString method, we are using
+ // Object.keys and a for loop instead of a for-in loop
+ // as both of these objects enter 'hash' mode, causing
+ // the function to be de-optimised in V8
+ var qEdges = Object.keys(frame.qNode.edges),
+ qLen = qEdges.length,
+ nEdges = Object.keys(frame.node.edges),
+ nLen = nEdges.length
+
+ for (var q = 0; q < qLen; q++) {
+ var qEdge = qEdges[q]
+
+ for (var n = 0; n < nLen; n++) {
+ var nEdge = nEdges[n]
+
+ if (nEdge == qEdge || qEdge == '*') {
+ var node = frame.node.edges[nEdge],
+ qNode = frame.qNode.edges[qEdge],
+ final = node.final && qNode.final,
+ next = undefined
+
+ if (nEdge in frame.output.edges) {
+ // an edge already exists for this character
+ // no need to create a new node, just set the finality
+ // bit unless this node is already final
+ next = frame.output.edges[nEdge]
+ next.final = next.final || final
+
+ } else {
+ // no edge exists yet, must create one
+ // set the finality bit and insert it
+ // into the output
+ next = new lunr.TokenSet
+ next.final = final
+ frame.output.edges[nEdge] = next
+ }
+
+ stack.push({
+ qNode: qNode,
+ output: next,
+ node: node
+ })
+ }
+ }
+ }
+ }
+
+ return output
+}
+lunr.TokenSet.Builder = function () {
+ this.previousWord = ""
+ this.root = new lunr.TokenSet
+ this.uncheckedNodes = []
+ this.minimizedNodes = {}
+}
+
+lunr.TokenSet.Builder.prototype.insert = function (word) {
+ var node,
+ commonPrefix = 0
+
+ if (word < this.previousWord) {
+ throw new Error ("Out of order word insertion")
+ }
+
+ for (var i = 0; i < word.length && i < this.previousWord.length; i++) {
+ if (word[i] != this.previousWord[i]) break
+ commonPrefix++
+ }
+
+ this.minimize(commonPrefix)
+
+ if (this.uncheckedNodes.length == 0) {
+ node = this.root
+ } else {
+ node = this.uncheckedNodes[this.uncheckedNodes.length - 1].child
+ }
+
+ for (var i = commonPrefix; i < word.length; i++) {
+ var nextNode = new lunr.TokenSet,
+ char = word[i]
+
+ node.edges[char] = nextNode
+
+ this.uncheckedNodes.push({
+ parent: node,
+ char: char,
+ child: nextNode
+ })
+
+ node = nextNode
+ }
+
+ node.final = true
+ this.previousWord = word
+}
+
+lunr.TokenSet.Builder.prototype.finish = function () {
+ this.minimize(0)
+}
+
+lunr.TokenSet.Builder.prototype.minimize = function (downTo) {
+ for (var i = this.uncheckedNodes.length - 1; i >= downTo; i--) {
+ var node = this.uncheckedNodes[i],
+ childKey = node.child.toString()
+
+ if (childKey in this.minimizedNodes) {
+ node.parent.edges[node.char] = this.minimizedNodes[childKey]
+ } else {
+ // Cache the key for this node since
+ // we know it can't change anymore
+ node.child._str = childKey
+
+ this.minimizedNodes[childKey] = node.child
+ }
+
+ this.uncheckedNodes.pop()
+ }
+}
+/*!
+ * lunr.Index
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * An index contains the built index of all documents and provides a query interface
+ * to the index.
+ *
+ * Usually instances of lunr.Index will not be created using this constructor, instead
+ * lunr.Builder should be used to construct new indexes, or lunr.Index.load should be
+ * used to load previously built and serialized indexes.
+ *
+ * @constructor
+ * @param {Object} attrs - The attributes of the built search index.
+ * @param {Object} attrs.invertedIndex - An index of term/field to document reference.
+ * @param {Object} attrs.fieldVectors - Field vectors
+ * @param {lunr.TokenSet} attrs.tokenSet - An set of all corpus tokens.
+ * @param {string[]} attrs.fields - The names of indexed document fields.
+ * @param {lunr.Pipeline} attrs.pipeline - The pipeline to use for search terms.
+ */
+lunr.Index = function (attrs) {
+ this.invertedIndex = attrs.invertedIndex
+ this.fieldVectors = attrs.fieldVectors
+ this.tokenSet = attrs.tokenSet
+ this.fields = attrs.fields
+ this.pipeline = attrs.pipeline
+}
+
+/**
+ * A result contains details of a document matching a search query.
+ * @typedef {Object} lunr.Index~Result
+ * @property {string} ref - The reference of the document this result represents.
+ * @property {number} score - A number between 0 and 1 representing how similar this document is to the query.
+ * @property {lunr.MatchData} matchData - Contains metadata about this match including which term(s) caused the match.
+ */
+
+/**
+ * Although lunr provides the ability to create queries using lunr.Query, it also provides a simple
+ * query language which itself is parsed into an instance of lunr.Query.
+ *
+ * For programmatically building queries it is advised to directly use lunr.Query, the query language
+ * is best used for human entered text rather than program generated text.
+ *
+ * At its simplest queries can just be a single term, e.g. `hello`, multiple terms are also supported
+ * and will be combined with OR, e.g `hello world` will match documents that contain either 'hello'
+ * or 'world', though those that contain both will rank higher in the results.
+ *
+ * Wildcards can be included in terms to match one or more unspecified characters, these wildcards can
+ * be inserted anywhere within the term, and more than one wildcard can exist in a single term. Adding
+ * wildcards will increase the number of documents that will be found but can also have a negative
+ * impact on query performance, especially with wildcards at the beginning of a term.
+ *
+ * Terms can be restricted to specific fields, e.g. `title:hello`, only documents with the term
+ * hello in the title field will match this query. Using a field not present in the index will lead
+ * to an error being thrown.
+ *
+ * Modifiers can also be added to terms, lunr supports edit distance and boost modifiers on terms. A term
+ * boost will make documents matching that term score higher, e.g. `foo^5`. Edit distance is also supported
+ * to provide fuzzy matching, e.g. 'hello~2' will match documents with hello with an edit distance of 2.
+ * Avoid large values for edit distance to improve query performance.
+ *
+ * Each term also supports a presence modifier. By default a term's presence in document is optional, however
+ * this can be changed to either required or prohibited. For a term's presence to be required in a document the
+ * term should be prefixed with a '+', e.g. `+foo bar` is a search for documents that must contain 'foo' and
+ * optionally contain 'bar'. Conversely a leading '-' sets the terms presence to prohibited, i.e. it must not
+ * appear in a document, e.g. `-foo bar` is a search for documents that do not contain 'foo' but may contain 'bar'.
+ *
+ * To escape special characters the backslash character '\' can be used, this allows searches to include
+ * characters that would normally be considered modifiers, e.g. `foo\~2` will search for a term "foo~2" instead
+ * of attempting to apply a boost of 2 to the search term "foo".
+ *
+ * @typedef {string} lunr.Index~QueryString
+ * @example
Simple single term query
+ * hello
+ * @example
Multiple term query
+ * hello world
+ * @example
term scoped to a field
+ * title:hello
+ * @example
term with a boost of 10
+ * hello^10
+ * @example
term with an edit distance of 2
+ * hello~2
+ * @example
terms with presence modifiers
+ * -foo +bar baz
+ */
+
+/**
+ * Performs a search against the index using lunr query syntax.
+ *
+ * Results will be returned sorted by their score, the most relevant results
+ * will be returned first. For details on how the score is calculated, please see
+ * the {@link https://lunrjs.com/guides/searching.html#scoring|guide}.
+ *
+ * For more programmatic querying use lunr.Index#query.
+ *
+ * @param {lunr.Index~QueryString} queryString - A string containing a lunr query.
+ * @throws {lunr.QueryParseError} If the passed query string cannot be parsed.
+ * @returns {lunr.Index~Result[]}
+ */
+lunr.Index.prototype.search = function (queryString) {
+ return this.query(function (query) {
+ var parser = new lunr.QueryParser(queryString, query)
+ parser.parse()
+ })
+}
+
+/**
+ * A query builder callback provides a query object to be used to express
+ * the query to perform on the index.
+ *
+ * @callback lunr.Index~queryBuilder
+ * @param {lunr.Query} query - The query object to build up.
+ * @this lunr.Query
+ */
+
+/**
+ * Performs a query against the index using the yielded lunr.Query object.
+ *
+ * If performing programmatic queries against the index, this method is preferred
+ * over lunr.Index#search so as to avoid the additional query parsing overhead.
+ *
+ * A query object is yielded to the supplied function which should be used to
+ * express the query to be run against the index.
+ *
+ * Note that although this function takes a callback parameter it is _not_ an
+ * asynchronous operation, the callback is just yielded a query object to be
+ * customized.
+ *
+ * @param {lunr.Index~queryBuilder} fn - A function that is used to build the query.
+ * @returns {lunr.Index~Result[]}
+ */
+lunr.Index.prototype.query = function (fn) {
+ // for each query clause
+ // * process terms
+ // * expand terms from token set
+ // * find matching documents and metadata
+ // * get document vectors
+ // * score documents
+
+ var query = new lunr.Query(this.fields),
+ matchingFields = Object.create(null),
+ queryVectors = Object.create(null),
+ termFieldCache = Object.create(null),
+ requiredMatches = Object.create(null),
+ prohibitedMatches = Object.create(null)
+
+ /*
+ * To support field level boosts a query vector is created per
+ * field. An empty vector is eagerly created to support negated
+ * queries.
+ */
+ for (var i = 0; i < this.fields.length; i++) {
+ queryVectors[this.fields[i]] = new lunr.Vector
+ }
+
+ fn.call(query, query)
+
+ for (var i = 0; i < query.clauses.length; i++) {
+ /*
+ * Unless the pipeline has been disabled for this term, which is
+ * the case for terms with wildcards, we need to pass the clause
+ * term through the search pipeline. A pipeline returns an array
+ * of processed terms. Pipeline functions may expand the passed
+ * term, which means we may end up performing multiple index lookups
+ * for a single query term.
+ */
+ var clause = query.clauses[i],
+ terms = null,
+ clauseMatches = lunr.Set.complete
+
+ if (clause.usePipeline) {
+ terms = this.pipeline.runString(clause.term, {
+ fields: clause.fields
+ })
+ } else {
+ terms = [clause.term]
+ }
+
+ for (var m = 0; m < terms.length; m++) {
+ var term = terms[m]
+
+ /*
+ * Each term returned from the pipeline needs to use the same query
+ * clause object, e.g. the same boost and or edit distance. The
+ * simplest way to do this is to re-use the clause object but mutate
+ * its term property.
+ */
+ clause.term = term
+
+ /*
+ * From the term in the clause we create a token set which will then
+ * be used to intersect the indexes token set to get a list of terms
+ * to lookup in the inverted index
+ */
+ var termTokenSet = lunr.TokenSet.fromClause(clause),
+ expandedTerms = this.tokenSet.intersect(termTokenSet).toArray()
+
+ /*
+ * If a term marked as required does not exist in the tokenSet it is
+ * impossible for the search to return any matches. We set all the field
+ * scoped required matches set to empty and stop examining any further
+ * clauses.
+ */
+ if (expandedTerms.length === 0 && clause.presence === lunr.Query.presence.REQUIRED) {
+ for (var k = 0; k < clause.fields.length; k++) {
+ var field = clause.fields[k]
+ requiredMatches[field] = lunr.Set.empty
+ }
+
+ break
+ }
+
+ for (var j = 0; j < expandedTerms.length; j++) {
+ /*
+ * For each term get the posting and termIndex, this is required for
+ * building the query vector.
+ */
+ var expandedTerm = expandedTerms[j],
+ posting = this.invertedIndex[expandedTerm],
+ termIndex = posting._index
+
+ for (var k = 0; k < clause.fields.length; k++) {
+ /*
+ * For each field that this query term is scoped by (by default
+ * all fields are in scope) we need to get all the document refs
+ * that have this term in that field.
+ *
+ * The posting is the entry in the invertedIndex for the matching
+ * term from above.
+ */
+ var field = clause.fields[k],
+ fieldPosting = posting[field],
+ matchingDocumentRefs = Object.keys(fieldPosting),
+ termField = expandedTerm + "/" + field,
+ matchingDocumentsSet = new lunr.Set(matchingDocumentRefs)
+
+ /*
+ * if the presence of this term is required ensure that the matching
+ * documents are added to the set of required matches for this clause.
+ *
+ */
+ if (clause.presence == lunr.Query.presence.REQUIRED) {
+ clauseMatches = clauseMatches.union(matchingDocumentsSet)
+
+ if (requiredMatches[field] === undefined) {
+ requiredMatches[field] = lunr.Set.complete
+ }
+ }
+
+ /*
+ * if the presence of this term is prohibited ensure that the matching
+ * documents are added to the set of prohibited matches for this field,
+ * creating that set if it does not yet exist.
+ */
+ if (clause.presence == lunr.Query.presence.PROHIBITED) {
+ if (prohibitedMatches[field] === undefined) {
+ prohibitedMatches[field] = lunr.Set.empty
+ }
+
+ prohibitedMatches[field] = prohibitedMatches[field].union(matchingDocumentsSet)
+
+ /*
+ * Prohibited matches should not be part of the query vector used for
+ * similarity scoring and no metadata should be extracted so we continue
+ * to the next field
+ */
+ continue
+ }
+
+ /*
+ * The query field vector is populated using the termIndex found for
+ * the term and a unit value with the appropriate boost applied.
+ * Using upsert because there could already be an entry in the vector
+ * for the term we are working with. In that case we just add the scores
+ * together.
+ */
+ queryVectors[field].upsert(termIndex, clause.boost, function (a, b) { return a + b })
+
+ /**
+ * If we've already seen this term, field combo then we've already collected
+ * the matching documents and metadata, no need to go through all that again
+ */
+ if (termFieldCache[termField]) {
+ continue
+ }
+
+ for (var l = 0; l < matchingDocumentRefs.length; l++) {
+ /*
+ * All metadata for this term/field/document triple
+ * are then extracted and collected into an instance
+ * of lunr.MatchData ready to be returned in the query
+ * results
+ */
+ var matchingDocumentRef = matchingDocumentRefs[l],
+ matchingFieldRef = new lunr.FieldRef (matchingDocumentRef, field),
+ metadata = fieldPosting[matchingDocumentRef],
+ fieldMatch
+
+ if ((fieldMatch = matchingFields[matchingFieldRef]) === undefined) {
+ matchingFields[matchingFieldRef] = new lunr.MatchData (expandedTerm, field, metadata)
+ } else {
+ fieldMatch.add(expandedTerm, field, metadata)
+ }
+
+ }
+
+ termFieldCache[termField] = true
+ }
+ }
+ }
+
+ /**
+ * If the presence was required we need to update the requiredMatches field sets.
+ * We do this after all fields for the term have collected their matches because
+ * the clause terms presence is required in _any_ of the fields not _all_ of the
+ * fields.
+ */
+ if (clause.presence === lunr.Query.presence.REQUIRED) {
+ for (var k = 0; k < clause.fields.length; k++) {
+ var field = clause.fields[k]
+ requiredMatches[field] = requiredMatches[field].intersect(clauseMatches)
+ }
+ }
+ }
+
+ /**
+ * Need to combine the field scoped required and prohibited
+ * matching documents into a global set of required and prohibited
+ * matches
+ */
+ var allRequiredMatches = lunr.Set.complete,
+ allProhibitedMatches = lunr.Set.empty
+
+ for (var i = 0; i < this.fields.length; i++) {
+ var field = this.fields[i]
+
+ if (requiredMatches[field]) {
+ allRequiredMatches = allRequiredMatches.intersect(requiredMatches[field])
+ }
+
+ if (prohibitedMatches[field]) {
+ allProhibitedMatches = allProhibitedMatches.union(prohibitedMatches[field])
+ }
+ }
+
+ var matchingFieldRefs = Object.keys(matchingFields),
+ results = [],
+ matches = Object.create(null)
+
+ /*
+ * If the query is negated (contains only prohibited terms)
+ * we need to get _all_ fieldRefs currently existing in the
+ * index. This is only done when we know that the query is
+ * entirely prohibited terms to avoid any cost of getting all
+ * fieldRefs unnecessarily.
+ *
+ * Additionally, blank MatchData must be created to correctly
+ * populate the results.
+ */
+ if (query.isNegated()) {
+ matchingFieldRefs = Object.keys(this.fieldVectors)
+
+ for (var i = 0; i < matchingFieldRefs.length; i++) {
+ var matchingFieldRef = matchingFieldRefs[i]
+ var fieldRef = lunr.FieldRef.fromString(matchingFieldRef)
+ matchingFields[matchingFieldRef] = new lunr.MatchData
+ }
+ }
+
+ for (var i = 0; i < matchingFieldRefs.length; i++) {
+ /*
+ * Currently we have document fields that match the query, but we
+ * need to return documents. The matchData and scores are combined
+ * from multiple fields belonging to the same document.
+ *
+ * Scores are calculated by field, using the query vectors created
+ * above, and combined into a final document score using addition.
+ */
+ var fieldRef = lunr.FieldRef.fromString(matchingFieldRefs[i]),
+ docRef = fieldRef.docRef
+
+ if (!allRequiredMatches.contains(docRef)) {
+ continue
+ }
+
+ if (allProhibitedMatches.contains(docRef)) {
+ continue
+ }
+
+ var fieldVector = this.fieldVectors[fieldRef],
+ score = queryVectors[fieldRef.fieldName].similarity(fieldVector),
+ docMatch
+
+ if ((docMatch = matches[docRef]) !== undefined) {
+ docMatch.score += score
+ docMatch.matchData.combine(matchingFields[fieldRef])
+ } else {
+ var match = {
+ ref: docRef,
+ score: score,
+ matchData: matchingFields[fieldRef]
+ }
+ matches[docRef] = match
+ results.push(match)
+ }
+ }
+
+ /*
+ * Sort the results objects by score, highest first.
+ */
+ return results.sort(function (a, b) {
+ return b.score - a.score
+ })
+}
+
+/**
+ * Prepares the index for JSON serialization.
+ *
+ * The schema for this JSON blob will be described in a
+ * separate JSON schema file.
+ *
+ * @returns {Object}
+ */
+lunr.Index.prototype.toJSON = function () {
+ var invertedIndex = Object.keys(this.invertedIndex)
+ .sort()
+ .map(function (term) {
+ return [term, this.invertedIndex[term]]
+ }, this)
+
+ var fieldVectors = Object.keys(this.fieldVectors)
+ .map(function (ref) {
+ return [ref, this.fieldVectors[ref].toJSON()]
+ }, this)
+
+ return {
+ version: lunr.version,
+ fields: this.fields,
+ fieldVectors: fieldVectors,
+ invertedIndex: invertedIndex,
+ pipeline: this.pipeline.toJSON()
+ }
+}
+
+/**
+ * Loads a previously serialized lunr.Index
+ *
+ * @param {Object} serializedIndex - A previously serialized lunr.Index
+ * @returns {lunr.Index}
+ */
+lunr.Index.load = function (serializedIndex) {
+ var attrs = {},
+ fieldVectors = {},
+ serializedVectors = serializedIndex.fieldVectors,
+ invertedIndex = Object.create(null),
+ serializedInvertedIndex = serializedIndex.invertedIndex,
+ tokenSetBuilder = new lunr.TokenSet.Builder,
+ pipeline = lunr.Pipeline.load(serializedIndex.pipeline)
+
+ if (serializedIndex.version != lunr.version) {
+ lunr.utils.warn("Version mismatch when loading serialised index. Current version of lunr '" + lunr.version + "' does not match serialized index '" + serializedIndex.version + "'")
+ }
+
+ for (var i = 0; i < serializedVectors.length; i++) {
+ var tuple = serializedVectors[i],
+ ref = tuple[0],
+ elements = tuple[1]
+
+ fieldVectors[ref] = new lunr.Vector(elements)
+ }
+
+ for (var i = 0; i < serializedInvertedIndex.length; i++) {
+ var tuple = serializedInvertedIndex[i],
+ term = tuple[0],
+ posting = tuple[1]
+
+ tokenSetBuilder.insert(term)
+ invertedIndex[term] = posting
+ }
+
+ tokenSetBuilder.finish()
+
+ attrs.fields = serializedIndex.fields
+
+ attrs.fieldVectors = fieldVectors
+ attrs.invertedIndex = invertedIndex
+ attrs.tokenSet = tokenSetBuilder.root
+ attrs.pipeline = pipeline
+
+ return new lunr.Index(attrs)
+}
+/*!
+ * lunr.Builder
+ * Copyright (C) 2019 Oliver Nightingale
+ */
+
+/**
+ * lunr.Builder performs indexing on a set of documents and
+ * returns instances of lunr.Index ready for querying.
+ *
+ * All configuration of the index is done via the builder, the
+ * fields to index, the document reference, the text processing
+ * pipeline and document scoring parameters are all set on the
+ * builder before indexing.
+ *
+ * @constructor
+ * @property {string} _ref - Internal reference to the document reference field.
+ * @property {string[]} _fields - Internal reference to the document fields to index.
+ * @property {object} invertedIndex - The inverted index maps terms to document fields.
+ * @property {object} documentTermFrequencies - Keeps track of document term frequencies.
+ * @property {object} documentLengths - Keeps track of the length of documents added to the index.
+ * @property {lunr.tokenizer} tokenizer - Function for splitting strings into tokens for indexing.
+ * @property {lunr.Pipeline} pipeline - The pipeline performs text processing on tokens before indexing.
+ * @property {lunr.Pipeline} searchPipeline - A pipeline for processing search terms before querying the index.
+ * @property {number} documentCount - Keeps track of the total number of documents indexed.
+ * @property {number} _b - A parameter to control field length normalization, setting this to 0 disabled normalization, 1 fully normalizes field lengths, the default value is 0.75.
+ * @property {number} _k1 - A parameter to control how quickly an increase in term frequency results in term frequency saturation, the default value is 1.2.
+ * @property {number} termIndex - A counter incremented for each unique term, used to identify a terms position in the vector space.
+ * @property {array} metadataWhitelist - A list of metadata keys that have been whitelisted for entry in the index.
+ */
+lunr.Builder = function () {
+ this._ref = "id"
+ this._fields = Object.create(null)
+ this._documents = Object.create(null)
+ this.invertedIndex = Object.create(null)
+ this.fieldTermFrequencies = {}
+ this.fieldLengths = {}
+ this.tokenizer = lunr.tokenizer
+ this.pipeline = new lunr.Pipeline
+ this.searchPipeline = new lunr.Pipeline
+ this.documentCount = 0
+ this._b = 0.75
+ this._k1 = 1.2
+ this.termIndex = 0
+ this.metadataWhitelist = []
+}
+
+/**
+ * Sets the document field used as the document reference. Every document must have this field.
+ * The type of this field in the document should be a string, if it is not a string it will be
+ * coerced into a string by calling toString.
+ *
+ * The default ref is 'id'.
+ *
+ * The ref should _not_ be changed during indexing, it should be set before any documents are
+ * added to the index. Changing it during indexing can lead to inconsistent results.
+ *
+ * @param {string} ref - The name of the reference field in the document.
+ */
+lunr.Builder.prototype.ref = function (ref) {
+ this._ref = ref
+}
+
+/**
+ * A function that is used to extract a field from a document.
+ *
+ * Lunr expects a field to be at the top level of a document, if however the field
+ * is deeply nested within a document an extractor function can be used to extract
+ * the right field for indexing.
+ *
+ * @callback fieldExtractor
+ * @param {object} doc - The document being added to the index.
+ * @returns {?(string|object|object[])} obj - The object that will be indexed for this field.
+ * @example
Extracting a nested field
+ * function (doc) { return doc.nested.field }
+ */
+
+/**
+ * Adds a field to the list of document fields that will be indexed. Every document being
+ * indexed should have this field. Null values for this field in indexed documents will
+ * not cause errors but will limit the chance of that document being retrieved by searches.
+ *
+ * All fields should be added before adding documents to the index. Adding fields after
+ * a document has been indexed will have no effect on already indexed documents.
+ *
+ * Fields can be boosted at build time. This allows terms within that field to have more
+ * importance when ranking search results. Use a field boost to specify that matches within
+ * one field are more important than other fields.
+ *
+ * @param {string} fieldName - The name of a field to index in all documents.
+ * @param {object} attributes - Optional attributes associated with this field.
+ * @param {number} [attributes.boost=1] - Boost applied to all terms within this field.
+ * @param {fieldExtractor} [attributes.extractor] - Function to extract a field from a document.
+ * @throws {RangeError} fieldName cannot contain unsupported characters '/'
+ */
+lunr.Builder.prototype.field = function (fieldName, attributes) {
+ if (/\//.test(fieldName)) {
+ throw new RangeError ("Field '" + fieldName + "' contains illegal character '/'")
+ }
+
+ this._fields[fieldName] = attributes || {}
+}
+
+/**
+ * A parameter to tune the amount of field length normalisation that is applied when
+ * calculating relevance scores. A value of 0 will completely disable any normalisation
+ * and a value of 1 will fully normalise field lengths. The default is 0.75. Values of b
+ * will be clamped to the range 0 - 1.
+ *
+ * @param {number} number - The value to set for this tuning parameter.
+ */
+lunr.Builder.prototype.b = function (number) {
+ if (number < 0) {
+ this._b = 0
+ } else if (number > 1) {
+ this._b = 1
+ } else {
+ this._b = number
+ }
+}
+
+/**
+ * A parameter that controls the speed at which a rise in term frequency results in term
+ * frequency saturation. The default value is 1.2. Setting this to a higher value will give
+ * slower saturation levels, a lower value will result in quicker saturation.
+ *
+ * @param {number} number - The value to set for this tuning parameter.
+ */
+lunr.Builder.prototype.k1 = function (number) {
+ this._k1 = number
+}
+
+/**
+ * Adds a document to the index.
+ *
+ * Before adding fields to the index the index should have been fully setup, with the document
+ * ref and all fields to index already having been specified.
+ *
+ * The document must have a field name as specified by the ref (by default this is 'id') and
+ * it should have all fields defined for indexing, though null or undefined values will not
+ * cause errors.
+ *
+ * Entire documents can be boosted at build time. Applying a boost to a document indicates that
+ * this document should rank higher in search results than other documents.
+ *
+ * @param {object} doc - The document to add to the index.
+ * @param {object} attributes - Optional attributes associated with this document.
+ * @param {number} [attributes.boost=1] - Boost applied to all terms within this document.
+ */
+lunr.Builder.prototype.add = function (doc, attributes) {
+ var docRef = doc[this._ref],
+ fields = Object.keys(this._fields)
+
+ this._documents[docRef] = attributes || {}
+ this.documentCount += 1
+
+ for (var i = 0; i < fields.length; i++) {
+ var fieldName = fields[i],
+ extractor = this._fields[fieldName].extractor,
+ field = extractor ? extractor(doc) : doc[fieldName],
+ tokens = this.tokenizer(field, {
+ fields: [fieldName]
+ }),
+ terms = this.pipeline.run(tokens),
+ fieldRef = new lunr.FieldRef (docRef, fieldName),
+ fieldTerms = Object.create(null)
+
+ this.fieldTermFrequencies[fieldRef] = fieldTerms
+ this.fieldLengths[fieldRef] = 0
+
+ // store the length of this field for this document
+ this.fieldLengths[fieldRef] += terms.length
+
+ // calculate term frequencies for this field
+ for (var j = 0; j < terms.length; j++) {
+ var term = terms[j]
+
+ if (fieldTerms[term] == undefined) {
+ fieldTerms[term] = 0
+ }
+
+ fieldTerms[term] += 1
+
+ // add to inverted index
+ // create an initial posting if one doesn't exist
+ if (this.invertedIndex[term] == undefined) {
+ var posting = Object.create(null)
+ posting["_index"] = this.termIndex
+ this.termIndex += 1
+
+ for (var k = 0; k < fields.length; k++) {
+ posting[fields[k]] = Object.create(null)
+ }
+
+ this.invertedIndex[term] = posting
+ }
+
+ // add an entry for this term/fieldName/docRef to the invertedIndex
+ if (this.invertedIndex[term][fieldName][docRef] == undefined) {
+ this.invertedIndex[term][fieldName][docRef] = Object.create(null)
+ }
+
+ // store all whitelisted metadata about this token in the
+ // inverted index
+ for (var l = 0; l < this.metadataWhitelist.length; l++) {
+ var metadataKey = this.metadataWhitelist[l],
+ metadata = term.metadata[metadataKey]
+
+ if (this.invertedIndex[term][fieldName][docRef][metadataKey] == undefined) {
+ this.invertedIndex[term][fieldName][docRef][metadataKey] = []
+ }
+
+ this.invertedIndex[term][fieldName][docRef][metadataKey].push(metadata)
+ }
+ }
+
+ }
+}
+
+/**
+ * Calculates the average document length for this index
+ *
+ * @private
+ */
+lunr.Builder.prototype.calculateAverageFieldLengths = function () {
+
+ var fieldRefs = Object.keys(this.fieldLengths),
+ numberOfFields = fieldRefs.length,
+ accumulator = {},
+ documentsWithField = {}
+
+ for (var i = 0; i < numberOfFields; i++) {
+ var fieldRef = lunr.FieldRef.fromString(fieldRefs[i]),
+ field = fieldRef.fieldName
+
+ documentsWithField[field] || (documentsWithField[field] = 0)
+ documentsWithField[field] += 1
+
+ accumulator[field] || (accumulator[field] = 0)
+ accumulator[field] += this.fieldLengths[fieldRef]
+ }
+
+ var fields = Object.keys(this._fields)
+
+ for (var i = 0; i < fields.length; i++) {
+ var fieldName = fields[i]
+ accumulator[fieldName] = accumulator[fieldName] / documentsWithField[fieldName]
+ }
+
+ this.averageFieldLength = accumulator
+}
+
+/**
+ * Builds a vector space model of every document using lunr.Vector
+ *
+ * @private
+ */
+lunr.Builder.prototype.createFieldVectors = function () {
+ var fieldVectors = {},
+ fieldRefs = Object.keys(this.fieldTermFrequencies),
+ fieldRefsLength = fieldRefs.length,
+ termIdfCache = Object.create(null)
+
+ for (var i = 0; i < fieldRefsLength; i++) {
+ var fieldRef = lunr.FieldRef.fromString(fieldRefs[i]),
+ fieldName = fieldRef.fieldName,
+ fieldLength = this.fieldLengths[fieldRef],
+ fieldVector = new lunr.Vector,
+ termFrequencies = this.fieldTermFrequencies[fieldRef],
+ terms = Object.keys(termFrequencies),
+ termsLength = terms.length
+
+
+ var fieldBoost = this._fields[fieldName].boost || 1,
+ docBoost = this._documents[fieldRef.docRef].boost || 1
+
+ for (var j = 0; j < termsLength; j++) {
+ var term = terms[j],
+ tf = termFrequencies[term],
+ termIndex = this.invertedIndex[term]._index,
+ idf, score, scoreWithPrecision
+
+ if (termIdfCache[term] === undefined) {
+ idf = lunr.idf(this.invertedIndex[term], this.documentCount)
+ termIdfCache[term] = idf
+ } else {
+ idf = termIdfCache[term]
+ }
+
+ score = idf * ((this._k1 + 1) * tf) / (this._k1 * (1 - this._b + this._b * (fieldLength / this.averageFieldLength[fieldName])) + tf)
+ score *= fieldBoost
+ score *= docBoost
+ scoreWithPrecision = Math.round(score * 1000) / 1000
+ // Converts 1.23456789 to 1.234.
+ // Reducing the precision so that the vectors take up less
+ // space when serialised. Doing it now so that they behave
+ // the same before and after serialisation. Also, this is
+ // the fastest approach to reducing a number's precision in
+ // JavaScript.
+
+ fieldVector.insert(termIndex, scoreWithPrecision)
+ }
+
+ fieldVectors[fieldRef] = fieldVector
+ }
+
+ this.fieldVectors = fieldVectors
+}
+
+/**
+ * Creates a token set of all tokens in the index using lunr.TokenSet
+ *
+ * @private
+ */
+lunr.Builder.prototype.createTokenSet = function () {
+ this.tokenSet = lunr.TokenSet.fromArray(
+ Object.keys(this.invertedIndex).sort()
+ )
+}
+
+/**
+ * Builds the index, creating an instance of lunr.Index.
+ *
+ * This completes the indexing process and should only be called
+ * once all documents have been added to the index.
+ *
+ * @returns {lunr.Index}
+ */
+lunr.Builder.prototype.build = function () {
+ this.calculateAverageFieldLengths()
+ this.createFieldVectors()
+ this.createTokenSet()
+
+ return new lunr.Index({
+ invertedIndex: this.invertedIndex,
+ fieldVectors: this.fieldVectors,
+ tokenSet: this.tokenSet,
+ fields: Object.keys(this._fields),
+ pipeline: this.searchPipeline
+ })
+}
+
+/**
+ * Applies a plugin to the index builder.
+ *
+ * A plugin is a function that is called with the index builder as its context.
+ * Plugins can be used to customise or extend the behaviour of the index
+ * in some way. A plugin is just a function, that encapsulated the custom
+ * behaviour that should be applied when building the index.
+ *
+ * The plugin function will be called with the index builder as its argument, additional
+ * arguments can also be passed when calling use. The function will be called
+ * with the index builder as its context.
+ *
+ * @param {Function} plugin The plugin to apply.
+ */
+lunr.Builder.prototype.use = function (fn) {
+ var args = Array.prototype.slice.call(arguments, 1)
+ args.unshift(this)
+ fn.apply(this, args)
+}
+/**
+ * Contains and collects metadata about a matching document.
+ * A single instance of lunr.MatchData is returned as part of every
+ * lunr.Index~Result.
+ *
+ * @constructor
+ * @param {string} term - The term this match data is associated with
+ * @param {string} field - The field in which the term was found
+ * @param {object} metadata - The metadata recorded about this term in this field
+ * @property {object} metadata - A cloned collection of metadata associated with this document.
+ * @see {@link lunr.Index~Result}
+ */
+lunr.MatchData = function (term, field, metadata) {
+ var clonedMetadata = Object.create(null),
+ metadataKeys = Object.keys(metadata || {})
+
+ // Cloning the metadata to prevent the original
+ // being mutated during match data combination.
+ // Metadata is kept in an array within the inverted
+ // index so cloning the data can be done with
+ // Array#slice
+ for (var i = 0; i < metadataKeys.length; i++) {
+ var key = metadataKeys[i]
+ clonedMetadata[key] = metadata[key].slice()
+ }
+
+ this.metadata = Object.create(null)
+
+ if (term !== undefined) {
+ this.metadata[term] = Object.create(null)
+ this.metadata[term][field] = clonedMetadata
+ }
+}
+
+/**
+ * An instance of lunr.MatchData will be created for every term that matches a
+ * document. However only one instance is required in a lunr.Index~Result. This
+ * method combines metadata from another instance of lunr.MatchData with this
+ * objects metadata.
+ *
+ * @param {lunr.MatchData} otherMatchData - Another instance of match data to merge with this one.
+ * @see {@link lunr.Index~Result}
+ */
+lunr.MatchData.prototype.combine = function (otherMatchData) {
+ var terms = Object.keys(otherMatchData.metadata)
+
+ for (var i = 0; i < terms.length; i++) {
+ var term = terms[i],
+ fields = Object.keys(otherMatchData.metadata[term])
+
+ if (this.metadata[term] == undefined) {
+ this.metadata[term] = Object.create(null)
+ }
+
+ for (var j = 0; j < fields.length; j++) {
+ var field = fields[j],
+ keys = Object.keys(otherMatchData.metadata[term][field])
+
+ if (this.metadata[term][field] == undefined) {
+ this.metadata[term][field] = Object.create(null)
+ }
+
+ for (var k = 0; k < keys.length; k++) {
+ var key = keys[k]
+
+ if (this.metadata[term][field][key] == undefined) {
+ this.metadata[term][field][key] = otherMatchData.metadata[term][field][key]
+ } else {
+ this.metadata[term][field][key] = this.metadata[term][field][key].concat(otherMatchData.metadata[term][field][key])
+ }
+
+ }
+ }
+ }
+}
+
+/**
+ * Add metadata for a term/field pair to this instance of match data.
+ *
+ * @param {string} term - The term this match data is associated with
+ * @param {string} field - The field in which the term was found
+ * @param {object} metadata - The metadata recorded about this term in this field
+ */
+lunr.MatchData.prototype.add = function (term, field, metadata) {
+ if (!(term in this.metadata)) {
+ this.metadata[term] = Object.create(null)
+ this.metadata[term][field] = metadata
+ return
+ }
+
+ if (!(field in this.metadata[term])) {
+ this.metadata[term][field] = metadata
+ return
+ }
+
+ var metadataKeys = Object.keys(metadata)
+
+ for (var i = 0; i < metadataKeys.length; i++) {
+ var key = metadataKeys[i]
+
+ if (key in this.metadata[term][field]) {
+ this.metadata[term][field][key] = this.metadata[term][field][key].concat(metadata[key])
+ } else {
+ this.metadata[term][field][key] = metadata[key]
+ }
+ }
+}
+/**
+ * A lunr.Query provides a programmatic way of defining queries to be performed
+ * against a {@link lunr.Index}.
+ *
+ * Prefer constructing a lunr.Query using the {@link lunr.Index#query} method
+ * so the query object is pre-initialized with the right index fields.
+ *
+ * @constructor
+ * @property {lunr.Query~Clause[]} clauses - An array of query clauses.
+ * @property {string[]} allFields - An array of all available fields in a lunr.Index.
+ */
+lunr.Query = function (allFields) {
+ this.clauses = []
+ this.allFields = allFields
+}
+
+/**
+ * Constants for indicating what kind of automatic wildcard insertion will be used when constructing a query clause.
+ *
+ * This allows wildcards to be added to the beginning and end of a term without having to manually do any string
+ * concatenation.
+ *
+ * The wildcard constants can be bitwise combined to select both leading and trailing wildcards.
+ *
+ * @constant
+ * @default
+ * @property {number} wildcard.NONE - The term will have no wildcards inserted, this is the default behaviour
+ * @property {number} wildcard.LEADING - Prepend the term with a wildcard, unless a leading wildcard already exists
+ * @property {number} wildcard.TRAILING - Append a wildcard to the term, unless a trailing wildcard already exists
+ * @see lunr.Query~Clause
+ * @see lunr.Query#clause
+ * @see lunr.Query#term
+ * @example
+ * query.term('foo', {
+ * wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
+ * })
+ */
+
+lunr.Query.wildcard = new String ("*")
+lunr.Query.wildcard.NONE = 0
+lunr.Query.wildcard.LEADING = 1
+lunr.Query.wildcard.TRAILING = 2
+
+/**
+ * Constants for indicating what kind of presence a term must have in matching documents.
+ *
+ * @constant
+ * @enum {number}
+ * @see lunr.Query~Clause
+ * @see lunr.Query#clause
+ * @see lunr.Query#term
+ * @example
query term with required presence
+ * query.term('foo', { presence: lunr.Query.presence.REQUIRED })
+ */
+lunr.Query.presence = {
+ /**
+ * Term's presence in a document is optional, this is the default value.
+ */
+ OPTIONAL: 1,
+
+ /**
+ * Term's presence in a document is required, documents that do not contain
+ * this term will not be returned.
+ */
+ REQUIRED: 2,
+
+ /**
+ * Term's presence in a document is prohibited, documents that do contain
+ * this term will not be returned.
+ */
+ PROHIBITED: 3
+}
+
+/**
+ * A single clause in a {@link lunr.Query} contains a term and details on how to
+ * match that term against a {@link lunr.Index}.
+ *
+ * @typedef {Object} lunr.Query~Clause
+ * @property {string[]} fields - The fields in an index this clause should be matched against.
+ * @property {number} [boost=1] - Any boost that should be applied when matching this clause.
+ * @property {number} [editDistance] - Whether the term should have fuzzy matching applied, and how fuzzy the match should be.
+ * @property {boolean} [usePipeline] - Whether the term should be passed through the search pipeline.
+ * @property {number} [wildcard=lunr.Query.wildcard.NONE] - Whether the term should have wildcards appended or prepended.
+ * @property {number} [presence=lunr.Query.presence.OPTIONAL] - The terms presence in any matching documents.
+ */
+
+/**
+ * Adds a {@link lunr.Query~Clause} to this query.
+ *
+ * Unless the clause contains the fields to be matched all fields will be matched. In addition
+ * a default boost of 1 is applied to the clause.
+ *
+ * @param {lunr.Query~Clause} clause - The clause to add to this query.
+ * @see lunr.Query~Clause
+ * @returns {lunr.Query}
+ */
+lunr.Query.prototype.clause = function (clause) {
+ if (!('fields' in clause)) {
+ clause.fields = this.allFields
+ }
+
+ if (!('boost' in clause)) {
+ clause.boost = 1
+ }
+
+ if (!('usePipeline' in clause)) {
+ clause.usePipeline = true
+ }
+
+ if (!('wildcard' in clause)) {
+ clause.wildcard = lunr.Query.wildcard.NONE
+ }
+
+ if ((clause.wildcard & lunr.Query.wildcard.LEADING) && (clause.term.charAt(0) != lunr.Query.wildcard)) {
+ clause.term = "*" + clause.term
+ }
+
+ if ((clause.wildcard & lunr.Query.wildcard.TRAILING) && (clause.term.slice(-1) != lunr.Query.wildcard)) {
+ clause.term = "" + clause.term + "*"
+ }
+
+ if (!('presence' in clause)) {
+ clause.presence = lunr.Query.presence.OPTIONAL
+ }
+
+ this.clauses.push(clause)
+
+ return this
+}
+
+/**
+ * A negated query is one in which every clause has a presence of
+ * prohibited. These queries require some special processing to return
+ * the expected results.
+ *
+ * @returns boolean
+ */
+lunr.Query.prototype.isNegated = function () {
+ for (var i = 0; i < this.clauses.length; i++) {
+ if (this.clauses[i].presence != lunr.Query.presence.PROHIBITED) {
+ return false
+ }
+ }
+
+ return true
+}
+
+/**
+ * Adds a term to the current query, under the covers this will create a {@link lunr.Query~Clause}
+ * to the list of clauses that make up this query.
+ *
+ * The term is used as is, i.e. no tokenization will be performed by this method. Instead conversion
+ * to a token or token-like string should be done before calling this method.
+ *
+ * The term will be converted to a string by calling `toString`. Multiple terms can be passed as an
+ * array, each term in the array will share the same options.
+ *
+ * @param {object|object[]} term - The term(s) to add to the query.
+ * @param {object} [options] - Any additional properties to add to the query clause.
+ * @returns {lunr.Query}
+ * @see lunr.Query#clause
+ * @see lunr.Query~Clause
+ * @example
adding a single term to a query
+ * query.term("foo")
+ * @example
adding a single term to a query and specifying search fields, term boost and automatic trailing wildcard
';
+}
+
+function displayResults (results) {
+ var search_results = document.getElementById("mkdocs-search-results");
+ while (search_results.firstChild) {
+ search_results.removeChild(search_results.firstChild);
+ }
+ if (results.length > 0){
+ for (var i=0; i < results.length; i++){
+ var result = results[i];
+ var html = formatResult(result.location, result.title, result.summary);
+ search_results.insertAdjacentHTML('beforeend', html);
+ }
+ } else {
+ search_results.insertAdjacentHTML('beforeend', "
No results found
");
+ }
+}
+
+function doSearch () {
+ var query = document.getElementById('mkdocs-search-query').value;
+ if (query.length > min_search_length) {
+ if (!window.Worker) {
+ displayResults(search(query));
+ } else {
+ searchWorker.postMessage({query: query});
+ }
+ } else {
+ // Clear results for short queries
+ displayResults([]);
+ }
+}
+
+function initSearch () {
+ var search_input = document.getElementById('mkdocs-search-query');
+ if (search_input) {
+ search_input.addEventListener("keyup", doSearch);
+ }
+ var term = getSearchTermFromLocation();
+ if (term) {
+ search_input.value = term;
+ doSearch();
+ }
+}
+
+function onWorkerMessage (e) {
+ if (e.data.allowSearch) {
+ initSearch();
+ } else if (e.data.results) {
+ var results = e.data.results;
+ displayResults(results);
+ } else if (e.data.config) {
+ min_search_length = e.data.config.min_search_length-1;
+ }
+}
+
+if (!window.Worker) {
+ console.log('Web Worker API not supported');
+ // load index in main thread
+ $.getScript(joinUrl(base_url, "search/worker.js")).done(function () {
+ console.log('Loaded worker');
+ init();
+ window.postMessage = function (msg) {
+ onWorkerMessage({data: msg});
+ };
+ }).fail(function (jqxhr, settings, exception) {
+ console.error('Could not load worker.js');
+ });
+} else {
+ // Wrap search in a web worker
+ var searchWorker = new Worker(joinUrl(base_url, "search/worker.js"));
+ searchWorker.postMessage({init: true});
+ searchWorker.onmessage = onWorkerMessage;
+}
diff --git a/site/search/search_index.json b/site/search/search_index.json
new file mode 100644
index 0000000..086f62f
--- /dev/null
+++ b/site/search/search_index.json
@@ -0,0 +1 @@
+{"config":{"lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Welcome / H\u00fbn bi x\u00ear hatin / \u0628\u06d5 \u062e\u06ce\u0631 \u0628\u06ce\u0646! \ud83d\ude42 Introduction Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as less-resourced languages . Despite a plethora of performant tools and specific frameworks for natural language processing (NLP), such as NLTK , Stanza and spaCy , the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish. Kurdish Language Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages): Northern Kurdish (or Kurmanji) kmr Central Kurdish (or Sorani) ckb Southern Kurdish sdh Laki lki Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script. KLPT - The Kurdish Language Processing Toolkit KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the Kurdish language . The current version (0.1) comes with four core modules, namely preprocess , stem , transliterate and tokenize , and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the Sorani and Kurmanji dialects of Kurdish. More importantly, it is an open-source project ! To find out more about how to use the tool, please check the \"User Guide\" section of this website. Cite this project Please consider citing this paper , if you use any part of the data or the tool ( bib file ): @inproceedings{ahmadi2020klpt, title = \"{KLPT} {--} {K}urdish Language Processing Toolkit\", author = \"Ahmadi, Sina\", booktitle = \"Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)\", month = nov, year = \"2020\", address = \"Online\", publisher = \"Association for Computational Linguistics\", url = \"https://www.aclweb.org/anthology/2020.nlposs-1.11\", pages = \"72--84\" } You can also watch the presentation of this paper at https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit . License Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means: You are free to share , copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material for any purpose, even commercially . You must give appropriate credit , provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original .","title":"Home"},{"location":"#welcome-hun-bi-xer-hatin","text":"","title":"Welcome / H\u00fbn bi x\u00ear hatin / \u0628\u06d5 \u062e\u06ce\u0631 \u0628\u06ce\u0646! \ud83d\ude42"},{"location":"#introduction","text":"Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as less-resourced languages . Despite a plethora of performant tools and specific frameworks for natural language processing (NLP), such as NLTK , Stanza and spaCy , the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish.","title":"Introduction"},{"location":"#kurdish-language","text":"Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages): Northern Kurdish (or Kurmanji) kmr Central Kurdish (or Sorani) ckb Southern Kurdish sdh Laki lki Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script.","title":"Kurdish Language"},{"location":"#klpt-the-kurdish-language-processing-toolkit","text":"KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the Kurdish language . The current version (0.1) comes with four core modules, namely preprocess , stem , transliterate and tokenize , and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the Sorani and Kurmanji dialects of Kurdish. More importantly, it is an open-source project ! To find out more about how to use the tool, please check the \"User Guide\" section of this website.","title":"KLPT - The Kurdish Language Processing Toolkit"},{"location":"#cite-this-project","text":"Please consider citing this paper , if you use any part of the data or the tool ( bib file ): @inproceedings{ahmadi2020klpt, title = \"{KLPT} {--} {K}urdish Language Processing Toolkit\", author = \"Ahmadi, Sina\", booktitle = \"Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)\", month = nov, year = \"2020\", address = \"Online\", publisher = \"Association for Computational Linguistics\", url = \"https://www.aclweb.org/anthology/2020.nlposs-1.11\", pages = \"72--84\" } You can also watch the presentation of this paper at https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit .","title":"Cite this project"},{"location":"#license","text":"Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means: You are free to share , copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material for any purpose, even commercially . You must give appropriate credit , provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original .","title":"License"},{"location":"about/contributing/","text":"How to help Kurdish language processing? One of our main objectives in this project is to promote collaborative projects with open-source outcomes. If you are generous and passionate to volunteer and help the Kurdish language, there are three ways you can do so: If you are a native Kurdish speaker with general knowledge about Kurdish and are comfortable working with computer, contributing to collaboratively-curated resources is the best starting point, particularly to: W\u00eek\u00eeferheng - the Kurdish Wiktionary Wikipedia in Sorani and in Kurmanji If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must. Please get in touch by joining the KurdishNLP community on Gitter . Our collaborations oftentimes lead to a scientific paper depending on the task. Please check the following repositories to find out about some of our previous projects: Kurdish tokenization Kurdish Hunspell Kurdish transliteration If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you are invited to create content online. You can start creating a blog or tweet in Kurdish. After all, every single person is a contributor as well . In any case, please follow this project and introduce it to your friends. Test the tool and raise your issues so that we can fix them.","title":"Contributing"},{"location":"about/contributing/#how-to-help-kurdish-language-processing","text":"One of our main objectives in this project is to promote collaborative projects with open-source outcomes. If you are generous and passionate to volunteer and help the Kurdish language, there are three ways you can do so: If you are a native Kurdish speaker with general knowledge about Kurdish and are comfortable working with computer, contributing to collaboratively-curated resources is the best starting point, particularly to: W\u00eek\u00eeferheng - the Kurdish Wiktionary Wikipedia in Sorani and in Kurmanji If you have expertise in Kurdish linguistics, you can take part in annotation tasks. Having a basic understanding on computational linguistics is a plus but not a must. Please get in touch by joining the KurdishNLP community on Gitter . Our collaborations oftentimes lead to a scientific paper depending on the task. Please check the following repositories to find out about some of our previous projects: Kurdish tokenization Kurdish Hunspell Kurdish transliteration If you are not included in 1 and 2 but have basic knowledge about Kurdish, particularly writing in Kurdish, you are invited to create content online. You can start creating a blog or tweet in Kurdish. After all, every single person is a contributor as well . In any case, please follow this project and introduce it to your friends. Test the tool and raise your issues so that we can fix them.","title":"How to help Kurdish language processing?"},{"location":"about/license/","text":"License This project is created by Sina Ahmadi and is publicly available under a Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/ .","title":"License"},{"location":"about/license/#license","text":"This project is created by Sina Ahmadi and is publicly available under a Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/ .","title":"License"},{"location":"about/release-notes/","text":"About the current version Please note that KLPT is under development and some of the functionalities will appear in the future versions. You can find out regarding the progress of each task at the Projects section. In the current version, the following tasks are included: Modules Tasks Sorani (ckb) Kurmanji (kmr) preprocess normalization \u2713 (v0.1.0) \u2713 (v0.1.0) standardization \u2713 (v0.1.0) \u2713 (v0.1.0) unification of numerals \u2713 (v0.1.0) \u2713 (v0.1.0) tokenize word tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) MWE tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) sentence tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) transliterate Arabic to Latin \u2713 (v0.1.0) \u2713 (v0.1.0) Latin to Arabic \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of u/w and \u00ee/y \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of Bizroke ( i ) \u2717 \u2717 stem morphological analysis \u2713 (v0.1.0) \u2717 morphological generation \u2713 (v0.1.0) \u2717 stemming \u2717 \u2717 lemmatization \u2717 \u2717 spell error detection and correction \u2713 (v0.1.0) \u2717","title":"Release Notes"},{"location":"about/release-notes/#about-the-current-version","text":"Please note that KLPT is under development and some of the functionalities will appear in the future versions. You can find out regarding the progress of each task at the Projects section. In the current version, the following tasks are included: Modules Tasks Sorani (ckb) Kurmanji (kmr) preprocess normalization \u2713 (v0.1.0) \u2713 (v0.1.0) standardization \u2713 (v0.1.0) \u2713 (v0.1.0) unification of numerals \u2713 (v0.1.0) \u2713 (v0.1.0) tokenize word tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) MWE tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) sentence tokenization \u2713 (v0.1.0) \u2713 (v0.1.0) transliterate Arabic to Latin \u2713 (v0.1.0) \u2713 (v0.1.0) Latin to Arabic \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of u/w and \u00ee/y \u2713 (v0.1.0) \u2713 (v0.1.0) Detection of Bizroke ( i ) \u2717 \u2717 stem morphological analysis \u2713 (v0.1.0) \u2717 morphological generation \u2713 (v0.1.0) \u2717 stemming \u2717 \u2717 lemmatization \u2717 \u2717 spell error detection and correction \u2713 (v0.1.0) \u2717","title":"About the current version"},{"location":"about/sponsors/","text":"Become a sponsor Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support, You can be an official sponsor You will get a GitHub sponsor badge on your profile If you have any questions, I will focus on it If you want, I will add your name or company logo on the front page of your preferred project Your contribution will be acknowledged in one of my future papers in a field of your choice Our sponsors: Be the first one! \ud83d\ude42 Name/company donation ($) URL","title":"Sponsors"},{"location":"about/sponsors/#become-a-sponsor","text":"Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support, You can be an official sponsor You will get a GitHub sponsor badge on your profile If you have any questions, I will focus on it If you want, I will add your name or company logo on the front page of your preferred project Your contribution will be acknowledged in one of my future papers in a field of your choice","title":"Become a sponsor"},{"location":"about/sponsors/#our-sponsors","text":"Be the first one! \ud83d\ude42 Name/company donation ($) URL","title":"Our sponsors:"},{"location":"user-guide/getting-started/","text":"Install KLPT KLPT is implemented in Python and requires basic knowledge on programming and particularly the Python language. Find out more about Python at https://www.python.org/ . Requirements Operating system : macOS / OS X \u00b7 Linux \u00b7 Windows (Cygwin, MinGW, Visual Studio) Python version : Python 3.5+ Package managers : pip cyhunspell >= 2.0.1 pip Using pip, KLPT releases are available as source packages and binary wheels. Please make sure that a compatible Python version is installed: pip install klpt All the data files including lexicons and morphological rules are also installed with the package. Although KLPT is not dependent on any NLP toolkit, there is one important requirement, particularly for the stem module. That is cyhunspell which should be installed with a version >= 2.0.1. Import klpt Once the package is installed, you can import the toolkit as follows: import klpt As a principle, the following parameters are widely used in the toolkit: dialect : the name of the dialect as Sorani or Kurmanji (ISO 639-3 code will be also added) script : the script of your input text as \"Arabic\" or \"Latin\" numeral : the type of the numerals as Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] Latin [1234567890]","title":"Getting started"},{"location":"user-guide/getting-started/#install-klpt","text":"KLPT is implemented in Python and requires basic knowledge on programming and particularly the Python language. Find out more about Python at https://www.python.org/ .","title":"Install KLPT"},{"location":"user-guide/getting-started/#requirements","text":"Operating system : macOS / OS X \u00b7 Linux \u00b7 Windows (Cygwin, MinGW, Visual Studio) Python version : Python 3.5+ Package managers : pip cyhunspell >= 2.0.1","title":"Requirements"},{"location":"user-guide/getting-started/#pip","text":"Using pip, KLPT releases are available as source packages and binary wheels. Please make sure that a compatible Python version is installed: pip install klpt All the data files including lexicons and morphological rules are also installed with the package. Although KLPT is not dependent on any NLP toolkit, there is one important requirement, particularly for the stem module. That is cyhunspell which should be installed with a version >= 2.0.1.","title":"pip"},{"location":"user-guide/getting-started/#import-klpt","text":"Once the package is installed, you can import the toolkit as follows: import klpt As a principle, the following parameters are widely used in the toolkit: dialect : the name of the dialect as Sorani or Kurmanji (ISO 639-3 code will be also added) script : the script of your input text as \"Arabic\" or \"Latin\" numeral : the type of the numerals as Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] Latin [1234567890]","title":"Import klpt"},{"location":"user-guide/preprocess/","text":"preprocess package This module deals with normalizing scripts and orthographies by using writing conventions based on dialects and scripts. The goal is not to correct the orthography but to normalize the text in terms of the encoding and common writing rules. The input encoding should be in UTF-8 only. To this end, three functions are provided as follows: normalize : deals with different encodings and unifies characters based on dialects and scripts standardize : given a normalized text, it returns standardized text based on the Kurdish orthographies following recommendations for Kurmanji and Sorani unify_numerals : conversion of the various types of numerals used in Kurdish texts It is recommended that the output of this module be used as the input of subsequent tasks in an NLP pipeline. Examples: >>> from klpt.preprocess import Preprocess >>> preprocessor_ckb = Preprocess(\"Sorani\", \"Arabic\", numeral=\"Latin\") >>> preprocessor_ckb.normalize(\"\u0644\u06d5 \u0633\u0640\u0640\u0640\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc \u0661\u0669\u0665\u0660\u062f\u0627\") '\u0644\u06d5 \u0633\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc 1950\u062f\u0627' >>> preprocessor_ckb.standardize(\"\u0631\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u0648\u06b5\u0627\u062a\u06d5\u062f\u0627\") '\u0695\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u06b5\u0627\u062a\u06d5\u062f\u0627' >>> preprocessor_ckb.unify_numerals(\"\u0662\u0660\u0662\u0660\") '2020' >>> preprocessor_kmr = Preprocess(\"Kurmanji\", \"Latin\") >>> preprocessor_kmr.standardize(\"di sala 2018-an\") 'di sala 2018an' >>> preprocessor_kmr.standardize(\"h\u00eaviya\") 'h\u00eav\u00eeya' The preprocessing rules are provided at data/preprocess_map.json . __init__ ( self , dialect , script , numeral = 'Latin' ) special Initialization of the Preprocess class Parameters: Name Type Description Default dialect str the name of the dialect or its ISO 639-3 code required script str the name of the script required numeral str the type of the numeral 'Latin' Source code in klpt/preprocess.py def __init__ ( self , dialect , script , numeral = \"Latin\" ): \"\"\" Initialization of the Preprocess class Arguments: dialect (str): the name of the dialect or its ISO 639-3 code script (str): the name of the script numeral (str): the type of the numeral \"\"\" with open ( klpt . get_data ( \"data/preprocess_map.json\" )) as preprocess_file : self . preprocess_map = json . load ( preprocess_file ) configuration = Configuration ({ \"dialect\" : dialect , \"script\" : script , \"numeral\" : numeral }) self . dialect = configuration . dialect self . script = configuration . script self . numeral = configuration . numeral normalize ( self , text ) Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: Sorani-Arabic: replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) remove Kashida \"\u0640\" \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Parameters: Name Type Description Default text str a string required Returns: Type Description str normalized text Source code in klpt/preprocess.py def normalize ( self , text ): \"\"\" Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: - Sorani-Arabic: - replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" - replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. - replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) - remove Kashida \"\u0640\" - \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) - replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Arguments: text (str): a string Returns: str: normalized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for normalization_type in [ \"universal\" , self . dialect ]: for rep in self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip () preprocess ( self , text ) One single function for normalization, standardization and unification of numerals Parameters: Name Type Description Default text str a string required Returns: Type Description str preprocessed text Source code in klpt/preprocess.py def preprocess ( self , text ): \"\"\" One single function for normalization, standardization and unification of numerals Arguments: text (str): a string Returns: str: preprocessed text \"\"\" return self . unify_numerals ( self . standardize ( self . normalize ( text ))) standardize ( self , text ) Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. Sorani-Arabic: replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) replace double rr and ll with \u0159 and \u0142 respectively Kurmanji-Latin: replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should \u0131 (LATIN SMALL LETTER DOTLESS I be replaced by i? Parameters: Name Type Description Default text str a string required Returns: Type Description str standardized text Source code in klpt/preprocess.py def standardize ( self , text ): \"\"\" Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. - Sorani-Arabic: - replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) - replace double rr and ll with \u0159 and \u0142 respectively - Kurmanji-Latin: - replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should [\u0131 (LATIN SMALL LETTER DOTLESS I](https://www.compart.com/en/unicode/U+0131) be replaced by i? Arguments: text (str): a string Returns: str: standardized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for standardization_type in [ self . dialect ]: for rep in self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip () unify_numerals ( self , text ) Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Parameters: Name Type Description Default text str a string required Returns: Type Description str text with unified numerals Source code in klpt/preprocess.py def unify_numerals ( self , text ): \"\"\" Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Arguments: text (str): a string Returns: str: text with unified numerals \"\"\" for i , j in self . preprocess_map [ \"normalizer\" ][ \"universal\" ][ \"numerals\" ][ self . numeral ] . items (): text = text . replace ( i , j ) return text","title":"Preprocess"},{"location":"user-guide/preprocess/#preprocess-package","text":"","title":"preprocess package"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess","text":"This module deals with normalizing scripts and orthographies by using writing conventions based on dialects and scripts. The goal is not to correct the orthography but to normalize the text in terms of the encoding and common writing rules. The input encoding should be in UTF-8 only. To this end, three functions are provided as follows: normalize : deals with different encodings and unifies characters based on dialects and scripts standardize : given a normalized text, it returns standardized text based on the Kurdish orthographies following recommendations for Kurmanji and Sorani unify_numerals : conversion of the various types of numerals used in Kurdish texts It is recommended that the output of this module be used as the input of subsequent tasks in an NLP pipeline. Examples: >>> from klpt.preprocess import Preprocess >>> preprocessor_ckb = Preprocess(\"Sorani\", \"Arabic\", numeral=\"Latin\") >>> preprocessor_ckb.normalize(\"\u0644\u06d5 \u0633\u0640\u0640\u0640\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc \u0661\u0669\u0665\u0660\u062f\u0627\") '\u0644\u06d5 \u0633\u0627\u06b5\u06d5\u06a9\u0627\u0646\u06cc 1950\u062f\u0627' >>> preprocessor_ckb.standardize(\"\u0631\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u0648\u06b5\u0627\u062a\u06d5\u062f\u0627\") '\u0695\u0627\u0633\u062a\u06d5 \u0644\u06d5\u0648 \u0648\u06b5\u0627\u062a\u06d5\u062f\u0627' >>> preprocessor_ckb.unify_numerals(\"\u0662\u0660\u0662\u0660\") '2020' >>> preprocessor_kmr = Preprocess(\"Kurmanji\", \"Latin\") >>> preprocessor_kmr.standardize(\"di sala 2018-an\") 'di sala 2018an' >>> preprocessor_kmr.standardize(\"h\u00eaviya\") 'h\u00eav\u00eeya' The preprocessing rules are provided at data/preprocess_map.json .","title":"klpt.preprocess.Preprocess"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.__init__","text":"Initialization of the Preprocess class Parameters: Name Type Description Default dialect str the name of the dialect or its ISO 639-3 code required script str the name of the script required numeral str the type of the numeral 'Latin' Source code in klpt/preprocess.py def __init__ ( self , dialect , script , numeral = \"Latin\" ): \"\"\" Initialization of the Preprocess class Arguments: dialect (str): the name of the dialect or its ISO 639-3 code script (str): the name of the script numeral (str): the type of the numeral \"\"\" with open ( klpt . get_data ( \"data/preprocess_map.json\" )) as preprocess_file : self . preprocess_map = json . load ( preprocess_file ) configuration = Configuration ({ \"dialect\" : dialect , \"script\" : script , \"numeral\" : numeral }) self . dialect = configuration . dialect self . script = configuration . script self . numeral = configuration . numeral","title":"__init__()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.normalize","text":"Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: Sorani-Arabic: replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) remove Kashida \"\u0640\" \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Parameters: Name Type Description Default text str a string required Returns: Type Description str normalized text Source code in klpt/preprocess.py def normalize ( self , text ): \"\"\" Text normalization This function deals with different encodings and unifies characters based on dialects and scripts as follows: - Sorani-Arabic: - replace frequent Arabic characters with their equivalent Kurdish ones, e.g. \"\u064a\" by \"\u06cc\" and \"\u0643\" by \"\u06a9\" - replace \"\u0647\" followed by zero-width non-joiner (ZWNJ, U+200C) with \"\u06d5\" where ZWNJ is removed (\"\u0631\u0647\u200c\u0632\u0628\u0647\u200c\u0631\" is converted to \"\u0631\u06d5\u0632\u0628\u06d5\u0631\"). ZWNJ in HTML is also taken into account. - replace \"\u0647\u0640\" with \"\u06be\" (U+06BE, ARABIC LETTER HEH DOACHASHMEE) - remove Kashida \"\u0640\" - \"\u06be\" in the middle of a word is replaced by \u0647 (U+0647) - replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649) It should be noted that the order of the replacements is important. Check out provided files for further details and test cases. Arguments: text (str): a string Returns: str: normalized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for normalization_type in [ \"universal\" , self . dialect ]: for rep in self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"normalizer\" ][ normalization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip ()","title":"normalize()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.preprocess","text":"One single function for normalization, standardization and unification of numerals Parameters: Name Type Description Default text str a string required Returns: Type Description str preprocessed text Source code in klpt/preprocess.py def preprocess ( self , text ): \"\"\" One single function for normalization, standardization and unification of numerals Arguments: text (str): a string Returns: str: preprocessed text \"\"\" return self . unify_numerals ( self . standardize ( self . normalize ( text )))","title":"preprocess()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.standardize","text":"Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. Sorani-Arabic: replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) replace double rr and ll with \u0159 and \u0142 respectively Kurmanji-Latin: replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should \u0131 (LATIN SMALL LETTER DOTLESS I be replaced by i? Parameters: Name Type Description Default text str a string required Returns: Type Description str standardized text Source code in klpt/preprocess.py def standardize ( self , text ): \"\"\" Method of standardization of Kurdish orthographies Given a normalized text, it returns standardized text based on the Kurdish orthographies. - Sorani-Arabic: - replace alveolar flap \u0631 (/\u027e/) at the begging of the word by the alveolar trill \u0695 (/r/) - replace double rr and ll with \u0159 and \u0142 respectively - Kurmanji-Latin: - replace \"-an\" or \"'an\" in dates and numerals (\"di sala 2018'an\" and \"di sala 2018-an\" -> \"di sala 2018an\") Open issues: - replace \" \u0648\u06d5 \" by \" \u0648 \"? But this is not always possible, \"min bo we\" (\u0631\u06cc\u0632\u06af\u0640\u0631\u062a\u0646\u0627 \u0645\u0646 \u0628\u0648 \u0648\u06d5 \u0646\u06d5 \u0626\u06d5 \u0648\u06d5 \u0626\u0640\u0640\u06d5 \u0632) - \"pirt\u00fck\u00ea\": \"pirt\u00fbk\u00ea\"? - Should [\u0131 (LATIN SMALL LETTER DOTLESS I](https://www.compart.com/en/unicode/U+0131) be replaced by i? Arguments: text (str): a string Returns: str: standardized text \"\"\" temp_text = \" \" + self . unify_numerals ( text ) + \" \" for standardization_type in [ self . dialect ]: for rep in self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ]: rep_tar = self . preprocess_map [ \"standardizer\" ][ standardization_type ][ self . script ][ rep ] temp_text = re . sub ( rf \" { rep } \" , rf \" { rep_tar } \" , temp_text , flags = re . I ) return temp_text . strip ()","title":"standardize()"},{"location":"user-guide/preprocess/#klpt.preprocess.Preprocess.unify_numerals","text":"Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Parameters: Name Type Description Default text str a string required Returns: Type Description str text with unified numerals Source code in klpt/preprocess.py def unify_numerals ( self , text ): \"\"\" Convert numerals to the desired one There are three types of numerals: - Arabic [\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669\u0660] - Farsi [\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9\u06f0] - Latin [1234567890] Arguments: text (str): a string Returns: str: text with unified numerals \"\"\" for i , j in self . preprocess_map [ \"normalizer\" ][ \"universal\" ][ \"numerals\" ][ self . numeral ] . items (): text = text . replace ( i , j ) return text","title":"unify_numerals()"},{"location":"user-guide/stem/","text":"stem package The Stem module deals with various tasks, mainly through the following functions: - check_spelling : spell error detection - correct_spelling : spell error correction - analyze : morphological analysis Please note that only Sorani is supported in this version in this module. The module is based on the Kurdish Hunspell project . Examples: >>> from klpt.stem import Stem >>> stemmer = Stem(\"Sorani\", \"Arabic\") >>> stemmer.check_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") False >>> stemmer.correct_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") (False, ['\u0633\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u0695\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0695\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0641\u06d5\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0628\u0648\u0648\u0698\u0627\u0646\u062f\u0628\u0648\u0648\u062a']) >>> stemmer.analyze(\"\u062f\u06cc\u062a\u0628\u0627\u0645\u0646\") [{'pos': 'verb', 'description': 'past_stem_transitive_active', 'base': '\u062f\u06cc\u062a', 'terminal_suffix': '\u0628\u0627\u0645\u0646'}] analyze ( self , word_form ) Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at https://github.com/sinaahmadi/KurdishHunspell . It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: \"pos\": the part-of-speech of the word-form according to the Universal Dependency tag set . \"description\": is flag \"terminal_suffix\": anything except ts flag \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. \"base\": ts flag. The definition of terminal suffix is a bit tricky in Hunspell. According to the Hunspell documentation , \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Parameters: Name Type Description Default word_form str a single word-form required Exceptions: Type Description TypeError only string as input Returns: Type Description (list(dict)) a list of all possible morphological analyses according to the defined morphological rules Source code in klpt/stem.py def analyze ( self , word_form ): \"\"\" Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell). It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: - \"pos\": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html). - \"description\": is flag - \"terminal_suffix\": anything except ts flag - \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. - \"base\": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Args: word_form (str): a single word-form Raises: TypeError: only string as input Returns: (list(dict)): a list of all possible morphological analyses according to the defined morphological rules \"\"\" if not isinstance ( word_form , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : # Given the morphological analysis of a word-form with Hunspell flags, extract relevant information and return a dictionary word_analysis = list () for analysis in list ( self . huns . analyze ( word_form )): analysis_dict = dict () for item in analysis . split (): if \":\" not in item : continue if item . split ( \":\" )[ 1 ] == \"ts\" : # ts flag exceptionally appears after the value as value:key in the Hunspell output analysis_dict [ \"base\" ] = item . split ( \":\" )[ 0 ] # anything except the terminal_suffix is considered to be the base analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 1 ]]] = word_form . replace ( item . split ( \":\" )[ 0 ], \"\" ) elif item . split ( \":\" )[ 0 ] in self . hunspell_flags . keys (): # assign the key:value pairs from the Hunspell string output to the dictionary output of the current function # for ds flag, add derivation as the formation type, otherwise inflection if item . split ( \":\" )[ 0 ] == \"ds\" : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = \"derivational\" analysis_dict [ self . hunspell_flags [ \"is\" ]] = item . split ( \":\" )[ 1 ] else : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = item . split ( \":\" )[ 1 ] # if there is no value assigned to the ts flag, the terminal suffix is a zero-morpheme 0 if self . hunspell_flags [ \"ts\" ] not in analysis_dict or analysis_dict [ self . hunspell_flags [ \"ts\" ]] == \"\" : analysis_dict [ self . hunspell_flags [ \"ts\" ]] = \"0\" word_analysis . append ( analysis_dict ) return word_analysis check_spelling ( self , word ) Check spelling of a word Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description bool True if the spelling is correct, False if the spelling is incorrect Source code in klpt/stem.py def check_spelling ( self , word ): \"\"\"Check spelling of a word Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: bool: True if the spelling is correct, False if the spelling is incorrect \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : return self . huns . spell ( word ) correct_spelling ( self , word ) Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description tuple (boolean, list) Source code in klpt/stem.py def correct_spelling ( self , word ): \"\"\" Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: tuple (boolean, list) \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : if self . check_spelling ( word ): return ( True , []) return ( False , list ( self . huns . suggest ( word )))","title":"Stem"},{"location":"user-guide/stem/#stem-package","text":"","title":"stem package"},{"location":"user-guide/stem/#klpt.stem.Stem","text":"The Stem module deals with various tasks, mainly through the following functions: - check_spelling : spell error detection - correct_spelling : spell error correction - analyze : morphological analysis Please note that only Sorani is supported in this version in this module. The module is based on the Kurdish Hunspell project . Examples: >>> from klpt.stem import Stem >>> stemmer = Stem(\"Sorani\", \"Arabic\") >>> stemmer.check_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") False >>> stemmer.correct_spelling(\"\u0633\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a\") (False, ['\u0633\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0633\u0648\u0648\u0695\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0695\u0648\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0641\u06d5\u0648\u062a\u0627\u0646\u062f\u0628\u0648\u0648\u062a', '\u0628\u0648\u0648\u0698\u0627\u0646\u062f\u0628\u0648\u0648\u062a']) >>> stemmer.analyze(\"\u062f\u06cc\u062a\u0628\u0627\u0645\u0646\") [{'pos': 'verb', 'description': 'past_stem_transitive_active', 'base': '\u062f\u06cc\u062a', 'terminal_suffix': '\u0628\u0627\u0645\u0646'}]","title":"klpt.stem.Stem"},{"location":"user-guide/stem/#klpt.stem.Stem.analyze","text":"Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at https://github.com/sinaahmadi/KurdishHunspell . It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: \"pos\": the part-of-speech of the word-form according to the Universal Dependency tag set . \"description\": is flag \"terminal_suffix\": anything except ts flag \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. \"base\": ts flag. The definition of terminal suffix is a bit tricky in Hunspell. According to the Hunspell documentation , \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Parameters: Name Type Description Default word_form str a single word-form required Exceptions: Type Description TypeError only string as input Returns: Type Description (list(dict)) a list of all possible morphological analyses according to the defined morphological rules Source code in klpt/stem.py def analyze ( self , word_form ): \"\"\" Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell). It returns morphological analyses. The morphological analysis is returned as a dictionary as follows: - \"pos\": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html). - \"description\": is flag - \"terminal_suffix\": anything except ts flag - \"formation\": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure. - \"base\": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), \"Terminal suffix fields are inflectional suffix fields \"removed\" by additional (not terminal) suffixes\". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base. If the input cannot be analyzed morphologically, an empty list is returned. Args: word_form (str): a single word-form Raises: TypeError: only string as input Returns: (list(dict)): a list of all possible morphological analyses according to the defined morphological rules \"\"\" if not isinstance ( word_form , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : # Given the morphological analysis of a word-form with Hunspell flags, extract relevant information and return a dictionary word_analysis = list () for analysis in list ( self . huns . analyze ( word_form )): analysis_dict = dict () for item in analysis . split (): if \":\" not in item : continue if item . split ( \":\" )[ 1 ] == \"ts\" : # ts flag exceptionally appears after the value as value:key in the Hunspell output analysis_dict [ \"base\" ] = item . split ( \":\" )[ 0 ] # anything except the terminal_suffix is considered to be the base analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 1 ]]] = word_form . replace ( item . split ( \":\" )[ 0 ], \"\" ) elif item . split ( \":\" )[ 0 ] in self . hunspell_flags . keys (): # assign the key:value pairs from the Hunspell string output to the dictionary output of the current function # for ds flag, add derivation as the formation type, otherwise inflection if item . split ( \":\" )[ 0 ] == \"ds\" : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = \"derivational\" analysis_dict [ self . hunspell_flags [ \"is\" ]] = item . split ( \":\" )[ 1 ] else : analysis_dict [ self . hunspell_flags [ item . split ( \":\" )[ 0 ]]] = item . split ( \":\" )[ 1 ] # if there is no value assigned to the ts flag, the terminal suffix is a zero-morpheme 0 if self . hunspell_flags [ \"ts\" ] not in analysis_dict or analysis_dict [ self . hunspell_flags [ \"ts\" ]] == \"\" : analysis_dict [ self . hunspell_flags [ \"ts\" ]] = \"0\" word_analysis . append ( analysis_dict ) return word_analysis","title":"analyze()"},{"location":"user-guide/stem/#klpt.stem.Stem.check_spelling","text":"Check spelling of a word Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description bool True if the spelling is correct, False if the spelling is incorrect Source code in klpt/stem.py def check_spelling ( self , word ): \"\"\"Check spelling of a word Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: bool: True if the spelling is correct, False if the spelling is incorrect \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : return self . huns . spell ( word )","title":"check_spelling()"},{"location":"user-guide/stem/#klpt.stem.Stem.correct_spelling","text":"Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Parameters: Name Type Description Default word str input word to be spell-checked required Exceptions: Type Description TypeError only string as input Returns: Type Description tuple (boolean, list) Source code in klpt/stem.py def correct_spelling ( self , word ): \"\"\" Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []). Args: word (str): input word to be spell-checked Raises: TypeError: only string as input Returns: tuple (boolean, list) \"\"\" if not isinstance ( word , str ): raise TypeError ( \"Only a word (str) is allowed.\" ) else : if self . check_spelling ( word ): return ( True , []) return ( False , list ( self . huns . suggest ( word )))","title":"correct_spelling()"},{"location":"user-guide/tokenize/","text":"tokenize package This module focuses on the tokenization of both Kurmanji and Sorani dialects of Kurdish with the following functions: word_tokenize : tokenization of texts into tokens (both multi-word expressions and single-word tokens). mwe_tokenize : tokenization of texts by only taking compound forms into account sent_tokenize : tokenization of texts into sentences The module is based on the Kurdish tokenization project . Examples: >>> from klpt.tokenize import Tokenize >>> tokenizer = Tokenize(\"Kurmanji\", \"Latin\") >>> tokenizer.word_tokenize(\"ji bo fort\u00ea xwe av\u00eatin\") ['\u2581ji\u2581', 'bo', '\u2581\u2581fort\u00ea\u2012xwe\u2012av\u00eatin\u2581\u2581'] >>> tokenizer.mwe_tokenize(\"bi serok\u00ea huk\u00fbmeta her\u00eama Kurdistan\u00ea Prof. Salih re saz kir.\") 'bi serok\u00ea huk\u00fbmeta her\u00eama Kurdistan\u00ea Prof . Salih re saz kir .' >>> tokenizer_ckb = Tokenize(\"Sorani\", \"Arabic\") >>> tokenizer_ckb.word(\"\u0628\u06d5 \u0647\u06d5\u0645\u0648\u0648 \u0647\u06d5\u0645\u0648\u0648\u0627\u0646\u06d5\u0648\u06d5 \u0695\u06ce\u06a9 \u06a9\u06d5\u0648\u062a\u0646\") ['\u2581\u0628\u06d5\u2581', '\u2581\u0647\u06d5\u0645\u0648\u0648\u2581', '\u0647\u06d5\u0645\u0648\u0648\u0627\u0646\u06d5\u0648\u06d5', '\u2581\u2581\u0695\u06ce\u06a9\u2012\u06a9\u06d5\u0648\u062a\u0646\u2581\u2581'] mwe_tokenize ( self , sentence , separator = '\u2581\u2581' , in_separator = '\u2012' , punct_marked = False , keep_form = False ) Multi-word expression tokenization Parameters: Name Type Description Default sentence str sentence to be split by multi-word expressions required separator str a specific token to specify a multi-word expression. By default two \u2581 (\u2581\u2581) are used for this purpose. '\u2581\u2581' in_separator str a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose. '\u2012' keep_form boolean if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash \u2012, as in \"dab\u2012\u00fb\u2012ner\u00eet\" False Returns: Type Description str sentence containing d multi-word expressions using the separator Source code in klpt/tokenize.py def mwe_tokenize ( self , sentence , separator = \"\u2581\u2581\" , in_separator = \"\u2012\" , punct_marked = False , keep_form = False ): \"\"\" Multi-word expression tokenization Args: sentence (str): sentence to be split by multi-word expressions separator (str): a specific token to specify a multi-word expression. By default two \u2581 (\u2581\u2581) are used for this purpose. in_separator (str): a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose. keep_form (boolean): if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash \u2012, as in \"dab\u2012\u00fb\u2012ner\u00eet\" Returns: str: sentence containing d multi-word expressions using the separator \"\"\" sentence = \" \" + sentence + \" \" if not punct_marked : # find punctuation marks and add a space around for punct in self . tokenize_map [ \"word_tokenize\" ][ self . dialect ][ self . script ][ \"punctuation\" ]: if punct in sentence : sentence = sentence . replace ( punct , \" \" + punct + \" \" ) # look for compound words and delimit them by double the separator for compound_lemma in self . mwe_lexicon : compound_lemma_context = \" \" + compound_lemma + \" \" if compound_lemma_context in sentence : if keep_form : sentence = sentence . replace ( compound_lemma_context , \" \u2581\u2581\" + compound_lemma + \"\u2581\u2581 \" ) else : sentence = sentence . replace ( compound_lemma_context , \" \u2581\u2581\" + compound_lemma . replace ( \"-\" , in_separator ) + \"\u2581\u2581 \" ) # check the possible word forms available for each compound lemma in the lex files, too # Note: compound forms don't have any hyphen or separator in the lex files for compound_form in self . mwe_lexicon [ compound_lemma ][ \"token_forms\" ]: compound_form_context = \" \" + compound_form + \" \" if compound_form_context in sentence : if keep_form : sentence = sentence . replace ( compound_form_context , \" \u2581\u2581\" + compound_form + \"\u2581\u2581 \" ) else : sentence = sentence . replace ( compound_form_context , \" \u2581\u2581\" + compound_lemma . replace ( \"-\" , in_separator ) + \"\u2581\u2581 \" ) # print(sentence) return sentence . replace ( \" \" , \" \" ) . replace ( \"\u2581\u2581\" , separator ) . strip () sent_tokenize ( self , text ) Sentence tokenizer Parameters: Name Type Description Default text [str] [input text to be tokenized by sentences] required Returns: Type Description [list] [a list of sentences] Source code in klpt/tokenize.py def sent_tokenize ( self , text ): \"\"\"Sentence tokenizer Args: text ([str]): [input text to be tokenized by sentences] Returns: [list]: [a list of sentences] \"\"\" text = \" \" + text + \" \" text = text . replace ( \" \\n \" , \" \" ) text = re . sub ( self . prefixes , \" \\\\ 1\" , text ) text = re . sub ( self . websites , \" \\\\ 1\" , text ) text = re . sub ( \"\\s\" + self . alphabets + \"[.] \" , \" \\\\ 1 \" , text ) text = re . sub ( self . acronyms + \" \" + self . starters , \" \\\\ 1 \\\\ 2\" , text ) text = re . sub ( self . alphabets + \"[.]\" + self . alphabets + \"[.]\" + self . alphabets + \"[.]\" , \" \\\\ 1 \\\\ 2 \\\\ 3\" , text ) text = re . sub ( self . alphabets + \"[.]\" + self . alphabets + \"[.]\" , \" \\\\ 1 \\\\ 2\" , text ) text = re . sub ( \" \" + self . suffixes + \"[.] \" + self . starters , \" \\\\ 1 \\\\ 2\" , text ) text = re . sub ( \" \" + self . suffixes + \"[.]\" , \" \\\\ 1\" , text ) text = re . sub ( self . digits + \"[.]\" + self . digits , \" \\\\ 1 \\\\ 2\" , text ) # for punct in self.tokenize_map[self.dialect][self.script][\"compound_puncts\"]: # if punct in text: # text = text.replace(\".\" + punct, punct + \".\") for punct in self . tokenize_map [ \"sent_tokenize\" ][ self . dialect ][ self . script ][ \"punct_boundary\" ]: text = text . replace ( punct , punct + \"\" ) text = text . replace ( \"\" , \".\" ) sentences = text . split ( \"\" ) sentences = [ s . strip () for s in sentences if len ( s . strip ())] return sentences word_tokenize ( self , sentence , separator = '\u2581' , mwe_separator = '\u2581\u2581' , keep_form = False ) Word tokenizer Parameters: Name Type Description Default sentence str sentence or text to be tokenized required Returns: Type Description [list] [a list of words] Source code in klpt/tokenize.py def word_tokenize ( self , sentence , separator = \"\u2581\" , mwe_separator = \"\u2581\u2581\" , keep_form = False ): \"\"\"Word tokenizer Args: sentence (str): sentence or text to be tokenized Returns: [list]: [a list of words] \"\"\" # find multi-word expressions in the sentence sentence = self . mwe_tokenize ( sentence , keep_form = keep_form ) # find punctuation marks and add a space around for punct in self . tokenize_map [ \"word_tokenize\" ][ self . dialect ][ self . script ][ \"punctuation\" ]: if punct in sentence : sentence = sentence . replace ( punct , \" \" + punct + \" \" ) # print(sentence) tokens = list () # split the sentence by space and look for identifiable tokens for word in sentence . strip () . split (): if \"\u2581\u2581\" in word : # the word is previously detected as a compound word tokens . append ( word ) else : if word in self . lexicon : # check if the word exists in the lexicon tokens . append ( \"\u2581\" + word + \"\u2581\" ) else : # the word is neither a lemma nor a compound # morphological analysis by identifying affixes and clitics token_identified = False for preposition in self . morphemes [ \"prefixes\" ]: if word . startswith ( preposition ) and len ( word . split ( preposition , 1 )) > 1 : if word . split ( preposition , 1 )[ 1 ] in self . lexicon : word = \"\u2581\" . join ([ \"\" , self . morphemes [ \"prefixes\" ][ preposition ], word . split ( preposition , 1 )[ 1 ], \"\" ]) token_identified = True break elif self . mwe_tokenize ( word . split ( preposition , 1 )[ 1 ], keep_form = keep_form ) != word . split ( preposition , 1 )[ 1 ]: word = \"\u2581\" + self . morphemes [ \"prefixes\" ][ preposition ] + self . mwe_tokenize ( word . split ( preposition , 1 )[ 1 ], keep_form = keep_form ) token_identified = True break if not token_identified : for postposition in self . morphemes [ \"suffixes\" ]: if word . endswith ( postposition ) and len ( word . rpartition ( postposition )[ 0 ]): if word . rpartition ( postposition )[ 0 ] in self . lexicon : word = \"\u2581\" + word . rpartition ( postposition )[ 0 ] + \"\u2581\" + self . morphemes [ \"suffixes\" ][ postposition ] break elif self . mwe_tokenize ( word . rpartition ( postposition )[ 0 ], keep_form = keep_form ) != word . rpartition ( postposition )[ 0 ]: word = ( \"\u2581\" + self . mwe_tokenize ( word . rpartition ( postposition )[ 0 ], keep_form = keep_form ) + \"\u2581\" + self . morphemes [ \"suffixes\" ][ postposition ] + \"\u2581\" ) . replace ( \"\u2581\u2581\u2581\" , \"\u2581\u2581\" ) break tokens . append ( word ) # print(tokens) return \" \" . join ( tokens ) . replace ( \"\u2581\u2581\" , mwe_separator ) . replace ( \"\u2581\" , separator ) . split ()","title":"Tokenize"},{"location":"user-guide/tokenize/#tokenize-package","text":"","title":"tokenize package"},{"location":"user-guide/tokenize/#klpt.tokenize.Tokenize","text":"This module focuses on the tokenization of both Kurmanji and Sorani dialects of Kurdish with the following functions: word_tokenize : tokenization of texts into tokens (both multi-word expressions and single-word tokens). mwe_tokenize : tokenization of texts by only taking compound forms into account sent_tokenize : tokenization of texts into sentences The module is based on the Kurdish tokenization project . Examples: >>> from klpt.tokenize import Tokenize >>> tokenizer = Tokenize(\"Kurmanji\", \"Latin\") >>> tokenizer.word_tokenize(\"ji bo fort\u00ea xwe av\u00eatin\") ['\u2581ji\u2581', 'bo', '\u2581\u2581fort\u00ea\u2012xwe\u2012av\u00eatin\u2581\u2581'] >>> tokenizer.mwe_tokenize(\"bi serok\u00ea huk\u00fbmeta her\u00eama Kurdistan\u00ea Prof. Salih re saz kir.\") 'bi serok\u00ea huk\u00fbmeta her\u00eama Kurdistan\u00ea Prof . Salih re saz kir .' >>> tokenizer_ckb = Tokenize(\"Sorani\", \"Arabic\") >>> tokenizer_ckb.word(\"\u0628\u06d5 \u0647\u06d5\u0645\u0648\u0648 \u0647\u06d5\u0645\u0648\u0648\u0627\u0646\u06d5\u0648\u06d5 \u0695\u06ce\u06a9 \u06a9\u06d5\u0648\u062a\u0646\") ['\u2581\u0628\u06d5\u2581', '\u2581\u0647\u06d5\u0645\u0648\u0648\u2581', '\u0647\u06d5\u0645\u0648\u0648\u0627\u0646\u06d5\u0648\u06d5', '\u2581\u2581\u0695\u06ce\u06a9\u2012\u06a9\u06d5\u0648\u062a\u0646\u2581\u2581']","title":"klpt.tokenize.Tokenize"},{"location":"user-guide/tokenize/#klpt.tokenize.Tokenize.mwe_tokenize","text":"Multi-word expression tokenization Parameters: Name Type Description Default sentence str sentence to be split by multi-word expressions required separator str a specific token to specify a multi-word expression. By default two \u2581 (\u2581\u2581) are used for this purpose. '\u2581\u2581' in_separator str a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose. '\u2012' keep_form boolean if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash \u2012, as in \"dab\u2012\u00fb\u2012ner\u00eet\" False Returns: Type Description str sentence containing d multi-word expressions using the separator Source code in klpt/tokenize.py def mwe_tokenize ( self , sentence , separator = \"\u2581\u2581\" , in_separator = \"\u2012\" , punct_marked = False , keep_form = False ): \"\"\" Multi-word expression tokenization Args: sentence (str): sentence to be split by multi-word expressions separator (str): a specific token to specify a multi-word expression. By default two \u2581 (\u2581\u2581) are used for this purpose. in_separator (str): a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose. keep_form (boolean): if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash \u2012, as in \"dab\u2012\u00fb\u2012ner\u00eet\" Returns: str: sentence containing d multi-word expressions using the separator \"\"\" sentence = \" \" + sentence + \" \" if not punct_marked : # find punctuation marks and add a space around for punct in self . tokenize_map [ \"word_tokenize\" ][ self . dialect ][ self . script ][ \"punctuation\" ]: if punct in sentence : sentence = sentence . replace ( punct , \" \" + punct + \" \" ) # look for compound words and delimit them by double the separator for compound_lemma in self . mwe_lexicon : compound_lemma_context = \" \" + compound_lemma + \" \" if compound_lemma_context in sentence : if keep_form : sentence = sentence . replace ( compound_lemma_context , \" \u2581\u2581\" + compound_lemma + \"\u2581\u2581 \" ) else : sentence = sentence . replace ( compound_lemma_context , \" \u2581\u2581\" + compound_lemma . replace ( \"-\" , in_separator ) + \"\u2581\u2581 \" ) # check the possible word forms available for each compound lemma in the lex files, too # Note: compound forms don't have any hyphen or separator in the lex files for compound_form in self . mwe_lexicon [ compound_lemma ][ \"token_forms\" ]: compound_form_context = \" \" + compound_form + \" \" if compound_form_context in sentence : if keep_form : sentence = sentence . replace ( compound_form_context , \" \u2581\u2581\" + compound_form + \"\u2581\u2581 \" ) else : sentence = sentence . replace ( compound_form_context , \" \u2581\u2581\" + compound_lemma . replace ( \"-\" , in_separator ) + \"\u2581\u2581 \" ) # print(sentence) return sentence . replace ( \" \" , \" \" ) . replace ( \"\u2581\u2581\" , separator ) . strip ()","title":"mwe_tokenize()"},{"location":"user-guide/tokenize/#klpt.tokenize.Tokenize.sent_tokenize","text":"Sentence tokenizer Parameters: Name Type Description Default text [str] [input text to be tokenized by sentences] required Returns: Type Description [list] [a list of sentences] Source code in klpt/tokenize.py def sent_tokenize ( self , text ): \"\"\"Sentence tokenizer Args: text ([str]): [input text to be tokenized by sentences] Returns: [list]: [a list of sentences] \"\"\" text = \" \" + text + \" \" text = text . replace ( \" \\n \" , \" \" ) text = re . sub ( self . prefixes , \" \\\\ 1\" , text ) text = re . sub ( self . websites , \" \\\\ 1\" , text ) text = re . sub ( \"\\s\" + self . alphabets + \"[.] \" , \" \\\\ 1 \" , text ) text = re . sub ( self . acronyms + \" \" + self . starters , \" \\\\ 1 \\\\ 2\" , text ) text = re . sub ( self . alphabets + \"[.]\" + self . alphabets + \"[.]\" + self . alphabets + \"[.]\" , \" \\\\ 1 \\\\ 2 \\\\ 3\" , text ) text = re . sub ( self . alphabets + \"[.]\" + self . alphabets + \"[.]\" , \" \\\\ 1 \\\\ 2\" , text ) text = re . sub ( \" \" + self . suffixes + \"[.] \" + self . starters , \" \\\\ 1 \\\\ 2\" , text ) text = re . sub ( \" \" + self . suffixes + \"[.]\" , \" \\\\ 1\" , text ) text = re . sub ( self . digits + \"[.]\" + self . digits , \" \\\\ 1 \\\\ 2\" , text ) # for punct in self.tokenize_map[self.dialect][self.script][\"compound_puncts\"]: # if punct in text: # text = text.replace(\".\" + punct, punct + \".\") for punct in self . tokenize_map [ \"sent_tokenize\" ][ self . dialect ][ self . script ][ \"punct_boundary\" ]: text = text . replace ( punct , punct + \"\" ) text = text . replace ( \"\" , \".\" ) sentences = text . split ( \"\" ) sentences = [ s . strip () for s in sentences if len ( s . strip ())] return sentences","title":"sent_tokenize()"},{"location":"user-guide/tokenize/#klpt.tokenize.Tokenize.word_tokenize","text":"Word tokenizer Parameters: Name Type Description Default sentence str sentence or text to be tokenized required Returns: Type Description [list] [a list of words] Source code in klpt/tokenize.py def word_tokenize ( self , sentence , separator = \"\u2581\" , mwe_separator = \"\u2581\u2581\" , keep_form = False ): \"\"\"Word tokenizer Args: sentence (str): sentence or text to be tokenized Returns: [list]: [a list of words] \"\"\" # find multi-word expressions in the sentence sentence = self . mwe_tokenize ( sentence , keep_form = keep_form ) # find punctuation marks and add a space around for punct in self . tokenize_map [ \"word_tokenize\" ][ self . dialect ][ self . script ][ \"punctuation\" ]: if punct in sentence : sentence = sentence . replace ( punct , \" \" + punct + \" \" ) # print(sentence) tokens = list () # split the sentence by space and look for identifiable tokens for word in sentence . strip () . split (): if \"\u2581\u2581\" in word : # the word is previously detected as a compound word tokens . append ( word ) else : if word in self . lexicon : # check if the word exists in the lexicon tokens . append ( \"\u2581\" + word + \"\u2581\" ) else : # the word is neither a lemma nor a compound # morphological analysis by identifying affixes and clitics token_identified = False for preposition in self . morphemes [ \"prefixes\" ]: if word . startswith ( preposition ) and len ( word . split ( preposition , 1 )) > 1 : if word . split ( preposition , 1 )[ 1 ] in self . lexicon : word = \"\u2581\" . join ([ \"\" , self . morphemes [ \"prefixes\" ][ preposition ], word . split ( preposition , 1 )[ 1 ], \"\" ]) token_identified = True break elif self . mwe_tokenize ( word . split ( preposition , 1 )[ 1 ], keep_form = keep_form ) != word . split ( preposition , 1 )[ 1 ]: word = \"\u2581\" + self . morphemes [ \"prefixes\" ][ preposition ] + self . mwe_tokenize ( word . split ( preposition , 1 )[ 1 ], keep_form = keep_form ) token_identified = True break if not token_identified : for postposition in self . morphemes [ \"suffixes\" ]: if word . endswith ( postposition ) and len ( word . rpartition ( postposition )[ 0 ]): if word . rpartition ( postposition )[ 0 ] in self . lexicon : word = \"\u2581\" + word . rpartition ( postposition )[ 0 ] + \"\u2581\" + self . morphemes [ \"suffixes\" ][ postposition ] break elif self . mwe_tokenize ( word . rpartition ( postposition )[ 0 ], keep_form = keep_form ) != word . rpartition ( postposition )[ 0 ]: word = ( \"\u2581\" + self . mwe_tokenize ( word . rpartition ( postposition )[ 0 ], keep_form = keep_form ) + \"\u2581\" + self . morphemes [ \"suffixes\" ][ postposition ] + \"\u2581\" ) . replace ( \"\u2581\u2581\u2581\" , \"\u2581\u2581\" ) break tokens . append ( word ) # print(tokens) return \" \" . join ( tokens ) . replace ( \"\u2581\u2581\" , mwe_separator ) . replace ( \"\u2581\" , separator ) . split ()","title":"word_tokenize()"},{"location":"user-guide/transliterate/","text":"transliterate package This module aims at transliterating one script of Kurdish into another one. Currently, only the Latin-based and the Arabic-based scripts of Sorani and Kurmanji are supported. The main function in this module is transliterate() which also takes care of detecting the correct form of double-usage graphemes, namely \u0648 \u2194 w/u and \u06cc \u2194 \u00ee/y. In some specific occasions, it can also predict the placement of the missing i (also known as Bizroke/\u0628\u0632\u0631\u06c6\u06a9\u06d5 ). The module is based on the Kurdish transliteration project . Examples: >>> from klpt.transliterate import Transliterate >>> transliterate = Transliterate(\"Kurmanji\", \"Latin\", target_script=\"Arabic\") >>> transliterate.transliterate(\"rojhilata nav\u00een\") '\u0631\u06c6\u0698\u0647\u0644\u0627\u062a\u0627 \u0646\u0627\u06a4\u06cc\u0646' >>> transliterate_ckb = Transliterate(\"Sorani\", \"Arabic\", target_script=\"Latin\") >>> transliterate_ckb.transliterate(\"\u0644\u06d5 \u0648\u06b5\u0627\u062a\u06d5\u06a9\u0627\u0646\u06cc \u062f\u06cc\u06a9\u06d5\u062f\u0627\") 'le wi\u0142atekan\u00ee d\u00eekeda' __init__ ( self , dialect , script , target_script , unknown = '\ufffd' , numeral = 'Latin' ) special Initializing using a Configuration object To do: - \"\u0644\u06d5 \u0626\u06cc\u0633\u067e\u0627\u0646\u06cc\u0627 \u0698\u0646\u0627\u0646 \u0644\u06d5 \u062f\u0698\u06cc \u2018patriarkavirus\u2019 \u0695\u06ce\u067e\u06ce\u0648\u0627\u0646\u06cc\u0627\u0646 \u06a9\u0631\u062f\": \"le \u00eespanya jinan le dij\u00ee \u2018patriarkavirus\u2019 \u0159\u00eap\u00eawanyan kird\" - \"eger\u00e7\u00ee damezrandn\u00ee r\u00eakxrawe kurd\u00eeyekan her r\u00eap\u00eanedraw mab\u00fbnewe Inz\u00eebat.\": \"\u0626\u06d5\u06af\u06d5\u0631\u0686\u06cc \u062f\u0627\u0645\u06d5\u0632\u0631\u0627\u0646\u062f\u0646\u06cc \u0695\u06ce\u06a9\u062e\u0631\u0627\u0648\u06d5 \u06a9\u0648\u0631\u062f\u06cc\u06cc\u06d5\u06a9\u0627\u0646 \u0647\u06d5\u0631 \u0631\u06ce\u067e\u06ce\u0646\u06d5\u062f\u0631\u0627\u0648 \u0645\u0627\u0628\u0648\u0648\u0646\u06d5\u0648\u06d5 \u0626\u0646\u0632\u06cc\u0628\u0627\u062a.\", Parameters: Name Type Description Default mode [type] [description] required unknown str [description]. Defaults to \"\ufffd\". '\ufffd' numeral str [description]. Defaults to \"Latin\". Modifiable only if the source script is in Arabic. Otherwise, the Default value will be Latin. 'Latin' Exceptions: Type Description ValueError [description] ValueError [description] Source code in klpt/transliterate.py def __init__ ( self , dialect , script , target_script , unknown = \"\ufffd\" , numeral = \"Latin\" ): \"\"\"Initializing using a Configuration object To do: - \"\u0644\u06d5 \u0626\u06cc\u0633\u067e\u0627\u0646\u06cc\u0627 \u0698\u0646\u0627\u0646 \u0644\u06d5 \u062f\u0698\u06cc \u2018patriarkavirus\u2019 \u0695\u06ce\u067e\u06ce\u0648\u0627\u0646\u06cc\u0627\u0646 \u06a9\u0631\u062f\": \"le \u00eespanya jinan le dij\u00ee \u2018patriarkavirus\u2019 \u0159\u00eap\u00eawanyan kird\" - \"eger\u00e7\u00ee damezrandn\u00ee r\u00eakxrawe kurd\u00eeyekan her r\u00eap\u00eanedraw mab\u00fbnewe Inz\u00eebat.\": \"\u0626\u06d5\u06af\u06d5\u0631\u0686\u06cc \u062f\u0627\u0645\u06d5\u0632\u0631\u0627\u0646\u062f\u0646\u06cc \u0695\u06ce\u06a9\u062e\u0631\u0627\u0648\u06d5 \u06a9\u0648\u0631\u062f\u06cc\u06cc\u06d5\u06a9\u0627\u0646 \u0647\u06d5\u0631 \u0631\u06ce\u067e\u06ce\u0646\u06d5\u062f\u0631\u0627\u0648 \u0645\u0627\u0628\u0648\u0648\u0646\u06d5\u0648\u06d5 \u0626\u0646\u0632\u06cc\u0628\u0627\u062a.\", Args: mode ([type]): [description] unknown (str, optional): [description]. Defaults to \"\ufffd\". numeral (str, optional): [description]. Defaults to \"Latin\". Modifiable only if the source script is in Arabic. Otherwise, the Default value will be Latin. Raises: ValueError: [description] ValueError: [description] \"\"\" # with open(\"data/default-options.json\") as f: # options = json.load(f) self . UNKNOWN = \"\ufffd\" with open ( klpt . get_data ( \"data/wergor.json\" )) as f : self . wergor_configurations = json . load ( f ) with open ( klpt . get_data ( \"data/preprocess_map.json\" )) as f : self . preprocess_map = json . load ( f )[ \"normalizer\" ] configuration = Configuration ({ \"dialect\" : dialect , \"script\" : script , \"numeral\" : numeral , \"target_script\" : target_script , \"unknown\" : unknown }) # self.preprocess_map = object.preprocess_map[\"normalizer\"] self . dialect = configuration . dialect self . script = configuration . script self . numeral = configuration . numeral self . mode = configuration . mode self . target_script = configuration . target_script self . user_UNKNOWN = configuration . user_UNKNOWN # self.mode = mode # if mode==\"arabic_to_latin\": # target_script = \"Latin\" # elif mode==\"latin_to_arabic\": # target_script = \"Arabic\" # else: # raise ValueError(f'Unknown transliteration option. Available options: {options[\"transliterator\"]}') # if len(unknown): # self.user_UNKNOWN = unknown # else: # raise ValueError(f'Unknown unknown tag. Select a non-empty token (e.g. .') self . characters_mapping = self . wergor_configurations [ \"characters_mapping\" ] self . digits_mapping = self . preprocess_map [ \"universal\" ][ \"numerals\" ][ self . target_script ] self . digits_mapping_all = list ( set ( list ( self . preprocess_map [ \"universal\" ][ \"numerals\" ][ self . target_script ] . keys ()) + list ( self . preprocess_map [ \"universal\" ][ \"numerals\" ][ self . target_script ] . values ()))) self . punctuation_mapping = self . wergor_configurations [ \"punctuation\" ][ self . target_script ] self . punctuation_mapping_all = list ( set ( list ( self . wergor_configurations [ \"punctuation\" ][ self . target_script ] . keys ()) + list ( self . wergor_configurations [ \"punctuation\" ][ self . target_script ] . values ()))) # self.tricky_characters = self.wergor_configurations[\"characters_mapping\"] self . wy_mappings = self . wergor_configurations [ \"wy_mappings\" ] self . hemze = self . wergor_configurations [ \"hemze\" ] self . bizroke = self . wergor_configurations [ \"bizroke\" ] self . uw_iy_forms = self . wergor_configurations [ \"uw_iy_forms\" ] self . target_char = self . wergor_configurations [ \"target_char\" ] self . arabic_vowels = self . wergor_configurations [ \"arabic_vowels\" ] self . arabic_cons = self . wergor_configurations [ \"arabic_cons\" ] self . latin_vowels = self . wergor_configurations [ \"latin_vowels\" ] self . latin_cons = self . wergor_configurations [ \"latin_cons\" ] self . characters_pack = { \"arabic_to_latin\" : self . characters_mapping . values (), \"latin_to_arabic\" : self . characters_mapping . keys ()} if self . target_script == \"Arabic\" : self . prep = Preprocess ( \"Sorani\" , \"Latin\" , numeral = self . numeral ) else : self . prep = Preprocess ( \"Sorani\" , \"Latin\" , numeral = \"Latin\" ) arabic_to_latin ( self , char ) Mapping Arabic-based characters to the Latin-based equivalents Source code in klpt/transliterate.py def arabic_to_latin ( self , char ): \"\"\"Mapping Arabic-based characters to the Latin-based equivalents\"\"\" if char != \"\" : if char in list ( self . characters_mapping . values ()): return list ( self . characters_mapping . keys ())[ list ( self . characters_mapping . values ()) . index ( char )] elif char in self . punctuation_mapping : return self . punctuation_mapping [ char ] elif char in self . punctuation_mapping : return self . punctuation_mapping [ char ] return char bizroke_finder ( self , word ) Detection of the \"i\" character in the Arabic-based script. Incomplete version. Source code in klpt/transliterate.py def bizroke_finder ( self , word ): \"\"\"Detection of the \"i\" character in the Arabic-based script. Incomplete version.\"\"\" word = list ( word ) if len ( word ) > 2 and word [ 0 ] in self . latin_cons and word [ 1 ] in self . latin_cons and word [ 1 ] != \"w\" and word [ 1 ] != \"y\" : word . insert ( 1 , \"i\" ) return \"\" . join ( word ) latin_to_arabic ( self , char ) Mapping Latin-based characters to the Arabic-based equivalents Source code in klpt/transliterate.py def latin_to_arabic ( self , char ): \"\"\"Mapping Latin-based characters to the Arabic-based equivalents\"\"\" # check if the character is in upper case mapped_char = \"\" if char . lower () != \"\" : if char . lower () in self . wy_mappings . keys (): mapped_char = self . wy_mappings [ char . lower ()] elif char . lower () in self . characters_mapping . keys (): mapped_char = self . characters_mapping [ char . lower ()] elif char . lower () in self . punctuation_mapping : mapped_char = self . punctuation_mapping [ char . lower ()] # elif char.lower() in self.digits_mapping.values(): # mapped_char = self.digits_mapping.keys()[self.digits_mapping.values().index(char.lower())] if len ( mapped_char ): if char . isupper (): return mapped_char . upper () return mapped_char else : return char preprocessor ( self , word ) Preprocessing by normalizing text encoding and removing embedding characters Source code in klpt/transliterate.py def preprocessor ( self , word ): \"\"\"Preprocessing by normalizing text encoding and removing embedding characters\"\"\" # replace this by the normalization part word = list ( word . replace ( ' \\u202b ' , \"\" ) . replace ( ' \\u202c ' , \"\" ) . replace ( ' \\u202a ' , \"\" ) . replace ( u \"\u0648\u0648\" , \"\u00fb\" ) . replace ( \" \\u200c \" , \"\" ) . replace ( \"\u0640\" , \"\" )) # for char_index in range(len(word)): # if(word[char_index] in self.tricky_characters.keys()): # word[char_index] = self.tricky_characters[word[char_index]] return \"\" . join ( word ) syllable_detector ( self , word ) Detection of the syllable based on the given pattern. May be used for transcription applications. Source code in klpt/transliterate.py def syllable_detector ( self , word ): \"\"\"Detection of the syllable based on the given pattern. May be used for transcription applications.\"\"\" syllable_templates = [ \"V\" , \"VC\" , \"VCC\" , \"CV\" , \"CVC\" , \"CVCCC\" ] CV_converted_list = \"\" for char in word : if char in self . latin_vowels : CV_converted_list += \"V\" else : CV_converted_list += \"C\" syllables = list () for i in range ( 1 , len ( CV_converted_list )): syllable_templates_permutated = [ p for p in itertools . product ( syllable_templates , repeat = i )] for syl in syllable_templates_permutated : if len ( \"\" . join ( syl )) == len ( CV_converted_list ): if CV_converted_list == \"\" . join ( syl ) and \"VV\" not in \"\" . join ( syl ): syllables . append ( syl ) return syllables to_pieces ( self , token ) Given a token, find other segments composed of numbers and punctuation marks not seperated by space \u2581 Source code in klpt/transliterate.py def to_pieces ( self , token ): \"\"\"Given a token, find other segments composed of numbers and punctuation marks not seperated by space \u2581\"\"\" tokens_dict = dict () flag = False # True if a token is a \\w i = 0 for char_index in range ( len ( token )): if token [ char_index ] in self . digits_mapping_all or token [ char_index ] in self . punctuation_mapping_all : tokens_dict [ char_index ] = token [ char_index ] flag = False i = 0 elif token [ char_index ] in self . characters_pack [ self . mode ] or \\ token [ char_index ] in self . target_char or \\ token [ char_index ] == self . hemze or token [ char_index ] . lower () == self . bizroke : if flag : tokens_dict [ char_index - i ] = tokens_dict [ char_index - i ] + token [ char_index ] else : tokens_dict [ char_index ] = token [ char_index ] flag = True i += 1 else : tokens_dict [ char_index ] = self . UNKNOWN return tokens_dict transliterate ( self , text ) The main method of the class: - find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space) - map characters - detect double-usage characters w/u and y/\u00ee - find possible position of Bizroke (to be completed - 2017) Notice: text format should not be changed at all (no lower case, no style replacement , etc.). If the source and the target scripts are identical, the input text should be returned without any further processing. Source code in klpt/transliterate.py def transliterate ( self , text ): \"\"\"The main method of the class: - find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space) - map characters - detect double-usage characters w/u and y/\u00ee - find possible position of Bizroke (to be completed - 2017) Notice: text format should not be changed at all (no lower case, no style replacement \\t, \\n etc.). If the source and the target scripts are identical, the input text should be returned without any further processing. \"\"\" text = self . prep . unify_numerals ( text ) . split ( \" \\n \" ) transliterated_text = list () for line in text : transliterated_line = list () for token in line . split (): trans_token = \"\" # try: token = self . preprocessor ( token ) # This is not correct as the capital letter should be kept the way it is given. tokens_dict = self . to_pieces ( token ) # Transliterate words for token_key in tokens_dict : if len ( tokens_dict [ token_key ]): word = tokens_dict [ token_key ] if self . mode == \"arabic_to_latin\" : # w/y detection based on the priority in \"word\" for char in word : if char in self . target_char : word = self . uw_iy_Detector ( word , char ) if word [ 0 ] == self . hemze and word [ 1 ] in self . arabic_vowels : word = word [ 1 :] word = list ( word ) for char_index in range ( len ( word )): word [ char_index ] = self . arabic_to_latin ( word [ char_index ]) word = \"\" . join ( word ) word = self . bizroke_finder ( word ) elif self . mode == \"latin_to_arabic\" : if len ( word ): word = list ( word ) for char_index in range ( len ( word )): word [ char_index ] = self . latin_to_arabic ( word [ char_index ]) if word [ 0 ] in self . arabic_vowels or word [ 0 ] . lower () == self . bizroke : word . insert ( 0 , self . hemze ) word = \"\" . join ( word ) . replace ( \"\u00fb\" , \"\u0648\u0648\" ) . replace ( self . bizroke . lower (), \"\" ) . replace ( self . bizroke . upper (), \"\" ) # else: # return self.UNKNOWN trans_token = trans_token + word transliterated_line . append ( trans_token ) transliterated_text . append ( \" \" . join ( transliterated_line ) . replace ( u \" w \" , u \" \u00fb \" )) # standardize the output # replace UNKOWN by the user's choice if self . user_UNKNOWN != self . UNKNOWN : return \" \\n \" . join ( transliterated_text ) . replace ( self . UNKNOWN , self . user_UNKNOWN ) else : return \" \\n \" . join ( transliterated_text ) uw_iy_Detector ( self , word , target_char ) Detection of \"\u0648\" and \"\u06cc\" in the Arabic-based script Source code in klpt/transliterate.py def uw_iy_Detector ( self , word , target_char ): \"\"\"Detection of \"\u0648\" and \"\u06cc\" in the Arabic-based script\"\"\" word = list ( word ) if target_char == \"\u0648\" : dic_index = 1 else : dic_index = 0 if word == target_char : word = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : for index in range ( len ( word )): if word [ index ] == self . hemze and word [ index + 1 ] == target_char : word [ index + 1 ] = self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ] index += 1 else : if word [ index ] == target_char : if index == 0 : word [ index ] = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : if word [ index - 1 ] in self . arabic_vowels : word [ index ] = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : if index + 1 < len ( word ): if word [ index + 1 ] in self . arabic_vowels : word [ index ] = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : word [ index ] = self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ] else : word [ index ] = self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ] word = \"\" . join ( word ) . replace ( self . hemze + self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ], self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ]) return word","title":"Transliterate"},{"location":"user-guide/transliterate/#transliterate-package","text":"","title":"transliterate package"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate","text":"This module aims at transliterating one script of Kurdish into another one. Currently, only the Latin-based and the Arabic-based scripts of Sorani and Kurmanji are supported. The main function in this module is transliterate() which also takes care of detecting the correct form of double-usage graphemes, namely \u0648 \u2194 w/u and \u06cc \u2194 \u00ee/y. In some specific occasions, it can also predict the placement of the missing i (also known as Bizroke/\u0628\u0632\u0631\u06c6\u06a9\u06d5 ). The module is based on the Kurdish transliteration project . Examples: >>> from klpt.transliterate import Transliterate >>> transliterate = Transliterate(\"Kurmanji\", \"Latin\", target_script=\"Arabic\") >>> transliterate.transliterate(\"rojhilata nav\u00een\") '\u0631\u06c6\u0698\u0647\u0644\u0627\u062a\u0627 \u0646\u0627\u06a4\u06cc\u0646' >>> transliterate_ckb = Transliterate(\"Sorani\", \"Arabic\", target_script=\"Latin\") >>> transliterate_ckb.transliterate(\"\u0644\u06d5 \u0648\u06b5\u0627\u062a\u06d5\u06a9\u0627\u0646\u06cc \u062f\u06cc\u06a9\u06d5\u062f\u0627\") 'le wi\u0142atekan\u00ee d\u00eekeda'","title":"klpt.transliterate.Transliterate"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.__init__","text":"Initializing using a Configuration object To do: - \"\u0644\u06d5 \u0626\u06cc\u0633\u067e\u0627\u0646\u06cc\u0627 \u0698\u0646\u0627\u0646 \u0644\u06d5 \u062f\u0698\u06cc \u2018patriarkavirus\u2019 \u0695\u06ce\u067e\u06ce\u0648\u0627\u0646\u06cc\u0627\u0646 \u06a9\u0631\u062f\": \"le \u00eespanya jinan le dij\u00ee \u2018patriarkavirus\u2019 \u0159\u00eap\u00eawanyan kird\" - \"eger\u00e7\u00ee damezrandn\u00ee r\u00eakxrawe kurd\u00eeyekan her r\u00eap\u00eanedraw mab\u00fbnewe Inz\u00eebat.\": \"\u0626\u06d5\u06af\u06d5\u0631\u0686\u06cc \u062f\u0627\u0645\u06d5\u0632\u0631\u0627\u0646\u062f\u0646\u06cc \u0695\u06ce\u06a9\u062e\u0631\u0627\u0648\u06d5 \u06a9\u0648\u0631\u062f\u06cc\u06cc\u06d5\u06a9\u0627\u0646 \u0647\u06d5\u0631 \u0631\u06ce\u067e\u06ce\u0646\u06d5\u062f\u0631\u0627\u0648 \u0645\u0627\u0628\u0648\u0648\u0646\u06d5\u0648\u06d5 \u0626\u0646\u0632\u06cc\u0628\u0627\u062a.\", Parameters: Name Type Description Default mode [type] [description] required unknown str [description]. Defaults to \"\ufffd\". '\ufffd' numeral str [description]. Defaults to \"Latin\". Modifiable only if the source script is in Arabic. Otherwise, the Default value will be Latin. 'Latin' Exceptions: Type Description ValueError [description] ValueError [description] Source code in klpt/transliterate.py def __init__ ( self , dialect , script , target_script , unknown = \"\ufffd\" , numeral = \"Latin\" ): \"\"\"Initializing using a Configuration object To do: - \"\u0644\u06d5 \u0626\u06cc\u0633\u067e\u0627\u0646\u06cc\u0627 \u0698\u0646\u0627\u0646 \u0644\u06d5 \u062f\u0698\u06cc \u2018patriarkavirus\u2019 \u0695\u06ce\u067e\u06ce\u0648\u0627\u0646\u06cc\u0627\u0646 \u06a9\u0631\u062f\": \"le \u00eespanya jinan le dij\u00ee \u2018patriarkavirus\u2019 \u0159\u00eap\u00eawanyan kird\" - \"eger\u00e7\u00ee damezrandn\u00ee r\u00eakxrawe kurd\u00eeyekan her r\u00eap\u00eanedraw mab\u00fbnewe Inz\u00eebat.\": \"\u0626\u06d5\u06af\u06d5\u0631\u0686\u06cc \u062f\u0627\u0645\u06d5\u0632\u0631\u0627\u0646\u062f\u0646\u06cc \u0695\u06ce\u06a9\u062e\u0631\u0627\u0648\u06d5 \u06a9\u0648\u0631\u062f\u06cc\u06cc\u06d5\u06a9\u0627\u0646 \u0647\u06d5\u0631 \u0631\u06ce\u067e\u06ce\u0646\u06d5\u062f\u0631\u0627\u0648 \u0645\u0627\u0628\u0648\u0648\u0646\u06d5\u0648\u06d5 \u0626\u0646\u0632\u06cc\u0628\u0627\u062a.\", Args: mode ([type]): [description] unknown (str, optional): [description]. Defaults to \"\ufffd\". numeral (str, optional): [description]. Defaults to \"Latin\". Modifiable only if the source script is in Arabic. Otherwise, the Default value will be Latin. Raises: ValueError: [description] ValueError: [description] \"\"\" # with open(\"data/default-options.json\") as f: # options = json.load(f) self . UNKNOWN = \"\ufffd\" with open ( klpt . get_data ( \"data/wergor.json\" )) as f : self . wergor_configurations = json . load ( f ) with open ( klpt . get_data ( \"data/preprocess_map.json\" )) as f : self . preprocess_map = json . load ( f )[ \"normalizer\" ] configuration = Configuration ({ \"dialect\" : dialect , \"script\" : script , \"numeral\" : numeral , \"target_script\" : target_script , \"unknown\" : unknown }) # self.preprocess_map = object.preprocess_map[\"normalizer\"] self . dialect = configuration . dialect self . script = configuration . script self . numeral = configuration . numeral self . mode = configuration . mode self . target_script = configuration . target_script self . user_UNKNOWN = configuration . user_UNKNOWN # self.mode = mode # if mode==\"arabic_to_latin\": # target_script = \"Latin\" # elif mode==\"latin_to_arabic\": # target_script = \"Arabic\" # else: # raise ValueError(f'Unknown transliteration option. Available options: {options[\"transliterator\"]}') # if len(unknown): # self.user_UNKNOWN = unknown # else: # raise ValueError(f'Unknown unknown tag. Select a non-empty token (e.g. .') self . characters_mapping = self . wergor_configurations [ \"characters_mapping\" ] self . digits_mapping = self . preprocess_map [ \"universal\" ][ \"numerals\" ][ self . target_script ] self . digits_mapping_all = list ( set ( list ( self . preprocess_map [ \"universal\" ][ \"numerals\" ][ self . target_script ] . keys ()) + list ( self . preprocess_map [ \"universal\" ][ \"numerals\" ][ self . target_script ] . values ()))) self . punctuation_mapping = self . wergor_configurations [ \"punctuation\" ][ self . target_script ] self . punctuation_mapping_all = list ( set ( list ( self . wergor_configurations [ \"punctuation\" ][ self . target_script ] . keys ()) + list ( self . wergor_configurations [ \"punctuation\" ][ self . target_script ] . values ()))) # self.tricky_characters = self.wergor_configurations[\"characters_mapping\"] self . wy_mappings = self . wergor_configurations [ \"wy_mappings\" ] self . hemze = self . wergor_configurations [ \"hemze\" ] self . bizroke = self . wergor_configurations [ \"bizroke\" ] self . uw_iy_forms = self . wergor_configurations [ \"uw_iy_forms\" ] self . target_char = self . wergor_configurations [ \"target_char\" ] self . arabic_vowels = self . wergor_configurations [ \"arabic_vowels\" ] self . arabic_cons = self . wergor_configurations [ \"arabic_cons\" ] self . latin_vowels = self . wergor_configurations [ \"latin_vowels\" ] self . latin_cons = self . wergor_configurations [ \"latin_cons\" ] self . characters_pack = { \"arabic_to_latin\" : self . characters_mapping . values (), \"latin_to_arabic\" : self . characters_mapping . keys ()} if self . target_script == \"Arabic\" : self . prep = Preprocess ( \"Sorani\" , \"Latin\" , numeral = self . numeral ) else : self . prep = Preprocess ( \"Sorani\" , \"Latin\" , numeral = \"Latin\" )","title":"__init__()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.arabic_to_latin","text":"Mapping Arabic-based characters to the Latin-based equivalents Source code in klpt/transliterate.py def arabic_to_latin ( self , char ): \"\"\"Mapping Arabic-based characters to the Latin-based equivalents\"\"\" if char != \"\" : if char in list ( self . characters_mapping . values ()): return list ( self . characters_mapping . keys ())[ list ( self . characters_mapping . values ()) . index ( char )] elif char in self . punctuation_mapping : return self . punctuation_mapping [ char ] elif char in self . punctuation_mapping : return self . punctuation_mapping [ char ] return char","title":"arabic_to_latin()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.bizroke_finder","text":"Detection of the \"i\" character in the Arabic-based script. Incomplete version. Source code in klpt/transliterate.py def bizroke_finder ( self , word ): \"\"\"Detection of the \"i\" character in the Arabic-based script. Incomplete version.\"\"\" word = list ( word ) if len ( word ) > 2 and word [ 0 ] in self . latin_cons and word [ 1 ] in self . latin_cons and word [ 1 ] != \"w\" and word [ 1 ] != \"y\" : word . insert ( 1 , \"i\" ) return \"\" . join ( word )","title":"bizroke_finder()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.latin_to_arabic","text":"Mapping Latin-based characters to the Arabic-based equivalents Source code in klpt/transliterate.py def latin_to_arabic ( self , char ): \"\"\"Mapping Latin-based characters to the Arabic-based equivalents\"\"\" # check if the character is in upper case mapped_char = \"\" if char . lower () != \"\" : if char . lower () in self . wy_mappings . keys (): mapped_char = self . wy_mappings [ char . lower ()] elif char . lower () in self . characters_mapping . keys (): mapped_char = self . characters_mapping [ char . lower ()] elif char . lower () in self . punctuation_mapping : mapped_char = self . punctuation_mapping [ char . lower ()] # elif char.lower() in self.digits_mapping.values(): # mapped_char = self.digits_mapping.keys()[self.digits_mapping.values().index(char.lower())] if len ( mapped_char ): if char . isupper (): return mapped_char . upper () return mapped_char else : return char","title":"latin_to_arabic()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.preprocessor","text":"Preprocessing by normalizing text encoding and removing embedding characters Source code in klpt/transliterate.py def preprocessor ( self , word ): \"\"\"Preprocessing by normalizing text encoding and removing embedding characters\"\"\" # replace this by the normalization part word = list ( word . replace ( ' \\u202b ' , \"\" ) . replace ( ' \\u202c ' , \"\" ) . replace ( ' \\u202a ' , \"\" ) . replace ( u \"\u0648\u0648\" , \"\u00fb\" ) . replace ( \" \\u200c \" , \"\" ) . replace ( \"\u0640\" , \"\" )) # for char_index in range(len(word)): # if(word[char_index] in self.tricky_characters.keys()): # word[char_index] = self.tricky_characters[word[char_index]] return \"\" . join ( word )","title":"preprocessor()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.syllable_detector","text":"Detection of the syllable based on the given pattern. May be used for transcription applications. Source code in klpt/transliterate.py def syllable_detector ( self , word ): \"\"\"Detection of the syllable based on the given pattern. May be used for transcription applications.\"\"\" syllable_templates = [ \"V\" , \"VC\" , \"VCC\" , \"CV\" , \"CVC\" , \"CVCCC\" ] CV_converted_list = \"\" for char in word : if char in self . latin_vowels : CV_converted_list += \"V\" else : CV_converted_list += \"C\" syllables = list () for i in range ( 1 , len ( CV_converted_list )): syllable_templates_permutated = [ p for p in itertools . product ( syllable_templates , repeat = i )] for syl in syllable_templates_permutated : if len ( \"\" . join ( syl )) == len ( CV_converted_list ): if CV_converted_list == \"\" . join ( syl ) and \"VV\" not in \"\" . join ( syl ): syllables . append ( syl ) return syllables","title":"syllable_detector()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.to_pieces","text":"Given a token, find other segments composed of numbers and punctuation marks not seperated by space \u2581 Source code in klpt/transliterate.py def to_pieces ( self , token ): \"\"\"Given a token, find other segments composed of numbers and punctuation marks not seperated by space \u2581\"\"\" tokens_dict = dict () flag = False # True if a token is a \\w i = 0 for char_index in range ( len ( token )): if token [ char_index ] in self . digits_mapping_all or token [ char_index ] in self . punctuation_mapping_all : tokens_dict [ char_index ] = token [ char_index ] flag = False i = 0 elif token [ char_index ] in self . characters_pack [ self . mode ] or \\ token [ char_index ] in self . target_char or \\ token [ char_index ] == self . hemze or token [ char_index ] . lower () == self . bizroke : if flag : tokens_dict [ char_index - i ] = tokens_dict [ char_index - i ] + token [ char_index ] else : tokens_dict [ char_index ] = token [ char_index ] flag = True i += 1 else : tokens_dict [ char_index ] = self . UNKNOWN return tokens_dict","title":"to_pieces()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.transliterate","text":"The main method of the class: - find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space) - map characters - detect double-usage characters w/u and y/\u00ee - find possible position of Bizroke (to be completed - 2017) Notice: text format should not be changed at all (no lower case, no style replacement , etc.). If the source and the target scripts are identical, the input text should be returned without any further processing. Source code in klpt/transliterate.py def transliterate ( self , text ): \"\"\"The main method of the class: - find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space) - map characters - detect double-usage characters w/u and y/\u00ee - find possible position of Bizroke (to be completed - 2017) Notice: text format should not be changed at all (no lower case, no style replacement \\t, \\n etc.). If the source and the target scripts are identical, the input text should be returned without any further processing. \"\"\" text = self . prep . unify_numerals ( text ) . split ( \" \\n \" ) transliterated_text = list () for line in text : transliterated_line = list () for token in line . split (): trans_token = \"\" # try: token = self . preprocessor ( token ) # This is not correct as the capital letter should be kept the way it is given. tokens_dict = self . to_pieces ( token ) # Transliterate words for token_key in tokens_dict : if len ( tokens_dict [ token_key ]): word = tokens_dict [ token_key ] if self . mode == \"arabic_to_latin\" : # w/y detection based on the priority in \"word\" for char in word : if char in self . target_char : word = self . uw_iy_Detector ( word , char ) if word [ 0 ] == self . hemze and word [ 1 ] in self . arabic_vowels : word = word [ 1 :] word = list ( word ) for char_index in range ( len ( word )): word [ char_index ] = self . arabic_to_latin ( word [ char_index ]) word = \"\" . join ( word ) word = self . bizroke_finder ( word ) elif self . mode == \"latin_to_arabic\" : if len ( word ): word = list ( word ) for char_index in range ( len ( word )): word [ char_index ] = self . latin_to_arabic ( word [ char_index ]) if word [ 0 ] in self . arabic_vowels or word [ 0 ] . lower () == self . bizroke : word . insert ( 0 , self . hemze ) word = \"\" . join ( word ) . replace ( \"\u00fb\" , \"\u0648\u0648\" ) . replace ( self . bizroke . lower (), \"\" ) . replace ( self . bizroke . upper (), \"\" ) # else: # return self.UNKNOWN trans_token = trans_token + word transliterated_line . append ( trans_token ) transliterated_text . append ( \" \" . join ( transliterated_line ) . replace ( u \" w \" , u \" \u00fb \" )) # standardize the output # replace UNKOWN by the user's choice if self . user_UNKNOWN != self . UNKNOWN : return \" \\n \" . join ( transliterated_text ) . replace ( self . UNKNOWN , self . user_UNKNOWN ) else : return \" \\n \" . join ( transliterated_text )","title":"transliterate()"},{"location":"user-guide/transliterate/#klpt.transliterate.Transliterate.uw_iy_Detector","text":"Detection of \"\u0648\" and \"\u06cc\" in the Arabic-based script Source code in klpt/transliterate.py def uw_iy_Detector ( self , word , target_char ): \"\"\"Detection of \"\u0648\" and \"\u06cc\" in the Arabic-based script\"\"\" word = list ( word ) if target_char == \"\u0648\" : dic_index = 1 else : dic_index = 0 if word == target_char : word = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : for index in range ( len ( word )): if word [ index ] == self . hemze and word [ index + 1 ] == target_char : word [ index + 1 ] = self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ] index += 1 else : if word [ index ] == target_char : if index == 0 : word [ index ] = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : if word [ index - 1 ] in self . arabic_vowels : word [ index ] = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : if index + 1 < len ( word ): if word [ index + 1 ] in self . arabic_vowels : word [ index ] = self . uw_iy_forms [ \"target_char_cons\" ][ dic_index ] else : word [ index ] = self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ] else : word [ index ] = self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ] word = \"\" . join ( word ) . replace ( self . hemze + self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ], self . uw_iy_forms [ \"target_char_vowel\" ][ dic_index ]) return word","title":"uw_iy_Detector()"}]}
\ No newline at end of file
diff --git a/site/search/worker.js b/site/search/worker.js
new file mode 100644
index 0000000..9cce2f7
--- /dev/null
+++ b/site/search/worker.js
@@ -0,0 +1,130 @@
+var base_path = 'function' === typeof importScripts ? '.' : '/search/';
+var allowSearch = false;
+var index;
+var documents = {};
+var lang = ['en'];
+var data;
+
+function getScript(script, callback) {
+ console.log('Loading script: ' + script);
+ $.getScript(base_path + script).done(function () {
+ callback();
+ }).fail(function (jqxhr, settings, exception) {
+ console.log('Error: ' + exception);
+ });
+}
+
+function getScriptsInOrder(scripts, callback) {
+ if (scripts.length === 0) {
+ callback();
+ return;
+ }
+ getScript(scripts[0], function() {
+ getScriptsInOrder(scripts.slice(1), callback);
+ });
+}
+
+function loadScripts(urls, callback) {
+ if( 'function' === typeof importScripts ) {
+ importScripts.apply(null, urls);
+ callback();
+ } else {
+ getScriptsInOrder(urls, callback);
+ }
+}
+
+function onJSONLoaded () {
+ data = JSON.parse(this.responseText);
+ var scriptsToLoad = ['lunr.js'];
+ if (data.config && data.config.lang && data.config.lang.length) {
+ lang = data.config.lang;
+ }
+ if (lang.length > 1 || lang[0] !== "en") {
+ scriptsToLoad.push('lunr.stemmer.support.js');
+ if (lang.length > 1) {
+ scriptsToLoad.push('lunr.multi.js');
+ }
+ for (var i=0; i < lang.length; i++) {
+ if (lang[i] != 'en') {
+ scriptsToLoad.push(['lunr', lang[i], 'js'].join('.'));
+ }
+ }
+ }
+ loadScripts(scriptsToLoad, onScriptsLoaded);
+}
+
+function onScriptsLoaded () {
+ console.log('All search scripts loaded, building Lunr index...');
+ if (data.config && data.config.separator && data.config.separator.length) {
+ lunr.tokenizer.separator = new RegExp(data.config.separator);
+ }
+
+ if (data.index) {
+ index = lunr.Index.load(data.index);
+ data.docs.forEach(function (doc) {
+ documents[doc.location] = doc;
+ });
+ console.log('Lunr pre-built index loaded, search ready');
+ } else {
+ index = lunr(function () {
+ if (lang.length === 1 && lang[0] !== "en" && lunr[lang[0]]) {
+ this.use(lunr[lang[0]]);
+ } else if (lang.length > 1) {
+ this.use(lunr.multiLanguage.apply(null, lang)); // spread operator not supported in all browsers: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_operator#Browser_compatibility
+ }
+ this.field('title');
+ this.field('text');
+ this.ref('location');
+
+ for (var i=0; i < data.docs.length; i++) {
+ var doc = data.docs[i];
+ this.add(doc);
+ documents[doc.location] = doc;
+ }
+ });
+ console.log('Lunr index built, search ready');
+ }
+ allowSearch = true;
+ postMessage({config: data.config});
+ postMessage({allowSearch: allowSearch});
+}
+
+function init () {
+ var oReq = new XMLHttpRequest();
+ oReq.addEventListener("load", onJSONLoaded);
+ var index_path = base_path + '/search_index.json';
+ if( 'function' === typeof importScripts ){
+ index_path = 'search_index.json';
+ }
+ oReq.open("GET", index_path);
+ oReq.send();
+}
+
+function search (query) {
+ if (!allowSearch) {
+ console.error('Assets for search still loading');
+ return;
+ }
+
+ var resultDocuments = [];
+ var results = index.search(query);
+ for (var i=0; i < results.length; i++){
+ var result = results[i];
+ doc = documents[result.ref];
+ doc.summary = doc.text.substring(0, 200);
+ resultDocuments.push(doc);
+ }
+ return resultDocuments;
+}
+
+if( 'function' === typeof importScripts ) {
+ onmessage = function (e) {
+ if (e.data.init) {
+ init();
+ } else if (e.data.query) {
+ postMessage({ results: search(e.data.query) });
+ } else {
+ console.error("Worker - Unrecognized message: " + e);
+ }
+ };
+}
diff --git a/site/sitemap.xml b/site/sitemap.xml
new file mode 100644
index 0000000..ff6d723
--- /dev/null
+++ b/site/sitemap.xml
@@ -0,0 +1,43 @@
+
+
+ https://sinaahmadi.github.io/klpt/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/user-guide/getting-started/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/user-guide/preprocess/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/user-guide/tokenize/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/user-guide/transliterate/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/user-guide/stem/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/about/release-notes/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/about/contributing/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/about/sponsors/
+ 2020-11-12
+ daily
+
+ https://sinaahmadi.github.io/klpt/about/license/
+ 2020-11-12
+ daily
+
+
\ No newline at end of file
diff --git a/site/sitemap.xml.gz b/site/sitemap.xml.gz
new file mode 100644
index 0000000..b73be2f
Binary files /dev/null and b/site/sitemap.xml.gz differ
diff --git a/site/user-guide/getting-started/index.html b/site/user-guide/getting-started/index.html
new file mode 100644
index 0000000..4399c04
--- /dev/null
+++ b/site/user-guide/getting-started/index.html
@@ -0,0 +1,335 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Getting started - KLPT
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
KLPT is implemented in Python and requires basic knowledge on programming and particularly the Python language. Find out more about Python at https://www.python.org/.
+
Requirements
+
+
Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
+ Studio)
Using pip, KLPT releases are available as source packages and binary wheels. Please make sure that a compatible Python version is installed:
+
pip install klpt
+
+
+
All the data files including lexicons and morphological rules are also installed with the package.
+
Although KLPT is not dependent on any NLP toolkit, there is one important requirement, particularly for the stem module. That is cyhunspell which should be installed with a version >= 2.0.1.
+
Import klpt
+
Once the package is installed, you can import the toolkit as follows:
+
import klpt
+
+
+
As a principle, the following parameters are widely used in the toolkit:
+
+
dialect: the name of the dialect as Sorani or Kurmanji (ISO 639-3 code will be also added)
+
script: the script of your input text as "Arabic" or "Latin"
+
numeral: the type of the numerals as
+
Arabic [١٢٣٤٥٦٧٨٩٠]
+
Farsi [۱۲۳۴۵۶۷۸۹۰]
+
Latin [1234567890]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
This module deals with normalizing scripts and orthographies by using writing conventions based on dialects and scripts. The goal is not to correct the orthography but to normalize the text in terms of the encoding and common writing rules. The input encoding should be in UTF-8 only. To this end, three functions are provided as follows:
+
+
normalize: deals with different encodings and unifies characters based on dialects and scripts
+
standardize: given a normalized text, it returns standardized text based on the Kurdish orthographies following recommendations for Kurmanji and Sorani
+
unify_numerals: conversion of the various types of numerals used in Kurdish texts
+
+
It is recommended that the output of this module be used as the input of subsequent tasks in an NLP pipeline.
def__init__(self,dialect,script,numeral="Latin"):
+ """
+ Initialization of the Preprocess class
+
+ Arguments:
+ dialect (str): the name of the dialect or its ISO 639-3 code
+ script (str): the name of the script
+ numeral (str): the type of the numeral
+
+ """
+ withopen(klpt.get_data("data/preprocess_map.json"))aspreprocess_file:
+ self.preprocess_map=json.load(preprocess_file)
+
+ configuration=Configuration({"dialect":dialect,"script":script,"numeral":numeral})
+ self.dialect=configuration.dialect
+ self.script=configuration.script
+ self.numeral=configuration.numeral
+
+
+
+
+
+
+
+
+normalize(self,text)
+
+
+
Text normalization
+
This function deals with different encodings and unifies characters based on dialects and scripts as follows:
+
+
+
Sorani-Arabic:
+
+
replace frequent Arabic characters with their equivalent Kurdish ones, e.g. "ي" by "ی" and "ك" by "ک"
+
replace "ه" followed by zero-width non-joiner (ZWNJ, U+200C) with "ە" where ZWNJ is removed ("رهزبهر" is converted to "رەزبەر"). ZWNJ in HTML is also taken into account.
+
replace "هـ" with "ھ" (U+06BE, ARABIC LETTER HEH DOACHASHMEE)
+
remove Kashida "ـ"
+
"ھ" in the middle of a word is replaced by ه (U+0647)
+
replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649)
+
+
+
+
It should be noted that the order of the replacements is important. Check out provided files for further details and test cases.
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
text
+
str
+
+
a string
+
+
required
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
str
+
+
normalized text
+
+
+
+
+
+Source code in klpt/preprocess.py
+
+
defnormalize(self,text):
+ """
+ Text normalization
+
+ This function deals with different encodings and unifies characters based on dialects and scripts as follows:
+
+ - Sorani-Arabic:
+
+ - replace frequent Arabic characters with their equivalent Kurdish ones, e.g. "ي" by "ی" and "ك" by "ک"
+ - replace "ه" followed by zero-width non-joiner (ZWNJ, U+200C) with "ە" where ZWNJ is removed ("رهزبهر" is converted to "رەزبەر"). ZWNJ in HTML is also taken into account.
+ - replace "هـ" with "ھ" (U+06BE, ARABIC LETTER HEH DOACHASHMEE)
+ - remove Kashida "ـ"
+ - "ھ" in the middle of a word is replaced by ه (U+0647)
+ - replace different types of y, such as 'ARABIC LETTER ALEF MAKSURA' (U+0649)
+
+ It should be noted that the order of the replacements is important. Check out provided files for further details and test cases.
+
+ Arguments:
+ text (str): a string
+
+ Returns:
+ str: normalized text
+
+ """
+ temp_text=" "+self.unify_numerals(text)+" "
+
+ fornormalization_typein["universal",self.dialect]:
+ forrepinself.preprocess_map["normalizer"][normalization_type][self.script]:
+ rep_tar=self.preprocess_map["normalizer"][normalization_type][self.script][rep]
+ temp_text=re.sub(rf"{rep}",rf"{rep_tar}",temp_text,flags=re.I)
+
+ returntemp_text.strip()
+
+
+
+
+
+
+
+
+preprocess(self,text)
+
+
+
One single function for normalization, standardization and unification of numerals
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
text
+
str
+
+
a string
+
+
required
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
str
+
+
preprocessed text
+
+
+
+
+
+Source code in klpt/preprocess.py
+
+
defpreprocess(self,text):
+ """
+ One single function for normalization, standardization and unification of numerals
+
+ Arguments:
+ text (str): a string
+
+ Returns:
+ str: preprocessed text
+ """
+ returnself.unify_numerals(self.standardize(self.normalize(text)))
+
+
+
+
+
+
+
+
+standardize(self,text)
+
+
+
Method of standardization of Kurdish orthographies
+
Given a normalized text, it returns standardized text based on the Kurdish orthographies.
+
+
+
Sorani-Arabic:
+
+
replace alveolar flap ر (/ɾ/) at the begging of the word by the alveolar trill ڕ (/r/)
+
replace double rr and ll with ř and ł respectively
+
+
+
+
Kurmanji-Latin:
+
+
replace "-an" or "'an" in dates and numerals ("di sala 2018'an" and "di sala 2018-an" -> "di sala 2018an")
+
+
+
+
Open issues:
+ - replace " وە " by " و "? But this is not always possible, "min bo we" (ریزگـرتنا من بو وە نە ئە وە ئــە ز)
+ - "pirtükê": "pirtûkê"?
+ - Should ı (LATIN SMALL LETTER DOTLESS I be replaced by i?
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
text
+
str
+
+
a string
+
+
required
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
str
+
+
standardized text
+
+
+
+
+
+Source code in klpt/preprocess.py
+
+
defstandardize(self,text):
+ """
+ Method of standardization of Kurdish orthographies
+
+ Given a normalized text, it returns standardized text based on the Kurdish orthographies.
+
+ - Sorani-Arabic:
+ - replace alveolar flap ر (/ɾ/) at the begging of the word by the alveolar trill ڕ (/r/)
+ - replace double rr and ll with ř and ł respectively
+
+ - Kurmanji-Latin:
+ - replace "-an" or "'an" in dates and numerals ("di sala 2018'an" and "di sala 2018-an" -> "di sala 2018an")
+
+ Open issues:
+ - replace " وە " by " و "? But this is not always possible, "min bo we" (ریزگـرتنا من بو وە نە ئە وە ئــە ز)
+ - "pirtükê": "pirtûkê"?
+ - Should [ı (LATIN SMALL LETTER DOTLESS I](https://www.compart.com/en/unicode/U+0131) be replaced by i?
+
+ Arguments:
+ text (str): a string
+
+ Returns:
+ str: standardized text
+
+ """
+ temp_text=" "+self.unify_numerals(text)+" "
+
+ forstandardization_typein[self.dialect]:
+ forrepinself.preprocess_map["standardizer"][standardization_type][self.script]:
+ rep_tar=self.preprocess_map["standardizer"][standardization_type][self.script][rep]
+ temp_text=re.sub(rf"{rep}",rf"{rep_tar}",temp_text,flags=re.I)
+
+ returntemp_text.strip()
+
+
+
+
+
+
+
+
+unify_numerals(self,text)
+
+
+
Convert numerals to the desired one
+
There are three types of numerals:
+- Arabic [١٢٣٤٥٦٧٨٩٠]
+- Farsi [۱۲۳۴۵۶۷۸۹۰]
+- Latin [1234567890]
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
text
+
str
+
+
a string
+
+
required
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
str
+
+
text with unified numerals
+
+
+
+
+
+Source code in klpt/preprocess.py
+
+
defunify_numerals(self,text):
+ """
+ Convert numerals to the desired one
+
+ There are three types of numerals:
+ - Arabic [١٢٣٤٥٦٧٨٩٠]
+ - Farsi [۱۲۳۴۵۶۷۸۹۰]
+ - Latin [1234567890]
+
+ Arguments:
+ text (str): a string
+
+ Returns:
+ str: text with unified numerals
+
+ """
+ fori,jinself.preprocess_map["normalizer"]["universal"]["numerals"][self.numeral].items():
+ text=text.replace(i,j)
+ returntext
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
The Stem module deals with various tasks, mainly through the following functions:
+ - check_spelling: spell error detection
+ - correct_spelling: spell error correction
+ - analyze: morphological analysis
+
Please note that only Sorani is supported in this version in this module. The module is based on the Kurdish Hunspell project.
"formation": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure.
+
"base": ts flag. The definition of terminal suffix is a bit tricky in Hunspell. According to the Hunspell documentation, "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.
+
+
If the input cannot be analyzed morphologically, an empty list is returned.
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
word_form
+
str
+
+
a single word-form
+
+
required
+
+
+
+
Exceptions:
+
+
+
+
Type
+
Description
+
+
+
+
+
TypeError
+
+
only string as input
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
(list(dict))
+
+
a list of all possible morphological analyses according to the defined morphological rules
+
+
+
+
+
+ Source code in klpt/stem.py
+
+
defanalyze(self,word_form):
+ """
+ Morphological analysis of a given word. More details regarding Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell).
+
+ It returns morphological analyses. The morphological analysis is returned as a dictionary as follows:
+
+ - "pos": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html).
+ - "description": is flag
+ - "terminal_suffix": anything except ts flag
+ - "formation": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure.
+ - "base": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.
+
+ If the input cannot be analyzed morphologically, an empty list is returned.
+
+ Args:
+ word_form (str): a single word-form
+
+ Raises:
+ TypeError: only string as input
+
+ Returns:
+ (list(dict)): a list of all possible morphological analyses according to the defined morphological rules
+
+ """
+ ifnotisinstance(word_form,str):
+ raiseTypeError("Only a word (str) is allowed.")
+ else:
+ # Given the morphological analysis of a word-form with Hunspell flags, extract relevant information and return a dictionary
+ word_analysis=list()
+ foranalysisinlist(self.huns.analyze(word_form)):
+ analysis_dict=dict()
+ foriteminanalysis.split():
+ if":"notinitem:
+ continue
+ ifitem.split(":")[1]=="ts":
+ # ts flag exceptionally appears after the value as value:key in the Hunspell output
+ analysis_dict["base"]=item.split(":")[0]
+ # anything except the terminal_suffix is considered to be the base
+ analysis_dict[self.hunspell_flags[item.split(":")[1]]]=word_form.replace(item.split(":")[0],"")
+ elifitem.split(":")[0]inself.hunspell_flags.keys():
+ # assign the key:value pairs from the Hunspell string output to the dictionary output of the current function
+ # for ds flag, add derivation as the formation type, otherwise inflection
+ ifitem.split(":")[0]=="ds":
+ analysis_dict[self.hunspell_flags[item.split(":")[0]]]="derivational"
+ analysis_dict[self.hunspell_flags["is"]]=item.split(":")[1]
+ else:
+ analysis_dict[self.hunspell_flags[item.split(":")[0]]]=item.split(":")[1]
+
+ # if there is no value assigned to the ts flag, the terminal suffix is a zero-morpheme 0
+ ifself.hunspell_flags["ts"]notinanalysis_dictoranalysis_dict[self.hunspell_flags["ts"]]=="":
+ analysis_dict[self.hunspell_flags["ts"]]="0"
+
+ word_analysis.append(analysis_dict)
+
+ returnword_analysis
+
+
+
+
+
+
+
+
+
+check_spelling(self,word)
+
+
+
+
+
Check spelling of a word
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
word
+
str
+
+
input word to be spell-checked
+
+
required
+
+
+
+
Exceptions:
+
+
+
+
Type
+
Description
+
+
+
+
+
TypeError
+
+
only string as input
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
bool
+
+
True if the spelling is correct, False if the spelling is incorrect
+
+
+
+
+
+ Source code in klpt/stem.py
+
+
defcheck_spelling(self,word):
+ """Check spelling of a word
+
+ Args:
+ word (str): input word to be spell-checked
+
+ Raises:
+ TypeError: only string as input
+
+ Returns:
+ bool: True if the spelling is correct, False if the spelling is incorrect
+ """
+ ifnotisinstance(word,str):
+ raiseTypeError("Only a word (str) is allowed.")
+ else:
+ returnself.huns.spell(word)
+
+
+
+
+
+
+
+
+
+correct_spelling(self,word)
+
+
+
+
+
Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect).
+ If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []).
+ If no suggestion is available, the list is returned empty as (True, []).
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
word
+
str
+
+
input word to be spell-checked
+
+
required
+
+
+
+
Exceptions:
+
+
+
+
Type
+
Description
+
+
+
+
+
TypeError
+
+
only string as input
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+
+
tuple (boolean, list)
+
+
+
+
+
+ Source code in klpt/stem.py
+
+
defcorrect_spelling(self,word):
+ """
+ Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect).
+ If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []).
+ If no suggestion is available, the list is returned empty as (True, []).
+
+ Args:
+ word (str): input word to be spell-checked
+
+ Raises:
+ TypeError: only string as input
+
+ Returns:
+ tuple (boolean, list)
+
+ """
+ ifnotisinstance(word,str):
+ raiseTypeError("Only a word (str) is allowed.")
+ else:
+ ifself.check_spelling(word):
+ return(True,[])
+ return(False,list(self.huns.suggest(word)))
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
>>> from klpt.tokenize import Tokenize
+
+>>> tokenizer = Tokenize("Kurmanji", "Latin")
+>>> tokenizer.word_tokenize("ji bo fortê xwe avêtin")
+['▁ji▁', 'bo', '▁▁fortê‒xwe‒avêtin▁▁']
+>>> tokenizer.mwe_tokenize("bi serokê hukûmeta herêma Kurdistanê Prof. Salih re saz kir.")
+'bi serokê hukûmeta herêma Kurdistanê Prof . Salih re saz kir .'
+
+>>> tokenizer_ckb = Tokenize("Sorani", "Arabic")
+>>> tokenizer_ckb.word("بە هەموو هەمووانەوە ڕێک کەوتن")
+['▁بە▁', '▁هەموو▁', 'هەمووانەوە', '▁▁ڕێک‒کەوتن▁▁']
+
a specific token to specify a multi-word expression. By default two ▁ (▁▁) are used for this purpose.
+
+
'▁▁'
+
+
+
in_separator
+
str
+
+
a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose.
+
+
'‒'
+
+
+
keep_form
+
boolean
+
+
if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash ‒, as in "dab‒û‒nerît"
+
+
False
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
str
+
+
sentence containing d multi-word expressions using the separator
+
+
+
+
+
+Source code in klpt/tokenize.py
+
+
defmwe_tokenize(self,sentence,separator="▁▁",in_separator="‒",punct_marked=False,keep_form=False):
+ """
+ Multi-word expression tokenization
+
+ Args:
+ sentence (str): sentence to be split by multi-word expressions
+ separator (str): a specific token to specify a multi-word expression. By default two ▁ (▁▁) are used for this purpose.
+ in_separator (str): a specific token to specify the composing parts of a multi-word expression. By default a dash - is used for this purpose.
+ keep_form (boolean): if set to True, the original form of the multi-word expression is returned the same way provided in the input. On the other hand, if set to False, the lemma form is used where the parts are delimited by a dash ‒, as in "dab‒û‒nerît"
+
+ Returns:
+ str: sentence containing d multi-word expressions using the separator
+
+ """
+ sentence=" "+sentence+" "
+
+ ifnotpunct_marked:
+ # find punctuation marks and add a space around
+ forpunctinself.tokenize_map["word_tokenize"][self.dialect][self.script]["punctuation"]:
+ ifpunctinsentence:
+ sentence=sentence.replace(punct," "+punct+" ")
+
+ # look for compound words and delimit them by double the separator
+ forcompound_lemmainself.mwe_lexicon:
+ compound_lemma_context=" "+compound_lemma+" "
+ ifcompound_lemma_contextinsentence:
+ ifkeep_form:
+ sentence=sentence.replace(compound_lemma_context," ▁▁"+compound_lemma+"▁▁ ")
+ else:
+ sentence=sentence.replace(compound_lemma_context," ▁▁"+compound_lemma.replace("-",in_separator)+"▁▁ ")
+# check the possible word forms available for each compound lemma in the lex files, too
+ # Note: compound forms don't have any hyphen or separator in the lex files
+ forcompound_forminself.mwe_lexicon[compound_lemma]["token_forms"]:
+ compound_form_context=" "+compound_form+" "
+ ifcompound_form_contextinsentence:
+ ifkeep_form:
+ sentence=sentence.replace(compound_form_context," ▁▁"+compound_form+"▁▁ ")
+ else:
+ sentence=sentence.replace(compound_form_context," ▁▁"+compound_lemma.replace("-",in_separator)+"▁▁ ")
+
+
+ # print(sentence)
+ returnsentence.replace(" "," ").replace("▁▁",separator).strip()
+
+
+
+
+
+
+
+
+sent_tokenize(self,text)
+
+
+
Sentence tokenizer
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
text
+
[str]
+
+
[input text to be tokenized by sentences]
+
+
required
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
[list]
+
+
[a list of sentences]
+
+
+
+
+
+Source code in klpt/tokenize.py
+
+
defsent_tokenize(self,text):
+ """Sentence tokenizer
+
+ Args:
+ text ([str]): [input text to be tokenized by sentences]
+
+ Returns:
+ [list]: [a list of sentences]
+
+ """
+ text=" "+text+" "
+ text=text.replace("\n"," ")
+ text=re.sub(self.prefixes,"\\1<prd>",text)
+ text=re.sub(self.websites,"<prd>\\1",text)
+ text=re.sub("\s"+self.alphabets+"[.] "," \\1<prd> ",text)
+ text=re.sub(self.acronyms+" "+self.starters,"\\1<stop> \\2",text)
+ text=re.sub(self.alphabets+"[.]"+self.alphabets+"[.]"+self.alphabets+"[.]","\\1<prd>\\2<prd>\\3<prd>",text)
+ text=re.sub(self.alphabets+"[.]"+self.alphabets+"[.]","\\1<prd>\\2<prd>",text)
+ text=re.sub(" "+self.suffixes+"[.] "+self.starters," \\1<stop> \\2",text)
+ text=re.sub(" "+self.suffixes+"[.]"," \\1<prd>",text)
+ text=re.sub(self.digits+"[.]"+self.digits,"\\1<prd>\\2",text)
+
+ # for punct in self.tokenize_map[self.dialect][self.script]["compound_puncts"]:
+ # if punct in text:
+ # text = text.replace("." + punct, punct + ".")
+
+ forpunctinself.tokenize_map["sent_tokenize"][self.dialect][self.script]["punct_boundary"]:
+ text=text.replace(punct,punct+"<stop>")
+
+ text=text.replace("<prd>",".")
+ sentences=text.split("<stop>")
+ sentences=[s.strip()forsinsentencesiflen(s.strip())]
+
+ returnsentences
+
defword_tokenize(self,sentence,separator="▁",mwe_separator="▁▁",keep_form=False):
+ """Word tokenizer
+
+ Args:
+ sentence (str): sentence or text to be tokenized
+
+ Returns:
+ [list]: [a list of words]
+
+ """
+ # find multi-word expressions in the sentence
+ sentence=self.mwe_tokenize(sentence,keep_form=keep_form)
+
+ # find punctuation marks and add a space around
+ forpunctinself.tokenize_map["word_tokenize"][self.dialect][self.script]["punctuation"]:
+ ifpunctinsentence:
+ sentence=sentence.replace(punct," "+punct+" ")
+
+ # print(sentence)
+ tokens=list()
+ # split the sentence by space and look for identifiable tokens
+ forwordinsentence.strip().split():
+ if"▁▁"inword:
+ # the word is previously detected as a compound word
+ tokens.append(word)
+ else:
+ ifwordinself.lexicon:
+ # check if the word exists in the lexicon
+ tokens.append("▁"+word+"▁")
+ else:
+ # the word is neither a lemma nor a compound
+ # morphological analysis by identifying affixes and clitics
+ token_identified=False
+
+ forprepositioninself.morphemes["prefixes"]:
+ ifword.startswith(preposition)andlen(word.split(preposition,1))>1:
+ ifword.split(preposition,1)[1]inself.lexicon:
+ word="▁".join(["",self.morphemes["prefixes"][preposition],word.split(preposition,1)[1],""])
+ token_identified=True
+ break
+ elifself.mwe_tokenize(word.split(preposition,1)[1],keep_form=keep_form)!=word.split(preposition,1)[1]:
+ word="▁"+self.morphemes["prefixes"][preposition]+self.mwe_tokenize(word.split(preposition,1)[1],keep_form=keep_form)
+ token_identified=True
+ break
+
+ ifnottoken_identified:
+ forpostpositioninself.morphemes["suffixes"]:
+ ifword.endswith(postposition)andlen(word.rpartition(postposition)[0]):
+ ifword.rpartition(postposition)[0]inself.lexicon:
+ word="▁"+word.rpartition(postposition)[0]+"▁"+self.morphemes["suffixes"][postposition]
+ break
+ elifself.mwe_tokenize(word.rpartition(postposition)[0],keep_form=keep_form)!=word.rpartition(postposition)[0]:
+ word=("▁"+self.mwe_tokenize(word.rpartition(postposition)[0],keep_form=keep_form)+"▁"+self.morphemes["suffixes"][postposition]+"▁").replace("▁▁▁","▁▁")
+ break
+
+ tokens.append(word)
+ # print(tokens)
+ return" ".join(tokens).replace("▁▁",mwe_separator).replace("▁",separator).split()
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Search
+
+
+
+ From here you can search these documents. Enter
+ your search terms below.
+
This module aims at transliterating one script of Kurdish into another one. Currently, only the Latin-based and the Arabic-based scripts of Sorani and Kurmanji are supported. The main function in this module is transliterate() which also takes care of detecting the correct form of double-usage graphemes, namely و ↔ w/u and ی ↔ î/y. In some specific occasions, it can also predict the placement of the missing i (also known as Bizroke/بزرۆکە).
Detection of the "i" character in the Arabic-based script. Incomplete version.
+
+Source code in klpt/transliterate.py
+
+
defbizroke_finder(self,word):
+ """Detection of the "i" character in the Arabic-based script. Incomplete version."""
+ word=list(word)
+ iflen(word)>2andword[0]inself.latin_consandword[1]inself.latin_consandword[1]!="w"andword[1]!="y":
+ word.insert(1,"i")
+ return"".join(word)
+
+
+
+
+
+
+
+
+latin_to_arabic(self,char)
+
+
+
Mapping Latin-based characters to the Arabic-based equivalents
+
+Source code in klpt/transliterate.py
+
+
deflatin_to_arabic(self,char):
+ """Mapping Latin-based characters to the Arabic-based equivalents"""
+ # check if the character is in upper case
+ mapped_char=""
+
+ ifchar.lower()!="":
+ ifchar.lower()inself.wy_mappings.keys():
+ mapped_char=self.wy_mappings[char.lower()]
+ elifchar.lower()inself.characters_mapping.keys():
+ mapped_char=self.characters_mapping[char.lower()]
+ elifchar.lower()inself.punctuation_mapping:
+ mapped_char=self.punctuation_mapping[char.lower()]
+ # elif char.lower() in self.digits_mapping.values():
+ # mapped_char = self.digits_mapping.keys()[self.digits_mapping.values().index(char.lower())]
+
+ iflen(mapped_char):
+ ifchar.isupper():
+ returnmapped_char.upper()
+ returnmapped_char
+ else:
+ returnchar
+
+
+
+
+
+
+
+
+preprocessor(self,word)
+
+
+
Preprocessing by normalizing text encoding and removing embedding characters
+
+Source code in klpt/transliterate.py
+
+
defpreprocessor(self,word):
+ """Preprocessing by normalizing text encoding and removing embedding characters"""
+ # replace this by the normalization part
+ word=list(word.replace('\u202b',"").replace('\u202c',"").replace('\u202a',"").replace(u"وو","û").replace("\u200c","").replace("ـ",""))
+ # for char_index in range(len(word)):
+ # if(word[char_index] in self.tricky_characters.keys()):
+ # word[char_index] = self.tricky_characters[word[char_index]]
+ return"".join(word)
+
+
+
+
+
+
+
+
+syllable_detector(self,word)
+
+
+
Detection of the syllable based on the given pattern. May be used for transcription applications.
+
+Source code in klpt/transliterate.py
+
+
defsyllable_detector(self,word):
+ """Detection of the syllable based on the given pattern. May be used for transcription applications."""
+ syllable_templates=["V","VC","VCC","CV","CVC","CVCCC"]
+ CV_converted_list=""
+forcharinword:
+ ifcharinself.latin_vowels:
+ CV_converted_list+="V"
+ else:
+ CV_converted_list+="C"
+
+ syllables=list()
+ foriinrange(1,len(CV_converted_list)):
+ syllable_templates_permutated=[pforpinitertools.product(syllable_templates,repeat=i)]
+ forsylinsyllable_templates_permutated:
+ iflen("".join(syl))==len(CV_converted_list):
+ ifCV_converted_list=="".join(syl)and"VV"notin"".join(syl):
+ syllables.append(syl)
+ returnsyllables
+
+
+
+
+
+
+
+
+to_pieces(self,token)
+
+
+
Given a token, find other segments composed of numbers and punctuation marks not seperated by space ▁
+
+Source code in klpt/transliterate.py
+
+
defto_pieces(self,token):
+ """Given a token, find other segments composed of numbers and punctuation marks not seperated by space ▁"""
+ tokens_dict=dict()
+ flag=False# True if a token is a \w
+ i=0
+
+ forchar_indexinrange(len(token)):
+ iftoken[char_index]inself.digits_mapping_allortoken[char_index]inself.punctuation_mapping_all:
+ tokens_dict[char_index]=token[char_index]
+ flag=False
+ i=0
+ eliftoken[char_index]inself.characters_pack[self.mode]or \
+ token[char_index]inself.target_charor \
+ token[char_index]==self.hemzeortoken[char_index].lower()==self.bizroke:
+ ifflag:
+ tokens_dict[char_index-i]=tokens_dict[char_index-i]+token[char_index]
+ else:
+ tokens_dict[char_index]=token[char_index]
+ flag=True
+ i+=1
+ else:
+ tokens_dict[char_index]=self.UNKNOWN
+
+ returntokens_dict
+
+
+
+
+
+
+
+
+transliterate(self,text)
+
+
+
The main method of the class:
+
- find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space)
+ - map characters
+ - detect double-usage characters w/u and y/î
+ - find possible position of Bizroke (to be completed - 2017)
+
+ Notice: text format should not be changed at all (no lower case, no style replacement ,
+
+
etc.).
+ If the source and the target scripts are identical, the input text should be returned without any further processing.
+
+Source code in klpt/transliterate.py
+
+
deftransliterate(self,text):
+ """The main method of the class:
+
+ - find word boundaries by splitting it using spaces and then retrieve words mixed with other characters (without space)
+ - map characters
+ - detect double-usage characters w/u and y/î
+ - find possible position of Bizroke (to be completed - 2017)
+
+ Notice: text format should not be changed at all (no lower case, no style replacement \t, \n etc.).
+ If the source and the target scripts are identical, the input text should be returned without any further processing.
+
+ """
+ text=self.prep.unify_numerals(text).split("\n")
+ transliterated_text=list()
+
+ forlineintext:
+ transliterated_line=list()
+ fortokeninline.split():
+ trans_token=""
+ # try:
+ token=self.preprocessor(token)# This is not correct as the capital letter should be kept the way it is given.
+ tokens_dict=self.to_pieces(token)
+ # Transliterate words
+ fortoken_keyintokens_dict:
+ iflen(tokens_dict[token_key]):
+ word=tokens_dict[token_key]
+ ifself.mode=="arabic_to_latin":
+ # w/y detection based on the priority in "word"
+ forcharinword:
+ ifcharinself.target_char:
+ word=self.uw_iy_Detector(word,char)
+ifword[0]==self.hemzeandword[1]inself.arabic_vowels:
+ word=word[1:]
+ word=list(word)
+ forchar_indexinrange(len(word)):
+ word[char_index]=self.arabic_to_latin(word[char_index])
+ word="".join(word)
+ word=self.bizroke_finder(word)
+ elifself.mode=="latin_to_arabic":
+ iflen(word):
+ word=list(word)
+ forchar_indexinrange(len(word)):
+ word[char_index]=self.latin_to_arabic(word[char_index])
+ ifword[0]inself.arabic_vowelsorword[0].lower()==self.bizroke:
+ word.insert(0,self.hemze)
+ word="".join(word).replace("û","وو").replace(self.bizroke.lower(),"").replace(self.bizroke.upper(),"")
+ # else:
+ # return self.UNKNOWN
+
+ trans_token=trans_token+word
+
+ transliterated_line.append(trans_token)
+ transliterated_text.append(" ".join(transliterated_line).replace(u" w ",u" û "))
+
+ # standardize the output
+ # replace UNKOWN by the user's choice
+ ifself.user_UNKNOWN!=self.UNKNOWN:
+ return"\n".join(transliterated_text).replace(self.UNKNOWN,self.user_UNKNOWN)
+ else:
+ return"\n".join(transliterated_text)
+
+
+
+
+
+
+
+
+uw_iy_Detector(self,word,target_char)
+
+
+
Detection of "و" and "ی" in the Arabic-based script