-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify diacratic removal; modify Latin & Greek preprocessors #724
Conversation
This looks good, just want to give some context on removing diacritics. In some languages, diacritics may be optional, or used only in texts for children and learners, dictionaries, where disambiguation is necessary, etc. Lookup needs work on texts with or without them. Instead of trying combinations of diacritics on text without them, it is more efficient to have the diacritic-less variations in the dictionary, and remove them with a textPreprocessor if the text does have them. This may however introduce ambiguity, so maybe the term-reading system could be used, with readings containing the diacritics. In some cases (e.g. German umlauts) diacritics are not generally omitted, so they can be in the dictionary, and there is no need to remove them. In the case of Latin, I was under the impression that the long vowel marks and such were optional, but used in some contexts, and so should be removed from the dictionary and in text preprocessing. Unlike me, you seem to actually know something about Latin, so I hope you can verify these assumptions. I'll also probably update the latin diacritic removal at KTY to match what you do here. |
I must prefix this by saying that I am a learner of Latin myself, but, as you say, the usage of macrons is not consistent in anything aside from dictionaries and learner texts. |
CodSpeed Performance ReportMerging #724 will not alter performanceComparing Summary
|
View Playwright Report (note: open the "playwright-report" artifact) |
Relevant, but I've also ran into problems using string.normalize before since it is somewhat indiscriminate with regard to what it replaces. Can create problems with dictionary forms of words not being searchable. Realistically, there should be some guidelines for dictionary creators for how the data should be formatted or normalized. |
Good information for not using it too willy-nilly. I assume that it should perform well on Lain-based languages (and I did not run into any problems myself), and issues related to it would crop up quickly in languages with so few characters. |
@martholomew Could you leave a similar warning on this function as a comment? Otherwise I feel it's quite likely someone might try to use it for a language with a more complicated writing system and introduce a bug by accident. |
Signed-off-by: Darius Jahandarie <[email protected]>
Signed-off-by: Matttttt <[email protected]>
Signed-off-by: Darius Jahandarie <[email protected]>
Signed-off-by: Darius Jahandarie <[email protected]>
CI unit test failure: "ext/js/language/la/latin-text-preprocessors.js", |
Signed-off-by: Matttttt <[email protected]>
@martholomew Are you able to figure out this TS error in CI or should I try to find someone who can help? |
Maybe just needs |
Sorry for the late response, I don't know how to fix the error, if I could get help that would be great! |
fix tests, merge master
I added a very simple function to remove diacritics that should also work with the other languages (besides just LA and GRC), but I do not know enough about them to be able to judge that.
Thanks!