-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update hyphenation .pattern files #373
Comments
About the "should be renamed", don't do that too quickly: it will need to be updated too in frontend, and some tweaks to at least be able to migrate from the old bad hyph dict filename saved as a setting to the new setting with a language tag (discussed among the comments in koreader/koreader#6072). |
Of course, there are other languages available for importing. |
@strn Can you run me through how you went about doing this for Serbian? |
@roshavagarga I just took hyphenation file from that TeX Github page and passed it through a small shell script that produced correct output. Is this what you wanted to know or you need more information? |
If you'd be up for sharing that I might have the time to update the other hyphenation patterns at some point 👍 |
Please see here: shell script to convert TeX hyphenation file to KOreader . Let me know if this was of any use. |
@strn It was useful for generating a new Bulgarian pattern, but it can't handle tex files that have more than 1 pattern per line, like the new Spanish one I quoted above. I believe those just need to be spaced out with 1 pattern per line, rather than anything more complicated. |
For the record, even if I'm not here often these times, the french.pattern still got all my love and attention =) |
@cramoisi I found an updated Catalan tex file (available in first post) - what do we do with lines that have a literal - in them, not sure why the maintainer of that source added those. |
@roshavagarga Could you give me an example ? I'm on the go |
@cramoisi Sure thing, there's some lines like this one here and there:
The source looks like an improvement to the one we're using and the last activity was from December 2019, but I don't speak Catalan, so... |
each part of each line should make a specific pattern rule. dot is translated by a space. hyphen parts are ignored (as crengine can't read them) if they translate the same as existing rules. As crengine consider hyphen as end of a word (like dot), the two first lines are translated the same as the 3 and 4. so patterns are made only with these :
|
@cramoisi What should I do about lines that are only available with a
|
you translate if like if it's a dot
|
Consider (like crengine does) a space, a dot, a hyphen, a quote, a paren... any punctuation actually, as a word boundary. |
@poire-z Then why did we just get rid of patterns with |
because their translation are doublon with other rules. |
Actually, that might be wrong. @cramoisi, what's the thinking? Is it this: ?
I'm not part of the "we", but I guess, because they were duplicates of other with a space instead? |
Yes
Yes, mostly @roshavagarga : you mean these ?
Because crengine can't read them properly and need some new rules to process them. and you can't make these rules if you're not fluent with the language (like I did with french, I've already explained the difficulties of that I think). |
We don't disable hyphenation, but as there is no hyph dict for these lang tags, and because they have unicode codepoints not present in any of our (possible enabled) hyph dicts, they won't match any pattern. But some LTR english or anything present among the arabic/hebrew glyphs (making the text BiDi) may match some patterns and be hyphenated - even in the middle of a line :) |
@poire-z Honestly I just saw that the 'hyphenation patterns' for Farsi/Persian/Arabic basically told the software using them to not hyphenate, so I thought they never should be :) I have zero plans of trying to work towards hyphenation for those languages ;) |
@roshavagarga: User ichnilatis in mobileread requested hyphen patterns for greek and ancient greek. They're both available in http://tug.org/tex-hyphen/#languages. Please consider them for inclusion. AFAICT ichnilatis is willing to test the results and provide feedback but not to do the work involved. |
@pazos The problem is I have no idea what the difference between the two Greek ones is. I believe our current Greek one is based off the Ancient Greek one, if I remember properly? If he can tell us which one is useful for what so that there's an understanding which one should be the default and which one's needed for what, I can whip up the files afterwards :) |
Lets wait for feedback, then 😄 |
I believe Greek as in |
@Frenzie As far as TeX, the issue is there are two types of files for modern - monotonic Greek and polytonic Greek. Those two are under |
@roshavagarga Native Greek speaker here. It's a long story but if you have to choose one, choose monotonic for modern ( |
Well, if you can offer any more info at any time, I'd be up for that. There's the option of having all of the above, just setting some as default ones. As I said, we have mono and poly for Modern Greek specifically and two separate files for Ancient Greek. Would the poly one for Modern really be useful for Ancient? I just noticed there's a second option for Ancient Greek called |
Oh! I got mixed up and thought you said there was a polytonic and monotonic option for both. Honestly, I would be surprised if there was a monotonic Ancient script, as no such thing exists formally to my knowledge. I've taken the time to read the PDFs the files refer to, which can be found here. Skip to the bottom for my suggestion. All in all, the articles make a strong case for the complex hyphenation rules that that group created. They also highlight that Ancient (which is polytonic as standard) has different needs to Polytonic Modern, so a separate ruleset was needed. They cite several examples that sound reasonable enough. I want to just highlight this passage with regard to
I'm not personally a TeX user but from my limited understanding based on the above paragraph and a bit of Googling: Finally, you asked if modern polytonic is useful for Ancient Greek. The article briefly covers some rules that have been introduced 'recently' (this would be the early 2000s) about how syllables are broken down for Ancient Greek texts specifically, which differentiates them from polytonic. I must be frank here and state that no-one but the most ardent and pedantic of linguists/typography nuts will ever notice the position of the hyphen, much less make technically accurate documents. So I doubt these rules really matter either way, but if pushed, go with the Ancient Greek-specific rules for Ancient Greek. Polytonic is a form of writing that can be applied to modern Greek, but realistically, most [1800s onwards] text in polytonic will either be influenced by Katharevousa or be in Katharevousa in itself. Katharevousa is still, fundamentally, a modern dialect, while Ancient is an intelligible but different language. So yes, I would reasonably expect them to follow different rules, and indeed the TeX files are radically different. I really just want to stress that while there is an enthusiast community for polytonic that includes my brother, almost nothing has been published in it since the 80s. So don't feel too bad if you have to leave it out. In summary, this would be the ideal solution:
Hope this helps! |
@superuser-does |
@poire-z Would you mind offering your 2 cents? I'm going through the Modern Greek files now and they seem to have a lot of stuff that I don't know what I should do with. Example:
or |
I'm afraid I don't have these 2 cents :) Although it looks like they should be different patterns: And the last comment may apply only to the last pattern: |
Still probably more than me, hahah.
@superuser-does If I make test .pattern files, would you be willing to test them out on your end and see if they do things properly? |
Here I am! (unfortunately I couldn't use the nickname "ichnilatis", as it has been already chosen by other user). @superuser-does I would like to inform you that many books are published in polytonic Greek even in our days. I'll be in your disposal, if I can help you somehow, even if I don't know anything about programming. Thank you again. |
@ichnilatis-gr Typically I take the tex files from either here or here. Are yours from any of these sources and, if not, where did you get them from and how do they compare? I'm guessing you agree with having mono as the default for Modern Greek? Would you be up for testing the files once they're created and do you have the knowledge required to know whether they're doing a proper job or not? I'm asking because I don't speak Greek, but in my own language (Bulgarian), hyphenation is instinctual and easy to guesstimate even for new loanwords from other languages :) |
I have taken the tex files from the link that pazos provided to me (http://tug.org/tex-hyphen/#languages). Thank you again. |
Pazos has told me in mobileread forum that conversion of tex files can be done with https://gist.github.com/strn/f5c6d9c5242fdc9c49d09f21ecad1ffa |
@ichnilatis-gr I believe our current Modern Greek file is wrong and actually uses the Ancient Greek tex file, but I'm going off memory. Can you share your experience with the current hyphenation and whether it's proper or not? The conversion itself is easy, but sometimes things pop up like the ones I discussed with poire-z before. I can convert it fairly easily and do the changes I think might be okay, but we'd still need you or another fluent user to test them out and confirm whether they do things properly - sometimes mistakes pop up in the conversion or the files themselves are just incomplete or don't handle newer words. That's why I asked whether fluent users like yourself can intuitively sense whether the hyphenation being used is the proper one :) Edit: You can also edit your old comments and add information to them instead of posting a new comment, in case you didn't know. Thanks for the input and help! |
Which file do you want me to check? |
@ichnilatis-gr Open a modern Greek book in koreader and test out the current hyphenation and tell me if it feels right. Besides that I need to create the new files and tell you what to change where and where to drag your files so you can compare those to the current hyphenation. |
I would like to make clear that even Modern Greek can be polytonic and I think that these polytonic files may use the hyphenation rules of Modern Greek. I noticed that KOReader uses the hyphenation rules of Ancient Greek when indicates "Greek". I give you some differences between Ancient Greek and Modern Greek hyphenation. Ancient Greek / Modern Greek μ-π (ἔμ-πο-ρος) / μπ (έ-μπο-ρος) There are also some differences, if a word is compound. συν-άγω / συ-νά-γω The most common compounds are with συν- and δυσ-. These 3 letters stay together in the same syllable in Ancient Greek, while they are separated in Modern Greek (συ-ν*-... / δυ-σ*...). But I don't know how these rules can be defined. Nevertheless, I have found a web page that uses these rules. You can see here. But, I don't know if it can help. Finally, if you are going crazy with all these differences and rules, maybe it would be better to just do the conversion without any other fixes... |
@roshavagarga So, even before I put my wish into words, it was almost granted.. |
Just a quick update: unfortunately nuked my HDD by mistake and lost all of my hyphenation work, so I'll start over from scratch at some point :) |
Bonus to-do list:
|
There's been some work in upstream (coolreader) crengine about licenses and ensuring its fully/onlyGPL: buggins/coolreader#339 As part of it, @virxkane has updated many hyphenation files, and added some (ie. Bengali). Removed files: buggins/coolreader@20ebd0a And some code change in buggins/coolreader@7925f6b to pick language, tags, and left/right min size from the files themselves - that I'd rather not pick. @uroybd : any interest in having bengali hyphenated ? I'm not sure if it would work as is, or if it would need more tweak (because bengali is considered cursive, and there's many things we avoid doing with cursive text segments). |
Let's not. It's complex and no one use hyphenation in Bengali nowadays. |
You can mark Russian hyphenations as complete, the remaining part is "Additional patterns with hyphen/dash", which aren't supported by crengine, as I understood. And the remainder contains ad-hoc hyphenated words, which can be safely ignored. |
Most of the hyphenation
.pattern
files are from hyphenation.org or LibreOffice.After a quick review, some of the files need to be updated, while others are missing exceptions - these can be implemented as can be seen at the end of the
English_GB.pattern
file.In need of love:
Bonus to-do list:
The text was updated successfully, but these errors were encountered: