Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hyphenation .pattern files #373

Open
1 of 10 tasks
roshavagarga opened this issue Aug 30, 2020 · 47 comments
Open
1 of 10 tasks

Update hyphenation .pattern files #373

roshavagarga opened this issue Aug 30, 2020 · 47 comments

Comments

@roshavagarga
Copy link
Contributor

roshavagarga commented Aug 30, 2020

Most of the hyphenation .pattern files are from hyphenation.org or LibreOffice.

After a quick review, some of the files need to be updated, while others are missing exceptions - these can be implemented as can be seen at the end of the English_GB.pattern file.

In need of love:

  • Icelandic - Switch to this source or this one or this one?
  • Catalan - Switch to this source, per confirmation that it is better here or here?
  • Czechoslovak - Switch multiple languages to this source?
  • Greek - Rename to Ancient Greek, add Monotonic Greek and Polytonic Greek, differentiate between all three.
  • Polish - Manually check if missing exceptions from hyph.org version are covered by the 3,0a.4 version we're using
  • Russian - additional hyphens part (last 3k+ lines) is missing
  • Ukrainian - additional hyphens part is missing
  • Add Portuguese (Brazil) and differentiate from pt_PT
  • Add Church Slavonic from this source?
  • Confirm whether Romanian is used for ro-MD (former Moldovan/Moldavian)?

Bonus to-do list:

  • Albanian - add from here.
  • Portuguese (BR) - update from here
  • Catalan - update file, take example from other updates files, mark version 3 and point to here
  • Belarusian - add from here or here
  • Danish - point to actual file here
  • Dutch - same, here
  • English - same, here and here
  • Esperanto - update from here and switch link, or switch to here - native speaker required to ascertain which version is better.
  • Mongolian - add from here
  • Hungarian - update from here (might be the same, check) or switch to here
  • Indonesian - add from here
  • Italian - switch to here - seems like improved version?
  • Telugi - add from here
  • Oriya - add from here
  • Panjabi - add from here
  • Tamil - add from here
  • Finnish - switch to or add as separate from here - some sort of 'school rules' (simplified?) version of Finnish hyphenation
  • Portuguese - switch to this source?
@poire-z
Copy link
Contributor

poire-z commented Aug 30, 2020

About the "should be renamed", don't do that too quickly: it will need to be updated too in frontend, and some tweaks to at least be able to migrate from the old bad hyph dict filename saved as a setting to the new setting with a language tag (discussed among the comments in koreader/koreader#6072).
(And people upgrading will have both old and renamed files, but that's a minor non-issue :)

@roshavagarga
Copy link
Contributor Author

roshavagarga commented Aug 30, 2020

Of course, there are other languages available for importing.

@roshavagarga
Copy link
Contributor Author

@strn Can you run me through how you went about doing this for Serbian?

@strn
Copy link
Contributor

strn commented Sep 4, 2020

@roshavagarga I just took hyphenation file from that TeX Github page and passed it through a small shell script that produced correct output. Is this what you wanted to know or you need more information?

@roshavagarga
Copy link
Contributor Author

If you'd be up for sharing that I might have the time to update the other hyphenation patterns at some point 👍

@strn
Copy link
Contributor

strn commented Sep 6, 2020

Please see here: shell script to convert TeX hyphenation file to KOreader . Let me know if this was of any use.

@roshavagarga
Copy link
Contributor Author

roshavagarga commented Sep 11, 2020

@strn It was useful for generating a new Bulgarian pattern, but it can't handle tex files that have more than 1 pattern per line, like the new Spanish one I quoted above. I believe those just need to be spaced out with 1 pattern per line, rather than anything more complicated.

@cramoisi
Copy link
Contributor

For the record, even if I'm not here often these times, the french.pattern still got all my love and attention =)

@roshavagarga
Copy link
Contributor Author

@cramoisi I found an updated Catalan tex file (available in first post) - what do we do with lines that have a literal - in them, not sure why the maintainer of that source added those.

@cramoisi
Copy link
Contributor

@roshavagarga Could you give me an example ? I'm on the go

@roshavagarga
Copy link
Contributor Author

@cramoisi Sure thing, there's some lines like this one here and there:

%verbal forms: infinitive and gerund
u1ir. qu4ir. gu4ir. u1int. qu4int. gu4int.
e1ir. e1int. a1ir. a1int. o1ir. o1int. 
u1ir- qu4ir- gu4ir- u1int- qu4int- gu4int-
e1ir- e1int- a1ir- a1int- o1ir- o1int- 
.cu4ir. .va4ir.

The source looks like an improvement to the one we're using and the last activity was from December 2019, but I don't speak Catalan, so...

@cramoisi
Copy link
Contributor

cramoisi commented Sep 22, 2020

each part of each line should make a specific pattern rule. dot is translated by a space. hyphen parts are ignored (as crengine can't read them) if they translate the same as existing rules. As crengine consider hyphen as end of a word (like dot), the two first lines are translated the same as the 3 and 4. so patterns are made only with these :

u1ir. qu4ir. gu4ir. u1int. qu4int. gu4int.
e1ir. e1int. a1ir. a1int. o1ir. o1int. 
.cu4ir. .va4ir.

@roshavagarga
Copy link
Contributor Author

roshavagarga commented Sep 22, 2020

@cramoisi What should I do about lines that are only available with a -? Replace it with a full stop/dot or just outright remove them like with lines that have ' or ''?
Example from Afrikaans file, I think the newer Catalan source had similar instances:

hi4sp
his5pa
hi4v-
2hl

@cramoisi
Copy link
Contributor

you translate if like if it's a dot

<pattern>hi4v </pattern>

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2020

Consider (like crengine does) a space, a dot, a hyphen, a quote, a paren... any punctuation actually, as a word boundary.
And you mark patterns that should match only at boundaries by adding a space on the left or on the right in its <pattern>

@roshavagarga
Copy link
Contributor Author

@poire-z Then why did we just get rid of patterns with ' or '' in a few of the previous patterns, or am I misremembering that one? Maybe they were copies of other lines and I didn't notice at the time tbh.

@cramoisi
Copy link
Contributor

Then why did we just get rid of patterns with ' or '' in a few of the previous patterns, or am I misremembering that one? Maybe they were copies of other lines and I didn't notice at the time tbh.

because their translation are doublon with other rules.

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2020

And you mark patterns that should match only at boundaries by adding a space on the left or on the right

Actually, that might be wrong. @cramoisi, what's the thinking? Is it this: ?
<pattern>hi4v</pattern> matches only when not at boundaries
<pattern>hi4v </pattern> matches it also when at boundary on the right

Then why did we just get rid of patterns

I'm not part of the "we", but I guess, because they were duplicates of other with a space instead?

@cramoisi
Copy link
Contributor

cramoisi commented Sep 22, 2020

<pattern>hi4v</pattern> matches only when not at boundaries
<pattern>hi4v </pattern> matches it also when at boundary on the right

Yes

I'm not part of the "we", but I guess, because they were duplicates of other with a space instead?

Yes, mostly

@roshavagarga : you mean these ?

.l's8p
.l's8c
.l'f8t
.d's8p
.d's8c
.d'f8t
n'8hi

Because crengine can't read them properly and need some new rules to process them. and you can't make these rules if you're not fluent with the language (like I did with french, I've already explained the difficulties of that I think).

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2020

Confirm whether hyphenating is disabled by default for Farsi/Persian and Arabic?

We don't disable hyphenation, but as there is no hyph dict for these lang tags, and because they have unicode codepoints not present in any of our (possible enabled) hyph dicts, they won't match any pattern.
(But if we ever get a hyph dict for arabic, the code might need adjustment, as I'm quite sure we would put the hyphen on the right instead of the left :) so.. please don't think about adding Arabic.pattern ! :) I'm done with BiDi helL)

But some LTR english or anything present among the arabic/hebrew glyphs (making the text BiDi) may match some patterns and be hyphenated - even in the middle of a line :)
koreader/koreader#5359 (comment) (in red)
koreader/koreader#5359 (comment) (said to be perfect)

@roshavagarga
Copy link
Contributor Author

@poire-z Honestly I just saw that the 'hyphenation patterns' for Farsi/Persian/Arabic basically told the software using them to not hyphenate, so I thought they never should be :) I have zero plans of trying to work towards hyphenation for those languages ;)

@pazos
Copy link
Member

pazos commented Dec 17, 2020

@roshavagarga: User ichnilatis in mobileread requested hyphen patterns for greek and ancient greek. They're both available in http://tug.org/tex-hyphen/#languages.

Please consider them for inclusion.

AFAICT ichnilatis is willing to test the results and provide feedback but not to do the work involved.

@roshavagarga
Copy link
Contributor Author

@pazos The problem is I have no idea what the difference between the two Greek ones is. I believe our current Greek one is based off the Ancient Greek one, if I remember properly? If he can tell us which one is useful for what so that there's an understanding which one should be the default and which one's needed for what, I can whip up the files afterwards :)

@pazos
Copy link
Member

pazos commented Dec 17, 2020

Lets wait for feedback, then 😄

@Frenzie
Copy link
Member

Frenzie commented Dec 17, 2020

I believe Greek as in el is modern, ancient is grc. See https://en.wikipedia.org/wiki/ISO_639-2 and https://en.wikipedia.org/wiki/ISO_639-3

@roshavagarga
Copy link
Contributor Author

@Frenzie As far as TeX, the issue is there are two types of files for modern - monotonic Greek and polytonic Greek. Those two are under el, while Ancient Greek is separate under grc.

@superuser-does
Copy link

@roshavagarga Native Greek speaker here. It's a long story but if you have to choose one, choose monotonic for modern (el) and polytonic for ancient (grc). Let me know if you have further questions.

@roshavagarga
Copy link
Contributor Author

@roshavagarga Native Greek speaker here. It's a long story but if you have to choose one, choose monotonic for modern (el) and polytonic for ancient (grc). Let me know if you have further questions.

Well, if you can offer any more info at any time, I'd be up for that. There's the option of having all of the above, just setting some as default ones. As I said, we have mono and poly for Modern Greek specifically and two separate files for Ancient Greek. Would the poly one for Modern really be useful for Ancient?

I just noticed there's a second option for Ancient Greek called ibycus?

@superuser-does
Copy link

Oh! I got mixed up and thought you said there was a polytonic and monotonic option for both. Honestly, I would be surprised if there was a monotonic Ancient script, as no such thing exists formally to my knowledge.

I've taken the time to read the PDFs the files refer to, which can be found here. Skip to the bottom for my suggestion.

All in all, the articles make a strong case for the complex hyphenation rules that that group created. They also highlight that Ancient (which is polytonic as standard) has different needs to Polytonic Modern, so a separate ruleset was needed. They cite several examples that sound reasonable enough.

I want to just highlight this passage with regard to ibycus:

Κλείνοντας θὰ πρέπει νὰ ποῦμε ὅτι οἱ παρόντες κώδικες συλλαβισμοῦ ἔχουνγραφεῖ σύμφωνα μὲ τὴν κωδικοποίηση τῶν ἑλληνικῶν τοῦ TEX κατὰ Levy[10]καὶχρησιμεύουν γιὰ στοιχειοθεσία μὲ τὸ GreeKTEX τοῦ ∆ρυλλεράκη[11] ἢ μὲ τὴν ἑλληνικὴ ἐπιλογὴ τοῦ babel[12, 13]. Οἱ κώδικες ἴσως χρησιμεύσουν καὶ γιὰ ἄλλα πακέτα στοιχειοθεσίας ἑλληνικῶν κειμένων, ὅπως τὸ ibycus[14]καὶ τὸ greektex τοῦ Μοσχοβάκη[15], ἀλλὰ σὲ μιὰ τέτοια περίπτωση θὰ χρειαστεῖ νὰ γίνουν ὁρισμένες ἀλλαγές. Τέλος, εὐελπιστοῦμε ὅτι μία ἡμέρα οἱ κώδικες συλλαβισμοῦ τῶν ἀρχαίων ἑλληνικῶν θὰ βροῦν τὸν δρόμο τους ἀκόμα καὶ μέσα στὸ πρόγραμμα ̓Ωμέγα.

I'm not personally a TeX user but from my limited understanding based on the above paragraph and a bit of Googling:
This library is strictly for hyphenation and can be used in conjunction with ibycus or greektex (seems different from GreeKTex?), though it warns that modifications may be needed. From that I can guess that ibycus (which can be found on the CTAN site) has its own hyphenation rules. As for GreekTeX, it seems to be a full typesetting system that converts Latin input to Greek.

Finally, you asked if modern polytonic is useful for Ancient Greek. The article briefly covers some rules that have been introduced 'recently' (this would be the early 2000s) about how syllables are broken down for Ancient Greek texts specifically, which differentiates them from polytonic. I must be frank here and state that no-one but the most ardent and pedantic of linguists/typography nuts will ever notice the position of the hyphen, much less make technically accurate documents. So I doubt these rules really matter either way, but if pushed, go with the Ancient Greek-specific rules for Ancient Greek.

Polytonic is a form of writing that can be applied to modern Greek, but realistically, most [1800s onwards] text in polytonic will either be influenced by Katharevousa or be in Katharevousa in itself. Katharevousa is still, fundamentally, a modern dialect, while Ancient is an intelligible but different language. So yes, I would reasonably expect them to follow different rules, and indeed the TeX files are radically different.

I really just want to stress that while there is an enthusiast community for polytonic that includes my brother, almost nothing has been published in it since the 80s. So don't feel too bad if you have to leave it out.


In summary, this would be the ideal solution:

  • For Modern - set monotonic as the default. Allow modern-polytonic optionally.
  • For Ancient: Only use the grc package that @pazos linked, you won't need anything else.
    We can surmise from the article that the grc package above is better, and the article suggests it works as a full replacement for any prior rules. I imagine ibycus is retained on the list (though curiously, it's not available for download) for technical reasons i.e. to ensure that TeX documents typeset with the ibycus system compile. Also, the polytonic rules for modern won't be useful for Ancient.

Hope this helps!

@roshavagarga
Copy link
Contributor Author

@superuser-does Ibycus is actually available and can be found here.

@roshavagarga
Copy link
Contributor Author

@poire-z Would you mind offering your 2 cents? I'm going through the Modern Greek files now and they seem to have a lot of stuff that I don't know what I should do with.

Example:

υ2α
υ2ά υ2ά

or έ2ι έ2ι έ2ϊ έ2ϊ % 'e3i --- not to be separated: t`o rw-m'ei-ko (one way to pronounce it)

@poire-z
Copy link
Contributor

poire-z commented Dec 18, 2020

Would you mind offering your 2 cents?

I'm afraid I don't have these 2 cents :)
I know less than @cramoisi about the patterns files :/
And I don't know what these spaces in there are for (are they real spaces? or some other kind of unicode space that could have a meaning in Greek (?!)

Although it looks like they should be different patterns:
έ2ι έ2ι έ2ϊ έ2ϊ differ only by the final ϊ so it looks like it could be:
έ2ι
έ2ι
έ2ϊ
έ2ϊ

And the last comment may apply only to the last pattern:
% 'e3i --- not to be separated

@roshavagarga
Copy link
Contributor Author

I'm afraid I don't have these 2 cents :)
I know less than @cramoisi about the patterns files :/

Still probably more than me, hahah.
My first instinctual reaction was to turn έ2ι έ2ι έ2ϊ έ2ϊ into:

<pattern>έ2ι</pattern>
<pattern>έ2ϊ</pattern>

@superuser-does If I make test .pattern files, would you be willing to test them out on your end and see if they do things properly?

@ichnilatis-gr
Copy link

Here I am! (unfortunately I couldn't use the nickname "ichnilatis", as it has been already chosen by other user).
Reading your conversation, I feel somewhat uncomfortable that I put you in such trouble.
You can find the files I would like to be converted to .pattern files also in this link
I also have these files that come from OpenOffice for polytonic Greek and for Ancient Greek. I provide them in case they can help somehow.

@superuser-does I would like to inform you that many books are published in polytonic Greek even in our days.

I'll be in your disposal, if I can help you somehow, even if I don't know anything about programming.

Thank you again.

@roshavagarga
Copy link
Contributor Author

@ichnilatis-gr Typically I take the tex files from either here or here. Are yours from any of these sources and, if not, where did you get them from and how do they compare? I'm guessing you agree with having mono as the default for Modern Greek?

Would you be up for testing the files once they're created and do you have the knowledge required to know whether they're doing a proper job or not? I'm asking because I don't speak Greek, but in my own language (Bulgarian), hyphenation is instinctual and easy to guesstimate even for new loanwords from other languages :)

@ichnilatis-gr
Copy link

I have taken the tex files from the link that pazos provided to me (http://tug.org/tex-hyphen/#languages).
I'm interested more in polytonic and Ancient Greek, as there is already the choice for (modern) Greek in KOReader.
I have the knowledge to check if the pattern is correct.

Thank you again.

@ichnilatis-gr
Copy link

ichnilatis-gr commented Dec 18, 2020

Pazos has told me in mobileread forum that conversion of tex files can be done with https://gist.github.com/strn/f5c6d9c5242fdc9c49d09f21ecad1ffa
So, I thought it would be something simple for those who know how it works...

@roshavagarga
Copy link
Contributor Author

roshavagarga commented Dec 18, 2020

@ichnilatis-gr I believe our current Modern Greek file is wrong and actually uses the Ancient Greek tex file, but I'm going off memory. Can you share your experience with the current hyphenation and whether it's proper or not?

The conversion itself is easy, but sometimes things pop up like the ones I discussed with poire-z before. I can convert it fairly easily and do the changes I think might be okay, but we'd still need you or another fluent user to test them out and confirm whether they do things properly - sometimes mistakes pop up in the conversion or the files themselves are just incomplete or don't handle newer words. That's why I asked whether fluent users like yourself can intuitively sense whether the hyphenation being used is the proper one :)

Edit: You can also edit your old comments and add information to them instead of posting a new comment, in case you didn't know. Thanks for the input and help!

@ichnilatis-gr
Copy link

Which file do you want me to check?
Or do you just want me to tell you the difference of the rules between ancient and polytonic Greek?

@roshavagarga
Copy link
Contributor Author

@ichnilatis-gr Open a modern Greek book in koreader and test out the current hyphenation and tell me if it feels right.

Besides that I need to create the new files and tell you what to change where and where to drag your files so you can compare those to the current hyphenation.

@ichnilatis-gr
Copy link

ichnilatis-gr commented Dec 19, 2020

I would like to make clear that even Modern Greek can be polytonic and I think that these polytonic files may use the hyphenation rules of Modern Greek.

I noticed that KOReader uses the hyphenation rules of Ancient Greek when indicates "Greek".

I give you some differences between Ancient Greek and Modern Greek hyphenation.

Ancient Greek / Modern Greek

μ-π (ἔμ-πο-ρος) / μπ (έ-μπο-ρος)
ν-τ (ἀν-τέ-χω) / ντ (α-ντέ-χω)
γμ (πρᾶ-γμα) / γ-μ (πράγ-μα)
θμ (ἀ-ρι-θμός) / θ-μ (α-ριθ-μός)
χμ (δρα-χμή) / χ-μ (δραχ-μή)
τν (φά-τνη) / τ-ν (φάτ-νη)
φν (δά-φνη) / φ-ν (δάφ-νη)

There are also some differences, if a word is compound.

συν-άγω / συ-νά-γω
ἐξ-έρ-χο-μαι / ε-ξέρ-χο-μαι

The most common compounds are with συν- and δυσ-. These 3 letters stay together in the same syllable in Ancient Greek, while they are separated in Modern Greek (συ-ν*-... / δυ-σ*...).

But I don't know how these rules can be defined.
Also, the files that I have taken from OpenOffice and I use them also in InDesign through this method don't use all of these rules (especially the first mentioned above).

Nevertheless, I have found a web page that uses these rules. You can see here. But, I don't know if it can help.

Finally, if you are going crazy with all these differences and rules, maybe it would be better to just do the conversion without any other fixes...

@noembryo
Copy link

noembryo commented Jan 2, 2021

@roshavagarga So, even before I put my wish into words, it was almost granted..
When ready, call for testing (using my limited knowledge of course ;o))

@roshavagarga
Copy link
Contributor Author

Just a quick update: unfortunately nuked my HDD by mistake and lost all of my hyphenation work, so I'll start over from scratch at some point :)

@roshavagarga
Copy link
Contributor Author

Bonus to-do list:

  • Albanian - add from here.
  • Portuguese (BR) - update from here
  • Catalan - update file, take example from other updates files, mark version 3 and point to here
  • Belarusian - add from here or here
  • Danish - point to actual file here
  • Dutch - same, here
  • English - same, here and here
  • Esperanto - update from here and switch link, or switch to here - native speaker required to ascertain which version is better.
  • Mongolian - add from here
  • Hungarian - update from here (might be the same, check) or switch to here
  • Indonesian - add from here
  • Italian - switch to here - seems like improved version?
  • Telugi - add from here
  • Oriya - add from here
  • Panjabi - add from here
  • Tamil - add from here
  • Finnish - switch to or add as separate from here - some sort of 'school rules' (simplified?) version of Finnish hyphenation

@poire-z
Copy link
Contributor

poire-z commented Dec 8, 2022

There's been some work in upstream (coolreader) crengine about licenses and ensuring its fully/onlyGPL: buggins/coolreader#339
I don't think/want/hope we need to care that much :) or have to follow it.

As part of it, @virxkane has updated many hyphenation files, and added some (ie. Bengali).

Removed files: buggins/coolreader@20ebd0a
Added files: buggins/coolreader@79727db
No idea about the quality differences between remove and added.

And some code change in buggins/coolreader@7925f6b to pick language, tags, and left/right min size from the files themselves - that I'd rather not pick.

@uroybd : any interest in having bengali hyphenated ? I'm not sure if it would work as is, or if it would need more tweak (because bengali is considered cursive, and there's many things we avoid doing with cursive text segments).

@uroybd
Copy link

uroybd commented Dec 9, 2022

There's been some work in upstream (coolreader) crengine about licenses and ensuring its fully/onlyGPL: buggins/coolreader#339
I don't think/want/hope we need to care that much :) or have to follow it.

As part of it, @virxkane has updated many hyphenation files, and added some (ie. Bengali).

Removed files: buggins/coolreader@20ebd0a
Added files: buggins/coolreader@79727db
No idea about the quality differences between remove and added.

And some code change in buggins/coolreader@7925f6b to pick language, tags, and left/right min size from the files themselves - that I'd rather not pick.

@uroybd : any interest in having bengali hyphenated ? I'm not sure if it would work as is, or if it would need more tweak (because bengali is considered cursive, and there's many things we avoid doing with cursive text segments).

Let's not. It's complex and no one use hyphenation in Bengali nowadays.

@dmalinovsky
Copy link
Contributor

You can mark Russian hyphenations as complete, the remaining part is "Additional patterns with hyphen/dash", which aren't supported by crengine, as I understood. And the remainder contains ad-hoc hyphenated words, which can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests