-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced text layout: links, thoughts and discussion #307
Comments
Pinging some KOReader contributors, that some PRs show you must be at ease with some of these languages I'm not: @frankyifei @houqp @chrox who did some CJK work on crengine, mainly aiming at chinese I guess Do you read EPUBs in these languages with KOReader? Is the experience good enough, or else? What's missing, what looks like easy fixing? |
Polish and Czech typography rules requires avoiding one letter prepositions at line endings. So they should be connected with the following words by nonbreakable space. Incorrect (Tom went to a shop in the city to buy apples and rolls.):
Correct:
Many publishers (but not all) add
|
I don't have much to say on the CJK/RTL/BIDI front, except to confirm that, yeah, we probably can't switch to pango, for pretty much the reasons you've explained. (Plus, it's possibly going to be rejigged as a thinner wrapper around harfbuzz, which sounds like a great idea, but is still far ahead on the horizon ;)). @shermp introduced libunibreak in FBInk for line-breaking purposes, so I have moderate experience with it. Basically, it parses an utf* string, and for each byte index in that string, it fills another buffer with a few differents flags (like can't break, must break, can break, ...). What you do with that information is left to you ;). EDIT: That implies having a pretty fast & rock-solid utf-*/unicode encoder/decoder, and a pretty solid utf-8 iterator. We've got a pretty great utf-8 decoder in FBInk, and a decent one-way iterator (next codepoint) based on it, but a pretty shitty reverse one (prev codepoint), which has lead to some pretty gnarly workarounds. |
@robert00s example raises a good point: I doubt UAX line-breaking rules deal with those kinds of grammatical/typographical conventions, do they? |
UAX line-breaking rules don't include syntactic parser, nor do I think they should. ;-) Whether it's best to attach a single character like that to the previous word or the next word is something you can't really say otherwise. For example, in a Dutch sentence like "ik ga na" it'd be preferable to keep "ga na" together, but unless you know that it is a conjugated form of the verb "nagaan" (to check up) as opposed to "gaan" (to go) there's no way to tell. Those words could also occur in a different context, e.g., "ik ga na het concert naar huis" (I go after the concert to home). In that case there'd be no preference to keep "ga na" together, possibly the opposite to prevent a "check up" reading. |
I was hoping it would :) so we wouldn't have to bother with all these pecularities... But it probably doesn't...
In french, it's also expected from publishers with quotations marks, where there is a space after the opening and one before the closing. I added some code to avoid that in #237 even if the publisher forgot to put |
I don't think my German books tend to have spaces between words and quotation marks like the guillemets in French? (Not that it matters in that case.) |
https://german.stackexchange.com/questions/117/what-is-the-correct-way-to-denote-a-quotation-in-german |
libunibreak is probably the easiest stuff to integrate. I had some quick testing, and it would just come done to adding/replacing this in copyText(): #if (USE_LIBUNIBREAK==1)
const char * lang = "fr";
if (!init_break_context_done) {
lb_init_break_context(&lbCtx, m_text[pos], lang);
init_break_context_done = true;
}
else {
int brk = lb_process_next_char(&lbCtx, (utf32_t)m_text[pos]);
// printf("between <%c%c>: brk %d\n", m_text[pos-1], m_text[pos], brk);
if (brk == LINEBREAK_ALLOWBREAK) {
m_flags[pos-1] |= LCHAR_ALLOW_WRAP_AFTER;
}
else {
m_flags[pos-1] &= ~LCHAR_ALLOW_WRAP_AFTER;
}
}
#endif
pos++; and later: + #if (USE_LIBUNIBREAK==1)
+ if (flags & LCHAR_ALLOW_WRAP_AFTER) {
+ lastNormalWrap = i;
+ }
+ #else
if ((flags & LCHAR_ALLOW_WRAP_AFTER) || isCJKIdeograph(m_text[i])) {
// Need to check if previous and next non-space char request a wrap on
// this space (or CJK char) to be avoided It exports its low level API, so we can feed it char by char without the need to allocate another long buffer to get the results. Unlike fribidi that seems to want the full text buffer, and fill a flags buffer as long :( I'm a bit surprised that it works this linearly, without any need to go correct decisions 2 or 3 chars ago - which may be means this UAX#14 algo is just some basic one the unicode people feel they had to provide, and is just not good enough :)
At this point of text layout, crengine has already and works with unicode codepoints, which is nice and as already prefered by most of these libraries. (the utf8 decoding is done at HTML parsing time, and if it's not fast or solid enough, it's for another topic :) |
Of course. This should be an option (default disabled - almost all language don't have this rule). |
Another note related to the use of fonts:
I hope we can stay with the 2 crengine fonts, to not add another layer of complexity :) Personally, I've made myself a somehow complete fallback font by merging (with fontforge) a few that we provide, because each individually has holes...: /* Run as : $ /C/Program\ Files\ \(x86\)/FontForge/bin/fontforge.exe -script */
freesans = "FreeSans.ttf"
freeserif = "FreeSerif.ttf"
notosanscjk = "NotoSansCJK-Regular.ttf"
newfont = "FreeSans-extended.ttf"
tmpfont = "FreeSans-tmp.ttf"
Open(freesans)
/* merge not found glyphs from FreeSerif */
MergeFonts(freeserif)
/* segfault if not in-between save */
Generate(tmpfont, "", 4/* remove bitmap */)
Open(tmpfont)
/* remove symbols (better/bigger in NotoSansCJK) */
Select(0u2500, 0u27FF)
DetachAndRemoveGlyphs()
/* merge not found glyphs from NotoSansCJK */
MergeFonts(notosanscjk)
SetFontNames("FreeSansExtended", "FreeSans extended", "FreeSans extended", "Book")
Generate(newfont, "", 4/* remove bitmap */)
Close() NotoSansCJK has/had wrong glyphs for greek and arabic, so I couldn't even decode greek - FreeSans and FreeSerif have good greek and hebrew, but no CJK - and my prefered font only have latin. So I couldn't get both CJK and Hebrew shown in a same book. Using my prefered (for look) latin font, and this fallback font, I rarely see Would we be allowed, and would it make sense, to provide such a fallbackenstein font with KOReader? Or is there still too much user preference to make some decisions (like me prefering to start with FreeSans, which is morphologially nearer to my prefered font, than the thin and small but nice FreeSerif)? (not willing to undertake that font building, just asking :) |
Also pinging @virxkane , who brought harfbuzz support into coolreader, that we then picked - for info and advice, to make sure we will do things right for russian too :) |
As @NiLuJe stated, I introduced libunibreak to FBInk when I was implementing the truetype/opentype rendering to it. It was chosen because I wanted better linebreaking than what FBInk had at the time, and it was really easy to use. As far as paragraph justification goes, some sort of best/total fit algorithm such as the Knuth & Plass (as used in TeX) would be nice to have. The CSS working group actually had some discussions about this at w3c/csswg-drafts#672, one idea floated was using a n-line sliding window to improve linebreaking. If you wanted to get really really fancy, one could dive down the rabbit hole that is the Microtype system implemented in pdfTeX and luaTeX, as documented here and here. (Ok, not really being serious here, that would be a LOT of work, if even possible...) |
Well, crengine indeed doesn't do complicated: it has to do it linearly, char by char, never going back. Anyway, I'm not really an aesthete regarging text layout and fonts, but I'm rarely stopped while reading thinking wow, that's really ugly because there would be too many consecutive hyphens or too large or narrow spacing (may be once every 100 pages :) I'm more often stopped by rivers. So, as far as latin text is concerned, I find what we have quite ok. What do others think? I'm more inclined for now to fix occasional embedded RTL words or sentences that currently mess the surrounding latin text, and possibly have proper western optical margin alignment while making all that fine for CJK and RTL too. |
Fair call on the line breaking. I know from experience with FBInk that first-fit is difficult enough to get right, and I wasn't even trying to do justification or hyphenation. And yeah, floats would throw a spanner in the works wouldn't they? Although I would have thought (and I could be totally wrong here) that the only floats that would cause concern would be those that protrude into the previous paragraph/block, I have to admit though that after reading so long using RMSDK, There's always been something slightly off (to me) about how both the Kepub renderer and crengine (among others) do line breaking. I have no idea what Adobe did, but their algorithm is (was) probably the best line breaking I've seen outside of typesetting software. It's a shame Adobe basically abandoned their renderer :( |
With negative margins you mean? Otherwise floats don't really do that. |
Yeah. And that's what I thought. Which is why I would think that so long as the algorithm can deal with differing line widths, floats shouldn't make that much of a difference for a multi-line algorithm. For floats that can be pre-positioned (like dropcaps), place them first before rendering text. For mid-paragraph floats, you could probably reset the line-breaking algorithm at the line the float starts. And of course, I could be talking out my arse, as I am probably completely wrong, so please feel free to ignore me. I'm more having a bit of a thought exercise at this point. |
Except that for such mid-paragraph (embedded) floats, when one is met, it may fit on the line, but if it can't it has to be delayed till next line. And if you have a complex algo, it may after passing it decide to shorten the text, and oups, the float could have fit after all :) Do the other well know line breaking algorithms supports bidi text? Or are supposed to work as-is with pure RTL text? edit: But crengine has detected if floats are present before laying out lines - so this could allow having some enhanced line layout algo, as we could just swith to the current linear algo only when there are floats present (as 99.9999% paragraphs of the world don't have any :) |
I guess one potential option is to fallback to first-fit if a block of text contains an embedded float. But yeah. It's hard. I don't blame you at all if you want to stick with first-fit. As to breaking bidi text, I have absolutely no idea how that's supposed to be handled. UAX 14 appears to have a single paragraph on the matter: In bidirectional text, line breaks are determined before applying rule L1 of the Unicode Bidirectional Algorithm [Bidi]. However, line breaking is strictly independent of directional properties of the characters or of any auxiliary information determined by the application of rules of that algorithm. |
It is very difficult to follow all the CJK layout rules. For example, see here Chinese Compression Rules for Punctuation Marks. The practical way is to use the default behaviour of pango and implement some rules later if necessary. If the default result is close to what chrome and firefox give, that would be good enough. |
Thanks, added your link (and a few others) to first post. Can you please tell me about my questions about the importance or not of the (invisible) grid (wish for perfect vertical alignment of the ideographs) for chinese text, that can get messed up (and is, in crengine) when there is some non-fixed-width latin chars in the line, or when the last char is some Left punctuation that need to be pushed onto next line (making a hole the size of one or two ideograph at end of line). edit: OK, it's mentionned in your link https://www.w3.org/TR/clreq/#handling_of_grid_alignment_in_chinese_and_western_mixed_text_composition
But I'd still like your subjective opinion (because we currently don't have that, and you seem to be fine without it). I still feel we could have it with a few tricks. About that specific code with the ifs, I expect UAX#14 / libunibreak should set the proper allow/avoid break flags for these CJK punctuations, so we may get a more correct implementation, and can avoid a bit of that code (or we may have to keep some if libunibreak is bad at that). |
@Frenzie @NiLuJe: any specific thoughts on my take on our fonts #307 (comment) ? Another thing with crengine font handling, is that a tag with |
It makes intuitive sense, but unless something changed in the past decade browsers don't do it that way. (Which needn't mean it's wrong, you'd have to check the spec for that, but it does mean there's no expectation for it.) |
@poire-z : Yeah, I wouldn't complexify the fallback mechanism. Unless someone one day decides to say "fuck it all" and switches to fontconfig ;p. |
So, bidi text support has been done with #309. And I've been contemplating the following idea: So, we could have/generate some settings per language tag.
that crengine could store/save/hash, and associate that hash to each paragraph so these settings are used when laying out that paragraph, without having to make any decision about the language/chars inside crengine: just follows the rules given. Another quicker or uglier option would be to just use CSS and have in our epub.css or in some style tweaks:
That might be superoverkill as I don't know how much publishers do use But I dunno, it feels like the right generic way to go at that, and may allow easy incremental addition/enabling/disabling of features. (It feels also quite complex to implement, given that our hyphenation and fontmanager kerning method and fallback handling are global...) |
@poire-z Thanks for the RTL\BIDI support, it is great. i do alot of hebrew and now can use Koreader on my kindle. i will adress your questions (as a user) in koreader/koreader#5359 |
@poire-z |
OK, i'm making some progress on some of that stuff.
Would we all be ok with that ? Replacing our I have to keep the now-legacy hyphen dict selection working (they will set a language and enable hyphenation) just to avoid the CoolReader devs to have to rework their various frontends - and for us, it would work the same way with our current readerhyphenation.lua - but I have minor issues with it (left/right minimale sizes, that CoolReader is not using) that I'd need to rework. Caveats: all the languages/hyph dict names/mappings would be hardcoded into a textlang.cpp, so languages.json would no mode be used (and so, not allowing customisation that easily). So, asking for permission :) Would that switch and that new menu be understandable by users? The technical idea is that each text node will get associated a TextLangCfg (from the selected language, or from the language specificed in an upper node lang=attr, and various text rendering bits of code would use members of this object instead of global defaults: TextLangCfg::TextLangCfg( lString16 lang_tag ) {
printf("TextLangCfg %s created\n", UnicodeToLocal(lang_tag).c_str());
// Keep the provided and non-lowercase'd lang_tag
_lang_tag = lang_tag;
// But lowercase it for our tests
lang_tag.lowercase();
_hyph_method = TextLangMan::getHyphMethodForLang(lang_tag);
// https://drafts.csswg.org/css-text-3/#script-tagging
// XXX Check for Lant, Hant, Hrkt...
// XXX 2nd fallback font
#if USE_HARFBUZZ==1
_hb_language = hb_language_from_string(UnicodeToLocal(_lang_tag).c_str(), -1);
#endif
#if USE_LIBUNIBREAK==1
_lb_char_sub_func = NULL;
if ( lang_tag.startsWith("de") ) {
_lb_props = (struct LineBreakProperties *) lb_prop_cre_German;
}
else {
_lb_props = (struct LineBreakProperties *) lb_prop_cre_Generic;
}
if ( lang_tag.startsWith("pl") ) {
_lb_char_sub_func = &lb_char_sub_func_polish;
// XXX also for pl: double real hyphen at start of next line
}
#endif
} and some hardcoded stuff like:
|
Sounds good to me. Assuming the default hyphenation dicts are decent, I'm not aware of anyone actively tweaking the list. And if they do, it'd be a PR away ;). |
Seems fine to me too. |
Found some excellent (I think :) resource on chinese typography in... chinese :/ For example, the article about hanging punctuation: And when I thought I could just fix chineese spacing by squeezing punctuation glyphes (as they seemed to only occupy the left or right half of their squared glyph), I learn there that it depends on the language: in chinese, a The same text, using the language specific glyphs:
Thanks to the TextLangMan typography stuff, we could delegate some decisions from lvtextfm.cpp to some typography specific functions per-language, but that means many things and combinations to test... |
I've been looking at implementing some alternative hanging punctuation code as envisionned in koreader/koreader#2844 (comment).
I figured I may need, in lvtextfm.cpp, some alternative methods for laying out lines and spacing words (but just that, not redesigning the whole thing!), and so I began looking at other topics like line breaking, bi-directionnal text, proper CJK text layout, at least to see how that could fit in and to not take early some wrong directions that would forbid working on these additional features.
Sadly, I know nothing about other languages and writtings than western ones... So I have tons of questions for CJK and RTL readers, that I may ask later in this issue, if there's some willing to help on that :)
I have no personal use for all that, as I only read western, but these are some quite interesting topics :)
I sometimes think that it could be simple to have these right by just using the appropriate thirdparty libraries. But at some other moments, I feel that even the libraries won't do all that correctly, and there may be much manual tweaking needed, possibly by language... So, it ends up feeling like opening a can of worms...
Terminology:
CJK = Chinese, Japanese and Korean
RTL = Right To Left (Arabic, Persian, Hebrew... scripts)
LTR = Left To Right (Latin, western languages, CJK...)
Bidi = Bidirectional text (LTR and RTL mixed)
For now, just cut and pasting and organizing my accumulation of urls and thoughts.
Unicode text layout references and algorithms:
http://www.unicode.org/reports/tr14/ UAX#14: Unicode Line Breaking Algorithm
http://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt reference file
http://jkorpela.fi/unicode/linebr.html Unicode line breaking rules: explanations and criticism
https://www.unicode.org/reports/tr29/ UAX#29: Unicode Text Segmentation
http://www.unicode.org/reports/tr9/ UAX#9: Unicode Bidirectional Algorithm
http://www.unicode.org/reports/tr11/ UAX#11: East Asian Width
https://www.w3.org/TR/jlreq/ Requirements for Japanese Text Layout
https://www.w3.org/TR/clreq/ Requirements for Chinese Text Layout
https://w3c.github.io/typography/ International text layout and typography index (links)
https://unicode.org/cldr/utility/breaks.jsp Unicode Utilities (to test algorithms output)
https://drafts.csswg.org/css-text-3/ CSS take on all that enhanced typography
Appendix D,E,F gives some insight about writting systems and the importance of the
lang=
attributeSites with valuable informations about foreign scripts, languages, typography and chars
https://r12a.github.io/scripts/ Wonderful and complete descriptions of each script, usage, layout
https://r12a.github.io/scripts/phrases Sample phrases in various scripts
https://r12a.github.io/scripts/tutorial/summaries/wrapping Sample phrases for testing wrapping
http://www.alanwood.net/unicode/index.html Dated, but very complete
http://jkorpela.fi/chars/index.html Characters and encodings
http://jkorpela.fi/chars/spaces.html http://jkorpela.fi/dashes.html
https://jrgraphix.net/research/unicode.php Unicode Character Ranges
https://unicode.org/charts/
http://unifoundry.com/unifont/index.html large single image of the full unicode planes
Line breaking & justification
https://en.wikipedia.org/wiki/Line_wrap_and_word_wrap
https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages
https://www.w3.org/International/articles/css3-text/ CSS and International Text (line breaking and text alignment)
https://www.w3.org/International/articles/typography/justification Approaches to full justification
http://w3c.github.io/i18n-drafts/articles/typography/linebreak.en Approaches to line breaking
https://www.w3.org/TR/2003/CR-css3-text-20030514/#justification-prop CSS justification options, describing the various ways to justify appropriately for some scripts
https://github.com/bramstein/typeset/ TeX line breaking algorithm in JavaScript
https://wiki.mozilla.org/Gecko:Line_Breaking Mozilla documentation about line breaking (obsolete? mention it should switch to UAX#14) implemented in https://github.com/mozilla-services/services-central-legacy/blob/master/intl/lwbrk/src/nsJISx4501LineBreaker.cpp
Hanging punctuation / Optical margin alignment
https://en.wikipedia.org/wiki/Hanging_punctuation
https://en.wikipedia.org/wiki/Optical_margin_alignment
https://askfrance.me/q/comment-bien-choisir-saillie-pour-les-lettres-et-la-ponctuation-hors-36070225344
https://helpx.adobe.com/fr/photoshop/using/formatting-paragraphs.html#specify_hanging_punctuation_for_roman_fonts
https://french.stackexchange.com/questions/1432/whats-hanging-punctuation-in-french
https://drafts.csswg.org/css-text/#hanging https://www.w3.org/TR/css-text-3/#hanging-punctuation-property There is support in CSS, but it's very limited and targetted to CJK
Relevant commit about its implementation in crengine: 3ffe694 (extended to other ideograph by 81bbb8d 59377ba).
I figured we could have both CJK hanging punctuation and western optical margin alignment handled the same, by using, for each candidate glyphs, a % of its width, to be pushed in the margins.
So, hanging punctuation in CJK can go fully in the margin, because the fixed-width ideogram glyphs have a good amount of blank space, and in the end, the space taken in the margin is smaller than the ideogram widths. For other western non-fixed-width glyps (punctuation), we would use a %. Some suggesstions and discussions at:
https://www.w3.org/Mail/flatten/index?subject=Amending+hanging-punctuation+for+Western+typography&list=www-style
https://source.contextgarden.net/tex/context/base/mkiv/font-imp-quality.lua hanging punctuation percentage by char
https://lists.w3.org/Archives/Public/www-style/2011Apr/0276.html
BIDI / RTL:
https://www.w3.org/International/articlelist#direction
https://www.w3.org/International/questions/qa-html-dir Q/A
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
https://www.w3.org/International/articles/inline-bidi-markup/index.en for inline elements
https://www.w3.org/International/questions/qa-html-dir.en for block elements
http://www.i18nguy.com/markup/right-to-left.html
https://www.mobileread.com/forums/showpost.php?p=3828770&postcount=406 sample-persian-book.epub with screenshots of the expected result
Unrelated to crengine, but to check if we want to make the UI RTL:
https://labs.spotify.com/2019/04/15/right-to-left-the-mirror-world/
https://material.io/design/usability/bidirectionality.html UI
Various articles about the text layout process
https://www.unicodeconference.org/presentations/S5T2-Röttsches-Esfahbod.pdf Text rendering in Chrome (by HarfBuzz author)
https://simoncozens.github.io/fonts-and-layout/ Some (unfinished book) about text layout.
http://litherum.blogspot.com/2015/02/end-to-end-tour-of-text-rendering.html
http://litherum.blogspot.com/2013/11/complex-text-handling-in-webkit-part-1.html Encoding
http://litherum.blogspot.com/2013/11/complex-text-handling-in-webkit-part-2.html Fonts
http://litherum.blogspot.com/2014/02/complex-text-handling-in-webkit-part-3.html Codepoint to Glyph
http://litherum.blogspot.com/2014/04/complex-text-handling-in-webkit-part-3.html Line breaking
http://litherum.blogspot.com/2014/11/complex-text-handling-in-webkit-part-5.html Bidi
http://litherum.blogspot.com/2014/11/complex-text-handling-in-webkit-part-5_22.html Run Layout
http://litherum.blogspot.com/2014/11/complex-text-handling-in-webkit-part-7.html Width Calculations
http://litherum.blogspot.com/2015/04/complex-glyph-positioning.html
http://litherum.blogspot.com/2015/07/knuth-plass-line-breaking-algorithm.html
http://litherum.blogspot.com/2015/10/vertical-text.html
http://litherum.blogspot.com/2017/05/relationship-between-glyphs-and-code.html
Available libraries that could help with that
For illustration, there is a Lua module that provides the full text rendering stack and use many of these libraries, which is interesting to look at (as it's readable :) and may be the only small complete full stack I found, which shows the order of how things should be done.
https://luapower.com/tr unibreak, fribidi in lua
https://github.com/fribidi/fribidi/issues/30
interesting Q/A between the author and the HB peoplehttps://github.com/luapower/tr/blob/master/tr_research.txt some short notes on these same topics
There is also this Lua layout engine, which has "just enough" wrappers, and has many specific tweaks per language:
https://github.com/simoncozens/sile/ (see justenoughharfbuzz.c, languages/fr.lua...)
https://github.com/Yoxem/sile/commits/master w.i.p. chinese zh.lua adapted from ja.lua
https://github.com/michal-h21/luatex-harfbuzz-shaper
utf8proc
https://github.com/JuliaStrings/utf8proc
http://juliastrings.github.io/utf8proc/doc/
Provides helpers for Unicode categorization (but a bit limited, as it does not provide all of them, like the unicode script - we can't use it to detect if some char is Chinese or Korean).
https://stackoverflow.com/questions/9868792/find-out-the-unicode-script-of-a-character gist in 1st answer gives a simple implementation for detecting script
harfbuzz
https://github.com/harfbuzz/harfbuzz
We already use it for font shaping in kerning "best" mode.
It can also provides useful things like direction and script detection of what we throw at it (https://harfbuzz.github.io/harfbuzz-hb-buffer.html), so it may complement utf8proc for some Unicode categorisation. (It includes UCDN https://harfbuzz.github.io/utilities-ucdn.html, so we get additional functions for free).
libunibreak
https://github.com/adah1972/libunibreak implements UAX#14 and UAX#29
https://luapower.com/libunibreak https://github.com/luapower/libunibreak Lua wrapper
https://github.com/HOST-Oman/libraqm/pull/76
open PR to use libunibreak in libraqmhttps://github.com/adah1972/libunibreak/issues/16
word breaks is less obviousThis works only on the text nodes in logical order, and could be used in crengine src/lvtextfm.cpp copyText() to set/unset
LCHAR_ALLOW_WRAP_AFTER
, trusting it and removing our explicite check for isCJKIdeograph() in processParagraph() and other places).I initially thought our check for
isCJKIdeograph()
was wrong as it is allowing breaks after any korean glyph (which are like syllables), but Korean have words separated by spaces, so we should use spaces like in western scripts. But it looks like korean, even if it has spaced words, allow a line break in the middle of such a word. So we're probably already fine with korean.libunibreak accepts a language parameter, but it's only used to add a few rules for breaking line specific to that language, mostly related to quotes (list in https://github.com/adah1972/libunibreak/blob/master/src/linebreakdef.c).
So, I discovered that German strangely closes on left angle quotation marks, and opens on right angle quotations marks :) (so, I guess what I put in #237 might give strange things on german text, unless they don't use spaces on both side, and only french does that).
Anyway, I'd like us to not have to detect the document or text segment language, neither to have it to be provided by frontend, to keep things simple. Dunno if that's a viable wish.
Some discussion about reshaping because of line breaks, and some unsafe_to_break flag that could/should complement our "is_ligature_tail" flag:
https://github.com/harfbuzz/harfbuzz/issues/1463#issuecomment-505592189
https://github.com/linebender/skribo/issues/4
We may also need to pass HB_BUFFER_FLAG_BOT / HB_BUFFER_FLAG_EOT to HarfBuzz for specific shaping at Begin/End Of Text (=paragraph).
Other code using libunibreak:
https://github.com/geometer/FBReader/blob/master/zlibrary/text/src/area/ZLTextParagraphBuilder.cpp FBReader
https://git.enlightenment.org/core/efl.git/tree/src/lib/evas/canvas/evas_object_textblock.c enlightenment
Note: when a word is followed by multiple spaces, libunibreak set the allowed break on the last space - crengine will want it on the first space, the others should be marked as collapsed spaces and should be at the beginning of the next word, where they will be ignored.
fribidi
https://github.com/fribidi/fribidi fribidi (implements UAX#9)
This works only on the text buffer in logical order, and fill another buffer (lUint32, so as large as the text buffer) from which we can get the bidi level of each char (because english can be detected to be embedded in some arabic which is itself part of some english paragraph...). Could be used in crengine src/lvtextfm.cpp copyText() to set that level to each char
We would then need in measureText() to split on bidi level change to have a new text segment to measure (like we do when there is a font change) as a single text node can have both latin and hebrew in it, and harfbuzz expects its buffer to have a single direction and script.
After that, I guess, line breaking should be tweaked (in processParagraph()), and may by in addLine()): when processing a text line in the the logical-order, and splitting words, it should re-order the words according to their origin text segment bidi level....
We've seen that harbuzz already RTL each individual word and renders it correctly (not so nicely the way we use it currently, see below). And according to
https://github.com/fribidi/fribidi/issues/30
, there's quite a bit less things to do about bidi when we use harfbuzz.So, it looks to me that we should indeed split lines with the logical text order, and as harbuzz renders correctly a RTL word, we just have to re-order the words.
https://github.com/fribidi/linear-reorder/blob/master/linear-reorder.c provides a generic algo. It looks like we could put a crengine
formatted_word_t
in thatrun_t
to have them re-ordered. Dunno if that's as simple as that :)(After that, there may be even more complicated things to have text selection and highlighting work with bidi and RTL...)
Our current harfbuzz implementation ("best") is a bit buggy with text more complex than just western with ligatures.
I thought that it does render RTL words correctly, but even that is not done well: the measurements are all messed up (we don't process correctly clusters, it there's some decomposed unicode), and the way we use the fallback font (no harfbuzz re-shaping, using the main codepoint for all chars parts of the cluster) makes wrong results.
And there are cases where the bidi algo doesn't say anything, like this reordering of soft hyphen (and so, should we hyphenate LTR in bidi test, where the hyphen may be in the middle of the line ? :)
http://unicode.org/pipermail/unicode/2014-April/thread.html#353 Bidi reordering of soft hyphen
http://www.staroceans.org/myprojects/vlc/modules/text_renderer/freetype/text_layout.c one of the rare example of the use of
fribidi_reorder_line
, that I guess we'll have to.One interesting solution to re-shaping with fallback fonts is how it was done in Chrome:
https://lists.freedesktop.org/archives/harfbuzz/2015-October/005168.html font fallback in Chrome
https://chromium.googlesource.com/chromium/src/+/9f6a2b03ccb7091804f173b70b5facff7dffbd61%5E%21/#F8 chrome improved shaping
See also minikin Layout.cpp code below.
We may also need freetype rebuilt against harfbuzz.
libraqm
https://github.com/HOST-Oman/libraqm
http://gtk.10911.n7.nabble.com/pango-vs-libraqm-td94839.html
raqm does not do font fallback and line breaking currently, nor does it do font enumeration. Raqm is designed to add to applications that otherwise have a very simplistic view of text rendering. Ie. they use FreeType and a single font to render single-line text (think, movie subtitles...).
pango
https://developer.gnome.org/pango/stable/
https://github.com/GNOME/pango pango
https://gist.github.com/bert/262331/ sample usage
Pango and libraqm provide higher level functions. They do the full pipeline (unicode preprocess, shaping, bidi, linebreaking, rendering).
But we can't use their high level functions because they don't do as much as crengine (vertical text alignment, inline images, floats), so if we were to use them, we'd need to provide small segments, and we may as well do that with the lower level libraries. Or skip all the crengine services (fonts management, text drawing) and use it instead, and have to re-implement all the crengine higher level functions that pango does not provide. Not my plan :)
Pango has dependencies on glib and fontconfig, which does not look like fun.
The most interesting stuff in pango is in https://github.com/GNOME/pango/blob/master/pango/break.c, where it implements UAX#14 and UAX#29, like libunibreak, but in one single pass, with some additional tweaks for arabic and indian (dunno if libunibreak does that as well or not).
Also in pango-layout.c justify_words(): for justification, it does as crengine does: it expands spaces. And if there is not a single one, it switches to adjusting letter spacing (which crengine does not do).
Others developments/discussions
https://raphlinus.github.io/rust/skribo/text/2019/02/27/text-layout-kickoff.html work towards a rust library
https://gitlab.redox-os.org/redox-os/rusttype/issues/2
Text rendering/Font fallback in Chrome and other browsers
https://chromium.googlesource.com/chromium/src/+/master/third_party/blink/renderer/platform/fonts/README.md
https://gist.github.com/CrendKing/c162f5a16507d2163d58ee0cf542e695
minikin is the library used in Android for text layout with harfbuzz. It's quite tough to find some master authoritative version, cause they are many divergent ones... (and the latest Android one does not include some changes provided by Harfbuzz author, that are available in some other branches or fork). Here a few links (the interesting file is Layout.cpp):
https://android.googlesource.com/platform/frameworks/minikin/ minikin main repo
https://dl.khadas.com/test/github/frameworks/minikin/libs/minikin/Layout.cpp
https://github.com/abarth/minikin
https://github.com/flutter/engine/blob/master/third_party/txt/src/minikin/Layout.cpp
https://source.codeaurora.org/quic/la/platform/frameworks/minikin with changes from HarfBuzz author
https://github.com/CyanogenMod/android_frameworks_minikin/blob/cm-12.0/libs/minikin/Layout.cpp with changes from Harfbuzz author
https://medium.com/mindorks/deep-dive-in-android-text-and-best-practices-part-1-6385b28eeb94 minikin (android text layout)
CJK (horizontal) layout
@frankyifei said in koreader/koreader#2844 (comment):
It looks like there is nothing special for CJK in pango. If a paragraph is pure CJK, if would do as crengine does, spacing each CJK char, and it would result in a nice regular ideographs grid.
If there is a single latin letter or punctuation or normal space, the grid would be broken (like it is in the Wikipedia ZH I use for testing CJK).
It's up to us to do the right thing in lvtextfm.cpp. A few ideas:
LTEXT_WORD_CAN_ADD_SPACE_AFTER
that is set on a space or a CJK ideograph. We could have multiple such flags, with different levels of priority, so a western space is prefered to a CJK ideograph, and a CJK punctuation is prefered to a CJK non-punctuation.Anyway, Pango looks like it does none of that.
Some question: with a pure CJK ideographs paragraph, and some line ending with two (or three) "CJK right punctuation", both after the available width, how should it be dealt with?
If there would be only one, it could be made hanging in the right margin, and the grid would be fine.
But with two? Would the 2 be hanging in the right margin? Or would they be pushed on the next line, with the previous regular char, so making a hole in the grid at the far right on the previous line (or breaking the grid if we justify that line, like crengine would do it seems).
What's the proper way to handle that?
Note: there may be some stuff to fix in crengine to also consider UNICODE_NO_BREAK_SPACE for expending/decreasing spaces for justification (pango has in break.c:
attrs[i].is_expandable_space = (0x0020 == wc || 0x00A0 == wc);
Vertical text layout
Low interest, cause it looks so much more complicated.
Pinging @xelxebar which showed interest in all that about vertical text.
Just some questions, cause I have no idea how that should work.
I guess it's the whole block element that makes a vertical text section. What's the effect of a
<BR>
? Go back to the top of the next vertical line? What should happen when there are more<BR>
than the max nb of vertical lines that could fit on the available width?I naively thought vertical text can be easily sized:
No idea about how variable font family/style/size, vertical-align, and inline images would work with that :)
How that's supposed to work for long paragraphs that would span multiple pages? Should scroll mode be aware of how page mode has cut the blocks, or can it layout text on some possibly infinite vertical length? How do browsers (that don't have pages) do it?
If having a go at it, possibly first to fix harfbuzz rendering of embedded RTL, which means implementing full bidi support...
Should this new possible expensive new stuff be used when selecting kerning mode "best" (which is the only one where we use harfbuzz correctly, so a requisite), or would we need a "bestest" switch, which, additionally to harfbuzz, would trigger the use of the probably expensive bidi processing?
Or some additional gTextRenderingFlag to enable or not the use of any of the new features (like done for enhanced block rendering)?
I fear starting all that because of the spaghetti mess it will be with so many
#ifdef USE_FRIBIDI
#ifdef USE_LIBUNIBREAK
if we want crengine to still be able to compile and work without all these... Or a singleifdef USE_ENHANCED_TEXT_LIBRARIES
(which should includeUSE_HARFBUZZ
?)The text was updated successfully, but these errors were encountered: