-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextLangMan for text typography by language, use libunibreak #337
Conversation
9b110d1
to
3e3b6c8
Compare
Codacy Quality Review checks are just complains because of |
(Travis CI checks are faster in the european mornings :) it just ran in 42m, while yesterday evening, the 4 runs exceeded a timeout of 60m (or 50m, don't remember). |
3e3b6c8
to
d17c777
Compare
Mostly some refactoring to make the private LVBase64Stream in lvxml.cpp be public in lvxml.h.
d17c777
to
d89ae37
Compare
@poire-z The weekend confounds that further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real comments, looks pretty good to me 👍
} ent_def_t; | ||
|
||
// From https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray! :-)
crengine/src/lvxml.cpp
Outdated
if ( !lStr_cmp( def_entity_table[n].name, entname ) ) { | ||
code = def_entity_table[n].code; | ||
break; | ||
// Straight comparisons for the most common ones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbsp is definitely quite common, I've also seen quot and apos a fair bit but much less so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had early checks for some of them.
But testing, &
took 12 loops (in the binary search), >
and <
around 10, and the ones you suggest may be 5-6.
Adding too many early tests (say 5 if checking for & < > ' nbsp) will make them still use 1-5 loops, and all other will then be +5.
So, I'm a bit torn :)
Going to re-check these numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the early checks:
entities iterations
amp 7
gt 10
lt 8
nbsp 5
quot 9
apos 10
shy 10
eacute 10
Help me decide which are worth an early check (and so, adding a check that will give false to all others).
(nbsp is 5 and may not need an early check, but it's indeed one of the most common - not really sure amp, gt and lt are that popular and need that early check - not sure about apos & quot in ebooks (where the U+20xx left/right angled/not quotations marks have more chances to be used).
Soft hyphens shy is 10 - there might be thousands of them in some books.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amp is quite widely used online (252 times on this very page); presumably much less so in ebooks because you won't have URL parameters.
I wasn't necessarily suggesting anything though, what was the rationale behind these ones specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the rationale behind these ones specifically?
Just the ones I know I always have to substitute with their named entities in other web related projects. So, no real thinking about if it matters here in our ebook context :)
I think I'll go with 2 early checks, just for
and ­
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, in the previous code, that used a linear iteration in a 350-items table, nbsp was first in that table, shy was 14th - and all others far further - so, we won't be slower than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! ^_^
d89ae37
to
45b29ca
Compare
This just adds generic support for libunibreak, which will be tweaked by next commit.
Parse and store values from lang= attributes, so we can propagate a TextlangCfg object to all calls dealing with text, which will allow to: - Use specific libunibreak rules for line breaking per lang (i.e. reverted quotation marks in German vs French). - Use the right hyphenation dictionary for each language - Add more specific line breaking tweaks for some languages (some single letter prepositions should not be at end of line in Polish and Czech, real hyphens should be duplicated at start of next line in Portuguese and Polish...) - Give the language tag to Harfbuzz so it can pick the right glyphs for the language (e.g. different glyphs for the same codepoint in zh-CN, zh-TW and ja, and for Bulgarian Cyrillic with some fonts). Update existing global HyphMan to use services from TextLangMan to ensure legacy single global hyphenation. TextLangMan still uses the hyphenation methods defined in hyphman.cpp.
45b29ca
to
e19f4ff
Compare
#include <linebreak.h> | ||
// linebreakdef.h is not wrapped by this, unlike linebreak.h | ||
// (not wrapping results in "undefined symbol" with the original | ||
// function name kinda obfuscated) | ||
#ifdef __cplusplus | ||
extern "C" { | ||
#endif | ||
#include <linebreakdef.h> | ||
#ifdef __cplusplus | ||
} | ||
#endif | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, code does the inverse of comment?
I'd naively assume you'd want C linking on both, actually? It's a C API, it expects C unmangled symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it works because upstream's <linebreak.h>
already enforces C linking w/ C++, but NOT <linebreakdef.h>
.
TL;DR: It works as-is, but I'd still explicitly move both under C linking here, to avoid future readers having to delve into libunibreak's headers like I just did ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it works because upstream's <linebreak.h> already enforces C linking w/ C++, but NOT <linebreakdef.h>.
Isn't what my comment says ?:
// linebreakdef.h is not wrapped by this, unlike linebreak.h
Guess my indentation (to make that stuff an aside) is confusing :)
I'd still explicitly move both under C linking here,
This would result in
extern "C" { extern "C" { <linebreak.h content> } }
right ? No issue with that ? It compiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, I see what you meant. I initially read that as "I'm not wrapping this...", while you actually meant the header itself ;).
I hadn't thought about the nested externs, but if it builds, I'll take it ,p.
Specs apparently say:
Linkage specifications nest. When linkage specifications nest, the innermost one determines the language linkage.
So, we're good to go ;).
if ( lang_cfg->hasLBCharSubFunc() ) { | ||
next_c = lang_cfg->getLBCharSubFunc()(txt+start, i+1, len-1 - (i+1)); | ||
} | ||
int brk = lb_process_next_char(&lbCtx, (utf32_t)next_c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
(I cringe every time I remember CRe uses uint16_t for text, which is just wrong on Linux).
(IIRC, in this context, that shouldn't be an issue with libunibreak, stuff is sane if you happen to point to the middle of a multibyte codepoint).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, saving grace: it's actually a wchar_t, which is why stuff mostly works. Name is just confusing, because wrong on Linux (where wchar_t is actually sane and 32 bits, unlike on Windows where it's 16 bits for some probably stupid legacy reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we had this conversation before :) #252 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, wchar_t is 16 bits on Windows, because Windows uses UTF-16 for all unicode string handling. Therefore a null terminated array of wchar_t on Windows is a UTF-16 string.
Wheras on most other OS's, I imagine wchar_t is mainly used to store codepoints.
This also makes cross platform path handling a right PITA, because unicode filenames must be in UTF-16 (or UCS2), and fopen() doesn't work... :( The Win32 API SUCKS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shermp : can I request your native english speaker opinion about the use of the word "Honor" in Would you like to honor or ignore embedded lang tags by default?
, cf koreader/koreader#6072 (comment) and followup discussion ? Or alternative suggestions ? Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, honor (or honour, because I speak the queen's English, dammit) is probably the right term here.
Although one might just ask: (Do you wish/would you like) to ignore embedded lang tags by default?
with a yes/no, or perhaps, more verbosely: We honor embedded lang tags by default, would you like to ignore them instead?
Some slightly related observation: I have a book which uses inline-block for footnote links: Some of these footnotes links follow a closing quote French, which considers but if I select EnglishUS, which considers So, I'd be happier with that EnglishUS rendering - but it does not help with the first case above when there is no I considered for a moment adding an option to not enable language specific line breaking rules, that we could use with books that do it properly with appropriate nbsp, like this book does - but as this would not totally solve this situation (the first case above), I'm dropping the idea. Anyway, in that case, as it shows badly with Calibre too, I guess it's a publisher issue. In UAX#14:
I guess it's fine/better to break before/after images in general, so probaly best to not do anything in the code. Any idea how I could go at solving that (preventing a break before such inline-block), with style tweaks or else? Any other idea/thought? |
Not really I'm afraid. |
Regarding my issue above, I can now solve it after #345 in 2 ways: With: a.footnotecall { display: inline !important; }
a.footnotecall:before { content: "\2060" } or inlineBox { white-space: nowrap; } to prevent a wrap on both side of inline-block. Actually, I initially went to quickly implement pseudo elements, to be able to add a But I was frustrated with the inline-block issue, so I went to hack white-space to be able to specify Oh, and when all that was coded and ready to test, I had finished that book where I needed it... |
@virxkane : I regularly follow your https://github.com/virxkane/coolreader/commits/koreader-merge-post - I look and usually pick your stuff - but when you're cherry picking some of my (huge) commits, I can't really notice if you did fix some bug when adapting them. Could you keep letting me know if you find some bug and fix it as part of the cherry picked commit (just bugs or typos, not the needed adaptations you have to do for the few differences we have). I just by chance noticed this minor thing in your todays' picks - that I'll fix on my side: - friend TextLangCfg;
+ friend class TextLangCfg; Btw, for the TextLangMan stuff, dunno if you saw that in the first post of this PR:
Which means it should stay compatible with your current frontend code, and should not need any change as a first step: you can keep just setting hyphenation dicts with the current Hyphen:: methods, and it will pick the language associated.
|
I always try not to change the source while making cherry-pick (exception - conflict resolution). I do adaptation in the next commit. To prevent this from happening: your commit under someone else's authorship: plotn/coolreader@cba0e06. Or this is really you wrote?
This change is so small that I did not make a separate commit. But, I think, ommiting the keyword 'class' in 'friend' clause is not error.
At this moment not all you things work yet. I must do some work around this PR. Can you upload some test files wich you demonstated in #337 (comment)? |
Of course not :) This fork/branck is really a mess and totally unusable/unfollowable. Hopefully, it's mostly android frontend changes, and nothing much about the engine.
linebreaking_lang_test_files.zip |
Adapting for CoolReader... Sorry, but I can't not write this. This spaghetti code such... Om nom nom :) |
I initially thought the same about the whole crengine :) Seriously, which part ? |
Yes, crengine (HyphMan also) already spaghetti. |
Well, I made TextLangMan like HyphMan a single/global/static class instance - because somehow, that makes sense: hyphenation and TextLangCfg instances can (and should, to avoid duplicating hyphenation dicts or lang properties) be shared between multiple documents (you can have multiple docs on CR, we don't on KOReader). Oh, and yes: I think you should just use of one these 2 ! Either you use only the legacy Hyphman props like PROP_HYPHENATION_DICT - or you use the PROP_TEXTLANG_MAIN_LANG and friends. And yes, the interaction between TextLangMan <> Hyphman are complicated and tedious. I did not want to change Hyphman too much (mainly, because I want a clean git history with a real log of the past), otherwise, I would have just taken the hyph methods code, and drop the rest. For me, the only issue for you would have been with |
Ok, @poire-z thank you very much. |
I had 2 targets:
So, your adaptation looks a bit hybrid :) Dunno if you went looking at our frontend changes for this switch from HyphMan to TextLangMan: see koreader/koreader#6072. |
While testing this PR: if I replace ISO639-1 language code with ISO639-2 (or ISO639-3) tag 'lang' not work anymore - nor hyphenation nor HarfBuzz's font scripting selection. For example, replace 'bg' with 'bul'. Tag 'lang' specification: added: |
Ok, in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry 3-ALPHA codes ommited if exist suitable 2-ALPHA code. But I unsure that all files in internet strictly conforms to the specification. |
I think 'eng' or 'bul' are not expected in HTML lang tags (and I guess HarfBuzz does not accept them, https://github.com/harfbuzz/harfbuzz/blob/d5439232946333b60f655d9ed37ec7dadf439287/src/hb-ot-tag-table.hh#L16-L114 ). But we may find them in books metadata. |
I found fb2 book with specified language 'eng'.
Ok, I'll think about it. |
@poire-z As you requested report about SEGFAULT (related to this PR). You introduce multiple construction |
You don't mention the line, but may be I've already fixed it in bc4500a that you may be have not yet picked ? |
It seems like that. |
@poire-z But still bug is not fixed. Try file Petra.AR.epub crengine/crengine/src/lvtextfm.cpp Line 1102 in c14ea51
If pos is equal zero asan tell me about heap-buffer-overflow. |
Am I corrected correctly? |
You mean you get a crash? I don't get any crash with the Petra.AR.epub from the DocumentsForTestingRTL.zip from buggins/coolreader#125 (comment) :/
Crash or just analyzer warning ? Of course, if pos=0 and we write at pos-1, it should complain. But aren't we wrapping this with
Looks correct. |
Or do you mean https://github.com/virxkane/coolreader/commit/086c571c8bfaa711a8c6f9e13b9e52f349fdcf12 did fix your asan issue and all is now fine. (And it's just that for some reason I don't really need to know, I did not get a crash.) |
it's not true crash, it's AddressSanitizer error log (this does not make the bugs less harmless).
Yes.
Ok, but you are overwriting some data on the heap:
Of course, line numbers are different ... |
OK, I get it - I witnessed in the past that I sometimes did not crash when I wrote just one byte to the left - I needed to write to the 2nd one to get a crash :) |
Use libunibreak for line breaking
Adds TextLangMan for text typography by language
Implement a bit more of the stuff discussed in #307.
What these commits will allow is detailed at #307 (comment)
So, this:
will render in "best" mode (full harfbuzz) as:
I'll bump this up to frontend first without any change to base and frontend, as it should work as-currently with our ReaderHyphenation module (just to have a nightly with this for reference).
And the next day, I'll do the ReaderHyphenation > ReaderTypography swap, that we can discuss in its PR.
One thing to note is that now, we might be loading and keep loaded multiple hyphenation dictionaries (which will use at max 1Mb of RAM per hyph dict). The TextLangCfg objects are also kept globally and will stick even when switching documents (but they are cheap).
Also note for CoolReader devs: CR on Android might use
HyphMan::activateDictionaryFromStream()
, which I tried to adapt and make right - but I couldn't test it.Also includes:
Add support for <img src="data:image/png;base64,...>
will allow closing koreader/koreader#5529
Text: fix standalone BR not making an empty line
Fix BR with "display: block" not making an empty line
Fix issues noticed at #172 (comment)
XML parsing: add more HTML5 named entities, optimize search
because why not ? (note that this may cause shifts in highlights in a text nodes that have some of the previously unsupported named entities...)
This change is