TextLangMan for text typography by language, use libunibreak #337

poire-z · 2020-04-17T19:38:11Z

Use libunibreak for line breaking
Adds TextLangMan for text typography by language
Implement a bit more of the stuff discussed in #307.
What these commits will allow is detailed at #307 (comment)

Parse and store values from lang= attributes, so we can
propagate a TextlangCfg object to all calls dealing with
text, which will allow to:
- Use specific libunibreak rules for line breaking per lang
  (i.e. reverted quotation marks in German vs French).
- Use the right hyphenation dictionary for each language
- Add more specific line breaking tweaks for some languages
  (some single letter prepositions should not be at end of
  line in Polish and Czech, real hyphens should be duplicated
  at start of next line in Portuguese and Polish...)
- Give the language tag to Harfbuzz so it can pick the
  right glyphs for the language (e.g. different glyphs
  for the same codepoint in zh-CN, zh-TW and ja, and for
  Bulgarian Cyrillic with some fonts).

Update existing global HyphMan to use services from
TextLangMan to ensure legacy single global hyphenation.
TextLangMan still uses the hyphenation methods defined
in hyphman.cpp.

So, this:

will render in "best" mode (full harfbuzz) as:

I'll bump this up to frontend first without any change to base and frontend, as it should work as-currently with our ReaderHyphenation module (just to have a nightly with this for reference).
And the next day, I'll do the ReaderHyphenation > ReaderTypography swap, that we can discuss in its PR.

One thing to note is that now, we might be loading and keep loaded multiple hyphenation dictionaries (which will use at max 1Mb of RAM per hyph dict). The TextLangCfg objects are also kept globally and will stick even when switching documents (but they are cheap).

Also note for CoolReader devs: CR on Android might use HyphMan::activateDictionaryFromStream(), which I tried to adapt and make right - but I couldn't test it.

Also includes:

Add support for <img src="data:image/png;base64,...>
will allow closing koreader/koreader#5529

Text: fix standalone BR not making an empty line
Fix BR with "display: block" not making an empty line
Fix issues noticed at #172 (comment)

XML parsing: add more HTML5 named entities, optimize search
because why not ? (note that this may cause shifts in highlights in a text nodes that have some of the previously unsupported named entities...)

This change is

poire-z · 2020-04-17T20:04:00Z

Codacy Quality Review checks are just complains because of #ifdef and macros using variables, that Codacy doesn't see. Added some comments to make that less confusing.

poire-z · 2020-04-18T08:09:23Z

(Travis CI checks are faster in the european mornings :) it just ran in 42m, while yesterday evening, the 4 runs exceeded a timeout of 60m (or 50m, don't remember).

Mostly some refactoring to make the private LVBase64Stream in lvxml.cpp be public in lvxml.h.

Frenzie · 2020-04-18T09:22:49Z

@poire-z The weekend confounds that further.

Frenzie

No real comments, looks pretty good to me 👍

Frenzie · 2020-04-18T09:29:31Z

crengine/src/lvxml.cpp

 } ent_def_t;

+// From https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references


Hooray! :-)

Frenzie · 2020-04-18T09:31:30Z

crengine/src/lvxml.cpp

-                        if ( !lStr_cmp( def_entity_table[n].name, entname ) ) {
-                            code = def_entity_table[n].code;
-                            break;
+                    // Straight comparisons for the most common ones


nbsp is definitely quite common, I've also seen quot and apos a fair bit but much less so.

Yeah, I had early checks for some of them.
But testing, & took 12 loops (in the binary search), > and < around 10, and the ones you suggest may be 5-6.
Adding too many early tests (say 5 if checking for & < > ' nbsp) will make them still use 1-5 loops, and all other will then be +5.
So, I'm a bit torn :)
Going to re-check these numbers.

Without the early checks:

entities iterations amp 7 gt 10 lt 8 nbsp 5 quot 9 apos 10 shy 10 eacute 10

Help me decide which are worth an early check (and so, adding a check that will give false to all others).
(nbsp is 5 and may not need an early check, but it's indeed one of the most common - not really sure amp, gt and lt are that popular and need that early check - not sure about apos & quot in ebooks (where the U+20xx left/right angled/not quotations marks have more chances to be used).
Soft hyphens shy is 10 - there might be thousands of them in some books.

amp is quite widely used online (252 times on this very page); presumably much less so in ebooks because you won't have URL parameters.

I wasn't necessarily suggesting anything though, what was the rationale behind these ones specifically?

what was the rationale behind these ones specifically?

Just the ones I know I always have to substitute with their named entities in other web related projects. So, no real thinking about if it matters here in our ebook context :)
I think I'll go with 2 early checks, just for   and .

Anyway, in the previous code, that used a linear iteration in a 350-items table, nbsp was first in that table, shy was 14th - and all others far further - so, we won't be slower than before.

Sounds good to me! ^_^

This just adds generic support for libunibreak, which will be tweaked by next commit.

Parse and store values from lang= attributes, so we can propagate a TextlangCfg object to all calls dealing with text, which will allow to: - Use specific libunibreak rules for line breaking per lang (i.e. reverted quotation marks in German vs French). - Use the right hyphenation dictionary for each language - Add more specific line breaking tweaks for some languages (some single letter prepositions should not be at end of line in Polish and Czech, real hyphens should be duplicated at start of next line in Portuguese and Polish...) - Give the language tag to Harfbuzz so it can pick the right glyphs for the language (e.g. different glyphs for the same codepoint in zh-CN, zh-TW and ja, and for Bulgarian Cyrillic with some fonts). Update existing global HyphMan to use services from TextLangMan to ensure legacy single global hyphenation. TextLangMan still uses the hyphenation methods defined in hyphman.cpp.

NiLuJe · 2020-04-19T00:42:07Z

crengine/include/textlang.h

+#include <linebreak.h>
+    // linebreakdef.h is not wrapped by this, unlike linebreak.h
+    // (not wrapping results in "undefined symbol" with the original
+    // function name kinda obfuscated)
+    #ifdef __cplusplus
+    extern "C" {
+    #endif
+#include <linebreakdef.h>
+    #ifdef __cplusplus
+    }
+    #endif
+#endif


Umm, code does the inverse of comment?

I'd naively assume you'd want C linking on both, actually? It's a C API, it expects C unmangled symbols.

Ah, it works because upstream's <linebreak.h> already enforces C linking w/ C++, but NOT <linebreakdef.h>.

TL;DR: It works as-is, but I'd still explicitly move both under C linking here, to avoid future readers having to delve into libunibreak's headers like I just did ;).

Ah, it works because upstream's <linebreak.h> already enforces C linking w/ C++, but NOT <linebreakdef.h>.

Isn't what my comment says ?:
// linebreakdef.h is not wrapped by this, unlike linebreak.h
Guess my indentation (to make that stuff an aside) is confusing :)

I'd still explicitly move both under C linking here,

This would result in
extern "C" { extern "C" { <linebreak.h content> } }
right ? No issue with that ? It compiles.

Oh, right, I see what you meant. I initially read that as "I'm not wrapping this...", while you actually meant the header itself ;).

I hadn't thought about the nested externs, but if it builds, I'll take it ,p.

Specs apparently say:

Linkage specifications nest. When linkage specifications nest, the innermost one determines the language linkage.

So, we're good to go ;).

NiLuJe · 2020-04-19T00:57:56Z

crengine/src/lvrend.cpp

+                if ( lang_cfg->hasLBCharSubFunc() ) {
+                    next_c = lang_cfg->getLBCharSubFunc()(txt+start, i+1, len-1 - (i+1));
+                }
+                int brk = lb_process_next_char(&lbCtx, (utf32_t)next_c);


:D

(I cringe every time I remember CRe uses uint16_t for text, which is just wrong on Linux).

(IIRC, in this context, that shouldn't be an issue with libunibreak, stuff is sane if you happen to point to the middle of a multibyte codepoint).

Okay, saving grace: it's actually a wchar_t, which is why stuff mostly works. Name is just confusing, because wrong on Linux (where wchar_t is actually sane and 32 bits, unlike on Windows where it's 16 bits for some probably stupid legacy reason).

Yep, we had this conversation before :) #252 (comment)

Note, wchar_t is 16 bits on Windows, because Windows uses UTF-16 for all unicode string handling. Therefore a null terminated array of wchar_t on Windows is a UTF-16 string.

Wheras on most other OS's, I imagine wchar_t is mainly used to store codepoints.

This also makes cross platform path handling a right PITA, because unicode filenames must be in UTF-16 (or UCS2), and fopen() doesn't work... :( The Win32 API SUCKS.

@shermp : can I request your native english speaker opinion about the use of the word "Honor" in Would you like to honor or ignore embedded lang tags by default?, cf koreader/koreader#6072 (comment) and followup discussion ? Or alternative suggestions ? Thanks :)

Yeah, honor (or honour, because I speak the queen's English, dammit) is probably the right term here.

Although one might just ask: (Do you wish/would you like) to ignore embedded lang tags by default? with a yes/no, or perhaps, more verbosely: We honor embedded lang tags by default, would you like to ignore them instead?

poire-z · 2020-05-14T08:01:49Z

Some slightly related observation:

I have a book which uses inline-block for footnote links:

which, because inline-block and images are considered breakable before and after by http://www.unicode.org/reports/tr14/#CB , renders as:

which is not super nice - but Calibre renders it the same.

Some of these footnotes links follow a closing quote », and there, there is a difference depending on the chosen typography language:

French, which considers » as a closing punctuation, forbids a break before it, but allows a break after it:

but if I select EnglishUS, which considers » as a quotation (http://www.unicode.org/reports/tr14/#QU) prevents a break on both sides:

So, I'd be happier with that EnglishUS rendering - but it does not help with the first case above when there is no » to help with it.

I considered for a moment adding an option to not enable language specific line breaking rules, that we could use with books that do it properly with appropriate nbsp, like this book does - but as this would not totally solve this situation (the first case above), I'm dropping the idea.

Anyway, in that case, as it shows badly with Calibre too, I guess it's a publisher issue.

In UAX#14:

Object-specific line break behavior is best implemented by querying the object itself, not by replacing the CB line breaking class by another class.

LB1 Assign a line breaking class to each code point of the input. Resolve AI, CB, CJ, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.

LB20 Break before and after unresolved CB (= objects)
Conditional breaks should be resolved external to the line breaking rules. However, the default action is to treat unresolved CB as breaking before and after.

I guess it's fine/better to break before/after images in general, so probaly best to not do anything in the code.

Any idea how I could go at solving that (preventing a break before such inline-block), with style tweaks or else?
Only idea that comes to mind would be using this (that crengine does not support):
a.footnotecall:before { content: " " }
that I guess would prevent libunibreak from allowing the break before.

Any other idea/thought?

Frenzie · 2020-05-14T09:11:12Z

Not really I'm afraid.

poire-z · 2020-06-04T17:56:47Z

Regarding my issue above, I can now solve it after #345 in 2 ways:

With:

a.footnotecall { display: inline !important; }
a.footnotecall:before { content: "\2060" }

or

inlineBox { white-space: nowrap; }

to prevent a wrap on both side of inline-block.

Actually, I initially went to quickly implement pseudo elements, to be able to add a   before the inlineBox with ::before.
And when I get to test it, I realized (and I knew that all along while coding it since I read the specs) that the :before is inserted inside the inlineBox... so it doesn't help at all :)
Then, I realized I could just switch the publisher display: inline-block to display: inline and it's mostly fine. But better when using content: "\2060".

But I was frustrated with the inline-block issue, so I went to hack white-space to be able to specify white-space: nowrap on images and inlineBox, so we can prevent these wraps around.

Oh, and when all that was coded and ready to test, I had finished that book where I needed it...

poire-z · 2020-06-29T19:24:03Z

@virxkane : I regularly follow your https://github.com/virxkane/coolreader/commits/koreader-merge-post - I look and usually pick your stuff - but when you're cherry picking some of my (huge) commits, I can't really notice if you did fix some bug when adapting them. Could you keep letting me know if you find some bug and fix it as part of the cherry picked commit (just bugs or typos, not the needed adaptations you have to do for the few differences we have).
You could just leave some small comment around the affected lines by reviewing our commit in https://github.com/koreader/crengine/commits/master - and I'll go look at how you fixed it around there in your cherry picked commit.

I just by chance noticed this minor thing in your todays' picks - that I'll fix on my side:

-    friend TextLangCfg;
+    friend class TextLangCfg;

Btw, for the TextLangMan stuff, dunno if you saw that in the first post of this PR:

I'll bump this up to frontend first without any change to base and frontend, as it should work as-currently with our ReaderHyphenation module (just to have a nightly with this for reference).
And the next day, I'll do the ReaderHyphenation > ReaderTypography swap, that we can discuss in its PR.

Which means it should stay compatible with your current frontend code, and should not need any change as a first step: you can keep just setting hyphenation dicts with the current Hyphen:: methods, and it will pick the language associated.
There's just one thing that I have not tested, and you might need to check/fix:

Also note for CoolReader devs: CR on Android might use HyphMan::activateDictionaryFromStream(), which I tried to adapt and make right - but I couldn't test it.

virxkane · 2020-06-30T06:19:58Z

@poire-z

Could you keep letting me know if you find some bug and fix it as part of the cherry picked commit (just bugs or typos, not the needed adaptations you have to do for the few differences we have).

I always try not to change the source while making cherry-pick (exception - conflict resolution). I do adaptation in the next commit. To prevent this from happening: your commit under someone else's authorship: plotn/coolreader@cba0e06. Or this is really you wrote?
Yes, of course, if I find some bugs, I will write about them.

-    friend TextLangCfg;
+    friend class TextLangCfg;

This change is so small that I did not make a separate commit. But, I think, ommiting the keyword 'class' in 'friend' clause is not error.

Btw, for the TextLangMan stuff, dunno if you saw that in the first post of this PR:

At this moment not all you things work yet. I must do some work around this PR. Can you upload some test files wich you demonstated in #337 (comment)?

poire-z · 2020-06-30T06:36:01Z

To prevent this from happening: your commit under someone else's authorship: plotn/coolreader@cba0e06. Or this is really you wrote?

Of course not :) This fork/branck is really a mess and totally unusable/unfollowable. Hopefully, it's mostly android frontend changes, and nothing much about the engine.

Can you upload some test files wich you demonstated

linebreaking_lang_test_files.zip
A few test files I've been using these last months. The one you're after is test-linebreaking.html - but others might be useful, for some commits you haven't yet picked.

virxkane · 2020-07-01T19:30:24Z

Adapting for CoolReader... Sorry, but I can't not write this. This spaghetti code such... Om nom nom :)
Sorry, again.

poire-z · 2020-07-01T20:01:38Z

I initially thought the same about the whole crengine :)
And I always try to adjust to the style of what I'm modifying, so I guess I suceeded ! :)

Seriously, which part ?
HyphMan, that was initially spaghetti, and that I just tried to adapt, keeping the same API, and make it a wrapper to the new TextLangMan ? (doing this was painful, and I did it mainly for you :) I thought it shouldn't need any adaptation.)
Or TextLangMan itself, which is really simple :/ (if that, you'll have made me sad :)
Or else ?
Anyway, I'm always learning, so comments and suggestions welcome.

virxkane · 2020-07-01T20:14:59Z

Yes, crengine (HyphMan also) already spaghetti.
But in new code: TextLangMan -> HyphMan, HyphMan -> TextLangMan, static fields of TextLangMan penetrate LVDocView, uhh it is very difficult to understand...
Of course, no complaints.

poire-z · 2020-07-01T20:28:22Z

Well, I made TextLangMan like HyphMan a single/global/static class instance - because somehow, that makes sense: hyphenation and TextLangCfg instances can (and should, to avoid duplicating hyphenation dicts or lang properties) be shared between multiple documents (you can have multiple docs on CR, we don't on KOReader).
As far as I can see, there are the same little things in lvdocview.cpp for TextLangMan and Hyphman: setting 4 or 5 properties by calling some methods of these 2 global static class instances.

Oh, and yes: I think you should just use of one these 2 ! Either you use only the legacy Hyphman props like PROP_HYPHENATION_DICT - or you use the PROP_TEXTLANG_MAIN_LANG and friends.

And yes, the interaction between TextLangMan <> Hyphman are complicated and tedious. I did not want to change Hyphman too much (mainly, because I want a clean git history with a real log of the past), otherwise, I would have just taken the hyph methods code, and drop the rest.
So, yes, it's ugly. If you need help on some parts, just tell where.

For me, the only issue for you would have been with HyphMan::activateDictionaryFromStream() on Android, because there's no obvious lang associated: you just provide a stream. That's if as a first step, you keep using the old HyphMan/PROP_HYPHENATION_DICT from frontend.
If you want to switch to using from frontend PROP_TEXTLANG_MAIN_LANG and friends, yes, you'll need more work in your frontend code. But that's optional. You still benefit from the new stuff with PROP_HYPHENATION_DICT.

virxkane · 2020-07-01T20:58:38Z

Ok, @poire-z thank you very much.

https://github.com/virxkane/coolreader/commit/bf60ffe2b67aa0de38d7be33f27c7eb08fa80637
Android build not fixed yet, I think, we must change function prototype (add lang_tag, etc...).

poire-z · 2020-07-01T22:10:36Z

I had 2 targets:

legacy behaviour: input is hyph dict, get the associated lang to set a main lang, no embedded lang tag support
new behavious: input is only a lang tag, embedded lang tags are supported.

So, your adaptation looks a bit hybrid :)

Dunno if you went looking at our frontend changes for this switch from HyphMan to TextLangMan: see koreader/koreader#6072.
We have other mapping in our frontend code https://github.com/koreader/koreader/blob/master/frontend/apps/reader/modules/readertypography.lua, like HYPH_DICT_NAME_TO_LANG_NAME_TAG or LANGUAGES.
It's a bit ugly to have all this on both sides - but it was the simplest (otherwise, having all this in crengine, I would have needed some API to transfer the info from crengine to koreader to build the menu of available languages, etc... too much work.

virxkane · 2020-07-02T04:52:19Z

While testing this PR: if I replace ISO639-1 language code with ISO639-2 (or ISO639-3) tag 'lang' not work anymore - nor hyphenation nor HarfBuzz's font scripting selection. For example, replace 'bg' with 'bul'. Tag 'lang' specification:
https://www.w3.org/International/questions/qa-html-language-declarations
https://developer.mozilla.org/ru/docs/Web/HTML/Global_attributes/lang
BCP47: https://www.rfc-editor.org/rfc/bcp/bcp47.txt
it is clear why hyphenation don't work - table '_hyph_dict_table' not contains any ISO-639-2(3) codes, but what about HarfBuzz?
I think we must embed into sources full languages table with ISO639-1, ISO639-2, ISO639-3, full language name and write some functions to lookup language (or language code) in this table.

added:
I think it’s not difficult to find a document with the specified language 'eng'.

virxkane · 2020-07-02T05:03:53Z

Ok, in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry 3-ALPHA codes ommited if exist suitable 2-ALPHA code. But I unsure that all files in internet strictly conforms to the specification.

poire-z · 2020-07-02T05:21:47Z

I think 'eng' or 'bul' are not expected in HTML lang tags (and I guess HarfBuzz does not accept them, https://github.com/harfbuzz/harfbuzz/blob/d5439232946333b60f655d9ed37ec7dadf439287/src/hb-ot-tag-table.hh#L16-L114 ).
https://www.w3.org/International/articles/language-tags/
Dunno about other formats, like FB2.

But we may find them in books metadata.
We handle them (and translate them to 'en' or 'bg') in our frontend code:
https://github.com/koreader/koreader/blob/f7d538b108167a6bb4e89880d2b0cf8b4c69b42f/frontend/apps/reader/modules/readertypography.lua#L52-L62
(It's a lot easier for me to add that kind of stuff in Lua than it is in C :)

virxkane · 2020-07-02T05:26:52Z

Dunno about other formats, like FB2.

I found fb2 book with specified language 'eng'.

We handle them (and translate them to 'en' or 'bg') in our frontend code

Ok, I'll think about it.

virxkane · 2020-07-02T11:14:16Z

@poire-z As you requested report about SEGFAULT (related to this PR). You introduce multiple construction m_flags[pos-1] in lvtextfm.cpp and when pos == 0 SEGFAULT catched. Found on file Dostoievsky.RU.epub that you uploaded earlier.
Maybe you can fix it yourself, I'm not sure I won’t break your code.

poire-z · 2020-07-02T11:22:37Z

You don't mention the line, but may be I've already fixed it in bc4500a that you may be have not yet picked ?
(No crash for me with our latest master on that Dostoievsky.RU.)

virxkane · 2020-07-02T11:29:25Z

It seems like that.

virxkane · 2020-07-15T10:30:11Z

@poire-z But still bug is not fixed. Try file Petra.AR.epub

crengine/crengine/src/lvtextfm.cpp

Line 1102 in c14ea51

m_flags[pos-1] |= LCHAR_ALLOW_WRAP_AFTER;

If pos is equal zero asan tell me about heap-buffer-overflow.

virxkane · 2020-07-15T10:43:03Z

Am I corrected correctly?
https://github.com/virxkane/coolreader/commit/086c571c8bfaa711a8c6f9e13b9e52f349fdcf12

poire-z · 2020-07-15T11:01:59Z

But still bug is not fixed. Try file Petra.AR.epub

You mean you get a crash? I don't get any crash with the Petra.AR.epub from the DocumentsForTestingRTL.zip from buggins/coolreader#125 (comment) :/

If pos is equal zero asan tell me about heap-buffer-overflow.

Crash or just analyzer warning ? Of course, if pos=0 and we write at pos-1, it should complain. But aren't we wrapping this with if ( pos > 0 ) ? I don't see any access to pos-1 not wrapped with pos > 0 in your lvtextfm.cpp...
So, need more info to understand what you mean :)

Am I corrected correctly?

Looks correct.
And OK :) I fixed it for text (that you picked), but I forgot to fix it for images and inlineBoxes (that your commit fixes the right way it seems)...

poire-z · 2020-07-15T11:13:29Z

Or do you mean https://github.com/virxkane/coolreader/commit/086c571c8bfaa711a8c6f9e13b9e52f349fdcf12 did fix your asan issue and all is now fine.

(And it's just that for some reason I don't really need to know, I did not get a crash.)

virxkane · 2020-07-15T11:23:00Z

You mean you get a crash?

it's not true crash, it's AddressSanitizer error log (this does not make the bugs less harmless).

Or do you mean virxkane/coolreader@086c571 did fix your asan issue and all is now fine.

Yes.

(And it's just that for some reason I don't really need to know, I did not get a crash.)

Ok, but you are overwriting some data on the heap:

==9505==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x604000963ace at pc 0x55e05f077987 bp 0x7ffc77859b50 sp 0x7ffc77859b40
READ of size 2 at 0x604000963ace thread T0
    #0 0x55e05f077986 in LVFormatter::copyText(int, int) coolreader/crengine/src/lvtextfm.cpp:1109
    #1 0x55e05f0a0acb in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3387
    #2 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #3 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #4 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #5 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #6 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #7 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #8 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #9 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #10 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #11 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #12 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #13 0x55e05f0820b3 in LVFormatter::measureText() coolreader/crengine/src/lvtextfm.cpp:1888
    #14 0x55e05f0a0b05 in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3389
    #15 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #16 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #17 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #18 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #19 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #20 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #21 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #22 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #23 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #24 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #25 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #26 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #27 0x55e05eb5f7ef in ldomDocument::render(LVRendPageList*, LVDocViewCallback*, int, int, bool, int, LVProtectedFastRef<LVFont>, int, LVFastRef<CRPropAccessor>) coolreader/crengine/src/lvtinydom.cpp:4583
    #28 0x55e05ef961ca in LVDocView::Render(int, int, LVRendPageList*) coolreader/crengine/src/lvdocview.cpp:2822
    #29 0x55e05ef48b14 in LVDocView::checkRender() coolreader/crengine/src/lvdocview.cpp:604
    #30 0x55e05efa474a in LVDocView::updateBookMarksRanges() coolreader/crengine/src/lvdocview.cpp:3304
    #31 0x55e05efb0c57 in LVDocView::restorePosition() coolreader/crengine/src/lvdocview.cpp:3791
    #32 0x55e05e92a83d in CR3View::loadDocument(QString) coolreader/cr3qt/src/cr3widget.cpp:474
    #33 0x55e05e981bcc in MainWindow::on_actionOpen_triggered() coolreader/cr3qt/src/mainwindow.cpp:248
    #34 0x55e05ea91cf2 in MainWindow::qt_static_metacall(QObject*, QMetaObject::Call, int, void**) coolreader-debug-build/cr3qt/src/moc_mainwindow.cpp:253
    #35 0x55e05ea92870 in MainWindow::qt_metacall(QMetaObject::Call, int, void**) coolreader-debug-build/cr3qt/src/moc_mainwindow.cpp:295
    #36 0x7fcb510d61fe  (/usr/lib64/libQt5Core.so.5+0x2d91fe)
    #37 0x7fcb51b20791 in QAction::triggered(bool) (/usr/lib64/libQt5Widgets.so.5+0x15d791)
    #38 0x7fcb51b23357 in QAction::activate(QAction::ActionEvent) (/usr/lib64/libQt5Widgets.so.5+0x160357)
    #39 0x7fcb51c2dc31  (/usr/lib64/libQt5Widgets.so.5+0x26ac31)
    #40 0x7fcb51c2dd86 in QAbstractButton::mouseReleaseEvent(QMouseEvent*) (/usr/lib64/libQt5Widgets.so.5+0x26ad86)
    #41 0x7fcb51d36929 in QToolButton::mouseReleaseEvent(QMouseEvent*) (/usr/lib64/libQt5Widgets.so.5+0x373929)
    #42 0x7fcb51b6f7a5 in QWidget::event(QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x1ac7a5)
    #43 0x7fcb51d369da in QToolButton::event(QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x3739da)
    #44 0x7fcb51b284ce in QApplicationPrivate::notify_helper(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x1654ce)
    #45 0x7fcb51b3019d in QApplication::notify(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x16d19d)
    #46 0x7fcb510a0ddf in QCoreApplication::notifyInternal2(QObject*, QEvent*) (/usr/lib64/libQt5Core.so.5+0x2a3ddf)
    #47 0x7fcb51b2f283 in QApplicationPrivate::sendMouseEvent(QWidget*, QMouseEvent*, QWidget*, QWidget*, QWidget**, QPointer<QWidget>&, bool, bool) (/usr/lib64/libQt5Widgets.so.5+0x16c283)
    #48 0x7fcb51b8bc85  (/usr/lib64/libQt5Widgets.so.5+0x1c8c85)
    #49 0x7fcb51b8ebbc  (/usr/lib64/libQt5Widgets.so.5+0x1cbbbc)
    #50 0x7fcb51b284ce in QApplicationPrivate::notify_helper(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x1654ce)
    #51 0x7fcb51b2ff57 in QApplication::notify(QObject*, QEvent*) (/usr/lib64/libQt5Widgets.so.5+0x16cf57)
    #52 0x7fcb510a0ddf in QCoreApplication::notifyInternal2(QObject*, QEvent*) (/usr/lib64/libQt5Core.so.5+0x2a3ddf)
    #53 0x7fcb514a9d7c in QGuiApplicationPrivate::processMouseEvent(QWindowSystemInterfacePrivate::MouseEvent*) (/usr/lib64/libQt5Gui.so.5+0x128d7c)
    #54 0x7fcb514ab3e4 in QGuiApplicationPrivate::processWindowSystemEvent(QWindowSystemInterfacePrivate::WindowSystemEvent*) (/usr/lib64/libQt5Gui.so.5+0x12a3e4)
    #55 0x7fcb514847ea in QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) (/usr/lib64/libQt5Gui.so.5+0x1037ea)
    #56 0x7fcb499caec9  (/usr/lib64/libQt5XcbQpa.so.5+0x75ec9)
    #57 0x7fcb4fdf0c3c in g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0+0x4fc3c)
    #58 0x7fcb4fdf0eb7  (/usr/lib64/libglib-2.0.so.0+0x4feb7)
    #59 0x7fcb4fdf0f4e in g_main_context_iteration (/usr/lib64/libglib-2.0.so.0+0x4ff4e)
    #60 0x7fcb510f895f in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) (/usr/lib64/libQt5Core.so.5+0x2fb95f)
    #61 0x7fcb5109f95a in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) (/usr/lib64/libQt5Core.so.5+0x2a295a)
    #62 0x7fcb510a7941 in QCoreApplication::exec() (/usr/lib64/libQt5Core.so.5+0x2aa941)
    #63 0x55e05e8efb63 in main coolreader/cr3qt/src/main.cpp:205
    #64 0x7fcb4ff2ce9a in __libc_start_main (/lib64/libc.so.6+0x23e9a)
    #65 0x55e05e8bd439 in _start (coolreader-debug-build/cr3qt/cr3+0x15ac439)

0x604000963ace is located 2 bytes to the left of 34-byte region [0x604000963ad0,0x604000963af2)
allocated by thread T0 here:
    #0 0x7fcb524f8d29 in realloc (/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/libasan.so.5+0x10cd29)
    #1 0x55e05f0ab05f in unsigned short* cr_realloc<unsigned short>(unsigned short*, unsigned long) coolreader/crengine/src/../include/lvmemman.h:42
    #2 0x55e05f073093 in LVFormatter::allocate(int, int) coolreader/crengine/src/lvtextfm.cpp:880
    #3 0x55e05f0a0a83 in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3385
    #4 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #5 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #6 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #7 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #8 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #9 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #10 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #11 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #12 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #13 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #14 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #15 0x55e05f0820b3 in LVFormatter::measureText() coolreader/crengine/src/lvtextfm.cpp:1888
    #16 0x55e05f0a0b05 in LVFormatter::processParagraph(int, int, bool) coolreader/crengine/src/lvtextfm.cpp:3389
    #17 0x55e05f0aa1f5 in LVFormatter::splitParagraphs() coolreader/crengine/src/lvtextfm.cpp:4101
    #18 0x55e05f0aada9 in LVFormatter::format() coolreader/crengine/src/lvtextfm.cpp:4149
    #19 0x55e05f0606c7 in LFormattedText::Format(unsigned short, unsigned short, int, BlockFloatFootprint*) coolreader/crengine/src/lvtextfm.cpp:4246
    #20 0x55e05ec2f523 in ldomNode::renderFinalBlock(LVRef<LFormattedText>&, RenderRectAccessor*, int, BlockFloatFootprint*) coolreader/crengine/src/lvtinydom.cpp:16640
    #21 0x55e05f0d66dc in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:7121
    #22 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #23 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #24 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #25 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #26 0x55e05f0d536f in renderBlockElementEnhanced(FlowState*, ldomNode*, int, int, int) coolreader/crengine/src/lvrend.cpp:6833
    #27 0x55e05f0d9381 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*, int) coolreader/crengine/src/lvrend.cpp:7337
    #28 0x55e05f0d9547 in renderBlockElement(LVRendPageContext&, ldomNode*, int, int, int, int, int*) coolreader/crengine/src/lvrend.cpp:7354
    #29 0x55e05eb5f7ef in ldomDocument::render(LVRendPageList*, LVDocViewCallback*, int, int, bool, int, LVProtectedFastRef<LVFont>, int, LVFastRef<CRPropAccessor>) coolreader/crengine/src/lvtinydom.cpp:4583

SUMMARY: AddressSanitizer: heap-buffer-overflow coolreader/crengine/src/lvtextfm.cpp:1109 in LVFormatter::copyText(int, int)
Shadow bytes around the buggy address:
  0x0c0880124700: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fd
  0x0c0880124710: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fd
  0x0c0880124720: fa fa fd fd fd fd fd fd fa fa fd fd fd fd fd fd
  0x0c0880124730: fa fa fd fd fd fd fd fd fa fa fd fd fd fd fd fd
  0x0c0880124740: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fa
=>0x0c0880124750: fa fa fd fd fd fd fd fa fa[fa]00 00 00 00 02 fa
  0x0c0880124760: fa fa 00 00 00 00 02 fa fa fa fa fa fa fa fa fa
  0x0c0880124770: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0880124780: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0880124790: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c08801247a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==9505==ABORTING

Of course, line numbers are different ...

poire-z · 2020-07-15T11:31:19Z

OK, I get it - I witnessed in the past that I sometimes did not crash when I wrote just one byte to the left - I needed to write to the 2nd one to get a crash :)
But OK, your fix if perfect, picking it as part of #357.

poire-z force-pushed the libunibreak_textlangman branch from 9b110d1 to 3e3b6c8 Compare April 17, 2020 19:57

poire-z force-pushed the libunibreak_textlangman branch from 3e3b6c8 to d17c777 Compare April 18, 2020 08:44

poire-z added 2 commits April 18, 2020 10:44

Fix a few clang-tidy warnings

ef460ba

Add support for <img src="data:image/png;base64,...>

f10a65e

Mostly some refactoring to make the private LVBase64Stream in lvxml.cpp be public in lvxml.h.

poire-z force-pushed the libunibreak_textlangman branch from d17c777 to d89ae37 Compare April 18, 2020 08:46

Frenzie reviewed Apr 18, 2020

View reviewed changes

poire-z force-pushed the libunibreak_textlangman branch from d89ae37 to 45b29ca Compare April 18, 2020 10:38

poire-z added 6 commits April 18, 2020 12:41

XML parsing: add more HTML5 named entities, optimize search

d10fcf7

Text: fix standalone BR not making an empty line

713c588

Fix BR with "display: block" not making an empty line

bb97584

Fix hyphens from soft-hyphens not part of highlighted segments

1550b17

Use libunibreak for line breaking

7a6f91f

This just adds generic support for libunibreak, which will be tweaked by next commit.

poire-z force-pushed the libunibreak_textlangman branch from 45b29ca to e19f4ff Compare April 18, 2020 10:42

poire-z merged commit 44eacb3 into koreader:master Apr 18, 2020

poire-z deleted the libunibreak_textlangman branch April 18, 2020 11:31

This was referenced Apr 18, 2020

bump crengine: text typography by language koreader/koreader-base#1082

Merged

bump crengine: text typography by language koreader/koreader#6069

Merged

NiLuJe reviewed Apr 19, 2020

View reviewed changes

This was referenced Apr 19, 2020

Force BR to always be display:inline #338

Merged

Adds ReaderTypography (replaces ReaderHyphenation) koreader/koreader#6072

Merged

Allow providing and using multiple fallback fonts #339

Merged

Frenzie mentioned this pull request May 1, 2020

bump crengine: multiple fallback fonts koreader/koreader#6090

Merged

poire-z mentioned this pull request Jun 27, 2020

Normalized xpointers fix, some FB2 footnotes tweaks #329

Merged

poire-z mentioned this pull request Aug 6, 2020

Linebreaking at dashes #364

Closed

poire-z mentioned this pull request Sep 27, 2020

Line break at the footnote number koreader/koreader#6718

Closed

virxkane mentioned this pull request Jan 10, 2021

Android: default the hyphenation dictionary to 'algorithm' buggins/coolreader#219

Closed

poire-z mentioned this pull request Aug 10, 2023

CSS: support for pseudo elements ::before & ::after #345

Merged

		} ent_def_t;

		// From https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

TextLangMan for text typography by language, use libunibreak #337

TextLangMan for text typography by language, use libunibreak #337

Conversation

poire-z commented Apr 17, 2020 • edited Loading

poire-z commented Apr 17, 2020

poire-z commented Apr 18, 2020

Frenzie commented Apr 18, 2020

Frenzie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poire-z Apr 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NiLuJe Apr 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NiLuJe Apr 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NiLuJe Apr 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poire-z commented May 14, 2020

Frenzie commented May 14, 2020

poire-z commented Jun 4, 2020

poire-z commented Jun 29, 2020 • edited Loading

virxkane commented Jun 30, 2020

poire-z commented Jun 30, 2020

virxkane commented Jul 1, 2020

poire-z commented Jul 1, 2020

virxkane commented Jul 1, 2020

poire-z commented Jul 1, 2020 • edited Loading

virxkane commented Jul 1, 2020

poire-z commented Jul 1, 2020

virxkane commented Jul 2, 2020 • edited Loading

virxkane commented Jul 2, 2020

poire-z commented Jul 2, 2020

virxkane commented Jul 2, 2020

virxkane commented Jul 2, 2020

poire-z commented Jul 2, 2020

virxkane commented Jul 2, 2020

virxkane commented Jul 15, 2020

virxkane commented Jul 15, 2020

poire-z commented Jul 15, 2020

poire-z commented Jul 15, 2020

virxkane commented Jul 15, 2020 • edited Loading

poire-z commented Jul 15, 2020

poire-z commented Apr 17, 2020 •

edited

Loading

poire-z Apr 18, 2020 •

edited

Loading

NiLuJe Apr 19, 2020 •

edited

Loading

NiLuJe Apr 19, 2020 •

edited

Loading

NiLuJe Apr 19, 2020 •

edited

Loading

poire-z commented Jun 29, 2020 •

edited

Loading

poire-z commented Jul 1, 2020 •

edited

Loading

virxkane commented Jul 2, 2020 •

edited

Loading

virxkane commented Jul 15, 2020 •

edited

Loading