-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup PreProcessXmlString() #526
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the logic, so I'll let @poire-z comment, but I'm not seeing anything too egregious, and the reasoning is definitely sound ;).
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @bbshelper)
crengine/src/lvxml.cpp
line 5436 at r1 (raw file):
static void decode_character(const lChar32 *end, const lChar32 *enc_table, const lChar32 *&src, lChar32 *&dst) {
Style nit: there's no real house style because it's a mess already, but I'm not a fan of the indent there ;).
crengine/src/lvxml.cpp
line 5437 at r1 (raw file):
static void decode_character(const lChar32 *end, const lChar32 *enc_table, const lChar32 *&src, lChar32 *&dst) { int state = 1, ch;
Shouldn't ch
be a lChar32
?
(Linters probably won't like this type of declarations anyway).
crengine/src/lvxml.cpp
line 5536 at r1 (raw file):
for (const lChar32 *src = str; src < end; ++src) { lChar32 ch = *src; if (ch <= '&') [[unlikely]] {
Can't rely on C++20 features, we've settled on GNU++17.
You'll want to go with compilers builtins instead, and that's... slightly trickier than expected in C++, e.g.,
// Likely/Unlikely branch tagging
#if defined(__cplusplus)
// c.f., https://stackoverflow.com/a/43870188/1067003
# define likely(x) __builtin_expect(static_cast<bool>((x)), 1)
# define unlikely(x) __builtin_expect(static_cast<bool>((x)), 0)
#else
# define likely(x) __builtin_expect(!!(x), 1)
# define unlikely(x) __builtin_expect(!!(x), 0)
#endif
e5281de
to
ae73f23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @bbshelper)
Ah I thought I replied to your comments but it seems it wasn't sent. Regarding [[unlikely]] attribute, it's actually supported since GCC 9. Can I be lazy here? :p |
My understanding is that this is conditional on std being c++20 or gnu++20, not gnu++17 (which is what we enforce). (I haven't double-checked if that's actually what happens in practice, but that's what I get out of the GCC C++ support table). |
So, if it so holds true with the NDK's clang, good enough for me ;). |
Haven't tried Android build recently. What's the NDK version? |
r23c, IIRC. |
Your code is obviously clearer and easier to read. But comparing with diff or side by side, it's not obvious to see if it's strictly equivalent, so we'll have to trust you :) Btw, about "unlikely": wouldn't it just be a matter of inverting the tests ? if (the likely stuff) { ...} else { the unlikely stuff } ? Will the compiler ensure our test as-is if we don't use likely/unlikely ? |
Just tried that and it's supported. Hooray! |
I think you can safely skip the
Well that's tricky :) Before commit, I've tested against 100 largest epubs from Project Gutenburg and the result is the same between the old and the new, but they might be too "regular" to be considered enough. I've also written some simple tests but definitely haven't covered all corner cases. I can try fuzzing if you feel the need.
Assembly code is architecture dependent and doesn't have a strict mapping to C expressions/statements. With optimizations on, compilers are free to rearrange the assembly. |
I would trust the developper's own feeling about it :) How he feels, whether he rewrote it from scratch, or if he tried to follow each little branch and condition to have it strictly equivalent while rewriting it :) If you don't feel confident, the other solution (which would be the one I would have taken :) not being confident by nature :) would be to keep the old PreProcessXmlString as-is, add a new PreProcessXMLStringV2, and make the use of one or the other depending on a new bump of |
Do a quick check before comparing `ch` to '\n', '\r' and '\t' in most cases.
Don't write to `lch` if it's not pre.
ae73f23
to
c908af8
Compare
Well, fuzzing quickly showed many discrepancies, and it turned out that my previous commit is beyond fix. I experimented a lot and decided to commit a series of small changes, hopefully both easy to review and to reason about functional equivalence. Feel free to squash them if you prefer that way. The original code has quirks in handling malformed input (i.e. random combinations of |
Yes, we should support bad input, just not assume we'll always get clean and properly encoded stuff :)
Thank you for taking the time to do that :) About the first commit replace indices with pointers, I'm not friend with pointer arithmetic, so I'll let @NiLuJe who eats that for breakfast :) review its equivalence. Fine with the other commits. I guess the point of the No need to also include
I guess I'll prefer if we keep them.
So, I guess you did run fuzzing with this new code, and it's all 100% equivalent - and your author feeling is confident? |
Looks sound at a glance, yeah ;).
Yeah! \m/ |
Yes
Yes, in my test, they are estimated to use ~6% and ~13% fewer cycles than their respective parent commits.
Yes. I believe every change on this branch can be proven to be equivalent. Fuzzing really helped during the process, as some corner cases are hard to reason about, and even harder to invent. |
Gave you one week for afterthoughts :) If ok, can you add in the message on the first commit |
Yes. I tried really hard, just to eliminate that need :)
How about a merge commit? |
Wouldn't that just add a 7th commit, while keeping your 6 associated commits with their original timestamps? (I could Squash and merge and get a single commit, and just link to here for the details. But dunno, it might be nice to be able to see each of them when working with git, and checkout to an individual commit.) |
There are more than necessary comparions made on every char. By
we see ~50% fewer time spent in the function.
This change is