Optimize offsetGet #506

kamil-tekiela · 2023-09-07T21:09:56Z

We don't actually need to call getCharLength() if we inline the operation and use mb_substr with substr. This trick fetches 4 bytes from the current byte offset because UTF-8 characters are at most 4 bytes. mb_substr will then read the first Unicode character from the selected 4 bytes.

The previous technique was actually brilliant. The ASCII map optimization is the fastest way to calculate it, but it also is quite long. This technique would be perfect if PHP had a way to avoid ord() call altogether. I benchmarked 3 different techniques I could come up with. The ASCII map is the fastest. But actually, the worst thing for performance is a function call. So inlining the long implementation wouldn't be wise. That's why I decided to use the mb_substr() technique.

Real-life benchmarks:
Before:

After:

As you can see this is a microoptimization, so I think we shouldn't be so worried about which one is the fastest. The difference in my benchmark is only 4ms with 10k calls.

What do you think? Should I add a code comment explaining how this works?

We don't actually need to call getCharLength() if we inline the operation and use mb_substr with substr. This trick fetches 4 bytes from the current byte offset because UTF-8 characters are at most 4 bytes. mb_substr will then read the first Unicode character from the selected 4 bytes. Signed-off-by: Kamil Tekiela <[email protected]>

codecov · 2023-09-07T21:11:56Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.02% ⚠️

Comparison is base (b737500) 96.46% compared to head (be2ca97) 96.45%.

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #506      +/-   ##
============================================
- Coverage     96.46%   96.45%   -0.02%     
+ Complexity     2179     2170       -9     
============================================
  Files            66       66              
  Lines          5065     5045      -20     
============================================
- Hits           4886     4866      -20     
  Misses          179      179

Files Changed	Coverage Δ
src/Tools/CustomJsonSerializer.php	`0.00% <ø> (ø)`
src/UtfString.php	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.

📢 Have feedback on the report? Share it here.

MauricioFauth · 2023-09-07T22:03:15Z

Should I add a code comment explaining how this works?

Definitely.

Signed-off-by: Kamil Tekiela <[email protected]>

williamdes · 2023-09-08T12:39:32Z

tests/benchmarks/UtfStringBench.php

-     * @Assert("mode(variant.time.avg) < 800 microseconds +/- 20%")
-     * @Assert("mode(variant.time.avg) > 100 microseconds +/- 10%")
-     */
-    public function benchGetCharLength(): void


Can you benchmark the function and set a baseline please ?

Should it not remain the same? How do I run this benchmark?

Well, since it's a more complete function and the function was removed maybe everything should be updated then
But having a benchmark data could help

I added a new commit. Is this what you had in mind? These are the results I get:

benchBuildUtfString.....................I19 ✔ Mo6.052ms (±3.59%) benchUtfStringRandomAccessWithUnicode...I19 ✔ Mo63.459μs (±10.50%)

Yes, that's good to have checked at all times.
For the results I do not know, if the benchmark was added before we could compare but that's totally okay never mind :)

williamdes · 2023-09-08T12:40:52Z

What do you think? Should I add a code comment explaining how this works?

Could we keep the previous optimisation that should normally be the fastest one and keep yours to avoid calling ord ?

kamil-tekiela · 2023-09-08T14:06:10Z

What do you think? Should I add a code comment explaining how this works?

Could we keep the previous optimisation that should normally be the fastest one and keep yours to avoid calling ord ?

No, because it's either this or that. To use the ASCII map optimization one must use ord(). My solution is only slightly faster now because I replace two function calls ord() and getCharLength() with another two substr() and mb_substr() while avoiding any ifs. If there is a way to do this using only one function call or less then that would be the best solution.

But maybe I misunderstood what you mean. What exactly are you proposing? Do you see a way to merge these two solutions together?

src/UtfString.php

williamdes · 2023-09-08T14:11:30Z

No, because it's either this or that. To use the ASCII map optimization one must use ord(). My solution is only slightly faster now because I replace two function calls ord() and getCharLength() with another two substr() and mb_substr() while avoiding any ifs. If there is a way to do this using only one function call or less then that would be the best solution.

But maybe I misunderstood what you mean. What exactly are you proposing?

Indeed, I misunderstood the new solution
But it feels like it requires much more functions, is there no way to avoid using some of them ?
Not sure if's me over optimizing the solution ^^

Do you see a way to merge these two solutions together?

Not really because of the use of delta and other things that seems to advance the string much more quickly

Maybe phpbench could show how this is quick ?

kamil-tekiela · 2023-09-08T14:16:15Z

But it feels like it requires much more functions, is there no way to avoid using some of them ?

Yeah it does. I wish there was a way to avoid it but I can't think of any. Good thing that strlen isn't a function call but a dedicated opcode.

Signed-off-by: Kamil Tekiela <[email protected]>

williamdes

Looks great !

kamil-tekiela force-pushed the UtfString branch from 33f2e72 to a66f0d5 Compare September 7, 2023 21:10

MauricioFauth approved these changes Sep 7, 2023

View reviewed changes

kamil-tekiela added 2 commits September 7, 2023 23:53

Add explanator comments

eb248b3

Signed-off-by: Kamil Tekiela <[email protected]>

Simplify offsetGet even more

905902c

Signed-off-by: Kamil Tekiela <[email protected]>

williamdes reviewed Sep 8, 2023

View reviewed changes

src/UtfString.php Show resolved Hide resolved

Update UtfStringBench.php

be2ca97

Signed-off-by: Kamil Tekiela <[email protected]>

kamil-tekiela force-pushed the UtfString branch from 28bf110 to be2ca97 Compare September 8, 2023 15:50

williamdes approved these changes Sep 10, 2023

View reviewed changes

williamdes added this to the 6.0.0 milestone Sep 10, 2023

MauricioFauth approved these changes Sep 16, 2023

View reviewed changes

MauricioFauth merged commit d57481d into phpmyadmin:master Sep 16, 2023
11 of 12 checks passed

MauricioFauth self-assigned this Sep 16, 2023

kamil-tekiela deleted the UtfString branch September 16, 2023 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize offsetGet #506

Optimize offsetGet #506

kamil-tekiela commented Sep 7, 2023

codecov bot commented Sep 7, 2023 •

edited

Loading

MauricioFauth commented Sep 7, 2023

williamdes Sep 8, 2023

kamil-tekiela Sep 8, 2023

williamdes Sep 8, 2023

kamil-tekiela Sep 8, 2023

williamdes Sep 10, 2023

williamdes commented Sep 8, 2023

kamil-tekiela commented Sep 8, 2023

williamdes commented Sep 8, 2023

kamil-tekiela commented Sep 8, 2023

williamdes left a comment

Optimize offsetGet #506

Optimize offsetGet #506

Conversation

kamil-tekiela commented Sep 7, 2023

codecov bot commented Sep 7, 2023 • edited Loading

Codecov Report

MauricioFauth commented Sep 7, 2023

williamdes Sep 8, 2023

Choose a reason for hiding this comment

kamil-tekiela Sep 8, 2023

Choose a reason for hiding this comment

williamdes Sep 8, 2023

Choose a reason for hiding this comment

kamil-tekiela Sep 8, 2023

Choose a reason for hiding this comment

williamdes Sep 10, 2023

Choose a reason for hiding this comment

williamdes commented Sep 8, 2023

kamil-tekiela commented Sep 8, 2023

williamdes commented Sep 8, 2023

kamil-tekiela commented Sep 8, 2023

williamdes left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 7, 2023 •

edited

Loading